No title

NOTE Communicated by Heiga Zen How to Pretend That Correlated Variables Are Independent by Using Difference Observati...

Author: MIT Press

9 downloads 726 Views 76MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

NOTE

Communicated by Heiga Zen

How to Pretend That Correlated Variables Are Independent by Using Difference Observations Christopher K. I. Williams [email protected] School of Informatics, University of Edinburgh, Edinburgh EH1 2QL, U.K.

In many areas of data modeling, observations at different locations (e.g., time frames or pixel locations) are augmented by differences of nearby observations (e.g., δ features in speech recognition, Gabor jets in image analysis). These augmented observations are then often modeled as being independent. How can this make sense? We provide two interpretations, showing (1) that the likelihood of data generated from an autoregressive process can be computed in terms of “independent” augmented observations and (2) that the augmented observations can be given a coherent treatment in terms of the products of experts model (Hinton, 1999).

1 Introduction In automatic speech recognition, it is often the case that hidden Markov models (HMMs) are used on observation vectors that are augmented by difference observations (so-called δ features; see Furui, 1986). Under the HMM, each observation vector is modeled as being conditionally independent given the hidden state. How can this make sense, as close-by differences are clearly not independent? A similar difficulty arises in image analysis tasks such as texture segmentation (see, e.g., Dunn & Higgins, 1995). Here derivative features obtained from Gabor filters or wavelet analysis, for example, are modeled as being independent at different locations, despite the fact that these features will have been computed sharing some pixels in common. In this article, we present two solutions to this problem. In section 2, we show that if the data are generated from a vector autoregressive (AR) model, then the likelihood can be expressed in terms of “independent” difference observations. In section 3, we show that the local models at each location can be combined using a product of experts model (Hinton, 1999) to provide a well-defined joint model for the data and that this can be related to AR models. Section 4 discusses how these interpretations are affected if the local models are conditional on a hidden state variable, as is the case for HMMs. Neural Computation 17, 1–6 (2005)

c 2004 Massachusetts Institute of Technology

2

C. Williams

2 An AR Model Consider a temporal vector autoregressive model, Xt =

p

Ai Xt−i + Nt ,

(2.1)

i=1

where the Ai ’s are square matrices and Nt is independent and identically distributed gaussian noise ∼ N(0, N ). Xt and Nt have dimension D for all t. To avoid complicated end effects, we will use periodic (wraparound) boundary conditions, so that the subscript t − i should be read mod(t − i, N). Thus, there are N random variables X0 , . . . , XN−1 , which collectively we denote as X, and similarly for N. Then X and N are related by N = TX for an appropriate matrix T. Thus, 1 −1 exp − NTt N Nt 2 t=0  p T p  N−1  1  −1 exp − Ai Xt−i N Ai Xt−i , =  2 i=0  t=0 i=0

P(X) ∝

N−1

(2.2)

(2.3)

p where we have set A0 = −I so that Nt = − i=0 Ai Xt−i . p Now let Y0t , . . . , Yt be linearly independent linear combinations of Xt , . . . , Xt−p . For example, we could choose Y0t = Xt , Y1t = Xt − Xt−1 and so on. As the Yit ’s are simple linear combinations of Xt , . . . , Xt−p , we have p

Ai Xt−i =

i=0

p

Bi Yit ,

(2.4)

i=0

for some set of matrices Bi . We can now write  T p  p N−1  1  −1 exp − Bi Yit N Bi Yit , P(X) ∝  2 i=0  t=0 i=0

(2.5)

showing that the likelihood of the underlying X process can be expressed in terms of a product of terms involving the difference observations up to p order p at each time. Stacking Y0t , Y1t , . . . , Yt as the vector Yt we have P(X) ∝

N−1 t=0

1 exp − YTt MYt , 2

(2.6) j

where the (i, j) block of the matrix M (between Yit and Yt ) has the form −1 Bj . Equation 2.6 almost looks like a product of independent gaussians, BTi N

How to Pretend Correlated Variables Are Independent

3

but note that M is singular (it has rank D as it arises from Nt ) so the correct normalization factor of the gaussian cannot be obtained from it. As a simple example, consider the scalar AR(1) process Xt = αXt−1 + Nt and set Yt0 = Xt , Yt1 = Xt − Xt−1 . Thus, Xt − αXt−1 = (1 − α)Xt + α(Xt − Xt−1 ) = (1 −

α)Yt0

+

αYt1 .

(2.7) (2.8)

To obtain the likelihood for the sequence X, the matrix M will have the form 1 (1 − α)2 α(1 − α) , (2.9) M= 2 α2 σn α(1 − α) where σn2 = var(Nt ). As expected, M has rank 1 (it is an outer product). Interestingly, the matrix M is not equal to the inverse covariance of the Yt ’s derived from the distribution for X. To show this, we first use the result that for the scalar AR(1) process on the circle, the covariance C[j] = Xt Xt−j is given by C[j] =

σn2 (α |j| + α |N−j| ) . (1 − α 2 )(1 − α N )

(2.10)

Thus, cov(Yt ) =

0 0 Yt Yt Yt0 Yt1

Yt0 Yt1 C[0] = (C[0] − C[1]) Yt1 Yt1

(C[0] − C[1]) . (2.11) 2(C[0] − C[1])

Inversion of cov(Yt ) shows that it is not equal to M as given in equation 2.9. Notice that the joint distribution of Y0 , . . . , YN−1 is singular. If we take an AR process on the X variables, then one can choose linear combinations of the Xt s that are truly independent by carrying out an eigenanalysis. (For the periodic boundary conditions described above and time-invariant coefficients, the eigenbasis would be the Fourier basis.) However, if we allow ourselves an overcomplete basis set, then we have shown that the likelihood of X under the AR process can readily be computed using “independent” densities at each location. Although we have given the derivation above using gaussian noise, in fact the conclusion concerning expressing the likelihood of the X sequence in terms of a product of terms involving Yt ’s is independent of the form of the noise driving the AR process. It is also possible to extend the AR model described above beyond the temporal one-dimensional chain. For example, Abend, Harley, and Kanal (1965) describe Markov mesh models in two dimensions. A simple example of such a model is a “third-order” Markov mesh, where Xi,j depends autoregressively on Xi,j−1 , Xi−1,j−1 , and Xi−1,j . The same construction in terms of Y variables can be used in this case.

4

C. Williams

3 Product of Experts Interpretation At an individual location, we have a model Pt (Yt ) for the augmented vector Yt . To define a joint distribution on X, we set P(X) =

1 Pt (Yt ), Z t

(3.1)

where Z is a normalization constant (known in statistical physics as the partition function). This is the product of experts construction (Hinton, 1999). One can also think of this as a Markov random field construction where

P(X) ∝ exp −E(X) and E(X) = − t log Pt (Yt ). If each Pt (Yt ) is gaussian, then P(X) will also be gaussian, and Z = (2π)N/2 |C|1/2 where C is the covariance matrix of X. Again we consider a simple example relating to a scalar AR(1) process, so Yt = (Xt , Xt − Xt−1 )T . Let 1 Pt (Yt ) ∝ exp − {a0 Xt2 + a1 (Xt − Xt−1 )2 }, 2

(3.2)

with a0 , a1 > 0. Then we obtain the joint distribution, 1 2 2 P(X) ∝ − Xt + a1 (Xt − Xt−1 ) . a0 2 t t

(3.3)

C−1 , the inverse covariance matrix of X, is circulant with entries a0 + 2a1 on the diagonal and −a1 in the bands above and below the diagonal and in the northeast and southwest corners. For the AR(1) process, Xt = αXt−1 + Nt with Nt ∼ N(0, β −1 ), we obtain corresponding entries of β(1 + α 2 ) on the diagonal and −βα off the diagonal. The overall scale of a0 and a1 has the def

same effect as β in setting the variance of the process but r = aa01 = (1−α) α , so for any given α value, there is a corresponding value of r1 . For the gaussian case with expert t involving interactions between Xt and Xt−p , we obtain a quadratic form with the same pattern of banding as in the inverse covariance matrix of an AR(p) process, but as above, for some choices of parameters there may not be a corresponding AR process. Again this construction can be extended to two (or more) dimensions. For example, in 2D we might consider the variable Xi,j and the differences to its four neighbors to the north, south, east, and west to obtain a fivedimensional Y vector. Equation 3.1, with each expert being gaussian, then defines a gaussian Markov random field over the lattice of X variables. 2

1 Interestingly for r ∈ (−4, 0), there are no corresponding values of α. Note that α = 0 ⇒ a1 = 0.

How to Pretend Correlated Variables Are Independent

5

4 Incorporating Hidden State In speech recognition using HMMs, the Yt s are modeled as conditionally independent given the discrete hidden variable st . We now consider how this affects the interpretations given above. For interpretation 1, we consider a switching AR(p) process or AR-HMM (see, e.g., Woodland, 1992), so that Xt depends on Xt−1 , . . . , Xt−p and also st . For example, using gaussian noise and setting st = k, we have Xt ∼ p N( i=1 Aki Xt−i , k ). Notice that the AR model parameters now depend on

p the switching variable. However, we can still write the prediction i=1 Aki Xt−i as a linear combination of the Yit s, so the likelihood can be written in the form of “independent” contributions from the Yt s. Note that the usual forward and backward HMM recursions can be carried out for the ARHMM. For interpretation 2, we have the individual component densities Pt (Yt | st ), and the joint distribution P(X | s) =

1 Pt (Yt | st ), Z(s) t

(4.1)

where s = (s0 , . . . , sN−1 ). Notice that the normalization constant in general depends on s, and thus when given X, the computation of P(X | s) depends no only on the component densities but also on Z(s). However, if Pt (Yt | st ) is gaussian and has the same covariance structure but different means depending on st for all t, then Z would turn out to be independent of s. While writing this article, I became aware of the work of Tokuda, Zen, and Kitamura (2003), who correctly derive the product of gaussian experts construction conditional on s and note the general dependence of Z(s) on s. They also observe that use of the Viterbi algorithm to find the state sequence s that maximizes P(s) t Pt (Yt | st ) (which is easily done with standard dynamic programming techniques) will not, in general, yield the sequence that maximizes P(s | X), because of the Z(s) term. Most practical HMM-based speech recognition systems use mixtures of gaussians to model the Yt s at each frame. The product of experts interpretation readily handles this situation. For an AR model interpretation, the use of a mixture distribution for the Yt s already suggests a switching AR process with the switching variable hidden. 5 Discussion We have described both conditionally specified models (AR processes) and simultaneously specified models (products of experts) to define the joint density2 P(X) and relate it to the augmented feature vectors {Yt }. 2

This terminology is derived from Cressie (1993, sec. 6.3).

6

C. Williams

While this article describes a theoretical framework for understanding why using difference observations make sense, it would be interesting to examine empirically the question of how well AR and products of experts models characterize the dependencies between time frames or pixel locations. Acknowledgments This note was inspired by questions raised by Joe Frankel’s Ph.D. thesis. Thanks to John Bridle and Joe Frankel for helpful conversations and comments on earlier drafts, to Joe Frankel for drawing my attention to Tokuda et al. (2003), and to the anonymous referees for their comments, which helped to improve the article. References Abend, K., Harley, T. J., & Kanal, L. N. (1965). Classification of binary random patterns. IEEE Transactions on Information Theory, 11(4), 538–544. Cressie, N. A. C. (1993). Statistics for spatial data. New York: Wiley. Dunn, D., & Higgins, W. E. (1995). Optimal Gabor filters for texture segmentation. IEEE Transactions on Image Processing, 4(7), 947–964. Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech and Signal Processing, 34, 52–59. Hinton, G. E. (1999). Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (Vol. 1, pp. 1–6). London: IEE. Tokuda, K., Zen, H., & Kitamura, T. (2003). Trajectory modelling based on HMMs with the explicit relationship between static and dynamic features. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003). N.P.: International Speech Communication Association. Woodland, P. C. (1992). Hidden Markov models using vector linear prediction and discriminative output distributions. In Proceedings of 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. I, pp. 509–512). Piscataway, NJ: IEEE. Received February 3, 2004; accepted May 28, 2004.

NOTE

Communicated by Thorsten Joachims

Convergence of the IRWLS Procedure to the Support Vector Machine Solution Fernando P´erez-Cruz [email protected] Gatsby Computational Neuroscience Unit, 17 Queen Sqaure, London WC1N 3AR, U.K., and Department of Signal Theory and Communications, University Carlos III in Madrid, Avda Universidad 30, 28911 Leganes (Madrid), Spain

Carlos Bousono-Calz ˜ on ´ [email protected]

Antonio Art´es-Rodr´ıguez [email protected] Department of Signal Theory and Communications, University Carlos III in Madrid, Avda Universidad 30, 28911 Leganes (Madrid), Spain

An iterative reweighted least squares (IRWLS) procedure recently proposed is shown to converge to the support vector machine solution. The convergence to a stationary point is ensured by modifying the original IRWLS procedure. 1 Introduction Support vector machines (SVMs) are state-of-the-art tools for linear and nonlinear input-output knowledge discovery (Vapnik, 1998; Scholkopf ¨ & Smola, 2001). The SVM relies on the minimization of a quadratic problem, which is frequently solved using quadratic programming (QP) (Burges, 1998). The iterative reweighted least square (IRWLS) procedure for solving SVM for clas´ sification was introduced in P´erez-Cruz, Navia-V´azquez, Rojo-Alvarez, and Art´es-Rodr´ıguez (1999) and P´erez-Cruz, Navia-V´azquez, Alarcon-Diana, ´ and Art´es-Rodr´ıguez (2001) and it was used in P´erez-Cruz, Alarcon-Diana, ´ Navia-V´azquez, and Art´es-Rodr´ıguez (2000) to construct the fastest SVM solver of the time. It solves a sequence of weighted least-square problems that, unlike other least-square procedures such as Lagrangian SVMs (Mangasarian & Musicant, 2000) or least square SVMs (Suykens & Vandewalle, 1999; Van Gestel et al., 2004), leads to the true SVM solution, as we will show here. However, to prove its convergence to the SVM solution, the IRWLS procedure has to be modified with respect to the formulation that appears in P´erez-Cruz et al. (1999, 2001). The IRWLS has been also proposed for solving regression problems (P´erez-Cruz, Navia-V´azquez, Alarcon-Diana, ´ & Art´es-Rodr´ıguez, 2000). AlNeural Computation 17, 7–18 (2005)

c 2004 Massachusetts Institute of Technology

8

F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez

though we will deal only with the IRWLS for classification, the extension of this proof to regression is straightforward. The article is organized as follows. We prove the convergence of the IRWLS procedure to the SVM solution in section 2 and summarize the algorithmic implementation of it in section 3. We conclude with some comments in section 4. 2 Proof of Convergence of the IRWLS Algorithm to the SVC Solution The support vector classifier (SVC) seeks to compute the dependency between a set of patterns xi ∈ Rd (i = 1, . . . , n) and its corresponding labels

φ(.)

yi ∈ {±1}, given a transformation to a feature space φ(·) (Rd −→ RH and d ≤ H). The SVC solves min

w,ξi ,b

n 1 ξi w2 + C 2 i=1

subject to yi (φT (xi )w + b) ≥ 1 − ξi ξi ≥ 0, where w and b define the linear classifier in the feature space (nonlinear in the input space, unless φ(x) = x) and C is the penalty applied over training errors. This problem is equivalent to the following unconstrained problem, in which we need to minimize LP (w, b) =

n 1 L(ui ) w2 + C 2 i=1

(2.1)

with respect to w and b, where ui = 1−yi (φT (xi )w+b) and L(u) = max(u, 0). To prove the convergence of the algorithm, we need LP (w, b) to be both continuous and differentiable; therefore, we replace L(u) by a smooth approximation,  0, L(u) = Ku2 /2,  u − 1/(2K),

u<0 0 ≤ u < 1/K u ≥ 1/K,

which tends to max(u, 0) as K approaches infinity (limK−→∞ L(u) = max(u, 0)).

IRWLS Convergence to the SVM Solution

9

Being a convex problem, the SVM solution is achieved at w∗ and b∗ , which makes the gradient vanish, ∇w LP (w∗ , b∗ ) ∇b LP (w∗ , b∗ )   n dL(u) ∗ φ(xi )yi w − C du u∗i  0   i=1 , = =  n 0 dL(u)   −C yi du u∗i i=1

∇LP (w∗ , b∗ ) =

(2.2)

where u∗i = 1 − yi (φT (xi )w∗ + b∗ ). Optimization problems are solved using iterative procedures that rely in each iteration on the previous solution (wk and bk , in our case) to obtain the following one, until the optimal solution has been reached. To construct the IRWLS procedure, we modify equation 2.1 using a first-order Taylor expansion of L(u) over the previous solution, as is common in other optimization procedures (Nocedal & Wright, 1999). This leads to n 1 dL(u) 2 k k LP (w, b) = w + C L(ui ) + [ui − ui ] , 2 du uki i=1

where uki = 1 − yi (φT (xi )wk + bk ), LP (wk , bk ) = LP (wk , bk ) and ∇LP (wk , bk ) = ∇LP (wk , bk ). Now, we construct a quadratic approximation imposing that LP (wk , bk ) = LP (wk , bk ) and ∇LP (wk , bk ) = ∇LP (wk , bk ), leading to n 1 dL(u) (ui )2 − (uki )2 2 k LP (w, b) = w + C L(ui ) + 2 du uki 2uki i=1

=

n 1 1 ai (1 − yi (φT (xi )w + b))2 + CT w2 + 2 2 i=1

(2.3)

where

 0, C dL(u) ai = k = KC,  ui du uki C/uki ,

uki < 0 0 ≤ uki < 1/K uki ≥ 1/K

and CT are constant terms that do not depend on w nor b. The IRWLS procedure consists in minimizing equation 2.3 and then recomputing ai with the obtained solution, and continuing until the solution has been reached. We will focus on the algorithmic implementation in the following section; meanwhile, we will demonstrate the following items to prove that the IRWLS procedure converges to the SVM solution:

10

F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez

• The sequence (w0 , b0 ), . . . , (wk , bk ), . . . converges to (wop , bop ). • wop = w∗ and bop = b∗ . First, we need to prove that the sequence of solutions converges to a limiting point in solution space (wop , bop ). Then we need to assess that this limit point corresponds with the SVM solution in equation 2.2. Line search algorithms, for advancing toward the optimum, look at the minimizing functional for a descending direction, pk , and modifies the previous solution, zk , and amount ηk to obtain the following one, zk+1 = zk + ηk pk . Wolfe conditions (Nocedal & Wright, 1999) ensure that line search methods make sufficient progress in each iteration, so the limit point is reached with required precision, being LP (zk + ηk pk ) ≤ LP (zk ) + c1 ∇LP (zk )T pk

(2.4)

∇LP (zk + ηk pk )T pk ≥ c2 ∇LP (zk )T pk

(2.5)

for 0 < c1 < c2 < 1. Wolfe conditions can be applied to the IRWLS procedure because we can describe it as a line search method, where zk = [(wk )T bk ]T , pk = [(ws − wk )T (bs − bk )]T , where ws and bs represent the minimum of the weighted least-square problem in equation 2.3. To prove the first Wolfe condition, also known as the strictly decreasing property, we will first show that LP (zk ) > LP (zk + ηk pk ) = LP (zk+1 ). We know that LP (wk , bk ) = LP (wk , bk ) and, being ws and bs the minimum of k k equation 2.3, LP (w , b ) ≥ LP (ws , bs ), equality will hold only if ws = wk and bs = bk , due to the fact that we are solving a least-square problem. Consequently, LP (wk , bk ) ≥ LP (wk+1 , bk+1 ) ∀ηk ∈ (0, 1], because (wk+1 , bk+1 ) are a convex combination of (wk , bk ) and (ws , bs ) and LP (w, b) is a convex functional, and equality will hold only if wk+1 = wk = ws and bk+1 = bk = bs . Now we will set ηk to enforce that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ), to guarantee that LP (wk , bk ) = LP (wk , bk ) > LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ). To show that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ), it is sufficient to prove that L(uki ) +

(uk+1 )2 −(uki )2 dL(u) i du |uki 2uki

≥ L(uk+1 ) ∀i = 1, . . . , n. i (u )2 −(uk )2

i i is tangent to L(ui ) at ui = uki , its For uki ≥ 0, L(uki ) + dL(u) du |uki 2uki minimum is attained at ui = 0, and its minimal value is greater than or

equal to zero. Therefore, in this case, L(uki ) + dL(u) du |uki

(ui )2 −(uki )2 2uki

≥ L(ui ) for any

ui ∈ R. We show an example for = 1 in Figure 1. For uki < 0, we need to ensure that L(uk+1 ) ≤ 0, which can be obtained i k+1 , bk+1 ) are a convex combination of (wk , bk ) and only for uk+1 ≤ 0. As (w i (ws , bs ), uk+1 can be greater than zero only if usi > 0. For the samples, whose i uki

uki < 0 and usi > 0, we will need to set ηik ≤

uki uki −usi

to ensure that uk+1 ≤ 0, i

IRWLS Convergence to the SVM Solution

11

L(u) L(1)+(u2−12)/2 L(−1)

2.5 2 1.5 1 0.5 0 −2

−1

0

1

2

u Figure 1: The dash-dotted line represents the actual SVM loss function L(u). The dashed line represents the approximation to the loss function used in equation 2.3 when uki = 1, and the solid line represents this approximation when uki < 0.

and it can be easily checked that 0 < ηk < 1. Then if we set ηk = min S

uki

uki , − usi

(2.6)

where S = {i| uki < 0 & usi > 0}, we will ensure that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ). In the case S = ∅, we set wk+1 = ws and bk+1 = bs (i.e., ηk = 1), which proves that LP (zk + ηk pk ) < LP (zk ). Now we can set c1 ∈ (0, c∗1 ] to fulfill equation 2.4, where c∗1 =

LP (zk +ηk pk )−LP (zk ) ∇LP (zk )T pk

is greater than zero because

∇LP < 0. Otherwise, would not be a descending direction. Before proving the second Wolfe condition for the IRWLS, let us rewrite LP (w, b) as follows: (zk )T pk

LP (w, b) =

pk

n 1 L(uki ) w2 + C 2 i=1 dL(u) yi [φT (xi )(wk − w) + (bk − b)], + du uki

and let us define

LP (w, b) =

n 1 L(uk+1 ) w2 + C i 2 i=1 dL(u) yi [φT (xi )(wk+1 − w) + (bk+1 − b)], + du uk+1 i

12

F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez

which is equivalent to LP (w, b) but defined over the actual solution instead. Being LP (w, b) convex, it can be readily seen that LP (w, b) ≥ LP (w, b) and LP (w, b) ≥ LP (w, b) ∀w ∈ RH and ∀b ∈ R. As (wk+1 , bk+1 ) are a convex combination of (wk , bk ) and (ws , bs ), we can rewrite pk = [(ws − wk )T (bs − bk )]T = [(wk+1 − wk )T (bk+1 − bk )]T /ηk , leading in the left-hand side of equation 2.5 to: ηk (∇LP (zk+1 )T pk ) k+1 T

) −C

= (w ×

n

i=1 k w

n dL(u) dL(u) φ (xi )yi −C yi du uk+1 du uk+1 i=1 i i T

− bk+1 − bk

wk+1

= wk+1 2 − (wk+1 )T wk − C

n dL(u) du i=1 k+1

uk+1 i

× yi [φT (xi )(wk+1 − wk ) + (b − bk )] n dL(u) = wk+1 2 − (wk+1 )T wk − C du k+1 T

× yi [φ (xi )(w

k+1

1 − wk 2 − C 2

k

i=1 k+1

− w ) + (b

n i=1

ui

k

− b )]

n 1 k 2 L(uk+1 ) + + C L(uk+1 ) w i i 2 i=1

1 1 1 = wk+1 2 − (wk+1 )T wk + wk 2 + wk+1 2 2 2 2 n +C L(uk+1 ) − LP (wk , bk ) i i=1

1 = wk+1 − wk 2 + LP (wk+1 , bk+1 ) − LP (wk , bk ). 2 We now repeat the same algebraic transformations over the right-hand side, of equation 2.5, leading to: ηk (∇LP (zk )T pk )

n dL(u) dL(u) φ (xi )yi −C yi = (w ) − C du uki du uki i=1 i=1 k+1 w − wk × bk+1 − bk n dL(u) = (wk+1 )T wk − wk 2 + C du k k T

n

T

i=1

ui

× yi [φT (xi )(wk − wk+1 ) + (bk − bk+1 )]

IRWLS Convergence to the SVM Solution

13

n n 1 1 + wk+1 2 + C L(uki ) − wk+1 2 − C L(uki ) 2 2 i=1 i=1

1 = − wk+1 − wk 2 − LP (wk , bk ) + LP (wk+1 , bk+1 ). 2 We now show that

wk+1 − wk 2 /2 + LP (wk+1 , bk+1 ) − LP (wk , bk ) ηk

>

−wk+1 − wk 2 /2 − LP (wk , bk ) + LP (wk+1 , bk+1 ) , ηk

which is equivalent to wk+1 − wk 2 + [LP (wk+1 , bk+1 ) − LP (wk+1 , bk+1 )] + [LP (wk , bk ) − LP (wk , bk )] > 0, because ηk ∈ (0, 1]. The terms L(wk+1 , bk+1 ) − L (wk+1 , bk+1 ) and L(wk , bk ) − L (wk , bk ) are equal to or greater than zero because the loss function is convex. Moreover, wk+1 − wk 2 ≥ 0, and it is zero only if wk+1 = wk . Therefore, we are not at the solution ∇LP (zk+1 )T pk > ∇LP (zk )T pk . Now, we can set c2 ∈ [c∗2 , 1)1 to fulfill equation 2.5, where c∗2 = ∇LP (zk+1 )T pk is less than one because ∇LP (zk )T pk < 0; otherwise, pk would ∇LP (zk )T pk not be a descending direction. We now need to prove that the proposed algorithm stops when the gradient of LP (w, b) vanishes. The Zoutendijk condition (Nocedal & Wright, 1999) tell us that if LP (w, b) is bounded below and it is Lipschitz continuous,2 and the optimization procedure fulfills the Wolfe conditions, then ∇LP (wk , bk )2 ∇L (wk ,bk )T pk

cos2 θk → 0 as k → ∞, where cos2 θk = ∇L P(wk ,bk )pk . If we prove that θk P does not tend to π/2 as k → ∞, we would have proven that the gradient of LP (w, b) vanishes and that the proposed algorithm converges to a minimum. Finally, we would need to prove that the achieved solution corresponds to the SVM solution, which we will first prove. The minimum of equation 2.3 is obtained by solving the following linear system:   n T φ(xi )yi ai (1 − yi (φ (xi )w + b)) w − 0   i=1 .  = n 0   T − yi ai (1 − yi (φ (xi )w + b)

(2.7)

i=1

If c∗2 < 0, the minimum value c2 can take is c1 , and if c∗1 > 1, the highest value c1 can take is c2 , but this does not affect the given proof. 2 L (w, b) is equal to or greater than zero, and it is Lipschitz continuous, because we P made it differentiable. 1

14

F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez

The IRWLS procedure stops when ws = wk and bs = bk ; if we replace them in equation 2.7, we are led to C dL(u) T s s φ(xi )yi k k (1 − yi (φ (xi )w + b ))  w − du u  u i i=1 i  n C dL(u)  − yi k (1 − yi (φT (xi )ws + bs ) u du k 

s

n

i=1

i

    

ui

  n dL(u) s φ(xi )yi w − C du usi  0   i=1 , = =  n 0 dL(u)   −C yi du usi i=1

(2.8)

which is equal to equation 2.2. Consequently the IRWLS algorithm stops when it has reached the SVM solution. To prove the sufficient condition, we need to show that if wk = w∗ and k b = b∗ , the IRWLS has stopped. Suppose it has not; then we can find ws = wk and bs = bk such that LP (wk , bk ) > LP (ws , bs ), and the strictly decreasing property will lead to LP (w∗ , b∗ ) > LP (ws , bs ), which is a contradiction because w∗ and b∗ give the minimum of LP (w, b). We have just proven that if the IRWLS has stopped, we will be at the SVM solution, and if we are at the SVM solution, the IRWLS has stopped. Finally, we do not need to prove that θk does not tend to π/2, because we have just shown that the algorithm stops iff we are at the SVM solution and, consequently, this is the point at which the gradient of LP (w, b) vanishes. That was what we still needed to prove, ending the proof of convergence. 3 Iterative Reweighted Least Squares for Support Vector Classifiers The IRWLS procedure, when introduced in P´erez-Cruz et al. (1999), did not consider the modification to ensure convergence presented in the previous section (i.e., ηk < 1 in some iterations). We will now describe the algorithmic implementation of the procedure. But before presenting the algorithm, let us rewrite equation 2.7 in matrix form, T Da + I aT

T a aT 1

T w Da y = , b aT y

(3.1)

where = [φ(x1 ), φ(x2 ), . . . , φ(xn )]T , y = [y1 , . . . , yn ]T , a = [a1 , . . . , an ]T , (Da )ij = ai δij (∀i, j = 1, . . . , n), I is the identity matrix, and 1 is a column vector of n ones. This system can be solved using kernels, as well as the regular SVM, by imposing that w = i φ(xi )yi αi and i αi yi = 0. These conditions can be obtained from the regular SVM solution (KKT conditions; see Scholkopf ¨

IRWLS Convergence to the SVM Solution

15

& Smola, 2001, for further details). Also, they can be derived from equation 2.2 in which the αi have replaced the derivative of L(ui ). The system in equation 3.1 becomes H + Da −1 yT

1 y α , = 0 0 b

(3.2)

where (H)ij = k(xi , xj ) = φT (xi )φ(xj ) and k(·, ·) is the kernel of the nonlinear transformation φ(·) (Scholkopf ¨ & Smola, 2001). The steps to derive equation 3.2 from equation 3.1 can be found in P´erez-Cruz et al. (2001). The IRWLS can be summarized in the following steps: 1. Initialization: set k = 0, α0 = 0, b0 = 0 and u0i = 1 2. Solve equation 3.2 to obtain 3.

αs

and

Compute usi . Construct S = {i|uki αk+1 = αs and bk+1 = bs and go to

∀i = 1, . . . , n.

bs .

< 0 & step 5.

usi > 0}. If S = ∅, set

4. Compute ηk , using equation 2.6, and αk+1 and bk+1 . If L(αk+1 , bk+1 ) > L(αs , bs ), set αk+1 = αs and bk+1 = bs . 5. Set k = k + 1 and go to step 2 until convergence. The modification in the third step helps to further decrease the SVM functional. The value of ηk in equation 2.6 is a sufficient condition but not a necessary one, and in some cases, αs and bs can produce a further decrease in L(α, b) than using αk+1 and bk+1 . It is worth pointing out that the solution achieved in the first step coincides with the least-square support vector machine solution (Suykens, Van Gestel, De Brabanter, De Moor, & Vandewalle, 2003), as Da is the identity matrix multiplied by C. In a way, we can say that the starting point of the IRWLS procedure is the LS-SVM solution. 4 Comments on the IRWLS Procedure In this article, we have proven the convergence of the IRWLS procedure to the SVM solution. This algorithm was devised for solving SVMs and presents several properties that make it desirable. First, the IRWLS algorithm, as its name indicates, needs to solve only a simple least-square problem in each iteration. Moreover, the linear system is formed only by the samples whose uki > 0, while those samples whose uki < 0 will not affect the functional LP (w, b) in equation 2.3. This property allows working with only part of the kernel matrix, significantly reducing the run-time complexity. Second, during the first iterations, which are the most computationally costly, many samples change the value of ui from positive to negative. Moreover, if ηk < 1, a sample whose uki < 0 ends with uk+1 = 0, and in the next i iteration its ai = KC. Therefore if any sample whose u∗i ≥ 0, but in some

16

F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez

intermediate iteration uki < 0, the algorithm will recover it. The IRWLS changes several constraints from active to inactive in one iteration, while most QP algorithms stop when a constraint changes. We illustrate this property with a simple example in Figure 2. Third, we point out that the value of ηk is 1 in most iterations; only seldom does a sample change from uki < 0 to usi > 0 and L(wk+1 , bk+1 ) ≤ L(ws , bs ), which explains why the IRWLS was working correctly without this modification. The role of K can be analyzed from the proof and implementation perspectives. A finite value of K allows demonstrating that the IRWLS procedure converges to the SVM solution. If K was infinite, the functional would not be differentiable, and although equation 2.2 and 2.8 will be equal, that would not mean that wop (bop ) is equal to w∗ (b∗ ). From the implementation viewpoint, we are adding at least 1/K to the diagonal of H; therefore, if H is nearly singular (as it is for most problems of interest—even for infinite VC dimension kernels), a finite value of K would avoid numerical instabilities. We usually fix K between 104 and 1010 , depending on the machine precision,

(a)

(b)

(d)

(e)

(c)

(f)

120

LP in (2)

100 80 60 40 0

# Iter

10

20

Figure 2: (a–e) The intermediate solution of the IRWLS procedure for a simple problem respectively, for iterations 1, 2, 3, 8, and 18 (final). (f) The value of LP for every iteration. The squares represent the negative-class samples and the circles the positive-class samples. The solid circles and squares are those samples whose uk+1 > 0. The solid line is the classification boundary, and the dashed i lines represent the ±1 margins. It can be seen that in the first step, almost half of the samples change from a u0i > 0 to u1i < 0, significantly advancing toward the optimum and reducing the complexity of subsequent iterations. The solution in iteration 8 is almost equal to the solution in iteration 18; in these intermediate iterations, the algorithm is only fine-tuning the values of αi .

IRWLS Convergence to the SVM Solution

17

and do not modify it during training. For this range of K, the solution does not vary significantly, meaning that we are close to or at the SVM solution. A software package in MATLAB can be downloaded from our web page (http://www.gatsby.ucl.ac.uk/∼fernando). 4.1 Extending the IRWLS Procedure. We have proven the convergence of the IRWLS procedure to the standard SVM solution. This procedure is sufficiently general to be applied to other loss functions. It can be directly applied to any convex, continuous, and differentiable loss function. As we rely on the derivative of the loss, we demonstrate that the limiting point of the IRWLS procedure is the solution to the actual functional; we need the first-order Taylor expansion to be a lower bound on the loss function to show the sufficient decreasing property. To complete the IRWLS procedure, one needs to come up with a quadratic approximation, which has to be at least locally an upper bound to the loss function to ensure the strictly decreasing property. This upper bound has to take the same value and derivative that the loss function does at the actual solution to ensure that the sequence of solutions converges to the solution of our functional. The question that readily arises is if it can be employed for nonconvex loss functions. First, if it is used at all, it would lead only to a local minimum; another method would be needed to assess the quality of the obtained solution. If the loss function is not convex, the sufficient decreasing property would not hold for every possible solution found by the second step of the IRWLS procedure and further analysis would be necessary, taking into account the shape of the nonconvex function, to demonstrate the convergence of the algorithm. Acknowledgments We kindly thank Chih-Jen Lin for his useful comments and pointing out the weakness of our previous proofs. References Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining, 2(2), 121–167. Mangasarian, O. L., & Musicant, D. R. (2000). Lagrangian support vector machines. Journal of Machine Learning Research, 1, 161–177. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer. P´erez-Cruz, F., Alarcon-Diana, ´ P. L., Navia-V´azquez, A., & Art´es-Rodr´ıguez, A. (2000). Fast training of support vector classifiers. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13, Cambridge, MA: MIT Press.

18

F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez

P´erez-Cruz, F., Navia-V´azquez, A., Alarcon-Diana, ´ P. L., & Art´es-Rodr´ıguez, A. (2000). An IRWLS procedure for SVR. In Proceedings of the EUSIPCO’00. Tampere, Finland. P´erez-Cruz, F., Navia-V´azquez, A., Alarcon-Diana, ´ P. L., & Art´es-Rodr´ıguez, A. (2001). SVC-based equalizer for burst TDMA transmissions. Signal Processing, 81(8), 1681–1693. ´ P´erez-Cruz, F., Navia-V´azquez, A., Rojo-Alvarez, J. L., & Art´es-Rodr´ıguez, A. (1999). A new training algorithm for support vector machines. In Proceedings of the Fifth Bayona Workshop on Emerging Technologies in Telecommunications (pp. 116–120). Baiona, Spain. Scholkopf, ¨ B., & Smola, A. (2001). Learning with kernels. Cambridge, MA: MIT Press. Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2003). Least squares support vector machines. Singapore: World Scientific. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. Van Gestel, T., Suykens, J. A. K., Baesens, B., Vanthienen, S., Dedene, G., De Moor, B., & Vandewalle, J. (2004). Benchmarking least squares support vector machines classifiers. Machine Learning, 54(1), 5–32. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Received January 7, 2004; accepted May 25, 2004.

LETTER

Communicated by Bruno Olshausen

Efficient Coding of Time-Relative Structure Using Spikes Evan Smith [email protected] Department of Psychology, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.

Michael S. Lewicki [email protected] Department of Computer Science, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.

Nonstationary acoustic features provide essential cues for many auditory tasks, including sound localization, auditory stream analysis, and speech recognition. These features can best be characterized relative to a precise point in time, such as the onset of a sound or the beginning of a harmonic periodicity. Extracting these types of features is a difficult problem. Part of the difficulty is that with standard block-based signal analysis methods, the representation is sensitive to the arbitrary alignment of the blocks with respect to the signal. Convolutional techniques such as shift-invariant transformations can reduce this sensitivity, but these do not yield a code that is efficient, that is, one that forms a nonredundant representation of the underlying structure. Here, we develop a non-block-based method for signal representation that is both time relative and efficient. Signals are represented using a linear superposition of time-shiftable kernel functions, each with an associated magnitude and temporal position. Signal decomposition in this method is a non-linear process that consists of optimizing the kernel function scaling coefficients and temporal positions to form an efficient, shift-invariant representation. We demonstrate the properties of this representation for the purpose of characterizing structure in various types of nonstationary acoustic signals. The computational problem investigated here has direct relevance to the neural coding at the auditory nerve and the more general issue of how to encode complex, time-varying signals with a population of spiking neurons. 1 Introduction Nonstationary and time-relative acoustic structures such as transients, timing relations among acoustic events, and harmonic periodicities provide essential cues for many types of auditory processing. In sound localization, Neural Computation 17, 19–45 (2005)

c 2004 Massachusetts Institute of Technology

20

E. Smith and M. Lewicki

human subjects can reliably detect interaural time differences as small as 10 µs, which corresponds to a binaural sound source shift of about 1 degree (Blauert, 1997). In comparison, the sampling interval for an audio CD sampled at 44.1 kHz is 22.7 microseconds. Auditory grouping cues, such as common onset and offset, harmonic comodulation, and sound source location, all rely on accurate representation of timing and periodicity (Slaney & Lyon, 1993). Time-relative structure is also crucial for the recognition of consonants and many types of transient, nonstationary sounds. Neurophysiological research in the auditory brainstem of mammals has found cells capable of conveying precise phase information up to 4 kHz or of tracking the quickly varying envelope of a high-frequency sound (Oertel, 1999). The importance of these acoustic cues has long been recognized, but extracting them from natural signals still poses many challenges because the problem is fundamentally ill posed. In natural acoustic environments, with multiple sound sources and background noises, acoustic events are not directly observable and must be inferred using numerous ambiguous cues. Another reason for the difficulty in obtaining these cues is that most approaches to signal representation are block based; the signal is processed piecewise in a series of discrete blocks. Transients and nonstationary periodicities in the signal can be temporally smeared across blocks. Large changes in the representation of an acoustic event can occur depending on the arbitrary alignment of the processing blocks with events in the signal. Signal analysis techniques such as windowing or the choice of the transform can reduce these effects, but it would be preferable if the representation was insensitive to signal shifts. Shift invariance alone, however, is not a sufficient constraint on designing a general sound processing algorithm. Another important constraint is coding efficiency or, equivalently, the ability of the representation to capture underlying structure in the signal. A desirable code should reduce the information rate from the raw signal so that the underlying structures are more directly observable. Signal processing algorithms can be viewed as a method for progressively reducing the information rate until one is left with only the information of interest. We can make a distinction between the observable information rate, or the rate of the observable variables, and the intrinsic information rate, or the rate of the underlying structure of interest. In speech, the observable information rate of the waveform samples is about 50,000 bits per second, but the intrinsic rate of the underlying words is only around 200 bits per second (Rabiner & Levinson, 1981). Information reduction can be achieved by either selecting only the desired information (and discarding everything else) or removing redundancy, such as the temporal correlations between samples. This reduces the observable information rate while preserving the intrinsic information. In this letter, we investigate algorithms for fitting an efficient, shiftinvariant representation to natural sound signals. The outline of the letter is as follows. The next section describes the motivations behind this approach

Efficient Coding of Time-Relative Structure Using Spikes

21

and illustrates some of the shortcomings of current methods. After defining the model for signal representation, we present different algorithms for signal decomposition and contrast their complexity. Next, we illustrate the properties of the representation on various types of speech sounds. We then present a measure of coding efficiency and compare these algorithms to traditional methods for signal representation. Finally, we discuss the relevance of the computational issues discussed here to spike coding and signal representation at the auditory nerve. 2 Representing Nonstationary Acoustic Structure Encoding the acoustic signal is the first step in any algorithm for performing an auditory task. There are numerous approaches to this problem, which differ in both their computational complexity and in what aspects of signal structure are extracted. Ultimately, the choice about what the representation encodes depends on the tasks that need to be performed. In the ideal case, the encoding process extracts only that information necessary to perform the task and suppresses noise or unrelated information. A generalist approach, like that taken by most mammalian auditory systems, requires a representation that is efficient for a wide range of signals. As natural sounds contain both relatively stationary harmonic structure (e.g., animal vocalizations) as well as nonstationary transient structure (e.g., crunching leaves and twigs), this generalist approach requires a code capable of efficiently representing these disparate sound classes (Lewicki, 2002a). Here we seek an auditory representation that is useful for a variety of different tasks. 2.1 Block-Based Representations. Most approaches to signal representation are block based, in which signal processing takes place on a series of overlapping, discrete blocks. This not only obscures transients and periodicities in the signal, but can also have the effect that for nonstationary signals, small time shifts can produce large changes in the representation, depending on whether and where a particular acoustic event falls within the block. Figure 1 illustrates the sensitivity of block-based representation with small shifts in speech signals. The upper panel shows a short speech waveform sectioned into blocks using two sequences of Hamming windows (solid and dashed curves). Each window spans approximately 30 msecs (512 samples) and successive blocks (A1, A2, and so on) are shifted by 10 msecs. The B blocks offset from the A blocks by an amount indicated by the dot-dash vertical lines (∼ 5 msecs), representing the arbitrary alignment of the signal with respect to the two block sequences. The lower panel shows spectral representations for the three corresponding blocks (solid for the A blocks, dashed for the B blocks). The jagged upper curves show the power spectra for each windowed waveform. The smooth lower curves (offset by −20 dB) show the spectrum of the optimal filter derived by linear predictive coding.

22

E. Smith and M. Lewicki !

!

"

3IGNAL,EVELD"

! "

!

"

! "

"

MS ! "

´ ´ ´ ´

&REQUENCYK(Z

Figure 1: Block-based representations are sensitive to temporal shifts. (Top panel) A speech waveform with two sets of overlaid Hamming windows, A1–3 (continuous lines above waveform) and B1–3 (dashed lines below the waveform). (Lower panels) The power spectrum (jagged) and LPC spectrum (smooth) of hamming windows offset by less than 5 ms are overlaid (A, continuous; B, dashed). In either of these, small shifts (e.g., from A2 to B2) can lead to large changes in the representation.

The sound used in Figure 1 is /et/ in the context of the word Vietnamese. The three-block sequence contains an abrupt, transient signal feature, a relatively high-frequency, and high-amplitude /t/ sound occurring at about the 38th msec. The windows preceding the /t/, A1 and B1, contain only the /ee/ vowel waveform. The spectra of these windows nearly overlap, although differences resulting from the slow change in the vowel can be seen. The spectra for windows A2 and B2 show a dramatic difference in the range of 3 to 5 kHz. This results entirely from the arbitrary alignment of each window and the /t/. Because window B2 contains a significant portion of the /t/ waveform, it shows a pronounced increase in powering the higher range. The spectra for the following windows, A3 and B3, are again nearly overlapping, as the /t/ is well represented in both windows. Notice that the increase in the power for window B2 is not as great as that for the final windows. This implies that the alignment of the B sequence will cause a temporal smearing of the constant onset, spreading the energy of the 2 msec transient over a 10 msec window. Discrimination of the phonemes, such as /ba/ and /pa/, is based on differences as small as 5 to 10 msecs in voice-onset time (Liberman, Delattre, & Cooper, 1958). The temporal smear-

Efficient Coding of Time-Relative Structure Using Spikes

23

! m #

"

k k k

Figure 2: A continuous filter bank produces a shift-invariant representation but does not reduce the information rate. An input signal (A) is convolved with a filter bank (B). The output of the convolution (C) has increased the dimensionality of the input signal.

ing illustrated in Figure 1 could create an ambiguity in the onset of voicing and lead to an alteration in the phoneme perception of a listener. 2.2 Convolutional Representations. One way to minimize the shift sensitivity problem is to increase the block rate. This reduces the variability of the observed spectra but results in a very inefficient code, because there are then several, slowly changing representations of the same underlying acoustic events. In the limit, increasing the block rate simply produces a filter bank in which windowed sinusoids are convolved with the signal. Although this yields a representation that is invariant to shifts, a major drawback is that a filter bank does not reduce the information rate because the dimensionality of the each output is identical to the input; furthermore, there is one output for each filter. This problem is illustrated in Figure 2. The speech waveform in the top row of Figure 2A (the /et/ in Vietnamese and identical to that used in figure 1) is convolved with each of the three (time domain) filters shown in the right column (see Figure 2B). The filters are Gabor functions with peak resonance frequencies at the first and second formants (360 and 2750 Hz) and 4000 Hz. The filter outputs (see Figure 2C) show that the formant energy is roughly constant throughout the sound, while energy in the /t/ is relatively localized. Clearly, it would be preferable to have an efficient representation that was insensitive to signal shift, preserving transients and harmonic shifts, but encoded structure in an event-based fashion. 3 A Sparse, Shiftable Kernel Representation Here we employ a sparse, shiftable kernel method of signal representation (Lewicki & Sejnowski, 1999; Lewicki, 2002b). In this model, the signal x(t) is encoded with a set of kernel functions, φ1 , . . . , φM , that can be positioned arbitrarily and independently in time. The mathematical form of the repre-

24

E. Smith and M. Lewicki

sentation with additive noise is x(t) =

nm M

m sm i φm (t − τi ) + (t),

(3.1)

m=1 i=1

where τim and sm i are the temporal position and coefficient of the ith instance of kernel φm , respectively. The notation nm indicates the number of instances of φm , which need not be the same across kernels. In addition, the kernels are not restricted in form or length. A more general way to express equation 3.1 is to assume that the kernel functions exist at all time points during the signal and let the nonzero coefficients determine the positions of the kernel functions. In this case, the model can be expressed in convolutional form, x(t) = sm (τ )φm (t − τ )dτ + (t), (3.2) m

where sm (τ ) is the coefficient at time τ for φm . By using a sparse coefficient signal sm (t) composed only of delta functions, equation 3.2 reduces to equation 3.1. A similar approach assuming only sparse coefficients has been used for coding of natural movies (Olshausen, 2002). The key theoretical abstraction of the model is that the signal is decomposed in terms of discrete acoustic events, represented by the kernel functions, each of which has a precise amplitude and temporal position. Here we assume the kernels are gammatone functions (gamma-modulated sinusoids) whose center frequency and width are set according to an equivalent rectangular band (ERB) filter bank cochlear model using Slaney’s auditory toolbox for Matlab (Slaney, 1998). Except where noted, we used a set of 64 kernel functions for the results below. The use of gammatone functions is well motivated by both biology and natural sound statistics (Lewicki, 2002a). In principle, we could also adapt the set of kernel functions to maximize the efficiency of the code. Figure 3 illustrates the generative model. A signal is represented in terms of a sparse set of discrete temporal events, a spike code. For example, the waveform in Figure 3A consists of three aperiodic “chirps,” each composed of discrete acoustic events with differing amplitudes but identical relative temporal alignments. This signal can be represented by nine spikes, each with a precise time and amplitude. We can plot this representation in terms of a spikegram, Figure 3B, where the nine spikes are shown as ovals of varying size, intensity, and position. Each oval indicates the temporal and spectral position (center of mass and center frequency, respectively) of one gammatone kernel function, with oval size and intensity indicating the amplitude of the kernel coefficient. Representing a kernel’s temporal position based on its center of mass causes them all to align precisely given a delta function as input. We adopt this convention to help illustrate the temporal precision of the spike code.

Efficient Coding of Time-Relative Structure Using Spikes

25

!

+ERNEL#&(Z

"

MS

Figure 3: An illustration of generative model and its spikegram representation. The signal (A) is represented in the spikegram (B) as a set of ovals whose size and intensity indicate the amplitude of the spike. The position of the oval indicates the kernel center frequency (CF, y-axis) and timing (x-axis). The gammatone functions corresponding to the spikes (represented by each oval) are overlaid in gray.

3.1 Encoding Algorithms. Equation 3.1 specifies the generative form of the model but does not provide an encoding algorithm, that is, how to compute the optimal values of τim and sm i for a given signal. The computational objective is to minimize the error (t) while maximizing coding efficiency. As is the case with most coding algorithms, there is a trade-off between the error of the representation and the computational complexity of the algorithm. For the results here, we used three different encoding algorithms to select values for τim and sm i . These show a clear trade-off between complexity and accuracy, but we can gain some flexibility along these dimensions by hybridizing, using the simpler algorithms to initialize the most complex. 3.1.1 Filter Threshold. One approach to efficient audio coding has been to use filter banks based on the human cochlea (Baumgarte, 2002; Lyon, 1982; Shamma, 1985; Gitza, 1988; Patterson, Holdsworth, Nimo-Smith, & Rice, 1988). The filter-threshold algorithm is a computationally simple approximation of cochlear processing. This is a causal approach, and it begins by convolving the signal with the full set of kernel functions from the gammatone ERB filter bank. (Note that for all of the algorithms described here, the kernels are restricted to have unit norm.) The encoded coefficients and m times, sm i and τi , are chosen based on the values and positions of all convolution peaks that exceed a preset threshold. This greatly reduces the size of the observable information rate compared to the convolutional representation, but some degree of (threshold-dependant) temporal and spectral redundancy remains. Filter banks with more than 16 gammatones kernel functions are highly overcomplete, but the filter threshold algorithm does not take the correlations between kernel functions into account during cod-

26

E. Smith and M. Lewicki

ing. As a result, it tends to be a poor estimate of the signal given our linear superposition model. We compensate for this to some degree by adding a single parameter to scale the coefficients. Despite its shortcomings under our model, filter threshold is relatively fast and resilient to noise due to its inherent redundancy. These could be desirable properties depending on the task the system must perform. 3.1.2 Matching Pursuit. An obvious improvement on filter threshold would be to account explicitly for the correlations between kernels, iteratively regressing the signal onto the kernels. This is a noncausal approach, but our goal here is to determine the optimal signal representation. One well-studied formalization of this approach is the matching pursuit algorithm (Mallat & Zhang, 1993). We employ it here to produce a more efficient estimate of the τim and sm i values for a given signal. Our goal is to decompose the signal, x(t), over a set of kernels selected from the gammatone filter bank so as to best capture the structure of the signal. Matching pursuit’s approach to this problem is to iteratively approximate the input signal with successive orthogonal projections onto some basis (in this case the unit-normed gammatone kernels). The signal can be decomposed into x(t) = x(t)φm φm + Rx (t),

(3.3)

where x(t)φm is the inner product between the signal and the kernel and is equivalent to sm in equation 3.1. The final term in equation 3.3, Rx (t), is the residual signal after approximating x(t) in the direction of φm . The projection with the largest inner product will minimize the power of Rx (t), thereby capturing the most structure possible given a single kernel. Equation 3.2 can be rewritten more generally as Rnx (t) = Rnx (t)φm φm + Rn+1 x (t),

(3.4)

with R0x (t) = x(t) at the start of the algorithm. With each iteration, the current residual is projected onto the gammatones. A single kernel is selected such that φm = arg maxRnx (t)φm . m

(3.5)

This best-fitting projection is subtracted out, and its coefficient and time are recorded. This projection and subtraction leaves Rnx (t)φm φm orthogonal to the residual signal, Rn+1 x (t). It is relatively straightforward to see that each projection is orthogonal to all previous and future projections (Mallat & Zhang, 1993). As a result, matching pursuit codes are composed of mutually orthogonal signal structures.

Efficient Coding of Time-Relative Structure Using Spikes

27

Assuming the kernels span the signal space, the power of the residual, Rnx (t), is guaranteed to decrease on each iteration of the algorithm (Mallat & Zhang, 1993; Goodwin & Vetterli, 1999), and so, in the limit, matching pursuit codes will have arbitrarily small error. For most practical purposes, however, some halting criteria should be defined. The simplest is a lower bound on the inner product between the signal and the kernels. We can also track the signal-to-noise ratio of the code over time and stop at a desired fidelity, or halt when some number of spikes has been recorded. More sophisticated criteria are also possible. We reduce some of the computational overhead of the algorithm by defining local neighborhoods among the kernels via cross-correlation. If the maximal inner product between two kernels across all time shifts was greater than some value θ, then they were included in each other’s neighborhood. Typically, θ was set to 0.001 (all kernels were normalized to have an L2 norm of 1). These neighborhoods are used for reconvolution with the residual signal (i.e., if the last spike involved kernel φn , then a new residual was calculated only for the neighborhood around n). This can introduce very low magnitude distortion in the code, but the computational cost is significantly reduced as most of the kernels in the filter bank are orthogonal to one another at all time shifts. 3.1.3 MAP Optimization. A probabilistic method for inferring spike amplitudes and times was described in Lewicki and Sejnowski (1999) and Lewicki (2002b). This approach makes no heuristic assumptions about where spikes should occur, for example, selecting convolution maxima as in the previous two algorithms. Instead, the problem is recast in a Bayesian probabilistic framework in which we attempt to maximize the a posteriori distribution of coefficients. To describe this approach, we begin by expressing the model in matrix form using a discrete sampling of the continuous time series: x = As + .

(3.6)

The rows of the basis matrix, A, contain each gammatone kernel replicated at each sample position making the basis highly overcomplete. The optimal set of τim and sm i for a signal is found by maximizing the posterior distribution of coefficients given the signal and the gammatones, sˆ = arg max P(s|x, A) = arg max P(x|A, s)P(s). s

s

(3.7)

We make two assumptions in modeling the distributions in equation 3.7. First, the noise, , is gaussian and so the data likelihood, P(x|A, s), is also gaussian. Second, the prior, P(s), a function of the spike times and amplitudes, is very sparse. Given these assumptions, the gradient of equation 3.7

28

E. Smith and M. Lewicki

is given by ∂ log P(s|A, x) ∝ AT (x − As) + z(s), ∂s

(3.8)

where z(s) = (log P(s)) . P(s) was assumed to follow a Laplacian, but other distributions are possible. The assumption of sparseness of the kernel coefficients means that optimizing equation 3.7 essentially selects out the minimal set of gammatones that best accounts for the structure of the sound signal for a given noise level. Although optimally efficient codes are possible in theory, in practice only the briefest sounds can be encoded in this manner. For example, using 64 kernels to encode a signal sampled at 44.1 kHz requires approximately 2.8 million coefficients to be optimized per second of signal. We can reduce most of the computational overhead by using filter threshold or matching pursuit to initialize the maximum a posteriori (MAP) optimization. Instead of optimizing over the entire parameter space, these hybrid algorithms search for optimal amplitude values, s, over a set of spike times τ selected by one of the two approximative algorithms. The departure from optimality is a function of the number and “quality” of the spike times selected by the initializing algorithm. In the results that follow, these hybrid algorithms are evaluated alongside the other algorithms as approximations of the true optimally efficient code. 4 Spike Code Signal Representation The sparse, shiftable kernel representation and a set of decomposition algorithms have now been formalized. To evaluate the model, we will present both examples of the codes it generates and an objective comparison between those and other codes. The following section contains specific examples of spike codes to illustrate its qualities as a method for signal representation and some benefits of time-relative coding. 4.1 Comparison of Encoding Algorithms. There are five possible encoding algorithms described in the previous section: filter threshold, matching pursuit, MAP optimization, optimized filter threshold, and optimized matching pursuit. Figure 4 shows the spike code for a short section of speech (three glottal pulses from the vowel /a/ sampled at 16 kHz) using four of these different encoding algorithms. Even at timescales of 480 to 800 samples, the optimization problem is prohibitive, and an example is not presented here. These spikegrams are formatted identically to Figure 3, with ovals representing the time, center frequency, and magnitude of a spike; only the kernel function overlays are removed. For each, we measure the quality of its representations in terms of signal-to-noise ratio (SNR). To compute this, a reconstruction of the input is generated from the code, and a

Efficient Coding of Time-Relative Structure Using Spikes

29

residual error is computed between the original and reconstruction. The SNR in decibels (dB) is then SNR = 10 log10 (Po /Pe ), where Po is the power of the original and Pe is the power of the residual error. The spikegram in Figure 4A is generated using filter threshold. A high degree of redundancy in both time and frequency is quite evident in the correlated waves of spikes that code each glottal pulse. This redundancy may serve to enhance structural similarity between sound events (e.g., the glottal pulses) and increase the representation’s resistance to noise, but it lacks a succinct description of the temporal and spectral characteristics of the sound. Filter threshold encodes the sound to 18 dB SNR (using the scaling parameter mentioned earlier). Perceptually, the input sound is noticeably distorted in the reconstruction, though the speech content is quite clear. Optimizing the filter threshold code has a dramatic effect on the quality of the encoding, pushing the SNR to 90.1 dB, well beyond the point where the original and reconstructed signals are perceptually discriminable. Given that the original .wav file had 16 bits of precision and assuming coding noise on the order of 1 bit, the estimated SNR of the original signal is about 90 dB. In the example shown in Figure 4B, we assumed a very low level of noise in the model. This results in the majority of spike amplitudes being shifted up or down but few pushed to zero; all of the available information is used to encode the signal accurately. Although few spikes are pruned given an assumption of very low noise, the distribution of spike amplitudes does become sparser as a result of optimization. As progressively high noise levels are assumed, the resulting codes become increasingly sparse, sacrificing SNR in order to prune spikes. Figure 4C shows an example of the spike code produced by matching pursuit. It is obviously vastly less redundant than the filter threshold code in Figure 4A. There is relatively little obvious structure within the representation of each glottal pulse, implying that primarily independent events are being represented; however, the similarity between pulses is evident. Despite the much more compact representation, the signal is encoded to 30.4 dB SNR, with only very subtle distortions perceivable. The code generated by a matching pursuit–MAP optimization hybrid (see Figure 4D) is nearly identical to that produced by matching pursuit alone. It is likely for a 30 dB SNR code that the optimization simply corrects some of the error introduced by our use of kernel neighborhoods when computing residuals on each iteration. One possible reason for the limited effect of optimizing is that matching pursuit codes represent a deep local minimum in the parameter space and the gradient method fails to find a global optimum. Another factor concerns the nature of signal decomposition using matching pursuit. This will be discussed further in a later section. 4.2 Convergence of Fidelity. When encoding a signal with matching pursuit, MAP optimization or any hybrid, the SNR of the code increases monotonically with the number of spikes (this is not necessarily true of

30

E. Smith and M. Lewicki

!

"

&ILTER#&(Z

MS

#

MS

MS

$

&ILTER#&(Z

MS

Figure 4: Spikegrams created from an input signal (top) using each of the four algorithms: (A) filter threshold encoded to 18.7 dB SNR (see text), (B) optimized filter threshold encoded to 90.1 dB SNR, (C) matching pursuit encoded to 30.4 dB SNR, and (D) optimized matching pursuit encoded to 33.0 dB SNR.

filter threshold). For the optimized codes, the amount of noise assumed in the model defines the trade-off between sparseness and accuracy. Because these codes are globally optimal, their specific form (the precise location of spikes) may be altered given different noise levels. For matching pursuit, the trade-off is much clearer: lowering the threshold for accepting a spike (or otherwise varying the halting criterion) simply adds additional spikes that code further residual structure. Figure 5 shows the effect of varying the number of spikes in a matching pursuit code. The input signal is a segment of speech (the word can sampled at 16 kHz). The spikegram in Figure 5A reflects a very high threshold, producing only 92 actual spikes (about 400 spikes/sec) and a relatively poor representation (10 dB SNR). Above the spikegram is the residual signal from the final iteration of the algorithm. It is apparent that a great deal of structure remains to be coded, although the onset of the consonant, /k/, and the periodicity of the /a/ and /n/ are already revealed. Perceptually, the sound is strongly distorted from the original, but the speech content is quite clear. Figures 5B through 5D show spikegrams and residuals for signals encoded to 20 dB (1600 spikes/sec), 30 dB (3100 spikes/sec), and 40 dB SNR (5500 spikes/sec). By 30 dB, Figure 5C, the distribution of residual amplitudes is not significantly different from a gaussian (based on Lilliefors statistical test to reject gaussian assumption, p-value > 0.2).

Efficient Coding of Time-Relative Structure Using Spikes

31

&ILTER#&(Z

MS

MS

MS

MS

&ILTER#&(Z

Figure 5: As the spike rate (spikes/sec) increases, the fidelity of the representation increases. The spikegrams above show the improvement of an optimized matching pursuit code with increasing spike rate: (A) 10 dB SNR at 400 spikes/sec, (B) 20 dB SNR at 1600 spikes/sec, (C) 30 dB SNR at 3100 spikes/sec, and (D) 40 dB SNR at 5500 spikes/sec. The residual error is plotted above each spikegram.

4.3 Effect of Kernel Number. Another parameter to be selected for any encoding algorithm is the number of kernel functions. Relatively few gammatones are needed to form a complete basis (i.e., a basis that spans the frequency space of the sounds used), but increasing the number allows greater spectral precision. Figure 6 shows the effect of using matching pursuit with 8, 16, 32, or 64 kernel functions (see Figures 6A–6D, respectively). To be certain that the four sets spanned the frequency space, they were generated independently using Slaney’s Matlab toolbox (Slaney, 1998). In each case, the signal is encoded to approximately 40 dB SNR, but the form of the code changes drastically. With relatively few gammatones (see Figures 6A and 6B), the code lacks both spectral and temporal precision. The time-relative coding is largely lost, and the representation becomes nearly convolutional. Using 32 or 64 kernels more clearly segments the acoustic events and begins to show invariant signal structure. Very similar findings were made testing MAP optimization and the hybrid algorithms. In contrast, increasing the number of kernels with filter threshold only shows enhanced spectral precision as spike times are selected independent of one another.

32

E. Smith and M. Lewicki

!

"

&ILTER#&(Z

MS

MS

MS

$

# &ILTER#&(Z

MS

Figure 6: The number of kernel functions affects both the spectral resolution and the temporal sparseness of the spike codes. The input signal (top) was encoded using matching pursuit with 8, 16, 32, or 64 kernel functions (A–D, respectively). The total number of spikes in each is (A) 12011, (B) 1167, (C) 497, and (D) 479.

4.4 Comparison to Spectrograms. Having described some of the details of the spike code, we now look more broadly at its representation. A comparison of a spectrogram and spikegram illustrates many properties of the model. In Figure 7, the upper plot shows the waveform of pizzerias spoken by an adult female and sampled at 16 kHz. The spikegram and spectrogram of this signal are shown in the middle and lower plots, respectively. The spikegram was constructed using an optimized filter threshold spike code with 128 ERB-spaced gammatone kernels. Both show the formant and harmonic structure of the vowels (e.g., 320–700 msec). Both also reveal the broad spectral and temporal characteristics of the signal, such as the diffuse energy of the /s/ from 700 to 800 msec. However, while the spectrogram is composed of 10 msec shifted windows (as illustrated in Figure 1), the spikegram possesses precise timing information to the sampling rate of the original signal and retains phase information. This allows it to reveal finely grained synchronous activity across bands. It also possesses a nonlinear frequency axis based on the cochlea. This axis emphasizes the range important to human hearing and is used in many auditory models and speech “front-ends.” 4.5 Sparse Representation of Transients. Though the “pizzerias” example demonstrates the large-scale features of the spike code, the fine structure

Efficient Coding of Time-Relative Structure Using Spikes

33

Figure 7: Three representations of the spoken word pizzerias. (A) Time-varying waveform. (B) Spikegram. (C) Spectrogram. They are presented on the same timescale (indicated at the bottom). Note that the spikegram and spectrogram use different frequency axes.

is more clearly revealed in a shorter speech segment. The waveform and spikegram of first half of the word wealth appear in Figure 8. Here we can see the time-relative coding of nonstationary structure. One hundred msec into the word (about 45 msec from the start of the spikegram), the period between glottal pulses begins to elongate. The spike code maintains a consistent representation of the individual pulses during this period, despite the time dilation. Although there is some slight variability in the representation of each pulse (appropriately reflecting the changes in the underlying signal), the spikes essentially align with the peak of each glottal pulse. Figure 8 also shows the efficiency of the code in representing harmonic structures. The spikegram shown can reconstruct the original signal to 30 db SNR, but it requires only a small number of spikes per pulse. Perceptually, the code contains only very subtle distortions of the original signal. Although the two are distinguishable, it is difficult to judge whether the original or reconstructed sound is the “true” signal. This demonstrates the efficiency of the spikegram with respect to nonstationary harmonic structures. One of the particularly desirable properties of the model is the efficient coding of transients, where precise temporal coding is most important. Dis-

34

E. Smith and M. Lewicki

+ ERNEL #& (Z

MS

Figure 8: A spikegram shows time-relative coding in the syllable /el/.

tinguishing consonants in continuous speech, for example, requires the detection of rapid, broadband transients. Figure 9 presents an example of a transient, /t/ sound from the word Vietnamese. The input signal in Figure 9A consists of an extended vowel with an embedded transient. The entire signal was encoded using matching pursuit and then optimized. The small set of spikes corresponding to the transient /t/ sound is easily distinguishable from the other spikes. We were able to segment them by hand from the rest of the representation. In Figure 9B, a spike code of only four events (magnified in the inset) is sufficient to encode the transient (see Figure 9A, Reconstruction), leaving only the vowel component (see Figure 9A, Residual). These two events are precise in time to within 0.06 msec (the sampling rate of the signal.) In a spectrogram (see Figure 9C), the same transient is smeared over 10 msec of time and a large region of frequency space. Note that the timescale (x-axis) is the same in Figures 9A, 9B, and 9C. Although this particular consonant is unusually short in duration, this example still illustrates the precise timing and localization achievable with a spike code. 5 Coding Efficiency The previous section shows that spike coding allows time-relative representation of sound structure, but it is not yet demonstrated that this produces an efficient representation of the signal. A complete evaluation of the spike code model requires some objective measure by which to compare the various algorithms and to compare the model against other representational techniques. Shannon’s rate distortion theory offers an objective measure of coding efficiency that is widely used in signal coding research (Shannon, 1948). The idea is to vary the rate of a code (typically in terms of bits per second) while measuring the effect that has on some measure distortion, such as mean squared error. For the comparisons between spike coding algorithms, we can start simply, varying the rate in terms of spikes per second and measuring the fidelity of the code (the inverse of distortion) in terms

Efficient Coding of Time-Relative Structure Using Spikes

35

Figure 9: Efficient representation of a speech consonant. An input signal (A, input) is represented as both a spikegram (B) and spectrogram (C). We can reconstruct the signal based on only the four spikes shown in the inset (B) to segment the /t/ sound from the vowel (A, reconstruction and residual).

of dB SNR. We will then address the issue of quantifying coding efficiency more precisely in terms of bits. 5.1 Coding Efficiency in Terms of Spikes. The within-model comparison of encoding algorithms will focus on filter threshold, matching pursuit, and the hybrids, allowing four different algorithm combinations. To generate a measure of coding efficiency, each algorithm was used to encode a large corpus of short (50–200 msec) segments of speech (Garofolo et al., 1990), music, and other “natural” sounds (e.g., birdsong, music and environmental sounds) at various spike rates. All of the stimuli were in .wav format, sampled at 16 kHz, and band-limited to 80 to 6000 Hz. The leading and trailing portions of the stimuli were multiplied by half-Hanning windows to prevent edge artifacts. The left panel in Figure 10 shows a simplified rate fidelity curve for each algorithm across the entire database. The x-axis indicates the spike rate on a log scale. The y-axis indicates the fidelity in terms of the mean SNR. The computationally simple filter threshold produces a highly redundant, relatively low-fidelity code (less than 11 dB SNR 10,000 spikes/second.) Its decomposition overrepresents large-amplitude components of signals while devoting relatively few spikes to lower-amplitude components, which may represent distinct sound structure. At large spike rates (more than

36

E. Smith and M. Lewicki

/PTIM -ATCHING 0URSUIT /PTIM &ILTER´4HRESHOLD -ATCHING 0URSUIT &ILTER´4HRESHOLD

-ATCHING 0URSUIT &ILTERS 3.2 D"

3.2 D"

&ILTERS

#OST SPIKESSEC

#OST SPIKESSEC

Figure 10: Spike coding efficiency curves. Plotted on the left is the increase in fidelity with increasing spike rate for the four spike code algorithms. Plotted on the right is the increase in fidelity with increasing spike rate for matching pursuit using different numbers of gammatones in the filter bank (8, 16, 32, 64, 128, 256).

50,000 spikes/second) a mean reconstruction fidelity of about 20 dB SNR is possible. Codes produced by matching pursuit have much greater fidelity at all rates than those from filter threshold. By decomposing signals into sets of orthogonal components, it eliminates all spectral and temporal redundancies in its representation. At low spike rates, where the code tends to represent nonoverlapping signal structures, matching pursuit appears to generate a near-optimal code. However, at higher rates, the fidelity of the greedy algorithm tends to reach a ceiling, with a mean SNR of less than 60 dB. Although the filter threshold produces a relatively inefficient code, MAP optimization of its spike amplitudes can result in an extremely high-fidelity representation. To achieve this, the filter threshold must generate a set of spike times sufficient to span the signal space. There are two factors responsible for the increased efficiency of optimized codes. First, the gradientdescent optimization eliminates redundant spikes by driving spike amplitudes to zero in accordance with the sparse prior. Second, in minimizing the expected error, the optimization makes use of correlations between kernels, subtly adjusting spike amplitudes rather than eliminating them. The relative contribution of each factor is largely dependent on the amount of noise, ε(t), assumed in the model. Low-noise models preserve spikes and rely on precise signal fitting; high-noise models eliminate most spikes (pushing their amplitudes to zero). While optimization of the filter threshold codes greatly increases their efficiency, further optimization of matching-pursuit spike amplitudes leads to relatively small increases in efficiency at lower bit rates. Equation 3.4

Efficient Coding of Time-Relative Structure Using Spikes

37

shows that the algorithm decomposes signals into orthogonal components. As such, increases in efficiency cannot result from redundancy reduction thorough spike elimination. Instead, spike amplitudes are adjusted to make use of correlations between kernels. Using these correlations can prevent the ceilings in coding efficiency found in the raw matching pursuit codes at higher spike rates. The right panel in Figure 10 plots the rate fidelity curves for matching pursuit using different numbers of gammatones. ERB filter banks of 8, 16, 32, 64, 128, and 256 gammatones were produced using Slaney’s Matlab toolbox (Slaney, 1998). Generating each filter bank separately rather than using subsets of some fixed large filter bank allows each component filter’s bandwidth to vary and better tile the frequency space. The sound ensemble (the same used in the between-algorithms comparison) was encoded with each kernel set while varying the spike rate. The resulting curves show a clear relation between coding efficiency and filter bank size. The progression of lowest to highest curves on the plot exactly follows the number of kernel functions used. Although efficiency increases monotonically with the number of size of the filter bank (i.e., number of kernels), the relative gain beyond 64 is extremely small. Additionally, as shown by example earlier in Figure 6, codes produced by 32 or fewer kernels lack a sparse temporal structure in addition to their relatively course spectral representation. Figure 10 shows that matching pursuit is highly efficient at low spike rates but is surpassed by the hybrid optimized filter threshold beyond about 25 dB SNR. The reason for this inefficiency at high rates is that matching pursuit often fails to accurately describe true signal structure (Gribonval, Depalle, Rodet, Bacryf, & Mallatt, 1994; Goodwin & Vetterli, 1999). Because each component of its code is constrained to be orthogonal (by equation 3.4); see also Mallat & Zhang, 1993), it cannot capture independent signal structure, which closely overlaps in time-frequency space. To test matching pursuit’s ability to separate overlapping signal structure, a test signal was created by summing pairs of gammatone kernels separated systematically in time (first third of the signal) and in frequency (latter two-thirds of the signal). The spikegrams in Figure 11 show the potential for time-frequency separability using both matching pursuit and MAP optimization. The ideal sparse representation would consist of pairs of spikes at each of the signal “events” except in the two instances where the kernels perfectly overlap. The MAP optimization algorithm generates just such an encoding (top panel). Looking closely at the representation of one pair of clicks separated 20 msec (top panel, inset), it is clear that two independent events have been coded (allowing perfect reconstruction). In contrast, matching pursuit cannot separate kernels that closely overlap in time and frequency (bottom panel, around 400 and 3600 msec). Matching pursuit’s representation of the same 20 msec separated click pair (bottom panel, inset) is clearly very different from the optimal. On the first iteration of the algorithm, it selects a kernel that is lower frequency than the gam-

38

E. Smith and M. Lewicki

+ERNEL #& (Z

MS

+ERNEL #& (Z

MS

MS

Figure 11: Spikegrams for a signal made of overlapping gammatones. (Top) MAP optimization finds the underlying structure. An example from one pair of clicks (circled) is shown in the inset. The thick gray curve shows the signal, two approximately 2.8 kHz gammatones separated by 20 msec. The thin dark lines are the kernels found by MAP optimization. (Bottom) Matching pursuit cannot separate kernels that closely overlap in time frequency. Given the same two clicks, it generates a single high-amplitude event centered between the chirps and numerous low amplitude events.

matones used to make the signal and centers it between them in time. This means that the representation underestimates the frequency and describes an event when none actually took place. To compensate for this inaccuracy, a large number of additional, low-amplitude kernels are selected on subsequent iterations. With six spikes, it still produces only a 17 dB SNR representation. Nonetheless, the first spike is the single best choice to reduce the residual power. As such, matching pursuit is extremely efficient at low rates. The signal structure initially encoded is typically well separated in time. For example, Figure 5 showed that the initial encoding involves structure that was largely well separated in time and frequency. Tracking the decomposition spike by spike, the kernel that satisfies equation 3.5 on each iteration tends to occur once at each glottal pulse, capturing the largest amount of signal structure possible with a single spike, before returning to encode residual local structure around the same time point. This efficient, low-rate representation can also be generated by MAP optimization by assuming an appropriate degree of noise in the model, but this parameter can be difficult to determine a priori, and the algorithm is much slower.

Efficient Coding of Time-Relative Structure Using Spikes

39

5.2 Coding Efficiency in Terms of Bits. The sparse, shiftable kernel model and a set of algorithms for spike coding have been described in some detail. We now want to quantify the coding efficiency in bits so as to evaluate the model objectively and compare it quantitatively to other signal representations. Rate fidelity again provides a useful objective measure for comparison. Computing the rate-fidelity curves begins with the associated m pairs of coefficients and time values, {sm i , τi }, which are initially stored as double-precision variables. Storing the original time values referenced to the start of the signal is costly because their range can be arbitrarily large and the distribution of time points is essentially uniform. Storing only the time since the last spike, δτim , greatly restricts the range and produces a variable that approximately follows a gamma distribution. Rate fidelity curves are generated by varying the precision of the code, m {sm i , δτi }, and computing the resulting fidelity through reconstruction. A uniform quantizer is used to vary the precision of the code between 1 and 16 bits. At all levels of precision, the bin widths for quantization are selected m so that equal numbers of values fall in each bin. All sm i or δτi that fall within a bin are recoded to have the same value. We use the mean of the unquantized m values that fell within the bin. sm i and δτi are quantized independently. m We found that δτi for gammatones with low center frequencies required much less precision than for higher-frequency gammatones. Accordingly, temporal precision for the kernel functions was normalized with respect to its wavelength so that the same error during quantization would produce the same relative displacement with respect to a kernel’s wavelength. Treating the quantized values as samples from a random variable, we estimate a code’s entropy (bits/coefficient) from histograms of the values. Rate is then the product of the estimated entropy of the quantized variables and the number of coefficients per second for a given signal. At each level of precision, the signal is reconstructed based on the quantized values, and an SNR for the code is computed. This process was repeated across a set of signals, and the results were averaged to produce rate fidelity curves. m Matching pursuit was used to estimate the {sm i , δτi } pairs for these rate fidelity curves. Coding efficiency can be measured in nearly identical fashion for other signal representation. In addition to spike codes, rate fidelity curves were generated for four other signal representation methods using the same set of sounds. The two most common methods for signal processing are Fourier and wavelet transform. Fourier coefficients were obtained for each signal via fast Fourier transform. The real and imaginary parts were quantized independently, and the rate was based on the estimated entropy of the quantized coefficients. Reconstruction was simply an inverse Fourier transform on the quantized coefficients. Similarly, coding efficiency using eighth-order Daubechies wavelets was estimated using Matlab’s discrete wavelet transform and inverse wavelet transform functions. As a baseline for comparison, rate fidelity curves were produced for the waveform of time-varying am-

40

E. Smith and M. Lewicki 3PEECH

3.2 D"

3.2 D"

2AW #OMPRESSED &OURIER 7AVELET 3PIKECODE

-USIC

2AW #OMPRESSED &OURIER 7AVELET 3PIKECODE

2ATE +BPS

2ATE +BPS

Figure 12: Rate fidelity curves for the raw (uncompressed) time-varying signals, compressed signals, Fourier transform, discrete wavelet transform, and spike coding using matching pursuit for speech (left) and music (right).

plitude values. The fidelity was determined by quantizing the amplitude values and computing the SNR directly from the resulting signal. Two different methods were used to determine the rate. In the compressed case, the rate was based on the estimated entropy of the quantized values, just as above. In the raw case, the cost of each coefficient was equal to the quantization level rather than the estimated entropy. Figure 12 shows the rate fidelity curves calculated for two classes of sounds: speech (from the TIMIT database; Garofolo et al., 1990) and music (a wide variety of instrumental pieces collected from the Internet.) Compressed, Fourier, wavelet, and spike representations are all far more efficient than the raw signal over the range of rate fidelity parameters we analyzed. At rates below 40 Kbps, spike codes produce the most efficient representations of both speech and music. For example, between 10 and 20 Kbps, the fidelity of the spike representation of speech is approximately twice that of either Fourier of wavelet transformations. At higher bit rates (above 60 Kbps), the Fourier and wavelet representations produce much higher rate fidelity curves than either spike codes or compressed signals. At high rates (off the scale shown), the wavelet representation is most efficient followed by Fourier. In Figure 10, the filter threshold hybrid code also exceeded the matching pursuit–generated code, in this case beyond 25 dB SNR. This offers some evidence that optimized spike codes might exceed the efficiency of Fourier and wavelets throughout the range of perceptually discriminable fidelities.

Efficient Coding of Time-Relative Structure Using Spikes

41

6 Discussion We have presented a theoretically motivated model for sound coding in which the computational goal is to form an efficient, time-relative representation of the time-varying-amplitude signal. Signals are represented by decomposing them into a minimal number of gammatone acoustic events, each with an associated time and amplitude. We have shown that this yields efficient representations of transient signals, allowing highly precise representations of sound onset. We have also shown that over a large range of fidelities, this representation is more efficient than Fourier or wavelet representations. The research presented here adds to the literature on efficient sound representation in several ways. It expands on the original spike coding model (Lewicki & Sejnowski, 1999; Lewicki, 2002b) by presenting a new encoding algorithm, objectively measuring efficiency of spike codes, and comparing them against Fourier and wavelets, and providing a detailed analysis of the representation with specific examples of its strengths and weaknesses. This work also addresses a number of issues relating to matching pursuit. Goodwin and Vetterling (1999) proposed using damped sinusoids with matching pursuit to avoid problems arising from using symmetric wavelets to represent sound. We demonstrated the potential of using a physiologically derived set of gammatone kernels for sound representation rather than abstract sinusoids. The success of this approach lends support to the idea that these functions are adapted to the statistics of natural sounds (Lewicki, 2002a). Additionally, we presented an analysis of the relationship between matching pursuit codes and those derived via MAP optimization. Finally, we used rate distortion to compare matching pursuit against other approximative spike coding algorithms. Another view of these results is that we have shown a method for achieving significantly greater coding efficiency using what is in effect an overcomplete basis. A motivating goal for so-called overcomplete representations or atomic decompositions with overcomplete dictionaries (Mallat & Zhang, 1993; Chen, Donoho, & Saunders, 1996; Goodwin & Vetterli, 1999; Lewicki & Sejnowski, 2000) is to draw from a rich vocabulary of signal descriptors to find compact signal representations. The method we have described here can be viewed in these terms, because the kernel functions φm (t) can be arbitrarily numerous in their center frequencies and can be placed at arbitrary points in time (Lewicki & Sejnowski, 1999; Lewicki, 2002b). In previous studies, it has been difficult to show that the sparse representations made possible with overcomplete representations could actually yield demonstrably more efficient codes. The general problem is that the added cost of coding an overcomplete set of coefficients often outweighs any gain achieved in representation sparseness (Lewicki & Olshausen, 1999; Lewicki & Sejnowski, 2000). The advantage offered by the approach here is that it is not necessary for the code to describe each implicit coefficient (i.e., at every

42

E. Smith and M. Lewicki

sample position). Instead, it is sufficient to describe the time intervals between the spikes. This yielded a much more efficient code for two reasons. The first is that the use of gammatones is well matched to the underlying structure of speech and music. The second is that the matching-pursuit algorithm achieves highly sparse representations. This is crucial, because optimization with filter threshold yields a highly redundant code that does not show increased coding efficiency. It is possible that improvements in either the kernel functions or the encoding algorithms could yield spike codes with even greater efficiencies, and thus provide improved methods for representing the underlying signal structure. We will address a few assumptions made in our model. The first is that we have assumed an explicit generative model and assessed performance by computing the fidelity of the reconstructed signal. It is possible, for example, that the simple filter threshold algorithm does achieve a high-fidelity encoding of the signal, but we do not know how to reconstruct it. It was our motivation, however, to develop efficient codes that can reveal the intrinsic information of a signal. Our assumption of a linear superposition of kernel functions is a simple means of evaluating the degree of redundancy in the representation. Another assumption concerns our choice of distortion measure for evaluating coding efficiency. The reported signal-to-noise values are based on the sum-squared error between the original signal and the reconstruction. It is well known that this measure of distortion does not agree closely with human perception. For example, gaussian white noise signals would tend to appear very dissimilar using this measure, yet they all sound the same to a listener. Although a perceptual distortion measure will offer greater insight into the relationship between the algorithms described here and human perception, the sum-squared error is adequate to show their relative coding efficiency as general signal representations. Beyond the questions of general signal processing, we are particularly interested in the use of spike coding algorithms as models of neural auditory processing. The analog amplitude values in the model can also be interpreted as representing a local population of auditory nerve spikes. As a theoretic model of auditory coding, this posits that the purpose of the (binary) spikes at the auditory nerve is to encode as accurately as possible the temporal position and amplitude of underlying acoustic features, which are compactly described by gammatones. Analog spikes are a useful theoretical abstraction, and it is simple to convert these individual spikes into a population of probabilistically firing, binary units that can carry the same information. Some possible neural network architectures along these lines are described in Lewicki (2002b). One potential concern for describing the cochlear processing with the model presented here is that this model lacks the explicit representation of the nonlinearities found in detailed cochlear models. Our goal, however, was to construct a coding algorithm motivated by higher-level principles.

Efficient Coding of Time-Relative Structure Using Spikes

43

We have not yet investigated whether any cochlear nonlinearities can be interpreted in terms of the assumed computational objective. For example, matching pursuit and gradient optimization can be viewed as means of selecting out a subset of the spikes that removes interspike redundancy and yields an efficient representation. This could offer a novel theoretical interpretation of two-tone inhibition. The addition of nonlinearities may still not help describe the intrinsic coding processes of the system. For example, the model by Yang, Wang, and Shamma (1992) is spike based and contains numerous biophysically motivated nonlinearities, but the fidelity of their reconstructions falls in the same range as filter threshold and is much lower than that of matching pursuit or the hybrids. They reported SNR ranging from 11 to 25 dB, depending on the extent of processing. This underscores the complexity of the signal encoding problem solved by the peripheral auditory system. We have sought to develop an algorithm that extracts the intrinsic information in a signal from the stream of observable information. As was stated earlier, information reduction can be achieved by either selecting only the desired information or removing redundancy. The choice of kernel functions biases our model to “select” certain types of signal structure over others. Our choice of a gammatone filter bank, a highly efficient basis for natural sounds, biases the model to select the underlying structures of natural sounds. The algorithms for generating spike codes take the approach of redundancy reduction. Reducing the temporal correlations shows promise for yielding better methods to extract intrinsic structure from raw acoustic signals.

Acknowledgments E.S. was supported by NIH training grant MH19983. This material is based on work supported by the National Science Foundation under grant 0238351 to M.L.

References Baumgarte, F. (2002). Improved audio coding using a psychoacoustic model based on a cochlear filter bank. IEEE Transactions on Speech and Audio Processing, 10(7), 495–503. Blauert, J. (1997). Spatial hearing (rev. ed.). Cambridge, MA: MIT Press. Chen, S. S., Donoho, D. L., & Saunders, M. A. (1996). A composite model of the auditory periphery for the processing of speech. SIAM Review, 43(1), 129–159. Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L., & Zue, V. (1990). TIMIT acoustic-phonetic continuous speech corpus. Audio CD.

44

E. Smith and M. Lewicki

Gitza, O. (1988). Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in noisy environments. J. Phonetics, 16, 109–124. Goodwin, M. M., & Vetterli, M. (1999). Matching pursuit and atomic signal models based on recursive filter banks. IEEE Transactions on Signal Processing, 47(7), 1890–1902. Gribonval, R., Depalle, P., Rodet, X., Bacryf, E., & Mallatt, S. (1994). Sound signals decomposition using a high resolution matching pursuit. In Proceedings of International Computer Music Conference (pp. 293–296). Lewicki, M. S. (2002a). Efficient coding of natural sounds. Nature Neuroscience, 5(4), 356–363. Lewicki, M. S. (2002b). Efficient coding of time-varying patterns using a spiking population code. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 241–255). Cambridge, MA: MIT Press. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America, 16(7), 1587–1601. Lewicki, M. S., & Sejnowski, T. J. (1999). Coding time-varying signals using sparse, shift-invariant representations. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 730–736). Cambridge, MA: MIT Press. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365. Liberman, A., Delattre, P., & Cooper, F. (1958). Some cues for the distinction between voiced and voiceless stops in initial position. Language and Speech, 1, 905–917. Lyon, R. F. (1982). A computational model of filtering, detection and compression in the cochlea. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (pp. 1282–1285). Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12), 3397–3415. Oertel, D. (1999). The role of timing in the brain stem auditory nuclei of vertebrates. Annual Review of Physiology, 61, 497–519. Olshausen, B. A. (2002). Sparse codes and spikes. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 257–272). Cambridge, MA: MIT Press. Patterson, R., Holdsworth, J., Nimo-Smith, I., & Rice, P. (1988). Implementing a gammatone filterbank (Tech. Rep. No. 2341). Cambridge, UK: MRC Applied Psychology Unit. Rabiner, L. R., & Levinson, S. E. (1981). Isolated and connected word recognition: Theory and selected applications. IEEE Trans. Communications, COM-25(5), 621–659. Shamma, S. (1985). Speech processing in the auditory system: II. Lateral inhibition and the central processing of speech-evoked activity in the auditory nerve. Journal of the Acoustical Society of America, 78, 1622–1632.

Efficient Coding of Time-Relative Structure Using Spikes

45

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Slaney, M. (1998). Auditory toolbox (Tech. Rep. No. 1998-010). Palo Alto, CA: Interval Research Corporation. Slaney, M., & Lyon, R. F. (1993). On the importance of time—A temporal representaton of sound. In M. Cooke, S. Beet, & M. Crawford (Eds.), Visual representations of speech signals (pp. 95–116). New York: Wiley. Yang, X., Wang, K., & Shamma, S. (1992). Auditory representations of acoustic signals. IEEE Trans. Inf. Theory, 38, 824–839. Received March 11, 2004; accepted June 4, 2004.

LETTER

Communicated by Bruno Olshausen

Bilinear Sparse Coding for Invariant Vision David B. Grimes [email protected]

Rajesh P. N. Rao [email protected] Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, U.S.A

Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. We describe an unsupervised algorithm for learning both localized features and their transformations directly from images using a sparse bilinear generative model. We show that from an arbitrary set of natural images, the algorithm produces oriented basis filters that can simultaneously represent features in an image and their transformations. The learned generative model can be used to translate features to different locations, thereby reducing the need to learn the same feature at multiple locations, a limitation of previous approaches to sparse coding and ICA. Our results suggest that by explicitly modeling the interaction between local image features and their transformations, the sparse bilinear approach can provide a basis for achieving transformation-invariant vision. 1 Introduction Algorithms for redundancy reduction and efficient coding have been the subject of considerable attention (Olshausen & Field, 1996, 1997; Bell & Sejnowski, 1997; Hinton & Ghahramani, 1997; Rao & Ballard, 1999; Lewicki & Sejnowski, 2000; Schwartz & Simoncelli, 2001). Although the basic ideas can be traced to earlier work (Attneave, 1954; Barlow, 1961), recent techniques such as independent component analysis (ICA) and sparse coding have helped formalize these ideas and have demonstrated the feasibility of efficient coding through redundancy reduction. These techniques produce an efficient code by using appropriate constraints to minimize the dependencies between elements of the code. One of the most successful applications of ICA and sparse coding has been in the area of image coding. Olshausen and Field (1996, 1997) showed that sparse coding of natural images produces localized, oriented basis filters that resemble the receptive fields of simple cells in primary visual cortex. Neural Computation 17, 47–73 (2005)

c 2004 Massachusetts Institute of Technology

48

D. Grimes and R. Rao

Bell and Sejnowski (1997) obtained similar results using their algorithm for ICA. However, these approaches do not take image transformations such as translation into account. Thus, the model cannot take advantage of the fact that certain basis features model the same image features but under different transformations. As a result, for each oriented feature, a number of independent units must code for the same feature at different locations, making it difficult to scale the approach to large image patches and hierarchical networks. In this letter, we propose an approach to sparse coding that explicitly models the interaction between image features and their transformations. A bilinear generative model is used to learn both the independent features in an image as well as their transformations. Our approach extends Tenenbaum and Freeman’s (2000) work on bilinear models for learning content and style by casting the problem within a probabilistic sparse coding framework. Thus, whereas prior work on bilinear models used global decomposition methods such as singular value decomposition (SVD), the approach presented here emphasizes the extraction of local features by removing higher-order redundancies through sparseness constraints. We show that for natural images, this approach produces localized, oriented filters that can be translated by different amounts to account for image features at arbitrary locations. Our results demonstrate how an image can be factored into a set of basic local features and their transformations, providing a basis for transformation-invariant vision. In particular, we focus on the problem of invariance to transformations caused by moving objects or smooth selfmotion. We assume that for a given object, the goal is to estimate the motion of the object and learn bilinear features for both the object and its transformation. We conclude by discussing related work and suggest an extension of the approach to parts-based object recognition, wherein an object is modeled as a collection of local features (or “parts”) and their relative transformations. 2 Bilinear Generative Models We begin by considering the standard linear generative model used in algorithms for ICA and sparse coding (Bell & Sejnowski, 1997; Olshausen & Field, 1997; Rao & Ballard, 1999): z=

m

wi xi = Wx,

(2.1)

i=1

where z is a k-dimensional input vector (for instance, an image), wi is a k-dimensional basis vector, and xi is its scalar coefficient. Given the linear generative model above, the goal of ICA is to learn the basis vectors wi (i.e., the matrix W) such that the xi are as independent as possible, while the goal in sparse coding is to make the distribution of xi highly kurtotic given equation 2.1.

Bilinear Sparse Coding for Invariant Vision

49

Now consider adding a transformation parameterized by λ to the generative process of an image z, so that z = Tλ (Wx). Given the linear model as described above, as λ varies, the x will need to change accordingly. Transformation-invariance methods seek to model the image formation process in such a way that x is independent (or at least uncorrelated) with λ. A simple method for achieving invariance is to introduce another variable y, which accounts for the changes in the image due to the transformation Tλ . Invariance is achieved once y is known because x and λ are conditionally independent given y. Thus, the key requirement of any such model is that y can easily be inferred. Using a probabilistic approach, we specify the form of the image likelihood function P(z|x, y). To model this likelihood, we introduce an interaction function f (x, y) that models the interactions between x and y in the image formation process. For an additive gaussian noise model, the likelihood becomes P(z|x, y) = G (z; f (x, y), σ 2 ). The function f (x, y) should not only be able to represent the transformations of interest, but must also be invertible in the sense that x and/or y can be inferred given z and possibly one of x or y. Perhaps the simplest function f is the linear function: f (x, y) = Wx + W y. Unfortunately, this model is too impoverished to represent most common classes of transformations such as affine transformations in the image plane. A logical next step is to consider multiplicative interactions between x and y. In this work, we explore the use of the bilinear function, which is the simplest form of f allowing multiplicative interactions. The linear generative model in equation 2.1 can be extended to the bilinear case by using two sets of coefficients xi and yj (or equivalently, two vectors x and y) (Tenenbaum & Freeman, 2000): z = f (x, y) =

n m

wij xi yj .

(2.2)

i=1 j=1

The coefficients xi and yj jointly modulate a set of basis vectors wij to produce an input vector z. For this study, the coefficient xi can be regarded as encoding the presence of object feature i in the image while the yj values determine the transformation present in the image. In the terminology of Tenenbaum and Freeman (2000), x describes the content of the image, while y encodes its style. Equation 2.2 can also be expressed as a linear equation in x for a fixed y:   m n m y  z = f (x)|y = wij yj  xi = wi xi . (2.3) i=1 y

j=1

i=1

The notation wi signifies a transformed feature computed by the weighted sum shown above of the bilinear features wi,∗ by the values in a given y

50

D. Grimes and R. Rao

Figure 1: Examples of linear and bilinear features. A comparison of learned features between a standard linear model and a bilinear model, both trained using sparseness constraints to obtain localized, independent features. The two y rows in the bilinear case depict the translated object features wi (see equation 2.3) for different y vectors corresponding to translations of −3, . . . , 3 pixels.

vector. Likewise, for a fixed x, one obtains a linear equation in y. Indeed, this is the definition of bilinear: given one fixed factor, the model is linear with respect to the other factor. The power of bilinear models stems from the rich nonlinear interactions that can be represented by varying both x and y simultaneously. Note that the standard linear generative model (see equation 2.1) can be seen as a special case of the bilinear model when n = 1 and y = 1. A comparison between examples of features used in the linear generative model and the bilinear model is given in Figure 1. The features in the linear model represent a single instance within the range of features that can be learned by the bilinear model. 3 Learning Sparse Bilinear Models 3.1 Learning Bilinear Models. Our goal is to learn from image data an appropriate set of basis vectors wij that effectively describe the interactions between the feature vector x and the transformation vector y. A commonly used approach in unsupervised learning is to minimize the sum of squared pixel-wise errors over all images: 2 m n E1 ({wij }, x, y) = z − wij xi yj i=1 j=1  T   n n m m = z − wij xi yj  z − wij xi yj  , i=1 j=1

i=1 j=1

(3.1)

(3.2)

Bilinear Sparse Coding for Invariant Vision

51

where · denotes the L2 norm of a vector. A standard approach to minimizing such a function is to use gradient descent and alternate between minimization with respect to {x, y} and minimization with respect to wij . Unfortunately, the optimization problem as stated is underconstrained. The function E1 has many local minima, and results from our simulations indicate that convergence is not obtainable with data drawn from natural images. There are many different ways to represent an image, making it difficult for the method to converge to a basis set that can effectively represent images that were not in the training set. A related approach is presented by Tenenbaum and Freeman (2000) in their article dealing with style and content separation. Rather than using gradient descent, their method estimates the parameters directly by computing the SVD of a matrix A containing input data corresponding to each content class in every style. Their approach can be regarded as an extension of methods based on principal component analysis (PCA) applied to the bilinear case. The SVD approach avoids the difficulties of convergence that plague the gradient-descent method and is much faster in practice. Unfortunately, the learned features tend to be global and nonlocalized, similar to those obtained from PCA-based methods based on second-order statistics. As a result, the method is unsuitable for the problem of learning local features of objects and their transformations. The underconstrained nature of the problem can be remedied by imposing constraints on x and y. In particular, we cast the problem within a probabilistic framework and impose specific prior distributions on x and y with higher probabilities for values that achieve certain desirable properties. We focus here on the class of sparse prior distributions for several reasons: (1) by forcing most of the coefficients to be zero for any given input, sparse priors minimize redundancy and encourage statistical independence between the various xi and between the various yj (Olshausen & Field, 1997); (2) there is some evidence for sparse representations in the brain (Foldi´ ¨ ak & Young, 1995): the distribution of neural responses in visual cortical areas is typically highly kurtotic, that is, cells exhibit little activity for most inputs but respond vigorously for a few inputs, causing a distribution with a high peak near zero and long tails; (3) previous approaches based on sparseness constraints have obtained encouraging results (Olshausen & Field, 1997); and (4) enforcing sparseness on the xi encourages the parts and local features shared across objects to be learned while imposing sparseness on the yj allows object transformations to be explained in terms of a small set of basis vectors.

3.2 Probabilistic Bilinear Sparse Coding. Our probabilistic model for bilinear sparse coding follows a standard Bayesian MAP (maximum a posteriori) approach. Thus, we begin by factoring the posterior probability of

52

D. Grimes and R. Rao

the parameters given the data as P(x, y, {wij }|z) = P(z|x, y, wij )

m

P(xi )

i=1

∝ P(z|x, y, wij )

m

n

P(yj )P({wij })

(3.3)

P(yj ).

(3.4)

j=1

P(xi )

i=1

n j=1

Equation 3.3 assumes independence between x, y, and {wij } as well as independence within the individual dimensions of x and y. Equation 3.4 assumes a uniform prior for P({wij }), which is thus ignored. We assume the following priors for xi and yj : 1 −αS(xi ) e Qα 1 −βS(yj ) P(yj ) = e , Qβ

P(xi ) =

(3.5) (3.6)

where Qα and Qβ are normalization constants, α and β are parameters that control the degree of sparseness, and S is a “sparseness function.” For this study, we used S(a) = log(1 + a2 ). As shown in Figure 2, our choice of S(a) corresponds to a Cauchy prior distribution, which exhibits a useful nonlinearity in the derivative S (a). The squared error function E1 in equation 3.2 can be interpreted as representing the negative log likelihood (− log P(z|x, y, wij )) under the assumption of gaussian noise with unit variance (see, e.g., Olshausen & Field, 1997). (a)

(b)

(c)

−6

x 10 4

6

2 1 0 −1

0

a

1

4

S’(a)

S(a)

P(a)

3

1

2 0 −1

0

a

1

0

−1 −1

0

1

a

Figure 2: A probabilistic sparse coding prior. (a) The probability distribution function for the Cauchy sparse coding prior. Although the distribution appears similar to a gaussian distribution, the Cauchy is supergaussian (highly kurtotic). (b) The derived sparseness error function. (c) The nonlinearity introduced in the derivative of the sparseness function. Note that the function differentially forces small coefficients toward zero, and only at some threshold are large coefficients made larger.

Bilinear Sparse Coding for Invariant Vision

53

Maximizing the posterior in equation 3.3 is thus equivalent to minimizing the following log posterior function over all input images: 2 m m n n E({wij }, x, y) = z − wij xi yj + α S(x ) + β S(yj ). (3.7) i i=1 j=1 i=1 j=1 The gradient of E can be used to derive update rules at time t for the components xa and yb of the feature vector x and transformation vector y, respectively, for any image z, assuming a fixed basis wij :   n n m 1 ∂E dxa = wTaq z − wij xi yj  yq + =− dt 2 ∂xa q=1 i=1 j=1   m m n 1 ∂E dyb T  = wqb z − wij xi yj  xq + =− dt 2 ∂yb q=1 i=1 j=1

α S (xa ) 2

(3.8)

β S (yb ). 2

(3.9)

Given a training set of inputs zl , the values for x and y for each image after convergence can be used to update the basis set wij in batch mode according to   q m n dwab 1 ∂E zl − = wij xi yj  xa yb . (3.10) =− dt 2 ∂wab i=1 j=1 l=1 One difficulty in the sparse coding formulation of equation 3.7 is that the algorithm can trivially minimize the sparseness function by making x or y very small and compensate by increasing the wij basis vector norms to maintain the desired output range. Therefore, as previously suggested (Olshausen & Field, 1997), in order to keep the basis vectors from growing without bound, we adapt the L2 norm of each basis vector in such a way that the variance of xi (and yj , as discussed below) were maintained at a fixed desired level (σg2 ). Simply forcing the basis vectors to have a certain norm can lead to instabilities; therefore, a “soft” variance normalization method was employed. The element-wise variance of the x vectors inferred during a single batch iteration was tracked in the vector xvar and adapted at a rate given by the parameter ε (see algorithm 1, line 18). A gain term gx is computed (see lines 19–20 in algorithm 1), which determines the multiplicative factor for adapting the norm of a particular basis vector: ˆ ij = gx,i w

wij . wij 2

(3.11)

An additional complication in the bilinear case is that wij 2 is related to the variance of both xi and yj . One possible solution is to compute a joint

54

D. Grimes and R. Rao

gain matrix G (which specifies a gain Gi,j for each wij basis vector) as the geometric mean of the elements in the gain vectors gx and gy : G=

gx gy T .

(3.12)

However, in the case where sparseness is desired for x but not y (i.e., β = 0.0), the variance of y will rapidly increase, as the variance of x rapidly decreases, and no perturbations to the norm of the basis vectors wij will solve this problem. To avoid this problem, the algorithm performs soft variance normalization directly on the evolving y vectors, and scales the basis vectors wij based only on the variance of xi (see algorithm 1, lines 12–14). 3.3 Algorithm for Learning Bilinear Models of Translating Image Patches. This section describes our unsupervised learning algorithm that uses the update rules (see equations 3.8–3.10) to learn localized bilinear features in natural images for two-dimensional translations. Figure 3 presents

Figure 3: Example training data from natural scene images. Training data are formed by randomly selected patch locations from a set of natural images (a). (b) The patch is then transformed to form the training set of patches zij . In this case, the patch is shifted using horizontal translations of ±2 pixels. To learn a model of style/content separation, a single x vector is used to represent each image in a column, and a single y vector represents each image in a row.

Bilinear Sparse Coding for Invariant Vision

55

a high-level view of the training paradigm in which patches are randomly selected from larger images and subsequently transformed. We initially tested the application of the gradient-descent rules simultaneously to estimate {wij }, x, and y. Unfortunately, obtaining convergence reliably was rather difficult in this situation, due to a degeneracy in the model in the form of an unconstrained degree of freedom. Given a constant c, there is ambiguity in the model since P(z|cx, 1c y) = P(z|x, y). Our use of the priors P(x), P(y) largely mitigates problems stemming from this degeneracy, yet oscillations are still possible when both x and y are adapted simultaneously. Fortunately, we found that minimizing E({wij }, x, y) with respect to a single variable until near convergence yields good results, particularly when combined with a batch derivative approach. This approach of iteratively performing MAP estimation with respect to a single variable at a time is known within the statistics community as iterated conditional modes (ICM) (Besag, 1986). ICM is a deterministic method shown to generally converge quickly, albeit to a local minimum. In our implementation, we use a conjugate gradient method to speed up convergence, minimizing E with respect to x and y. The algorithm we have developed for learning the model parameters {wij } is essentially an incremental expectation-maximization (EM) algorithm applied to randomly selected subsets (“batches”) of training image patches. The algorithm is incremental in the sense that we use a single step in the parameter spaces, increasing the log likelihood of the model but not fully maximizing it. We have observed that an incremental M-step often produces better results than a full M-step (equivalent to fully minimizing equation 3.7 with respect to {wij } for fixed x and y)). We believe this is because performing a full M-step on each batch of data can potentially lead to many shallow local minima, which may be avoided by taking the incremental M-steps. The algorithm can be summarized as follows (see also the pseudocode labeled algorithm 1). First, randomly initialize the bilinear basis W, and the matrix Y containing a set of vectors describing each style (ys ). For each batch of training image patches, estimate the per patch (indexed by i) vectors xi and yi,s for a set of transformations (indexed by s). This corresponds to the E-step in the EM algorithm. In the M-step, take a single gradient step with respect to W using the estimated xi and yi,s values. In order to regularize Y and avoid overfitting to a particular set of patches, we slowly adapt Y over time as follows. For each particular style s, adapt ys toward the mean of all inferred vectors y∗,s corresponding to patches transformed according to style s (line 11 in algorithm 1). Averaging ys across all patches for a particular transformation encourages the style representation to be invariant to particular patch content. Without additional constraints, the algorithm above would not necessarily learn to represent content only in x and transformations only in y. In order to learn style and content separation for transformation invariance,

56

D. Grimes and R. Rao

Algorithm 1: LearnSparseBilinearModel(I, T, l, α, β, , η, ε)

1: W ⇐ RandNormalizedVectors(k, m, n) 2: Y ⇐ RandNormalizedVectors(n, r) 3: for iter ⇐ 1, · · · , maxIter do 4: P ⇐ SelectPatchLocations(I, q) 5: Z ⇐ ExtractPatches(I, P) 6: X ⇐ InferContent(W, Y(:, c), Z, α) 7: dW ⇐ Zeros(k, m, n) 8: for s ⇐ 1, · · · , r do 9: Z ⇐ ExtractPatches(I, Transform(P, T(:, s))) 10: Ybatch ⇐ InferStyle(W, X, Z, β) 11: Y(:, s) ⇐ (1 − )Y(:, s) + · SampleMean(Ybatch ) 12: yvar ⇐ (1 − var + ε · SampleVar(Ybatch ) ε)y

γ yvar 13: gy ⇐ gy σ 2 g

14: 15: 16: 17: 18:

Y(:, s) ⇐ NormalizeVectors(Y(:, s), gy ) dW ⇐ dW + 1r dEdW(Z, W, X, Y(:, s)) end for W = W + ηdW xvar ⇐ (1 − var + ε · SampleVar(X) ε)x

19:

gx ⇐ gx

xvar σg2

γ

20: W ⇐ NormalizeVectors(W, gx ) 21: end for Algorithm 2: InferStyle(W, X, Z, β)

1: Y ⇐ ConjGradSparseFit(W, X, Z, β)

we estimate x and y in a constrained fashion. We first infer xi . Finding the initial x vector relies on having an initial y vector. Thus, we refer to one of the transformations as “canonical,” corresponding to the identity transformation. This transformation vector yc is used for initially bootstrapping the content vector xi , but besides this use is adapted exactly like the rest of the style vectors in the matrix Y. We then use the same xi to infer all yi,s vectors for each transformed patch zi,s . This ensures that a single content vector xi codes for the content in the entire set of transformed patches zi,s . Algorithm 1 presents the pseudocode for the learning algorithm. It makes use of algorithms 2 and 3 for inferring style and content, respectively. Table 1 describes the variables used in the learning of the sparse bilinear model from natural images. Capital letters indicate matrices containing column vectors. Individual columns are indexed using Matlab style “slicing,” for example, Y(:, s) is the ith column of Y yi . The indicates the element-wise product.

Bilinear Sparse Coding for Invariant Vision

57

Algorithm 3: InferContent(W, Y, Z, α)

1: X ⇐ ConjGradSparseFit(W, Y, Z, α) Table 1: Variables for Learning and Inference. Name

Size

m k I α η ε σg2 r

scalar scalar [k , l ] scalar scalar scalar scalar scalar

Description x (content) dim. z (patch) dim. Full-sized images xi prior weight W adaptation rate yvar adaptation rate xi goal variance Number of transforms

Name

Size

n l Z β γ c T

scalar scalar [k, l] scalar scalar scalar scalar [2, r]

Description y (style) dim. Number of patches in batch Image patches yj prior weight Y adaptation rate Variance penalty Canon. transform idx. Transform parameters

4 Results 4.1 Training Methodology. We tested the algorithms for bilinear sparse coding on natural image data. The natural images we used are distributed by Olshausen and Field (1997) along with the code for their algorithm. Except where otherwise noted in an individual experiment, the training set of images consisted of 10 × 10 pixel patches randomly extracted from 10 512×512 pixel source images. The images are prewhitened to equalize large variances in frequency, helping to speed convergence. To assist convergence, all learning occurs in batch mode, where each batch consists of l = 100 image patches. The effects of varying the number of bilinear basis units m and n were investigated in depth. An obvious setting for m, the dimension of the content vector x, is to use the value corresponding to a complete basis set in the linear model (m equals the number of pixels in the patch). As we will demonstrate, this choice for m yields good results. However, one might imagine that because of the ability to merge representations of features that are equivalent with respect to the transformations, m can be set to a much lower value and still effectively learn the same basis. As we discuss later, this is true only if the features can be transformed independently. An obvious choice for n, the number of dimensions of the style vector y, is simply the number of transformations in the training set. However, we found that the model is able to perform substantial dimensionality reduction, and we present the case where the number of transformations is 49 and a basis of n = 10 is used to represent translated features. Experiments were also performed to determine values for the sparseness parameters α and β. Settings between 25 to 35 for α and between 0 and 5 for β appear to reliably yield localized, oriented features. Ranges are given

58

D. Grimes and R. Rao

here because optimum values seem to depend on n for two reasons. First, as n increases, the number of prior terms in equation 3.8 is significantly less than the mn likelihood terms. Thus, α must be increased to force sparseness on the posterior distribution on x. This intuitive explanation is reinforced by noting that α is approximately a factor of n larger than those found previously (Olshausen & Field, 1997). Second, if dimensionality reduction is desired (i.e., n is reduced), β must be lowered or set to zero, as the elements of y cannot be coded in an independent fashion. For instance, in the case of dimensionality reduction from 49 to 10 transformation basis vectors β = 0. The W step size η for gradient descent using equation 3.10 was set to 0.25. Variance normalization used ε = 0.25, γ = 0.05, and σg2 = 0.1. These parameters are not necessarily the optimum parameters, and the algorithm is robust with respect to significant parameter changes. Generally, only the amount of time required to find a good sparse representation changes with untuned parameters. In the cases presented here, the algorithm converged after approximately 2500 iterations, although to ensure that the representations had converged, we ran several runs for 10,000 iterations. The transformations for most of the experiments were chosen to be twodimensional translations in the range [−3, 3] pixels in both the axes. The experiments measuring transformation invariance (see section 4.4) considered one-dimensional translations in the range of [−8, 8] in 12 × 12 sized patches. 4.2 Bilinear Sparse Coding of Natural Images. Experimental results are analyzed as follows. First, we study the qualitative properties of the learned representation, then look quantitatively at how model parameters affect these and other properties, and finally examine the learned model’s invariance properties. Figures 4 and 5 show the results of training the sparse bilinear model on natural image data. Both show localized oriented features resembling Gabor filters. Qualitatively, these are similar to the features learned by the model for linear sparse coding. Some features appear to encode more complex features than is common for linear sparse coding. We offer several possible explanations here. First, we believe that some deformations of a Gabor response are caused by the occlusion introduced by the patch boundaries. This effect is most pronounced when the feature is effectively shifted out of the patch based on a particular translation and its canonical (or starting) location. Second, we believe that closely located and similarly oriented features may sometimes be representable in the same basis feature by using slightly different transformations representations. In turn, this may “free up” a basis dimension for representing a more complex feature. The bilinear method is able to model the same features under different transformations. In this case, horizontal translations in the range [−3, 3] were used for training the model. Figure 4 provides an example of how

Bilinear Sparse Coding for Invariant Vision

59

Figure 4: Representing natural images and their transformations with a sparse bilinear model. The representation of an example natural image patch and of the same patch translated to the left. Note that the bar plot representing the x vector is indeed sparse, having only three significant coefficients. The code for the translation vectors for both the canonical patch and the translated one is likewise sparse. The wij basis images are shown for those dimensions that have nonzero coefficients for xi or yj .

the model encodes a natural image patch and the same patch after it has been translated. Note that in this case, both the x and y vectors are sparse. Figure 5 displays the transformed features for each translation represented by a learned ys vector. Figure 6 shows how the model can account for a given localized feature at different locations by varying the y vector. As shown in the last column of the figure, the translated local feature is generated by linearly combining a sparse set of basis vectors wij . This figure demonstrates that the bilinear form of the interaction function f (x, y) is sufficient for translating features to different locations. 4.3 Effects of Sparseness on Representation. The free parameters α and β play an important role in deciding how sparse the coefficients in the vectors x and y are. Likewise, the sparseness of the vectors is intertwined with the desired local and independent properties of the wij bilinear basis features. As noted in other research on sparseness (Olshausen & Field, 1996), both the attainable sparseness and independence of features also depend on the model dimensionality—in our case, the parameters m and n. In all of our experiments, we use a complete basis (in which m = k) for content representation, assuming that the translations do not affect the number of basis features needed for representation. We believe this is justified also by the very idea that changes in style should not change the intrinsic content.

60

D. Grimes and R. Rao

Figure 5: Localized, oriented set of learned basis features. The transformed basis features learned by the sparse bilinear model for a set of five horizontal ys translations. Each block of transformed features wi is organized with values of i = 1, . . . , m across the rows and values of s = 1, . . . , r down the columns. In this case, m = 100 and r = 5. Note that almost all of the features exhibit localized and oriented properties and are qualitatively similar to Gabor features.

In theory, the style vector y could also use a sparse representation. In the case of affine transformations on the plane, using a complete basis for y means using a large value on n. From a practical perspective, this is undesirable, as it would essentially equate to the tiling of features at all possible transformations. Thus, in our experiments, we set β to a small or zero value and also perform dimensionality reduction by setting n to a fraction of the number of styles (usually between four and eight). This configuration allows

Bilinear Sparse Coding for Invariant Vision

61

Figure 6: Translating a learned feature to multiple locations. The two rows of eight images represent the individual basis vectors wij for two values of i. The yj values for two selected transformations for each i are shown as bar plots. y(a, b) denotes a translation of (a, b) pixels in the Cartesian plane. The last column shows the resulting basis vectors after translation.

learning of sparse, independent content features while taking advantage of dimensionality reduction in the coding of transformations. We also analyzed the effects of the sparseness weighting term α. Figure 7 illustrates the effect of varying α on the sparseness of the content representation x and on the log posterior optimization function E (see equation 3.7). The results shown are based on 1000 x vectors inferred by presenting the learned model with a random sample of 1000 natural image patches. For each value of α, we also average over five runs of the learning algorithm. Figure 8 illustrates the effect of the sparseness weighting term α on the kurtosis of x and the basis vectors learned by the algorithm. 4.4 Style and Content Invariance. A series of experiments were conducted to analyze the invariance properties of the sparse bilinear model. These experiments examine how transformation (y) and content representations (x) change when the input patch is translated in the plane. After learning a sparse bilinear model on a set of translated patches, we select a new test patch z0 and estimate a reference content vector x0 using InferContent(W, yc , z0 ). We then shift the patch according to transformation i: zi = Ti (z0 ). Next, we infer the new yi using InferStyle(W, x0 , zi ) (without using knowledge of the transformation parameter i). Finally, we infer the new content representation xi using a call to the procedure InferContent(W, yi , zi ). To quantify the amount of change or variance in a given transformation or content representation, we use the L2 norm of the vector difference between the reestimated vector and the initial vector. To normalize our metric and account for different scales in coefficients, we divide by the norm of the

62

D. Grimes and R. Rao 0 −100

Log likelihood

−200 −300 −400 −500 −600

Prior (sparseness) term Reconstruction (Gaussian) term Sum of terms

−700 −800 0

5

10 15 Sparseness parameter value

20

Figure 7: Effect of sparseness on the optimization function. As the sparseness weighting value is increased, the sparseness term (log P(x)) takes on higher values, increasing the posterior likelihood. Note that the reconstruction likelihood (log P(z|x, y, {wij )) is effectively unchanged, even as the sparseness term is weighted 20 times more heavily, thus illustrating that the sparse code does not cause a loss of representational fidelity.

reference vector: xi =

|xi − x0 | |x0 |

yi =

|yi − y0 | . |y0 |

(4.1)

For testing invariance, the model was trained on 12 × 12 patches and vertical shifts in the range [−8, 8], with m = 144 and n = 15. Figure 9 shows the result of vertically shifting a particular content (image patch at a particular location) and recording the subsequent representational changes. Figure 10 shows the result of selecting random patches (different content vectors x) and translating each in an identical way (same translational offset). Both figures show a strong degree of invariance of representation: the content vector x remains approximately unchanged when subjected to different translations, while the translation vector y remains approximately the same when different content vectors are subject to the same translation. While Figures 9 and 10 show two sample test sequences, Figures 11 and 12 show the transformation-invariance properties for the average of 100 runs on shifts in the range [−8, 8] in steps of 0.5. The amount of variance in x to translations of up to three pixels is less than a 2 percent change. These

Bilinear Sparse Coding for Invariant Vision

63

Figure 8: Sparseness prior yields highly kurtotic coefficient distributions. (a) The effect of weighting the sparseness prior for x (via α) on the kurtosis (denoted by k) of xi coefficient distribution. (b) A subset of the corresponding bilinear basis vectors learned by the algorithm.

64

D. Grimes and R. Rao

Figure 9: Transformation-invariance property of the sparse bilinear model. (a) Randomly selected natural image patch transformed by an arbitrary sequence of vertical translations. (b) Sequence of vertical pixel translations applied to the original patch location. (c) Effects of the transformations on the transformation (y) and patch content (x) representation vectors. Note that the magnitude of the change in y is well correlated to the magnitude of the vertical translation, while the change in x is relatively insignificant (mean x = 2.6), thus illustrating the transformation-invariance property of the sparse bilinear model.

results suggest that the sparse bilinear model is able to learn an effective representation of translation invariance with respect to local features. Figure 13 compares the effects of translation in the sparse linear versus sparse bilinear model. We first trained a linear model using the corresponding subset of parameter values used in the bilinear model. The same metric for measuring changes in representation was used by first estimating x0 for a random patch and then reestimating xi on a translated patch. As expected, the bilinear model exhibits a much greater degree of invariance to transformation than the linear model. 4.5 Interpolation for Continuous Translations. Although transformations are learned as discrete style classes, we found that the sparse bilinear model can handle continuous transformations. Linear interpolation in the style space was found to be sufficient for characterizing translations in the continuum between two discrete learned translations. Figure 14a shows

Bilinear Sparse Coding for Invariant Vision

65

Figure 10: Invariance of style representation to content. (a) Sequence of randomly selected patches (denoted “Can.”) and their horizontally shifted versions (denoted “Trans.”). (b) Plot of the the L2 distance in image space between the two canonical images. (c) The change in the inferred y for the translated version of each patch. Note that the patch content representation fluctuates wildly (as does the distance in image space), while the translation vector changes very little.

the values of the six-dimensional y vector for each learned translation. The filled circles on each line represent the value of the learned translation vector for that dimension. Plus symbols indicate the resulting linearly interpolated translation vectors. Note that generally the values vary smoothly with respect to the translation amount, allowing simple linear interpolation between translation vectors, similar to the method of locally linear embedding (Roweis & Saul, 2000). Figure 14b shows the style vectors for a model trained with only the transformations −4, −2, 0, +2, +4. Figure 14c shows examples of three reconstructed patches for two interpolated style vectors (for translations, −3 and +0.5 pixels). Also shown is the mean squared error (MSE) over all image pixels between each reconstructed patch and the actual translated patch. The MSE values for the interpolated cases are somewhat higher than those for translations in the training set (labeled “learned” in the figure) but within the range of MSE values across all image patches.

66

D. Grimes and R. Rao

Percent Norm Change

10

8 ∆x 6

4

2

0 −10

−5

0 Translation (pixels)

5

10

Figure 11: Effects of translation on patch content representation. The average relative change in the content vector x versus translation over 100 experiments in which 12 × 12 image patches were shifted by varying degrees. Note that for translations in the range (−3, +3), the relative change in x is small, yet as more and more features are shifted into the patch, the content representation must change. Error bars represent the standard deviation over the 100 experiments.

5 Discussion 5.1 Related Work. Our work is based on a synthesis of two extensively studied tracks of vision research. The first is transformation-invariant object representations, and the second is the extraction of sparse, independent features from images. The key observation is that the combination of the two tracks of research can be synergistic, effectively making each problem easier to solve. A large body of work exists on transformation invariance in image processing and vision. As discussed by Wiskott (2004), the approaches can be divided into two rough classes: (1) those that explicitly deal with different scales and locations by means of normalizing a perceived image to a canonical view and (2) those that find simple localized features that become transformation invariant by pooling across various scales and regions. The first approach is essentially generative, that is, given a representation of “what” (content) and “where” (style), the model can output an image. The bilinear model is one example of such an approach; others are discussed below. The second approach is discriminative in the sense that the model operates by extracting object information from an image and discards posi-

Bilinear Sparse Coding for Invariant Vision

Percent Norm Change

10

67

∆y

8

6

4

2

0

D4 D2 L2 L1.5 L1 L0.5 0 R0.5 R1 R1.5 R2 U2 U4 Translation (direction,amount)

Figure 12: Effects of changing patch content on translation representation. The average relative change in y for random patch content over 100 experiments in which various transformations were performed on a randomly selected patch. No discernible pattern seems to exist to suggest that some transformations are more sensitive to content than others. The bar shows the mean relative change for each transformation.

tional and scale information. Generation of an image in the second approach is difficult because the pooling operation is noninvertible and discards the “where” information. Examples of this approach include weight sharing and weight decay models Hinton, 1987; Foldi´ ¨ ak, 1991), the neocognitron (Fukushima, Miyake, & Takayukiito, 1983), LeNet (LeCun et al., 1989), and slow feature analysis (Wiskott & Sejnowski, 2002). A key property of the sparse bilinear model is that it preserves information about style and explicitly uses this information to achieve invariance. It is thus similar to the shifting and routing circuit models of Anderson and Van Essen (1987) and Olshausen, Anderson, and Van Essen (1995). Both the bilinear model and the routing circuit model retain separate pathways containing “what” and “where” information. There is also a striking similarity in the way routing nodes in the routing circuit model select for scale and shift to mediate routing, and the multiplicative interaction of style and content in the bilinear model. The y vector in the bilinear model functions in the same role as the routing nodes in the routing circuit model. One important difference is that while the parameters and structure of the routing circuit are selected a priori for a specific transformation, our focus is on learning

68

D. Grimes and R. Rao

250

Percent Norm Change

200

150

100 ∆ xi (linear) ∆ xi (bilinear)

50

0 −10

−5

0 Translation (pixels)

5

10

Figure 13: Effects of transformations on the content vector in the bilinear model versus the linear model. The solid line shows, for a linear model, the relative content representation change due to the input patch undergoing various translations. The points on the line represent the average of 100 image presentations per translation, and the error bars indicate the standard deviation. For reference, the results from the bilinear model (shown in detail in Figure 11) are plotted with a dashed line. This shows the high degree of transformation invariance in the bilinear model in comparison to the linear model, whose representation changes steeply with small translations of the input.

arbitrary transformations directly from natural images. The bilinear model is similarly related to the Lie group–based model for invariance suggested in Rao and Ruderman (1999), which also uses multiplicative interactions but in a more constrained way by invoking a Taylor-series expansion of a transformed image. Our emphasis on modeling both “what” and “where” rather than just focusing on invariant recognition (“what”) is motivated by the belief that preserving the “where” information is important. This information is critical not simply for recognition but for acting on visual information, in both biological and robotic settings. The bilinear model addresses this issue in a very explicit way by directly modeling the interaction between “what” and “where” processes, similar in spirit to the “what” and “where” dichotomy seen in biological vision (Ungerleider & Mishkin, 1982). The second major component of prior work on which our model is based is that of representing natural images through the use of sparse or statisti-

Bilinear Sparse Coding for Invariant Vision

69

Figure 14: Modeling continuous transformations by interpolating in translation space. (a) Each line represents a dimension of the tranformation space. The values for each dimension at each learned translation are denoted by circles, and interpolated subpixel coefficient values are denoted by plus marks. Note the smooth transitions between learned translation values, which allow interpolation. (b) Tranformation space values as in a when learned on coarser translations (steps of 2 pixels). (c) Two examples of interpolated patches based on the training in b. See section 4.5 for details.

70

D. Grimes and R. Rao

cally independent features. As discussed in section 1, our work is strongly based on the work of Olshausen and Field (1997), Bell and Sejnowski (1997), and Hinton and Ghahramani (1997) in forming local, sparse distributed representations directly from images. The sparse and independent nature of the learned features and their locality enables a simple, essentially linear model (given fixed content) to efficiently represent transformations of that content. Given global eigenvectors such as those resulting from principal component analysis, this would be more difficult to achieve. A second benefit of combining transformationinvariant representation with sparseness is that the multiplicity of the same learned features at different locations and transformations can be reduced by explicitly learning transformations of a given feature. Finally, our use of a lower-dimensional representation based on the bilinear basis to represent inputs and to interpolate style and content coefficients between known points in the space has some similarities to the method of locally linear embedding (Roweis & Saul, 2000). 5.2 Extension to a Parts-Based Model of Object Representation. The bilinear generative model in equation 2.2 uses the same set of transformation values yj for all the features i = 1, . . . , m. Such a model is appropriate for global transformations that apply to an entire image region such as a shift of p pixels for an image patch or a global illumination change. Consider the problem of representing an object in terms of its constituent parts (Lee & Seung, 1999). In this case, we would like to be able to transform each part independent of other parts in order to account for the location, orientation, and size of each part in the object image. The standard bilinear model can be extended to address this need as follows:   m n  z= wij yji  xi . (5.1) i=1

j=1

Note that each object feature i now has its own set of transformation values yji . The double summation is thus no longer symmetric. Also, note that the standard model (see equation 2.2) is a special case of equation 5.1 where yji = yj for all i. We have tested the feasibility of equation 5.1 using a set of object features learned for the standard bilinear model. Preliminary results (Grimes & Rao, 2003) suggest that allowing independent transformations for the different features provides a rich substrate for modeling images and objects in terms of a set of local features (or parts) and their individual transformations relative to an object-centered reference frame. Additionally, we believe that the number of basis features needed to represent natural images could be greatly reduced in the case of independently transformed features. A single “template” feature could be learned for a particular orientation and then

Bilinear Sparse Coding for Invariant Vision

71

translated to represent a localized, oriented image feature. 6 Summary and Conclusion A fundamental problem in vision is to simultaneously recognize objects and their transformations (Anderson & Van Essen, 1987; Olshausen et al., 1995; Rao & Ballard, 1998; Rao & Ruderman, 1999; Tenenbaum & Freeman, 2000). Bilinear generative models provide a tractable way of addressing this problem by factoring an image into object features and transformations using a bilinear function. Previous approaches used unconstrained bilinear models and produced global basis vectors for image representation (Tenenbaum & Freeman, 2000). In contrast, recent research on image coding has stressed the importance of localized, independent features derived from metrics that emphasize the higher-order statistics of inputs (Olshausen & Field, 1996, 1997; Bell & Sejnowski, 1997; Lewicki & Sejnowski, 2000). This paper introduces a new probabilistic framework for learning bilinear generative models based on the idea of sparse coding. Our results demonstrate that bilinear sparse coding of natural images produces localized oriented basis vectors that can simultaneously represent features in an image and their transformation. We showed how the learned generative model can be used to translate a basis vector to different locations, thereby reducing the need to learn the same basis vector at multiple locations, as in traditional sparse coding methods. We also demonstrated that the learned representations for transformations vary smoothly and allow simple linear interpolation to be used for modeling transformations that lie in the continuum between training values. Finally, we showed that the object representation (the “content”) in the sparse bilinear model remains invariant to image translations, thus providing a basis for invariant vision. Our current efforts are focused on exploring the application of the model to learning other types of transformations such as rotations, scaling, and view changes. We are also investigating a framework for learning parts-based object representations that builds on the bilinear sparse coding model presented in this article. Acknowledgments This research is supported by NSF Career grant 133592, ONR YIP grant N00014-03-1-0457, and fellowships to R.P.N.R. from the Packard and Sloan foundations. References Anderson, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Sciences, 84, 1148–1167.

72

D. Grimes and R. Rao

Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217– 234). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Besag, J. (1986). On the Statistical Analysis of Dirty Pictures. J. Roy. Stat. Soc. B, 48, 259–302. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation., 3(2), 194–200. Foldi´ ¨ ak, P., & Young, M. P. (1995). Sparse coding in the primate cortex. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 895–898). Cambridge, MA: MIT Press. Fukushima, K., Miyake, S., & Takayukiito (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 826–834. Grimes, D. B., & Rao, R. P. N. (2003). A bilinear model for sparse coding. In S. Becker, S. Thrun, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel network. In G. Goos & J. Hartmanis (Eds.), PARLE: Parallel architectures and languages Europe (pp. 1–13). Berlin: Springer-Verlag. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions Royal Society B, 352, 1177–1190. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning Overcomplete Representations. Neural Computation, 12(2), 337–365. Olshausen, B., Anderson, C., & Van Essen, D. (1995). A multiscale routing circuit for forming size- and position-invariant object representations. Journal of Computational Neuroscience, 2, 45–62. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 33113325. Rao, R. P. N., & Ballard, D. H. (1998). Development of localized oriented receptive fields by learning a translation-invariant code for natural images. Network: Computation in Neural Systems, 9(2), 219–234. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive field effects. Nature Neuroscience, 2(1), 79–87.

Bilinear Sparse Coding for Invariant Vision

73

Rao, R. P. N., & Ruderman, D. L. (1999). Learning Lie groups for invariant visual perception. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 810–816). Cambridge, MA: MIT Press. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8), 819–825. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12(6), 1247–1283. Ungerleider, L., & Mishkin, M. (1982). Two cortical visual systems. In D. Ingle, M. Goodale, & R. Mansfield (Eds.), Analysis of visual behavior (pp. 549–585). Cambridge, MA: MIT Press. Wiskott, L. (2004). How does our visual system achieve shift and size invariance? In J. L. van Hemmen & T. J. Sejnowski (Eds.), Problems in systems neuroscience. New York: Oxford University Press. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Received November 4, 2003; accepted June 4, 2004.

LETTER

Communicated by Jaap van Pelt

A Probabilistic Framework for Region-Specific Remodeling of Dendrites in Three-Dimensional Neuronal Reconstructions Rishikesh Narayanan [email protected]

Anusha Narayan [email protected]

Sumantra Chattarji [email protected] National Centre for Biological Sciences, Bangalore 560065, India

Dendritic arborization is an important determinant of single-neuron function as well as the circuitry among neurons. Dendritic trees undergo remodeling during development, aging, and many pathological conditions, with many of the morphological changes being confined to certain regions of the dendritic tree. In order to analyze the functional consequences of such region-specific dendritic remodeling, it is essential to develop techniques that can systematically manipulate three-dimensional reconstructions of neurons. Hence, in this study, we develop an algorithm that uses statistics from precise morphometric analyses to systematically remodel neuronal reconstructions. We use the distribution function of the ratio of two normal distributed random variables to specify the probabilities of remodeling along various regions of the dendritic arborization. We then use these probabilities to drive an iterative algorithm for manipulating the dendritic tree in a region-specific manner. As a test, we apply this framework to a well-characterized example of dendritic remodeling: stress-induced dendritic atrophy in hippocampal CA3 pyramidal cells. We show that our pruning algorithm is capable of eliciting atrophy that matches biological data from rodent models of chronic stress. 1 Introduction Dendrites are the primary sites for receiving synaptic inputs from other neurons. The structure and biophysical properties of the dendritic arbor critically modulate synaptic integration. The discovery of the presence of numerous voltage-gated ion channels in dendrites has increased the importance of the role of dendritic arbor in neuronal function (Stuart, Spruston, & H¨ausser, 1999; Magee, Hoffman, Colbert, & Johnston, 1998; Johnston, Magee, Colbert, & Cristie, 1996; Migliore & Shepherd, 2002). Theoretical and experimental studies point to a significant role for dendritic morphology in modulating neuronal firing patterns (Mainen & Sejnowski, 1996; Neural Computation 17, 75–96 (2005)

c 2004 Massachusetts Institute of Technology

76

R. Narayanan, A. Narayan, and S. Chattarji

Krichmar, Nasuto, Scorcioni, Washington, & Ascoli, 2002; van Ooyen, Duijnhouwer, Remme, & van Pelt, 2002), calcium dynamics (Regehr & Tank, 1994), propagation of action potentials (Vetter, Roth, & H¨ausser, 2001), axonal competition (van Ooyen, Willshaw, & Ramakers, 2000), and other important biophysical properties of neurons (Segev & London, 2000; H¨ausser & Mel, 2003). The functional importance of dendritic morphology is also reflected in its strict regulation during development and in the adult brain (Stuart et al., 1999). During development, dendrites undergo dramatic changes in arborization in order to formulate appropriate synaptic contacts with other neurons (Cline, 2001; Wong & Ghosh, 2002). Structural plasticity in dendrites also plays a prominent role in changes elicited by aging (Duan et al., 2003; Pyapali & Turner, 1996), hibernation (Popov, Bocharova, & Bragin, 1992), Alzheimer’s disease (Anderton et al., 1998; Brizzee, 1987; Geula, 1998), temporal lobe epilepsy (Bothwell et al., 2001), brain injury and lesions (Jones & Schallert, 1994; Kevyani & Schallert, 2002), syndromes related to mental retardation (Kaufmann & Moser, 2000; Ramakers, 2002), retinal degeneration (Marc, Jones, Watt, & Strettoi, 2003), and chronic stress (McEwen, 1999; Vyas, Mitra, Rao, & Chattarji, 2002). Detailed morphometric analyses of dendritic plasticity indicate that such remodeling is often confined to specific regions of the dendrites. For instance, stress-induced dendritic atrophy in hippocampal CA3 pyramidal cells is restricted largely to the stratum radiatum, with other regions undergoing little or no atrophy (McEwen, 1999; Vyas et al., 2002). The possibility that the functional correlates of such localized structural changes also remain localized offers exciting roles for dendritic computation and signal processing. Further, such localized dendritic remodeling promises to provide interesting and important insights into neuronal function for a number of reasons (Johnston & Amaral, 1997; Migliore & Shepherd, 2002; Segev & London, 2000; Stuart et al., 1999): • Inputs to different regions along the dendritic tree arrive from different brain areas (see Table 1). • Inputs to different regions along the dendritic tree have different synaptic properties (see Table 1). • Dendritic ion channels are not uniformly distributed (e.g., the A-type transient potassium channel, cation non-specific hyperpolarizationactivated channel, Ih , and the calcium-dependent channels in CA1 pyramidal cells). This could leave similar magnitudes of dendritic remodeling with different effects on various neuronal conductances, depending on the region of remodeling. • The effects of excitatory postsynaptic potentials (EPSP) originating from a remote dendritic location, on voltage changes at the soma, depend on the distance of that dendritic region from the soma.

Region-Specific Remodeling of Dendrites

77

Table 1: Region Specificity of Inputs to the Dendritic Tree of a CA3 Pyramidal Neuron. Dendritic Region Stratum

oriensc,d

Stratum pyramidalec,d Stratum lucidumb

Distance from Soma (µm)

Receives Inputs from

Receptors

0–400

Commissural/associational, interneurons Interneurons Dentate gyrus, interneurons Commissural/associational, interneurons Entorhinal cortex, interneurons

NMDA, AMPA GABAA GABAA Kainate, AMPA GABAA NMDA, AMPA GABAA , GABAB NMDA, AMPA GABAA , GABAB

Soma 0–100

Stratum radiatumc,d

100–350

Stratum lacunosum molecularea

350–550

Notes: Stratum oriens forms the basal side of the tree. All other strata, except for pyramidale, form the apical side. a Berzhanskaya, Urban, and Barrionuevo (1998). b Cossart et al. (2002). c Johnston and Amaral (1997). d Traub, Jefferys, and Whittington (1999). AMPA = α-amino-3-hydroxy-5methyl-4-isoxazole propionate. GABA = gamma aminobutyric acid. NMDA = N-methylD-aspartate.

• Remodeling of different dendritic regions leads to different effects on synaptic integration, which also depends critically on the distance from the cell body and on the distribution of dendritic ion channels. • Dendrites are capable of eliciting localized changes in excitability (Frick, Magee, & Johnston, 2004) and spatial integration (Wang, Xu, Wu, Duan, & Poo, 2003) by modifying dendritic ion channels. The expression of such localized intrinsic mechanisms can be modulated by localized dendritic remodeling. These issues highlight the need for a formalism that allows us to systematically manipulate dendritic morphology in three-dimensional reconstructions of neurons. Hence, the goal of this study is to develop an algorithm that would systematically manipulate neuronal reconstructions to match biological data on the modulation of dendritic architecture. To this end, we use experimental data from a well-established model of structural plasticity of dendrites in the hippocampus, dendritic atrophy in CA3 pyramidal neurons induced by chronic or repeated stress, to specify the extent and location of dendritic remodeling. 2 Methodological Overview The flow diagram in Figure 1 provides an overview of the framework used to develop and implement the analysis presented here. Previous studies (Vyas et al., 2002) from our laboratory, using Sholl’s analysis of Golgi-impregnated CA3b pyramidal neurons in the rat hippocampus, have demonstrated that

R. Narayanan, A. Narayan, and S. Chattarji

Data from Vyas et al., 2002

78

Control Animals

CIS Animals

2D Sholl’s Analysis on CA3b Pyramidal Neurons Shell-wise DL and BP Atrophy Statistics

3D Sholl’s Analysis

Shell-wise DL and BP Pruning Specifications

Section 5

Section 6 Shell-wise DL and BP Pruning Probabilities Section 7

3D CA3b Pyramidal Neuron Reconstructions

Pruning Algorithm Section 8

DSArchive Pruned CA3b Pyramidal Neurons

Figure 1: Overview of the methodological framework used in the study.

chronic immobilization stress (2 hours a day for 10 days) elicits specific patterns of region-specific atrophy (reduction in dendritic length, DL) and debranching (reduction in the number of branch points, BP). Thus, our experimental data (see the dotted box in Figure 1) provide region-wise statistics, for both control and stress-treated neurons, on atrophy and debranching in each concentric shell that constitutes the basic unit of two-dimensional morphometric analysis in Sholl’s method (see Figure 2). These provide us with the reductions in DL and BP observed in neurons from stress-treated animals with respect to control animals. The goal of the proposed algorithm is to enforce these experimentally observed reductions on three-dimensional (3D) neuronal reconstructions in a region-specific manner. We use digital reconstructions of CA3b pyramidal neurons from the Duke-Southampton Archive (DSArchive) as inputs to our algorithm (Cannon, Turner, Pyapali, & Wheal, 1998). There are two problems

Apical k=A

Region-Specific Remodeling of Dendrites

79

7

50 Pm

6

SAB(6)=2

5

SAD(5)

4 3 2 1

Basal k=B

0

50 Pm

Figure 2: Illustration of Sholl’s analysis and notations used. Two-dimensional projection of 3D neuronal reconstruction overlaid with concentric shells radiating out in steps of 50 µm from the center of gravity (CoG) of the soma. The number within each shell corresponds to n; the apical and basal sides of the dendritic tree are differentiated by k = A and k = B, respectively. In this example, SAD (5) corresponds to the dendritic length in shell number 5 along the apical side of the tree and is represented by the shaded region. SAB (6) corresponds to the number of BPs in shell number 6 along the apical side of the tree and is equal to 2 in this case (the locations of the two BPs are indicated by arrows).

80

R. Narayanan, A. Narayan, and S. Chattarji

that we face in specifying the actual reductions in each concentric shell used in Sholl’s analysis. First, there are considerable differences in values of DL and BP for neurons from the experimental data (two-dimensional, 2D) and DSArchive (3D). Second, each dendritic tree, for a given neuron from either database, has its own intershell variations in actual values for DL and BP. We overcome these problems by implementing an algorithm that uses the experimentally observed ratios of reduction in DL and BP as the key parameter. This ratio in turn is enforced by the algorithm on 3D neuronal reconstructions from the DSArchive. In implementing this, in section 5, we first subject digital reconstructions of CA3b pyramidal neurons, from the DSArchive, to 3D Sholl’s analysis to obtain the DLs and BPs at various distances. Next, we combine this output with experimental statistics to arrive at the pruning specifications for this neuron. We accomplish this in section 6 by arriving at DL and BP pruning specifications at various distances by sampling the ratio distribution of corresponding random variables, which are specified by experimental statistics. These specifications are then used to set the probabilities of pruning along various distances (see section 7). Finally, an iterative algorithm (see section 8) employs these probabilities to subject the 3D reconstruction to levels of dendritic atrophy and debranching that are the same as those observed in our experiments (Vyas et al., 2002). We present the results of the proposed algorithm in section 9 and discuss the implications of the study in section 10.

3 Notations As a first step in our algorithm, we divide the dendritic tree into concentric spherical shells in order to perform Sholl’s analysis on the neuron to be pruned (see section 5). Figure 2 illustrates this diagrammatically, along with providing a reference to the notations. Nk , with k ∈ {A, B}, represents the number of spherical shells along apical (k = A) and basal (k = B) sides of the dendritic tree. For the neuron shown in Figure 2, NA = 8 and NB = 4. skl (n) and ckl (n) represent discrete random processes defining the BP and DL statistics in stress-treated and control animals, respectively. k, as above, denotes the class of the dendrite (apical or basal), whereas l represents whether the random process corresponds to BPs (l = B) or to DL (l = D), that is, l ∈ {B, D}, n = 0, 1, . . . , Nk − 1. For instance, sAD (5) represents the random variable defining the DL in the fifth shell (250–300µm) along the apical branches in neurons from stress-treated animals (see Figure 2). µskl (n) and µckl (n) represent the mean value of the process skl (n) and ckl (n), respectively, and σkls (n) and σklc (n) represent their respective standard deviations. These are obtained directly from the DL and BP statistics of stresstreated and control animals (Vyas et al., 2002).

Region-Specific Remodeling of Dendrites

81

Skl (n) represents the DL, obtained by subjecting the neuron to Sholl’s analysis (see below), of a given 3D neuronal reconstruction with l = D and l = B, respectively (see Figure 2 for an illustration). As above, k ∈ {A, B} represents the side of the shell (apical or basal), and n = 0, 1, . . . , Nk − 1 gives the shell number. ζkl (n), n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, and l ∈ {B, D} are the pruning specifications, giving the amount of DL or BP to be removed from each given shell on the apical and basal sides of the neuron. Explicitly, running Sholl’s analysis at the end of the pruning process should yield Skl (n)−ζkl (n) ∀ k, l, n. The probabilities for pruning DL and BP are denoted by ρkl (n), k ∈ {A, B}, l ∈ {B, D}, n = 1, . . . , Nk − 1, which will be derived from the pruning specifications (see section 7). As the algorithm is an iterative one, it reduces 1 µm of dendritic length or one branching point at a time. In order to update the specifications for and probabilities of pruning at each iteration, it is necessary to keep track of the current levels of pruning, which is referred to as φkl (n), k ∈ {A, B}, l ∈ {B, D}, n = 0, 1, . . . , Nk − 1. 4 Assumptions 1. Radial isotropy is maintained in stress-induced dendritic atrophy, that is, statistics of atrophy observed in 2D morphometric data extends to the 3D case. 2. The random processes skl and ckl , corresponding to stress and control statistics, respectively, are independent. 3. The random variables skl (n), n = 0, 1, . . . , Nk − 1 are independent, as are ckl (n), n = 0, 1, . . . , Nk − 1. This assumption, given the nature of dendritic arborization, does not hold. For instance, if the number of points in shell Nk − 2 is zero, then the number of points in shell Nk − 1 has to be zero as well. However, the impact of this assumption is reduced because the pruning process takes the connectivity of the dendrites into account. 4. The random variables skl (n) and ckl (n), n = 0, 1, . . . , Nk − 1 conform to a normal distribution. The normal distribution with mean µ and standard deviation σ is denoted by N(x; µ, σ ): N(x; µ, σ ) = √

(x − µ)2 . exp − 2σ 2 2πσ 1

(4.1)

5 Three-Dimensional Sholl’s Analysis The first step in reproducing dendritic remodeling of model neurons similar to experimentally observed atrophy involves performing Sholl’s analysis (Sholl, 1953) on digitally reconstructed neurons obtained from the

82

R. Narayanan, A. Narayan, and S. Chattarji

DSArchive. Neuronal morphology in the DSArchive is stored in the SWC format (Cannon et al., 1998), which consists of seven field data lines, each defining a neuronal compartment. Each SWC data line has information about the 3D coordinates of the compartment, its radius, its parent in the dendritic tree, and a type code specifying the location of the compartment (in soma, axon, basal dendrite, or apical dendrite). The 3D coordinates of the various compartments (taken from the corresponding SWC data lines) are used to reconstruct the neuron by connecting straight lines between them. Once this is done, we have access to the exact coordinates of all points along the soma and the dendritic tree. We then perform 3D Sholl’s analysis on these neurons as follows (see Figure 2): 1. Calculate center of gravity (CoG) of the soma. Calculate the mean of the 3D coordinates of points associated with the soma and assign that as the center of gravity of the soma. 2. Assign shells. Construct a set of concentric spheres radiating out in steps of 50 µm from the CoG of the soma. The annular regions between these spheres form the shells, which form the basic unit of morphometric analysis using Sholl’s analysis (see Figure 2). In keeping with the 3D nature of our analysis (see assumption 1, above), we employ spherical shells instead of circular shells, which are used in the traditional 2D version of Sholl’s analysis (Sholl, 1953; Vyas et al., 2002). 3. Measure DL. Measure the length of dendrites within a given shell, n, by superimposing the dendritic tree on these set of concentric spheres. Assign this as SkD (n). 4. Count BPs. Count the number of points in a given shell, n, having more than one child, and assign that as SkB (n). In both of the above cases, the counts for the apical (k = A) and basal (k = B) sides are done separately. 6 Specification of the Pruning Schedule The goal of the specification process is to ensure that the loss of dendritic arborization in model neurons reflects experimental data on stress-induced dendritic atrophy, where apical dendrites undergo greater atrophy than their basal counterparts and each shell along both apical and basal dendrites undergoes atrophy depending on their distance from the soma. For example, it has previously been reported (Watanabe, Gould, & McEwen, 1992; McEwen, 1999; Vyas et al., 2002) that CA3 apical dendrites in the stratum radiatum layer undergo maximal atrophy, which is in sharp contrast to little or no atrophy of basal dendrites in the stratum oriens layer. The specification process involves setting the portion of DL to be removed from SkD and the number of BPs to be removed from SkB , in each of the concentric shells used in Sholl’s analysis. These reductions, denoted

Region-Specific Remodeling of Dendrites

83

as ζkl (n), n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, l ∈ {B, D}, are set as samples of the ratio distribution (Marsaglia, 1965) involving the DL and the BP statistics of stressed (sAD (n)) and control (cAD (n)) animals (Vyas et al., 2002). For instance, to specify the amount of DL to be pruned in shell n on the apical (n) side, we take a sample z of the distribution csAD (see equation 6.3) and set AD (n) ζAD (n) = (1 − z)SAD (n). The following algorithm elaborates on this specification procedure:1 Algorithm SetSpecifications Inputs: µs , µc , σ s , σ c Output: ζ 1. Obtain random numbers ζ s and ζ c corresponding to the (independent) distributions N(x; µs , σ s ) and N(x; µc , σ c ), respectively, using s the Polar method (Knuth, 1997). Find the ratio ζζ c . 2. The ratio r=

ζs ζc

is an instance of the random variable,

µs + σ s x , µc + σ c y

(6.1)

where x and y are independent standard normal distributed random variables. Map r to the representation of Marsaglia’s ratio distribution (Marsaglia, 1965) as follows: r=k

a+x ; b+y

k=

µs µc σs , a = s, b = c. c σ σ σ

(6.2)

a+x is given as (Marsaglia, The probability density function of kr = b+y 1965) q q exp(−0.5(a2 + b2 )) g(z)dz , 1 + f (t) = π(1 + t2 ) g(q) o at + b , (6.3) q= √ 1 + t2

where g(z) is the density function of the standard normal deviate. s As ζζ c is an instance of r, to obtain the corresponding instance of kr = s a+x , scale ζ c by k and set t as: b+y

ζ

1 ζs σ cζ s t= = . k ζc σ sζ c

(6.4)

1 For brevity, a general method to generate ζ , given S, µs , µc σ s , and σ c , is presented. This holds for all ζkl (n) given Skl (n), µskl (n), µckl (n) σkls (n) and σklc (n); k ∈ {A, B}, l ∈ {B, D}, n = 0, 1, . . . , Nk − 1.

84

R. Narayanan, A. Narayan, and S. Chattarji

3. Find f (t = t) from equations 6.3 and 6.4. 4. if (0 ≤ f ( t) ≤ 1 & f ( t) ≥ 0.75 ∗ maxt ( f (t))) accept

ζs ζc

as a representative sample of r. s return ζ = S 1 − ζζ c as the amount to be pruned. else goto Step 1. Step 4 of the above algorithm ensures the following: • f (t = t) lies within the [0, 1] range, such that there is no negative pruning and there is no specification greater than 100% of actual BP or DL. • f (t = t) is always greater than 75% of the maximum value of f (t). This is to make sure that the pruning specification approaches the mean distribution of the original data. The problem of matching DLs arises when the algorithm is applied to neurons obtained from the DSArchive. Experimental data provide pruning statistics for apical DLs up to 400 µm and 300 µm for basal dendrites (cf. Figure 1 of Vyas et al., 2002). However, CA3b pyramidal neurons from the DSArchive have DLs greater than these experimentally reported values. In order to circumvent this problem, statistics for shells with DL less than and equal to these values are set with statistics from Vyas et al. (2002). The statistics of the last shell (shell 8 on the apical side and shell 6 on the basal side) are replicated for shells with DL greater than these values. 7 Specification of Pruning Probabilities As outlined in Figure 1, once the pruning specifications are set, we need to set probabilities of pruning for DL or BP in any given shell based on the specifications obtained from the SetSpecifications algorithm. Intuitively, if the specification mentions that a given shell has to undergo higher pruning with respect to the other shells, then the probability of pruning for that shell has to be higher. Because the pruning algorithm we employ is an iterative one, within a given shell, we also take into account the current pruning values to specify the probabilities of subsequent pruning. 7.1 DL Pruning Probabilities. These are computed by normalizing the current pruning specifications such that the sum of the probabilities is one. Specifically, the DL pruning probabilities are set as follows: ρkD (n) =

ζkD (n) − φkD (n) , n = 1, . . . , Nk − 1. k,n (ζkD (n) − φkD (n))

(7.1)

7.2 BP Pruning Probabilities. Although DL and BP pruning are different numbers in terms of specifications and experimental statistics, they are

Region-Specific Remodeling of Dendrites

85

physically linked, as they constitute the same dendritic tree. An important goal of our algorithm is to ensure that the reduction schedules of BP and DL remain in synchrony through all the iterations of the algorithm. Given that for a specific shell n, ζkB (n) number of BPs have to be pruned for a ζkD (n) reduction in DL, we enforce this by setting the BP pruning prob(n) abilities such that, on average, one BP is removed for every ζζkD reduction kB (n) in DL: ρkB (n) = ρkD (n)P

with

P=

ζkD (n)φkB (n) . ζkB (n)φkD (n)

(7.2)

In order to explain the motivation behind the above equation, we rewrite it as the following set of equations: ρ kB (n) = ρkD (n)P1

with

ρkB (n) = ρ kB (n)P2

with

ζkD (n) ζkB (n) φkB (n) P2 = . φkD (n) P1 =

(7.3) (7.4)

The expression for ρ kB (n), equation 7.3, states that for one BP to be pruned, as per the specifications, on average, P1 units of DL are required to be pruned. Equation 7.4 modulates this base expression with current pruning values φkl (n), ensuring that BP pruning runs according to schedule with respect to DL pruning. It may be noted here that the value of P in equation 7.2 determines whether BP pruning is on par or not with respect to DL pruning. This can be easily confirmed if P is seen as the ratio of P1 and P12 . If P > 1, then BP pruning is ahead of schedule; in this case, ρkB (n) is set to be less than ρkD (n) as per equation 7.2, which means that the probability of pruning a BP has been decreased with respect to that of pruning a DL, thus forcing BP pruning to slow down. Similarly, if P > 1, BP pruning would be running behind schedule and ρkB (n) > ρkD (n), meaning that BP pruning becomes more probable than DL pruning. Finally, BP pruning is on schedule with respect to DL pruning if P = 1. Effectively, P, which gets updated at each iteration of the algorithm (being dependent on φkl (n)), ensures the requirement that, (n) on average, one BP is removed for every ζζkD reduction in DL. kB (n) 8 Pruning the Dendritic Tree Our pruning algorithm is an iterative process that selects a given shell in each iteration, based on the pruning probabilities derived above, and prunes either 1 µm of DL or one BP. An adaptation of the rejection method for generating random numbers (Knuth, 1997) is employed to select the shells within which pruning of DL and BP is to be executed. Similar methods have been used in models of dendritic growth and for pruning simple, virtual dendritic trees (van Pelt, Dityatev, & Uylings, 1997; van Pelt, 1997).

86

R. Narayanan, A. Narayan, and S. Chattarji

The steps involved in pruning DL (in steps of 1 µm) or a BP, as part of a single iteration, are: 1. Select a BP or a terminal dendritic point randomly using a uniformly distributed random number generator. 2. Find the shell number n to which the point belongs. 3. Generate a random number rand from a uniform distribution. 4. DL pruning: If (rand < ρkD ( n)), prune 1 µm length starting from the terminal point, reduce φkD ( n) by unit value, and update ρkD ( n) and ρkB ( n) as in equations 7.1 and 7.2, respectively. n) m∈S ρkD (m)Rm ), prune the dendrite 5. BP pruning: If (rand < ρkB ( that connects the BP to a terminal point, reduce φkB ( n) by one, and update φkD ( n). Modify ρkD ( n) and ρkB ( n) as in equations 7.1 and 7.2, respectively. kl (n) The algorithm converges when maxk,l,n ζkl (n)−φ < T (with a default ζkl (n) value of T = 0.001), where T gives the allowance threshold for the maximum error in specifications. If the algorithm does not converge satisfactorily in certain shells due to the organization of the dendrites in some trees, postprocessing is done on those shells alone to make sure that the pruned tree meets the specifications. In the above scheme, we use the fact that a BP ceases to exist if one of the two subtrees originating from it is completely removed. We remove a BP by removing all the dendritic points present between it and its corresponding terminal point (with no other BP in between). For instance, if the BP represented by BPt in Figure 3 is to be pruned, one has to remove either DLtt1 or DLtt2 , so that BPt ceases to be a BP. This removal of dendritic length should also be taken into account in setting the probability of pruning a BP. This is done by revising the probability for pruning the BP as (step 5 above):

ρkB (n) = ρkB ( n)

ρkD (m)Rm ,

(8.1)

m∈P

n) is as in equation 7.2. The set {Rm : m ∈ P } refers to the denwhere ρkB ( dritic points that are present between the pruned BP and the corresponding terminal point, where P corresponds to the set of all the Sholl shells through which these sets of points traverse. With reference to the illustration in Figure 3, if DLtt1 is chosen for pruning, then P = {3, 4, 5} as DLtt1 traverses through shells 3, 4, and 5 (see Figure 3). Pruning a nonterminal branch point involves more constraints than pruning a terminal branch point. The difference between pruning a terminal and nonterminal branches is depicted in Figure 3. The probability of pruning the terminal branch BPt is a product of the probability of pruning BPt alone and pruning DLtt1 through P , as given by equation 8.1. On the other hand,

Region-Specific Remodeling of Dendrites

87

DLtt2

5 4

1

tt DL

BPt

3 BPn

t

n DL

2

1 Figure 3: Illustration of pruning terminal and nonterminal BPs. Dendritic subtree overlaid on concentric shells, used in Sholl’s analysis, indicating typical terminal (BPt ) and nonterminal (BPn ) BPs. DLnt corresponds to the DL between BPn and BPt . DLtt1 and DLtt2 represent DLs between BPt and its terminal points. The numerical indices on the left correspond to the shell numbers.

the probability of pruning the nonterminal branching point BPn would be a product of the probabilities of pruning BPn alone, pruning DLnt , pruning BPt , and pruning DLtt1 and DLtt2 . This product turns out to have a very low value for all nonterminal branch points, and omitting these did not cause any difference in the pruning procedure. Hence we do not consider nonterminal branching points for pruning unless they eventually become terminal branches during the pruning process. 9 Results A Linux system running on a Pentium IV processor is used for all computations. The pruning algorithm is implemented in C++. A typical execution of the algorithm takes less than 1 minute for a CA3b pyramidal cell from the DSArchive. The total DL of typical CA3 pyramidal neurons is in the range of 1.2 mm to 1.5 mm, with the apical tree covering around 0.7 mm of the total length (approximately 60–65% of the total DL). The total number of BPs is around 60 to 75 in these neurons.

88

R. Narayanan, A. Narayan, and S. Chattarji

In order to test the algorithm, we generate 100 sets of specifications using the SetSpecifications algorithm, and run the pruning algorithm with these specifications. We then perform Sholl’s analysis on these pruned neurons and plot their statistics (see Figure 4). Figure 4A illustrates, as a function of the radial distance from the soma, the region-specific reduction in DL of the pruned neurons (equivalent to an experimental neuron that has undergone stress-induced atrophy) with respect to the original neuron (equivalent to a control unstressed neuron). Figure 4B displays the effects on the number of BPs. Several features of the results depicted in Figure 4 are particularly relevant with respect to experimental data on stress-induced dendritic atrophy. First, the most significant pruning is evident at a distance of 100 to 300 µm from the soma on apical dendrites, that is, in the stratum radiatum of area CA3. This is in agreement with experimental observations in several studies (Vyas et al., 2002; McEwen, 1999; Watanabe et al., 1992). Second, we carried out a more quantitative analysis to further validate this qualitative agreement between experimental data and results obtained from our pruning algorithm (see Figure 5). As described in section 6, the algorithm is designed to emulate the ratio of pruning BP and DL along all dendritic shells. Thus, we compare the modes of the reduction distribution2 obtained from experimental data and from the outcomes of our algorithm for each shell along the apical and basal sides. The methods used to compute the experimental and algorithmic modes of the reduction distribution are explained in Figure 5A. This illustration employs the dendritic length of a shell located at 400 µm from the soma on the apical side as an example. We use biological atrophy data (Vyas et al., 2002) and equation 6.3 to generate the distribution of percentage reductions in the dendritic length. The percentage value at which this distribution attains its maximum corresponds to the experimental mode of the distribution (see Figure 5A, left). The distributions that are used for computing these modes are 1 − csklkl (n) (n) n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, l ∈ {B, D}, and

(8) . for the given example, it is 1 − csAD AD (8) We then determine the algorithm modes by following these steps (see Figure 5A, right): (1) generate 100,000 different specifications on a given model neuron, (2) prune the given neuron exactly to these specifications using the algorithm, (3) obtain Sholl’s analysis for each of the 100,000 outcomes, (4) plot the histograms of each of Skl (n) for n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, l ∈ {B, D} (i.e., the output of Sholl’s analysis) using all the 100,000 outcomes, and (5) compute the mode of the histograms. Figure 5A also

s (n)

2 kl ckl (n) corresponds to the ratio distribution of statistics from the stressed and control animals. Reduction distribution corresponds to the distribution of reduction in DL/BP in s (n) stressed neurons with respect to control neurons and is represented as 1 − ckl (n) . kl

Region-Specific Remodeling of Dendrites

A

89

Dendritic length (Pm)

Apical

Basal 2000

Original Pruned

1500 1000 500

500

400

300

200

100

0

100

200

300

Distance from soma (Pm)

Number of branching points

B

500

Apical

Basal 15

Original Pruned 10

5

400

300

200

100

0

100

200

Distance from soma (Pm) Figure 4: Pruning algorithm replicates experimental dendritic atrophy in 3D reconstructions of CA3 pyramidal cells. (A) Dendritic length (µm) as a function of the radial distance from the soma (µm). Open circles: Shell-wise DL for unpruned neuron; filled circles: mean and SEM of shell-wise DL of 100 neurons pruned with specifications set in the range of changes experimentally observed following chronic stress. (B) Branching points as a function of the radial distance from the soma (µm). Open circles: Shell-wise count of BPs on apical and basal sides for unpruned neuron; Filled circles: mean and SEM of shell-wise BPs of 100 neurons used in A.

90

R. Narayanan, A. Narayan, and S. Chattarji

0.3 0.2 0.1 0.0

73.3%

-100 0 100 200 Percentage of reduction in DL

Mode of dendritic length reduction distribution

B

Experimental 0.4

Normalized frequency

Probability of occurrence

A

Mode of branching point reduction distribution

C

Algorithm

0.008

MAX

0.006 0.75*MAX 0.004 0.002 0.000

74.2%

-100 0 100 200 Percentage of reduction in DL 90

Apical

80 70

Basal

60 50 40 30 20 10 0 50 100 150 200 250 400 350 300 250 200 150 100 50 Distance from soma (µm) Experimental 80 Algorithm Apical 70 Basal 60 50 40 30 20 10 0 50 100 150 250 200 150 100 50 Distance from soma (µm)

brings out the accuracy of the pruning procedure by showing that within the valid specification range (indicated by the solid vertical lines, Figure 5A, right), there is complete overlap between the pruning specifications (gray curve) and the outcomes of the algorithm (black curve). Using the process depicted in Figure 5A, modes of percentage reduction distribution for DL (see Figure 5B) and BP (see Figure 5C) are plotted for each shell as a function of its distance from the soma. As Figures 5B and 5C depict, this analysis confirmed that the modes of the reduction distributions

Region-Specific Remodeling of Dendrites

91

obtained from biological data (see Figures 5B and 5C, “experimental,” open bars) match the outcomes of our pruning algorithm (see Figures 5B and 5C, “algorithm,” gray bars) in all dendritic regions. The maximum discrepancy is found to be less than 5% in all comparisons involving the modes of DL and BP reduction along all shells on the apical and basal sides. Finally, a further point of agreement between our results and experimental data is with reference to the reduction in total dendritic length. Stressinduced atrophy in total dendritic length has been reported to be at around 30% in CA3 pyramidal cells (Watanabe et al., 1992; Vyas et al., 2002). This is in agreement with the outcome of our algorithm, which prunes 30% to 45% of the total dendritic length, while maintaining region specificity in the process (see Figure 6). It may also be observed from Figure 6 that the stratum radiatum (the region between the two dotted lines) has undergone maximal pruning relative to the other regions of the neuron, which is also consistent with the experimentally obtained data on stress-induced dendritic atrophy (Watanabe et al., 1992; Vyas et al., 2002). While the pruning algorithm is capable of eliciting specific patterns of dendritic remodeling that match experimental observations rather well, it should also be noted that our algorithm can prune a given tree to various degrees of atrophy under a specified probabilistic regime (see Figure 6). This helps in obtaining trees of any desired level of atrophy with the relative reductions in BP and DL across various shells maintained as per the specifications. This feature is a direct outcome of the iterative nature of the

Figure 5: Facing page. Agreement between shell-specific modes of percentage reduction distributions observed biologically and calculated from the algorithmic outcomes. (A) Illustration of the calculation of modes of the reduction distribution for DL, using the example of a shell located at 400 µm from the soma on the apical side. Experimental mode (left) is computed by finding the point at which the reduction distribution (obtained from equation 6.3) attains its maximum value. The mode of reduction distribution corresponding to the algorithm (right) is obtained by finding the histogram of Sholl’s analysis outcomes of the pruned neurons (curve highlighted in black). The plot in gray depicts the samples of the reduction distribution that were used to arrive at the pruning specifications. The 0.75*MAX value (dotted horizontal line) indicates the constraint imposed by the SetSpecifications algorithm (step 4) to determine the samples that are considered valid specifications. This, along with the other constraint that the sample has to lie between 0 and 100% (step 4 of the SetSpecifications algorithm), constitutes the difference between the gray and black curves. Within the valid specification range (indicated by the solid vertical lines), there is complete overlap between the pruning specifications (gray curve) and the outcomes of the algorithm (black curve). Using the process depicted in A, modes of reduction distribution for DL (B) and BP (C) are computed for each shell as a function of distance from the soma.

92

R. Narayanan, A. Narayan, and S. Chattarji

Stratum Radiatum

75%

60%

30%

Experimentally observed atrophy

45%

15%

0% 50 Pm

Figure 6: A wide range of dendritic atrophy can be achieved using the pruning algorithm. Projections of 3D reconstructions of a CA3 pyramidal cell and its various pruned versions. The percentage of pruning is given to the left of each projection. Pruning in the range of 30% to 45% reduction (reconstructions within the gray box) corresponds to the outcomes of the algorithm with specifications set in the range of atrophy observed experimentally after chronic stress in animal models (Vyas et al., 2002; Watanabe et al., 1992). The stratum radiatum (delineated by parallel dotted lines) undergoes maximal dendritic pruning and debranching, which is consistent with experimental data (Vyas et al., 2002; Watanabe et al., 1992).

Region-Specific Remodeling of Dendrites

93

algorithm, which can be stopped at any given iteration to obtain the tree of desired length. The relative reduction across shells is kept constant because the choice to prune a DL or BP is guided by the probabilistic regime through all the iterations. 10 Discussion In this study, we have employed chronic stress-induced dendritic atrophy in the hippocampus as a model to develop an algorithm to elicit systematic, region-specific remodeling of 3D reconstructions of CA3 pyramidal neurons. While it is standard practice to find the ratio of the means of two distributions to determine the amount of reduction, in our model, we have employed the actual ratio distribution to arrive at the statistics of the ratio. Using this ratio distribution, we have calculated region-based distributions of reduction in dendritic arborization, which are used to characterize the region-specific pruning probabilities at any given distance from the soma. These probabilities then drive an iterative algorithm, which probabilistically reduces either 1 µm of dendritic length or removes one branching point in each iteration. There are two distinct advantages of this algorithm. First, it prunes BP and DL in parallel, both within the same probabilistic framework, thus ensuring that the average pruning in DL per BP is maintained through all iterations (see equation 7.2). Second, it provides precise control over the remodeling process at any point of interest with the relative pruning across various regions still conforming to biological data. This helps in analyzing the causal structure-function relationship over a range of remodeling by maintaining region specificity throughout, thereby allowing us to monitor the evolution of the neuron’s biological function as the remodeling proceeds. It is also important to emphasize that although we used stress-induced dendritic atrophy as a test case, the algorithm presented here is a generalized one that can be applied to a wide range of biological problems involving dendritic remodeling. In other words, this algorithm can be applied to induce both growth and pruning of neuronal reconstructions. In problems involving dendritic growth, the only difference would be to add a BP or 1 µm of DL once a specific region is probabilistically selected. Further, for simulating dendritic growth, the diameter of the newly formed dendrites should also be fixed, which can be done using Rall’s branching rule (Rall, 1977). A related class of algorithms for generating virtual dendritic trees to match a given statistical framework is the L-Neuron package (Ascoli & Krichmar, 2000; Ascoli, Krichmar, Scorcioni, Nasuto, & Senft, 2001). Our algorithm is different from these because here we remodel real 3D reconstructions based on experimental data, whereas L-Neuron generates virtual neurons based on morphometric data. Further, for analyzing the effects of localized remodeling, our algorithm has the advantage of generating neu-

94

R. Narayanan, A. Narayan, and S. Chattarji

rons for possible analysis of causal structure-function relationships. This is possible because we remodel dendrites of a given neuron to various levels of atrophy, and the extent and location of the reduction are well known. This is in contrast to, say, having two groups of neurons generated to match the stress and control statistics and using them for analyzing the structurefunction relationship. Such an analysis can provide us with only correlative relationships rather than precise causal relationships, as the analysis is not confined to remodeling a given neuron. Given that systematic manipulation of dendritic morphology is currently impossible with biological neurons, the framework presented here provides us with a tool to quantitatively address questions related to specific contributions of dendritic structure to neuronal function during both neuronal development and experience-induced plasticity in adults. The region specificity of the outputs of the algorithm opens new avenues to address the exciting possibility that localized changes in dendritic structure translate to localized changes in neuronal function. It also allows us to analyze the effects of possible local modulation of ionic and synaptic channels as a result of dendritic remodeling. Such analysis may have significant implications for recent findings related to intrinsic plasticity in neurons (Frick et al., 2004). Acknowledgments This work was supported by a research grant from the Department of Biotechnology, India. We express grateful thanks to Ajai Vyas and Rupshi Mitra for helpful discussions. We also thank the anonymous reviewer for the positive comments and suggestions, which helped substantially in improving the presentation of this article. References Anderton, B. H., Callahan, L., Coleman, P., Davies, P., Flood, D., Jicha, G. A., Ohm, T., & Weaver, C. (1998). Dendritic changes in Alzheimer’s disease and factors that may underlie these changes. Progress in Neurobiology, 55, 595–609. Ascoli, G., & Krichmar, J. L. (2000). L-Neuron: A modeling tool for the efficient generation and parsimonious description of dendritic morphology. Neurocomputing, 32–33, 1003–1011. Ascoli, G., Krichmar, J., Scorcioni, R., Nasuto, S., & Senft, S. (2001). Computer generation and quantitative morphometric analysis of virtual neurons. Anatomy and Embryology, 204, 283–301. Berzhanskaya, J., Urban, N. N., & Barrionuevo, G. (1998). Electrophysiological and pharmacological characterization of the direct perforant path input to hippocampal area CA3. Journal of Neurophysiology, 79, 2111–2118. Bothwell, S., Meredith, G. E., Phillips, J., Staunton, H., Doherty, C., Grigorenko, E., Glazier, S., Deadwyler, S., O’Donovan, C. A., & Farrell, M. (2001). Neuronal hypertrophy in the neocortex of patients with temporal lobe epilepsy. Journal of Neuroscience, 21, 4789–4800.

Region-Specific Remodeling of Dendrites

95

Brizzee, K. R. (1987). Neurons numbers and dendritic extent in normal aging and Alzheimer’s disease. Neurobiology of Aging, 8, 579–580. Cannon, R., Turner, D., Pyapali, G., & Wheal, H. (1998). An on-line archive of reconstructed hippocampal neurons. Journal of Neuroscience Methods, 84, 49–54. Cline, H. T. (2001). Dendritic arbor development and synaptogenesis. Current Opinion in Neurobiology, 11, 118–126. Cossart, R., Epsztein, J., Tyzio, R., Becq, H., Hirsch, J., Ben-Ari, Y., & Crepel, V. (2002). Quantal release of glutamate generates pure kainate and mixed AMPA/kainate EPSCs in hippocampal neurons. Neuron, 35(1), 147– 159. Duan, H., Wearne, S. L., Rocher, A. B., Macedo, A., Morrison, J. H., & Hof, P. R. (2003). Age-related dendritic and spine changes in corticocortically projecting neurons in macaque monkeys. Cerebral Cortex, 13, 950–961. Frick, A., Magee, J., & Johnston, D. (2004). LTP is accompanied by an enhanced local excitability of pyramidal neuron dendrites. Nature Neuroscience, 7, 126– 135. Geula, C. (1998). Abnormalities of neural circuitry in Alzheimer’s disease: Hippocampus and cortical cholinergic innervation. Neurology, 51, S18–S29. H¨ausser, M., & Mel, B. (2003). Dendrites: Bug or feature? Current Opinion in Neurobiology, 13, 372–383. Johnston, D., & Amaral, D. G. (1997). Hippocampus. In G. Shepherd (Ed.), Synaptic organization of the brain (4th ed.). New York: Oxford University Press. Johnston, D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Annual Review of Neuroscience, 19, 165–186. Jones, T. A., & Schallert, T. (1994). Use-dependent growth of pyramidal neurons after neocortical damage. Journal of Neuroscience, 14, 2140–2152. Kaufmann, W. E., & Moser, H. W. (2000). Dendritic anomalies in disorders associated with mental retardation. Cerebral Cortex, 10, 981–991. Keyvani, K., & Schallert, T. (2002). Plasticity-associated molecular and structural events in the injured brain. Journal of Neuropathology and Experimental Neurology, 61, 831–840. Knuth, D. E. (1997). The art of computer programming, Vol. 2: Seminumerical algorithms (3rd ed.). Reading, MA: Addison-Wesley. Krichmar, J. L., Nasuto, S. J., Scorcioni, R., Washington, S. D., & Ascoli, G. A. (2002). Effects of dendritic morphology on CA3 pyramidal cell electrophysiology: A simulation study. Brain Research, 941, 11–28. Magee, J., Hoffman, D., Colbert, C., & Johnston, D. (1998). Electrical and calcium signaling in dendrites of hippocampal pyramidal neurons. Annual Review of Physiology, 60, 327–346. Mainen, Z., & Sejnowski, T. (1996). Influence of dendritic structure on firing pattern in model neocortical neurons. Nature, 382, 363–366. Marc, R. E., Jones, B. W., Watt, C. B., & Strettoi, E. (2003). Neural remodeling in retinal degeneration. Progress in Retinal and Eye Research, 22, 607–655. Marsaglia, G. (1965). Ratios of normal variables and ratios of sums of uniform variables. American Statistical Association Journal, 60, 193–204. McEwen, B. (1999). Stress and hippocampal plasticity. Annual Review of Neuroscience, 22, 105–122.

96

R. Narayanan, A. Narayan, and S. Chattarji

Migliore, M., & Shepherd, G. M. (2002). Emerging rules for the distributions of active dendritic conductances. Nature Reviews Neuroscience, 3, 362–370. Popov, V. I., Bocharova, L. S., & Bragin, A. G. (1992). Repeated changes of dendritic morphology in the hippocampus of ground squirrels in the course of hibernation. Neuroscience, 48, 45–51. Pyapali, G. K., & Turner, D. A. (1996). Increased dendritic extent in hippocampal CA1 neurons from aged F344 rats. Neurobiology of Aging, 17, 601–611. Rall, W. (1977). Core conductor theory and cable properties of neurons. In E. R. Kandel (Ed.), Handbook of physiology (Sec. 1, Vol. 1, pp. 39–97). Bethesda, MD: American Physiological Society. Ramakers, G. J. (2002). Rho proteins, mental retardation and the cellular basis of cognition. Trends in Neuroscience, 25, 191–199. Regehr, W. G., & Tank, D. W. (1994). Dendritic calcium dynamics. Current Opinion in Neurobiology, 4, 373–382. Segev, I., & London, M. (2000). Untangling dendrites with quantitative models. Science, 290, 744–750. Sholl, D. A. (1953). Dendritic organization in the neurons of the visual and motor cortices of the cat. Journal of Anatomy, 87, 387–406. Stuart, G., Spruston, N., & H¨ausser, M. (Eds.). (1999). Dendrites. New York: Oxford University Press. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. van Ooyen, A., Duijnhouwer, J., Remme, M. W., & van Pelt, J. (2002). The effect of dendritic topology on firing patterns in model neurons. Network: Computation in Neural Systems, 13, 311–325. van Ooyen, A., Willshaw, D. J., & Ramakers, G. J. A. (2000). Influence of dendritic morphology on axonal competition. Neurocomputing, 32–33, 255–260. van Pelt, J. (1997). Effect of pruning on dendritic tree topology. Journal of Theoretical Biology, 186, 17–32. van Pelt, J., Dityatev, A. E., & Uylings, H. B. M. (1997). Natural variability in the number of dendritic segments: Model-based inferences about branching during neurite outgrowth. Journal of Comparative Neurology, 387, 325–340. Vetter, P., Roth, A., & H¨ausser, M. (2001). Propagation of action potentials in dendrites depends on dendritic morphology. Journal of Neurophysiology, 85, 926–937. Vyas, A., Mitra, R., Rao, B. S. S., & Chattarji, S. (2002). Chronic stress induces contrasting patterns of dendritic remodeling in hippocampal and amygdaloid neurons. Journal of Neuroscience, 22, 6810–6818. Wang, Z., Xu, N. L., Wu, C. P., Duan, S., & Poo, M. M. (2003). Bidirectional changes in spatial dendritic integration accompanying long-term synaptic modifications. Neuron, 37, 463–472. Watanabe, Y., Gould, E., & McEwen, B. S. (1992). Stress induces atrophy of apical dendrites of hippocampal CA3 pyramidal neurons. Brain Research, 588(2), 341–345. Wong, R. O. L., & Ghosh, A. (2002). Activity-dependent regulation of dendritic growth and patterning. Nature Reviews Neuroscience, 3, 803–812. Received February 3, 2004; accepted June 2, 2004.

LETTER

Communicated by Helge Ritter

Analysis of Cyclic Dynamics for Networks of Linear Threshold Neurons H. J. Tang [email protected]

K. C. Tan [email protected] Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576

Weinian Zhang [email protected] Department of Mathematics, Sichuan University, Chengdu, Sichuan, P.R. China 610064

The network of neurons with linear threshold (LT) transfer functions is a prominent model to emulate the behavior of cortical neurons. The analysis of dynamic properties for LT networks has attracted growing interest, such as multistability and boundedness. However, not much is known about how the connection strength and external inputs are related to oscillatory behaviors. Periodic oscillation is an important characteristic that relates to nondivergence, which shows that the network is still bounded although unstable modes exist. By concentrating on a general parameterized two-cell network, theoretical results for geometrical properties and existence of periodic orbits are presented. Although it is restricted to two-dimensional systems, the analysis can provide a useful contribution to analyze cyclic dynamics of some specific LT networks of high dimension. As an application, it is extended to an important class of biologically motivated networks of large scale: the winner-take-all model using local excitation and global inhibition. 1 Introduction Networks of linear threshold (LT) neurons with nonsaturating transfer functions have attracted growing interest (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Hahnloser, 1998; Wersing, Beyn, & Ritter, 2001; Xie, Hahnloser, & Seung, 2002; Yi, Tan, & Li, 2003). The prominent properties of an LT network are embodied at least in two categories. First, it is believed to be more biologically plausible, for example, in modeling cortical neurons, which rarely operate close to saturation (Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Second, the LT network is ready for performing multistability analysis. A multistable network is not as computationally restrictive as a Neural Computation 17, 97–114 (2005)

c 2004 Massachusetts Institute of Technology

98

H. Tang, K. Tan, and W. Zhang

monostable network, such as in decision making. Its computational abilities have been explored in a wide range of applications. A class of multistable winner-take-all (WTA) networks was presented in Hahnloser (1998), and an LT network composed of competitive layers was employed to accomplish perceptual grouping (Wersing, Steil, & Ritter, 2001). An LT network designed in microelectronic circuits demonstrated both computation of analog amplification and digital selection (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). The transfer function is an important factor in bounding the network dynamics. Unlike the sigmoid function in Hopfield networks (Hopfield, 1984) and the piecewise linear sigmoid function in cellular networks (Chua & Yang, 1988), both of which ensure bounded dynamics and at least one equilibrium, nonsaturating transfer function may give rise to unbounded dynamics and even the nonexistence of equilibrium (Forti & Tesi, 1995). The existence and uniqueness of equilibrium of LT networks have been investigated (Hadeler & Kuhn, 1987; Feng & Hadeler, 1996). A stability condition such that local inhibition is sufficient to achieve nondivergence of LT networks was established (Wersing, Beyn et al., 2001). A wide scope of dynamics analysis, including boundedness, global attractivity, and complete convergence, was conducted in Yi et al. (2003). However, in analysis of the dynamics of the LT networks, little attention has been paid to theoretically analyzing the geometric properties of equilibria and periodic oscillations. It is often difficult to perform theoretical analysis for the cyclic dynamics of a two-dimensional system and impossible to study that of high-dimensional systems. It was reported that slowing the global inhibition produced periodic oscillations in a WTA network and the epileptic network approached a limit cycle by computer simulations (Hahnloser, 1998). Nevertheless, there was a lack of theoretical proof to the existence of periodic orbits. It also remains unclear what factors affect the amplitude and period of the oscillations. This article provides an in-depth analysis of these issues based on parameterized networks with two LT neurons. Generally, the analytical result is applicable only to two-cell networks; however, it can be extended to a special class of high-dimensional networks, such as a generalized WTA network using local excitation and global inhibition. The principal analysis of the planar system would provide a useful framework to study the dynamical behaviors and give insight into such LT networks. 2 Preliminaries The network with n linear threshold neurons is described by the additive recurrent model, ˙ = −x(t) + Wσ (x(t)) + h, x(t)

(2.1)

Analysis of Cyclic Dynamics

99

where x, h ∈ Rn are the neural states and external inputs and W = (wij )n×n the synaptic connection strengths. σ (x) = (σ (xi )) is called the LT function defined by σ (xi ) = max{0, xi }, i = 1, . . . , n. Due to the simple nonlinearity of threshold, network 2.1 can be divided into pieces of linear systems, and there are 2n such partitions for n neurons. In each partition, we are able to study the qualitative behavior of the dynamics in details. In general, equation 2.1 can be linearized as ˙ = −x(t) + W + x(t) + h, x(t)

(2.2)

and it must correspond to each partition. For a given partition, W + is com+ puted simply by w+ ij = wij for xj ≥ 0 and wij = 0 for xj < 0, for all i, j. Then it is easy to compute the equilibria of equation 2.2 by solving linear equation Jx∗ + h = 0, where J = (W + − I) is the Jacobian and I is the identity matrix. If J is singular, the network may possess infinite equilibria, for example, a line attractor (necessary conditions for line attractors have been studied in Hahnloser, Seung, & Slotine, 2003). If J is nonsingular, a unique equilibrium exists, which is explicitly given by x∗ = −J−1 h. It should be noted that if x∗ lies in this region, it is called a true equilibrium; otherwise, it is called a virtual equilibrium. 3 Geometrical Properties of Equilibria For a planar system, the phase plane is divided into four quadrants: R2 = D1 ∪ D2 ∪ D3 ∪ D4 , where D1 = {(x1 , x2 ) : x1 ≥ 0, x2 ≥ 0}, D2 = {(x1 , x2 ) : x1 < 0, x2 ≥ 0}, D3 = {(x1 , x2 ) : x1 < 0, x2 < 0}, and D4 = {(x1 , x2 ) : x1 ≥ 0, x2 < 0}. Obviously, D1 is a linear region, which is unsaturated. D2 and D4 are partially saturated, but D3 is saturated. The Jacobians of the two-cell network in each region are formulated explicitly, w11 − 1 w21

−1 w12 , w22 − 1 0

w12 −1 , 0 w22 − 1

w11 − 1 0 , −1 w21

0 , −1

adding the subscript 1, . . . , 4 when it is necessary to indicate their corresponding regions. Define δ =det J, µ =trace J, and = µ2 − 4δ. The dynamical properties of the network are determined by the relations among δ, µ, and according to the linear system theory in R2 (Perko, 2001): (1) If δ < 0 then x∗ is a saddle. (2) If δ > 0 and ≥ 0, then x∗ is a node that is stable if µ < 0 and unstable if µ > 0. (3) If δ > 0, < 0, and µ = 0, then x∗ is a focus that is stable if µ < 0 and unstable if µ > 0. (4) If δ > 0 and µ = 0, then x∗ is a center. Based on the theory, for the sake of conciseness, the following theorem is presented without detailed analysis:

100

H. Tang, K. Tan, and W. Zhang

Table 1: Properties and Distributions of the Equilibria. Distribution

Conditions

D1

(w11 − 1)(w22 − 1) < w12 w21 (w22 − 1)h1 ≥ w12 h2 (w11 − 1)h2 ≥ w21 h1 (w11 − 1)(w22 − 1) > w12 w21 (w11 − w22 )2 ≥ −4w12 w21 d < 2 (w22 − 1)h1 ≤ w12 h2 d>2 (w11 − w22 )2 < −4w12 w21 d < 2 (w11 − 1)h2 ≤ w21 h1 d>2 d=2 w22 < 1, h1 < w12 h2 /(w22 − 1), h2 ≥ 0 w22 > 1, h1 < w12 h2 /(w22 − 1), h2 ≤ 0 h1 < 0, h2 < 0 w11 < 1, h1 ≥ 0, h2 < w21 h1 /(w11 − 1) w11 > 1, h1 ≤ 0, h2 < w21 h1 /(w11 − 1)

D2 D3 D4

Properties Saddle

s-node u-node s-focus u-focus center s-node saddle s-node s-node saddle

Note: d = w11 + w22 , s: stable, u: unstable.

Theorem 1. A distribution of equilibria of the two-cell LT network, equation 2.1, and their geometrical properties are given for various cases of connection strengths and external inputs in Table 1.

4 Existence and Boundary of Periodic Orbits According to the results in theorem 1, the equilibrium x∗ is an unstable focus or a center if it satisfies that  w w < − 14 (w11 − w22 )2 ,    12 21 (w22 − 1)h1 < w12 h2 , (4.1) (w11 − 1)h2 < w21 h1 ,    w11 + w22 > 2(unstable focus), w11 + w22 = 2(center). If network 2.1 has a topological center in D1 , then it always exhibits oscillations around the equilibrium in D1 . However, the outmost of the periodic orbits needs further study. When x∗ is an unstable focus, the network may tend to be unstable, and it is nontrivial to determine if the periodic orbits exist. Remark 1. The Poincar´e-Bendixson theorem (Zhang, Ding, Huang, & Dong, 1992) holds that if there exists an annular region ⊂ R2 , containing no equilibrium, such that both orbits from its outer boundary γ2 and from its inner boundary γ1 enter into , then there exists a (may not be unique) periodic orbit γ in . If the vector field is analytic (can be expanded as a convergent power series), then the periodic orbits in are limit cycles. Since σ is not analytic, our attention is devoted to the existence of periodic orbits

Analysis of Cyclic Dynamics

101

instead of making more efforts to assert that of limit cycles. Actually, they do not make significant difference in the oscillatory behavior, though they are different in geometrical meaning. Let L1 and L2 be the two lines that satisfy x˙ 1 = 0 and x˙ 2 = 0, respectively: L1 : (w11 − 1)x1 + w12 x2 + h1 = 0, L2 : w21 x1 + (w22 − 1)x2 + h2 = 0. When w12 < 0, w11 > 1, w22 < 1, w21 > 0, equation 4.1 describes a vector field as shown in Figure 1. The trajectories that evolve in D1 and D2 are also intuitively illustrated. R is the right trajectory in D1 that starts from point p and ends at point q(0, q). L is the left trajectory in D2 that starts from point q and ends at point r(0, r). The arrows indicate the directions of the vector fields. It can be seen that there exists a rotational vector field around the equilibrium. Theorem 2. The two-cell LT network, equation 2.1, must have one periodic orbit if it satisfies that w12 < 0, w22 < 1 and equation 4.1.

Figure 1: Vector fields described by the oscillation prerequisites. The trajectories in D1 and D2 (R and L , respectively) forced by the vector fields are intuitively illustrated.

102

H. Tang, K. Tan, and W. Zhang

Proof.

The equilibrium in D2 is computed by

x∗ =

1 1 − w22

w12 h2 + (1 − w22 )h1 . h2

It is noted that x∗1 > 0, x∗2 > 0; hence, it is a virtual equilibrium lying in D1 . Although it does not practically exist, it still drives the trajectory to approach it or diverge from it along its invariant eigenspace. When w22 = 0, the Jacobian matrix in D2 has two distinct eigenvalues, λ1 = −1, λ2 = w22 − 1, as well as two linear independent eigenvectors, T 12 e1 = (1, 0)T , e2 = ( w w22 , 1) . First, consider the two cases for w22 < 1, where the subspaces spanned by e1 and e2 are both stable: i. 0 < w22 < 1. In this case, λ1 < λ2 , and let Ess = Span{e1 } and Ews = Span{e2 } denote the strong stable subspace and weakly stable subspace, respectively. Figure 2 shows the phase portrait. Each trajectory except the invariant line Ess approaches the equilibrium along a well-defined tangent line, Ews . Since (0, p) lies in Ess , each trajectory must intersect the x2 -axis at some point (0, r) where r > p. Therefore, a simple closed curve can be constructed such that = R ∪ L ∪ rp, which can be used as an outer boundary of an annular region. ii. w22 < 0. In this case, λ2 < λ1 , and let Ess = Span{e2 } and Ews = Span{e1 } denote the strong stable subspace and weakly stable sub-

Figure 2: Phase portrait for 0 < w22 < 1.

Analysis of Cyclic Dynamics

103

Figure 3: Phase portrait for w22 < 0.

space, respectively. The phase portrait is shown in Figure 3. Each trajectory except the invariant line Ess approaches the equilibrium along the well-defined tangent line, Ews . Since (0, p) lies in Ess , each trajectory that starts from (0, q) must intersect the x2 -axis at some point (0, r) where r > p. Thus, a simple closed curve can be constructed such that = R ∪ L ∪ rp. iii. w22 = 0. In this case, there is only a single eigenvector e1 , which corresponds to the negative eigenvalue of −1. Figure 4 shows there is only a single invariant subspace. Each trajectory approaches the equilibrium along a well-defined tangent line determined by e1 . Thus, the left trajectory intersects the x2 -axis at a point above (0, p). To summarize the results of the above cases, an outer boundary of an annular region can be constructed such that = R ∪ L ∪ rp. According to the Poincar´e-Bendixson theorem, there must be a periodic orbit in the interior of the domain enclosed by . Remark 2. It can be shown that the two-cell network, equation 2.1, can have no periodic orbit if w22 > 1 and w12 < 0. As analyzed in theorem 3, x∗1 < 0, x∗2 < 0, the equilibrium is a virtual equilibrium lying in D3 . Notice that the T 12 system has two linear independent eigenvectors, e1 = (1, 0)T , e2 = ( w w22 , 1) ; s u thus, it has a stable subspace E = Span{e1 } and an unstable subspace E = Span{e2 }. All the trajectories in D2 leave away from x∗ along the invariant space Eu . Therefore, no periodic orbit exits in this case.

104

H. Tang, K. Tan, and W. Zhang

Figure 4: Phase portrait for w22 = 0.

The periodic orbits resulting from a center type equilibrium are investigated below. The proof can be found in the appendix. Theorem 3. If τ = 0, w12 < 0, w22 < 1 and equation 4.1 holds for network 2.1, then the network produces periodic orbits of center type. The amplitude of its outermost periodic orbit is confined in the closed curve = R ∪ L ∪ rp, as shown in Figure 1, where h2 , and 1 − w22 (w11 − 1)h2 − w21 h1 w12 h2 + (1 − w22 )h1 + . q = −w22 (1 − w22 )w12 w12 w21 − (w11 − 1)(w22 − 1)

p=

From the geometrical view, the trajectory can switch only between a stable mode and an unstable mode; switching between two unstable modes or two stable modes is not permitted. It is difficult to calculate the amplitude and period of a periodic orbit. Fortunately, by virtue of the analytical results for center-type periodic orbits in theorem 3 and the trajectory equations in temporal domain (referred to in the appendix), it is possible to estimate their variation trends—how they change with respect to θi , τ , and hi . An approximate relationship describing the amplitude, period, and external inputs can be stated as (1) q ∝ hi , q ∝ τ 2 , q indicating the amplitude, and (2) the period increases as τ increases, while analysis procedures are omitted.

Analysis of Cyclic Dynamics

105

5 Winner-Take-All Network A WTA network is an interesting and meaningful application of LT networks. The study of its dynamics, fixed points, and stability has great merit. A biologically motivated WTA model with local excitation and global inhibition was established and studied by Hahnloser (1998). A generalized model of such a network is formulated as the following differential equations: x˙ i (t) = −xi (t) + θi σ (xi (t)) − L + hi , n ˙ = −L(t) + θj σ (xj (t)), τ L(t)

(5.1a) (5.1b)

j=1

where θi > 0 denotes local excitation, i = 1, . . . , n, L is a global inhibitory neuron, and τ is a time constant reflecting the delay of inhibition. It is noted that the dynamics of L has a property of nonnegativity: if L(0) ≥ 0, then L(t) ≥ 0, and if L(0) < 0, then L(t1 ) ≥ 0 after some transient time t1 > 0. In an explicit and compact form for x = (x1 , . . . , xn )T , the WTA model can be rewritten as

˙ −1 x(t) h x(t) σ (x(t)) , (5.2) + + = − ˙ 0 L(t) L(t) L(t) v 1 − τ1 where = diag(θ1 , . . . , θn ), v= τ1 (θ1 , . . . , θn ), and 1 is a columnar vector of ones. Unlike in Hahnloser (1998), the generalized WTA model, equations 5.1a and 5.1b allow each neuron to have an individual self-excitation θi , which can be different from each other. Hence, it is believed to be more biologically plausible than requiring the same excitation for all neurons. By generalizing 1 the analysis (Hahnloser, 1998), it can be concluded that if θi > 1 and τ < θi −1 for all i, the network performs the WTA computation: xk = hk ,

(5.3a)

xi = hi − θk hk , i = k,

(5.3b)

L = θ k hk ,

(5.3c)

where k indicates the winning neuron. Nevertheless, it may not be the neuron having the largest external input. And all θi = 1 will produce an absolute WTA network, in the sense that the winner xk is the neuron that receives the largest external input hk . Hahnloser (1998) claimed that a periodic orbit may occur by slowing global inhibition somewhat. However, its occurrence is not known a priori. Consider a single excitatory-inhibitory pair of the dynamic system 5.1a and 5.1b, that is,

˙ θ −1 x(t) x(t) h σ (x(t)) + . (5.4) = − + θ ˙ L(t) 0 L(t) L(t) 1 − τ1 τ

106

H. Tang, K. Tan, and W. Zhang

Since each neuron is coupled only with a global inhibitory neuron (though such a global inhibitory neuron reflects the collective activities of all the excitatory neurons), it does not interact directly with other neurons. Therefore, the dynamics of n-dimensional WTA network can be clarified in the light of studying a two-cell network with a single excitatory neuron. The established results for any two-dimensional network in theorem 2 allow us to put forward the theorem on the existence of periodic orbits for the two-cell WTA model. Theorem 4. The single excitatory-inhibitory WTA network, equation 5.4, must have a periodic orbit if its local excitation and global inhibition satisfy θ > 1 and √ 1 θ −1

<τ <

Proof.

( θ+1)2 . (θ−1)2

The connection matrix and external inputs are, respectively,

W=

θ θ τ

−1 h ˆ . 1 , and h = 0 1− τ

Obviously, the first two conditions of theorem 2 are met since −1 < 0, 1− τ1 < 1. Now it remains to prove that condition 4.1 is also met by synaptic weights 1 and external inputs. From θ−1 < τ , it is easy to show that θ + 1 − τ1 > 2, which verifies the last inequality of equation 4.1. On the other hand, its first inequality is equivalent to θ 1 1 2 − <− . θ −1+ τ 4 τ

(5.5)

Then it immediately yields √ √ ( θ + 1)2 ( θ − 1)2 < τ < . (θ − 1)2 (θ − 1)2 Apparently,

√ ( θ−1)2 (θ−1)2

<

1 θ−1 .

Recalling τ >

√ ( θ + 1)2 1 . <τ < θ −1 (θ − 1)2

(5.6) 1 θ−1 ,

it is obtained that (5.7)

The next two inequalities of equation 4.1 are easy to verify by considering that hˆ 1 > 0, hˆ 2 = 0. Therefore, according to theorem 2, the WTA network must have a periodic orbit. The above analysis indicates that the parameters (θ and τ ), that is, the synaptic connections between excitatory and inhibitory neurons, are the

Analysis of Cyclic Dynamics

107

dominating factors on the stability and oscillation behavior. On the other hand, adding a new excitatory neuron to equation 5.4 only increases the strength of the global inhibition L, which would resemble adding a new excitatory-inhibitory pair (x2 , L) to the existing one (x1 , L). As a consequence, the stability of the existing pair would not change if the new pair is stable; hence, the whole network (x1 , x2 , L) remains stable. Otherwise, the existing pair would lose stability if the new pair is oscillating, and so the whole network oscillates. Based on the heuristic discussion above, we conjecture the following conditions that lead to oscillation behaviors of n-dimensional WTA networks, equations 5.1a and 5.1b. Rigorous proof of them is formidable. Nevertheless, extensive simulation studies support our observations. Proposition 1. The WTA network, equations 5.1a and 5.1b, loses stability and periodic oscillation√ occurs if its local excitation and global inhibition satisfy θi > 1 2 1 i +1) < τ < ( (θθ−1) and θi −1 2 for any i = 1, . . . , n. i

As mentioned, the strengths of local excitation and global inhibition (θi and τ ) play a determining role on the dynamical properties of the recurrent model. A complete description of the dynamics of the WTA model now can be concluded as i = 1, . . . , n, • θi ≤ 1 for all i results in a global stability. 1 for all i give rise to a WTA computation; it is • θi > 1 and τ < θi −1 absolute if all θi = 1.

• θi > 1 and

1 θi −1

<τ < √

√ ( θi +1)2 , (θi −1)2

for any i lead to periodic oscillations.

i +1) • θi > 1 and τ > ( (θθ−1) 2 for any i, implying too slow a global inhibition, i result in unstable dynamics due to a very strong unstable mode (in two dimensions, it is exactly an unstable focus) that give birth to divergent activities. 2

The study on cyclic dynamics of WTA networks is of special interest for cortical processing and short-term memory and efforts have been paid (Ellias & Grossberg, 1975; Ermentrout, 1992). A different WTA model using global inhibition was analyzed (Ermentrout, 1992), and conditions on the existence of periodic orbits were proven for a single excitatory-inhibitory pair by virtue of the Poincar´e-Bendixson theorem. Unlike model 5.4, the neuronal activities were saturated; thus, the outer boundary of an annular region is straightforward. In contrast, it is nontrivial to construct such an outer boundary for network 5.4 since nonsaturating transfer functions may give rise to unbounded activities. Different treatment of applying the Poincar´eBendixson theorem is also required by the fact that the WTA network of LT neurons differs from various models studied in Ermentrout (1992) due

108

H. Tang, K. Tan, and W. Zhang

to the nondifferentiability of LT transfer functions, which imposes another difficulty on cyclic dynamics analysis. Most interesting, observations similar to our analysis were made: the topology of the connections is the main determining factor for such WTA dynamics, and oscillatory behavior begins when the inhibition slows. 6 Examples and Discussions In this section, examples of computer simulations are provided to illustrate and verify the theories developed. Consider an example of WTA network 5.1a and 5.1b with six neurons. When τ = 2, θi = 2 for all i, the network becomes periodically oscillating (see Figures √ 5 and 6). It is verified by proposition 1, since τ is in the region 1 < τ < ( 2 + 1)2 where periodic orbit (undamped oscillation) occurs. As τ < 1, the network is asymptotically stable and performs the WTA computation, whereas it becomes unstable when √ τ > ( 2 + 1)2 .

25 20 15

Neural States

10 5 0 −5 −10 −15 −20 0

10

20

30

40

50

Time Figure 5: Cyclic dynamics of the WTA network with six excitatory neurons. τ = 2, θi = 2 for all i, h = (1, 1.5, 2, 2.5, 3, 3.5)T . The trajectory of each neural state is illustrated in 50 seconds. The dashed curve shows the state of the global inhibitory neuron L.

Analysis of Cyclic Dynamics

109

20

15

L

10

5

0

−5 −5

0

5

x

10

15

6

Figure 6: A periodic orbit constructed in the x6 − L plane. It shows that the trajectories starting from five random points eventually approach the periodic orbit.

To show the periodic orbits of center type, a simple network is taken as an example: dx 2 = −x + 2 dt

−1 0

σ (x1 (t)) 2 . + 1 σ (x2 (t))

Figure 7 gives an illustration of the trajectories starting from three different points: (0, 0.5), (0, 1), and (0, 1.2). As can be seen, each finally approaches a periodic orbit, where the inner closed curve shows a center. It is verified by theorem 3, as a computer illustration of Figure 1, where p = 1 and q = 3. The value of q can also be determined by solving the temporal equations of neural states (see the appendix). Further evaluating the temporal equations, A.10 to A.12, shows exactly 1 < r < 2. The established analytical results imply that slowing global inhibition in a multistable WTA network (Hahnloser, 1998) is equivalent to giving rise to an unstable focus, which would result in periodic orbits. Another important issue about LT networks is continuous attractors, for example, line attractors

110

H. Tang, K. Tan, and W. Zhang 6

5

x

2

4

q

3

2

r 1

p

0 −0.5

0

0.5

1

1.5

2

2.5

3

x1 Figure 7: Periodic orbits of the center type. The trajectories with three different initial points (0, 0.5), (0, 1) and (0, 1.2) approach a periodic orbit that crosses the boundary of D1 . The trajectory starting from (0, 1) constructs an outer boundary = p qr ∪ rp such that all periodic orbits lie in its interior.

as discussed in Seung (1998) and Hahnloser et al. (2003). On the other hand, as shown in this article, a stable periodic orbit is another type of continuous attractor, which encircles an unstable stationary state. 7 Conclusion By concentrating on the analysis of a general parameterized two-cell network, this article studied the geometrical properties of equilibria and oscillation behaviors of the LT networks. The conditions for the existence of periodic orbits were established, and the factors affecting its amplitude and period were also revealed. Generally, the theoretical results are applicable only to a two-cell network; however, they can be extended to a special class of biologically motivated networks of more than two neurons: a WTA network using local excitation and global inhibition. The theory for the cyclic dynamics of such a WTA network was presented. Finally, simulation results illustrated the theory developed.

Analysis of Cyclic Dynamics

111

Appendix A.1 Proof of Theorem 3. Based on vector field analysis for D1 and D2 , it is known that the trajectory starting from a point (0, q) with q > s will intersect the x2 -axis again at a point (0, r) above (0, p). The trajectory starting from (0, r) must approach a periodic orbit at t → ∞. Thus, the right trajectory R , which starts from (0, p), and ends at (0, q), and its continued left trajectory L prescribe an outermost boundary such that every periodic orbit lies in the interior of the closed curve . Define x1 (t) = x1 (t) − x∗1 , x2 (t) = x2 (t) − x∗2 for t ≥ 0. Applying the trajectory equation in D1 , it is obtained that w212 2(w11 − 1)w12 (w11 − 1)2 2 x1 (t)2 x x (t) + (t) x (t) + 1 + 2 1 2 ω2 ω2 ω2 2 w12 w11 − 1 − x1 (0)2 = 0. (A.1) x1 (0) + x2 (0) − ω ω Suppose the trajectory intersects the x2 -axis again at (0, q), where it exists after time tR . Let x1 (tR ) = x1 (0) = 0 − x∗1 = −x∗1 . Then w212 2(w11 − 1)w12 (w11 − 1)2 2 x1 (0)2 x x (t ) + (0) x (t ) + 1 + 2 R 1 2 R ω2 ω2 ω2 w212 (w11 − 1)2 2 x x2 (0)2 (0) − 1 ω2 ω2 2w12 (w11 − 1) − x2 (0) − x1 (0)2 = 0, x1 (0) ω2 −

(A.2)

that is, w212 x2 (tR )2 + 2(w11 − 1)w12 x1 (0) x2 (0)2 x2 (tR ) − w212 − 2w12 (w11 − 1) x1 (0) x2 (0) = 0.

(A.3)

To solve the above equation, let φ = 4(w11 − 1)2 w212 x1 (0)2 + 4w212 (w212 x2 (0)2 + 2w12 (w11 − 1) x1 (0) x2 (0)) 2 2 = 4w12 ((w11 − 1) x2 (0)) , x1 (0) + w12 and it is obtained that x2 (tR ) =

−(w11 − 1) x1 (0) ∓ |(w11 − 1) x1 (0) + w12 x2 (0)| . w12

(A.4)

Since (w11 − 1) x2 (0) x1 (0) + w12 = (w11 − 1)x1 (0) + w12 x2 (0) − ((w11 − 1)x∗1 + w12 x∗2 ),

(A.5)

112

H. Tang, K. Tan, and W. Zhang

and x1 (0) = 0, x2 (0) = p = − w22h2−1 , (w11 − 1)x∗1 + w12 x∗2 + h1 = 0, it holds that h2 x1 (0) + w12 x2 (0) = 0 + w12 − (w11 − 1) − (−h1 ) > 0. (A.6) w22 − 1 Therefore, x2 (tR ) =

x2 (0)) x1 (0) ∓ ((w11 − 1) x1 (0) + w12 −(w11 − 1) . w12

(A.7)

x2 (0) when tR = 0, and as tR > 0, It is noted that one of the roots is x2 (tR ) = x2 (tR ) =

−(w11 − 1)(0 − x∗1 ) −

w12 h2 +(1−w22 )h1 1−w22

w12 w22 h2 w22 h1 =− − 1 − w22 w12 w12 h2 + (1 − w22 )h1 = −w22 . (1 − w22 )w12

(A.8)

Hence, q = x2 (tR ) + x∗2 w12 h2 + (1 − w22 )h1 (w11 − 1)h2 − w21 h1 = −w22 + . (1 − w22 )w12 w12 w21 − (w11 − 1)(w22 − 1)

(A.9)

A.2 Neural States Computed in Temporal Domain. The orbit of the system in D1 is computed by

µ w11 −w22 sin ωt + cos ωt 2ω t w21 2 ω sin ωt e × (x(0) − x ),

x(t) − xe = exp

w22 −w11 2ω

sin ωt sin ωt + cos ωt

w12 ω

(A.10)

√ where ω = 12 | |. The orbit of the system in D2 is computed by exp(−t) x(t) − x = 0 e

w12 w22 (exp((w22

− 1)t) − exp(−t)) exp((w22 − 1)t)

× (x(0) − xe ), w22 = 0,

(A.11)

or else exp(−t) x(t) − x = 0 e

w12 t exp(−t) w22 t exp(−t) + exp(−t)

× (x(0) − xe ), w22 = 0.

(A.12)

Analysis of Cyclic Dynamics

113

References Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci. USA, 92, 3844–3848. Chua, L. O., & Yang, L. (1988). Cellular neural networks: Theory. IEEE Trans. Circuits Systems, 35, 1257–1272. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Ellias, S. A., & Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short term memory of shunting on-center off-surround networks. Biol. Cybernetics, 20, 69–98. Ermentrout, B. (1992). Complex dynamics in winner-take-all neural nets with slow inhibition. Neural Networks, 5, 415–431. Feng, J., & Hadeler, K. P. (1996). Qualitative behavior of some simple networks. Journal of Physics A, 29, 5019–5033. Forti, M., & Tesi, A. (1995). New conditions for global stability of neural networks with application to linear and quadratic programming problems. IEEE Trans. Circuits Systems, 42(7), 354–366. Hadeler, K. P., & Kuhn, D. (1987). Stationary states of the Hartline-Ratliff model. Biological Cybernetics, 56, 411–417. Hahnloser, R. H. R. (1998). On the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11, 691–697. Hahnloser, R. H. R., Seung, H. S., & Slotine, J. J. (2003). Permitted and forbidden sets in symmetric threshold-linear networks. Neural Computation, 15(3), 621– 638. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analog amplification coexist in a cortexinspired silicon circuit. Nature, 405, 947–951. Hopfield, J. J. (1984). Neurons with grade response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Perko, L. (2001). Differential equations and dynamical systems. New York: SpringerVerlag. Seung, H. S. (1998). Continuous attractors and oculomotor control. Neural Networks, 11, 1253–1258. Wersing, H., Beyn, W. J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13(8), 1811–1825. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive layer model for feature binding and sensory segmentation. Neural Computation, 13(2), 357–387. Xie, X., Hahnloser, R. H. R., & Seung, H. S. (2002). Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Computation, 14(11), 2627– 2646. Yi, Z., Tan, K. K., & Lee, T. H. (2003). Multistability analysis for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 15(3), 639–662.

114

H. Tang, K. Tan, and W. Zhang

Zhang, Z., Ding, T., Huang, W., & Dong, Z. (1992). Qualitative theory of differential equations (W. K. Leung, transl.). Providence, RI: American Mathematical Society.

Received October 14, 2003; accepted June 2, 2004.

LETTER

Communicated by Dennis Barbour

Nonlinear and Noisy Extension of Independent Component Analysis: Theory and Its Application to a Pitch Sensation Model Shin-ichi Maeda [email protected] Nara Institute of Science and Technology, Ikoma, Nara 630–0192, Japan

Wen-Jie Song [email protected] Osaka University, Suita, Osaka 565–0871, Japan

Shin Ishii [email protected] Nara Institute of Science and Technology, Ikoma, Nara, 630-0192 Japan

In this letter, we propose a noisy nonlinear version of independent component analysis (ICA). Assuming that the probability density function (p.d.f.) of sources is known, a learning rule is derived based on maximum likelihood estimation (MLE). Our model involves some algorithms of noisy linear ICA (e.g., Bermond & Cardoso, 1999) or noise-free nonlinear ICA (e.g., Lee, Koehler, & Orglmeister, 1997) as special cases. Especially when the nonlinear function is linear, the learning rule derived as a generalized expectation-maximization algorithm has a similar form to the noisy ICA algorithm previously presented by Douglas, Cichocki, and Amari (1998). Moreover, our learning rule becomes identical to the standard noise-free linear ICA algorithm in the noiseless limit, while existing MLE-based noisy ICA algorithms do not rigorously include the noise-free ICA. We trained our noisy nonlinear ICA by using acoustic signals such as speech and music. The model after learning successfully simulates virtual pitch phenomena, and the existence region of virtual pitch is qualitatively similar to that observed in a psychoacoustic experiment. Although a linear transformation hypothesized in the central auditory system can account for the pitch sensation, our model suggests that the linear transformation can be acquired through learning from actual acoustic signals. Since our model includes a cepstrum analysis in a special case, it is expected to provide a useful feature extraction method that has often been given by the cepstrum analysis.

Neural Computation 17, 115–144 (2005)

c 2004 Massachusetts Institute of Technology

116

S. Maeda, W.-J. Song, and S. Ishii

1 Introduction 1.1 Review of Independent Component Analysis. Blind source separation (BSS) is a problem that recovers original sources that are assumed to be independent of each other, from observations that are linear mixtures of the original sources. Let s = [s1 , . . . , sm ]T denote an m-dimensional original source variable, A an n × m linear mixing matrix, and an n-dimensional observation variable x = [x1 , . . . , xn ]T given as x = As.

(1.1)

The probability density function (p.d.f.) of the sources, p(s), is unknown, except that the mean of s is fixed at zero. When the mixing matrix A is square (n = m), a demixing matrix W for recovering the sources can be estimated in which y = Wx is component-wise independent. Due to the insufficiency of constraints in the BSS, however, the demixing matrix W is not uniquely determined and contains ambiguities with respect to permutation and scale of sources, such that W = PDA−1 where P and D are any n × n permutation and diagonal matrices, respectively. In the following we ignore these indeterminacies by identifying a demixing matrix W with PDW or a mixing matrix A with ADP. Various algorithms have been proposed to obtain an n × n square mixing or, equivalently, demixing matrix by defining varieties of cost functions, each based on the corresponding independence measure. Some of the cost functions are based on Kullback-Leibler (KL) divergence between p(y) and p(y) ˜ ˜ p(y) ≡ ni=1 pi (yi ), KL[p(y) p(y)] = p(y) log p(y) ˜ dy, where y = Wx and pi (yi ) = p(y)dy¯ i (Amari, Cichocki, & Yang, 1996). Here, y¯ i is a marginalized random vector, whose ith component yi is removed from the vector y, ˜ and p(y) is a distribution in which the components of y are independent of each other. Since the KL divergence is always positive, or zero if and only if ˜ p(y) is equivalent to p(y), it is appropriate to minimize the KL divergence in order to recover the independent sources. This approach, however, needs to know the marginal distribution pi (yi ). Another approach for obtaining the mixing matrix A is to maximize the entropy of z = f (y), where y = Wx and f is a component-wise monotonically increasing nonlinear function (Bell & Sejnowski, 1995). When the nonlinear function f becomes a cumulative distribution function of y, the entropy maximization of z is equivalent to the minimization of the KL divergence. This approach, however, also needs to ˜ know pi (yi ). The KL divergence between p(y) and p(y) has a relation to “ne˜ gentropy” J(y) ≡ H(yg ) − H(y), namely, KL[p(y) p(y)] = J(y) − ni=1 J(yi ) (Hyv¨arinen, 1998b). Here, H(y) is the entropy of random variable y, and yg is a gaussian random variable whose covariance matrix is the same as that of y. Since the negentropy J(y) is invariant with respect to an application of any invertible linear transformation to y, the minimization of the KL divergence is equivalent to the maximization of the sum of marginal negentropies

Nonlinear and Noisy Extension of Independent Component Analysis

117

n

J(yi ). This optimization problem also requires knowledge of pi (yi ). Accordingly, all of these approaches based on information-theoretical cost functions require knowledge of pi (yi ), though it is computationally difficult to estimate it. If we use kernel density estimation for approximating pi (yi ) from sample points, then, for example, the evaluation of pi (yi ) needs computation that is proportional to the sample number. Moreover, reestimation of pi (yi ) is necessary after every update of the demixing matrix W, which may occur after every data observation. To reduce the computational cost, therefore, the marginal distribution pi (yi ) is often restricted, and the estimation is performed in an on-line manner. For example, if the source’s p.d.f. is close to a gaussian distribution, it is well approximated by using several of the first terms of Gram-Chalier expansion or Edgeworth expansion (Gaeta & Lacoume, 1990; Amari et al., 1996), and those moments can be estimated in an online manner. Alternatively, pi (yi ) is parameterized or fixed, hence is restricted. A supposition that the source distribution p(s) = ni=1 pi (si ) is parameterized or fixed is related to whether p(y) is parameterized or fixed, respectively. When p(s) is parameterized or fixed, the above approaches based on information-theoretical cost functions are similar to maximum likelihood estimation (MLE) of p(x|A), for given observation x. However, the MLE does not guarantee that the obtained parameter is the true one, when the assumed p.d.f. parametric family (model) does not include the true parameter. If the model incorporates not only an unknown parameter but also an unknown function like pi (yi ), suggesting that an infinite number of parameters are required to be free from the unknown function, then such an estimation problem is said to be semiparametric. Therefore, the BSS is a semiparametric problem. One way to deal with a semiparametric problem is to define an estimation function, a function of the parameter, which is zero if the parameter is identical to the true one, or nonzero otherwise, whatever the unknown function is. Although an approach based on an estimation function does not suppose any cost function, making the estimation function zero is a type of MLE in the case of BSS, because it is known that when the source distribution of BSS is parameterized or fixed, the derivative of the likelihood function with respect to the parameter can be regarded as a proper estimation function (Amari & Cardoso, 1997).1 Some other BSS algorithms based on cost functions such as a higher-order cumulant can also be regarded as the optimization of certain estimation functions (Cardoso & Souloumiac, 1993). Although all of the above approaches assume a linear noiseless square mixture, several extensions have been studied. One is a noisy mixture case, i=1

1 Although a proper estimation function guarantees that the optimized parameter is the true one, it describes neither the convergence nor convergence speed. See Amari, Chen, and Cichocki (1997) for the stability analysis.

118

S. Maeda, W.-J. Song, and S. Ishii

in which the observation process is modeled as x = As + n,

(1.2)

where n is a noise, usually obeying a gaussian. To deal with a noisy mixture case, not only an MLE approach with a certain parametrization of the source distribution (Belouchrani & Cardoso, 1995; Hyv¨arinen, 1998a; Bermond & Cardoso, 1999; Lewicki & Sejnowski, 2000; Zibulevsky & Pearlmutter, 2001; Girolami, 2001) but also an approach based on an estimation function that is free from the source distribution assumption (Hyv¨arinen, 1999; Kawanabe & Murata, 2000), have been proposed. The MLE approach treats the source signal s as a hidden variable. In this case, the likelihood function includes an integration over the hidden variable. Although the expectationmaximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is often used for MLE when the model includes hidden variables, the algorithm also requires an integration over the hidden variable to calculate expected likelihood. Since such integrations are often intractable, they are approximated by using point approximation with a delta function (Belouchrani & Cardoso, 1995; Olshausen & Field, 1996; Hyv¨arinen, 1998a; Rao & Ballard, 1999), Laplace approximation (Bermond & Cardoso, 1999; Lewicki & Sejnowski, 2000; Girolami, 2001), or mean-field approximation (HøjenSørensen, Winther, & Hansen, 2002). Another extension is a nonlinear mixture case. Because of indeterminacies in a general nonlinear BSS, an appropriate nonlinearity should be introduced. One such nonlinearity has been introduced in a postnonlinear mixture model (Jutten & Karhunen, 2003): x = f (As),

(1.3)

where the nonlinear function f is a component-wise monotonic function and the mixing matrix A is square (n = m). In this case, the nonlinear function and the mixing matrix are both uniquely determined, because the KL divergence is invariant with respect to the transformation by a componentwise monotonic function. In a nonlinear noise-free case, therefore, either the minimization of the KL divergence or the entropy maximization is used to estimate the demixing matrix W (Lee, Koehler, & Orglmeister, 1997; Taleb & Jutten, 1999; Almeida, 2004). However, such an information-theoretical approach suffers from ill-conditionness, because the independent sources cannot be obtained by a deterministic transformation like A−1 f −1 by itself, when the observation process involves a noise or the transformation is overcomplete (n < m). Estimation functions available for noise-free nonlinear mixture cases are currently unknown. In this letter, we deal with a situation in which the mixture is nonlinear and noisy: x = f (As) + n.

(1.4)

Nonlinear and Noisy Extension of Independent Component Analysis

119

There have been few studies that deal with noisy nonlinear mixture cases excepting a case that the source distribution is gaussian (Lappalainen & Honkela, 2000).2 Estimation functions that can be applied to noisy nonlinear mixture cases are unknown. For simplicity, the source distribution is fixed in this letter, though an extension to the case that the source distribution is a mixture of gaussians (Attias, 1999) will be possible. As a theoretical discussion, we present the relation between the MLE and the entropy maximization based on a deterministic transformation (deterministic entropy maximization), by means of the notion of probabilistic entropy maximization, which is an extension of deterministic entropy maximization to what allows a probabilistic sampling instead of deterministic transformation. Our discussion generalizes the equivalence between the MLE and the (deterministic) entropy maximization, presented by Cardoso (1997). To evaluate the likelihood function in the noisy nonlinear ICA, an integral over the hidden variable s is also necessary. For this reason, we employ an EM algorithm to maximize the likelihood, and the integral is performed by the Laplace approximation with a Taylor series expansion. In a special case that the nonlinear function is linear, an analytical solution for the M-step condition is obtained and the derived algorithm becomes identical to the noisy linear ICA presented by Bermond and Cardoso (1999). In a nonlinear case, on the other hand, an analytical solution is not obtained, though the algorithm is still derived as a generalized EM (GEM) algorithm. When the nonlinear function is linear and the noise is sufficiently small, the GEM algorithm becomes similar to the noisy linear ICA by using a “bias removal” technique (Douglas, Cichocki, & Amari, 1998). Moreover, the GEM algorithm under the assumption of linear mixture becomes identical in the noiseless limit to the standard ICA algorithm, which assumes a square noise-free linear mixture. The important difference is that our noisy nonlinear ICA algorithm based on MLE includes the noise-free linear ICA algorithm, while the other MLE-based noisy linear ICA algorithms do not include the noise-free linear ICA algorithm even in their special cases. We will later discuss why this discrepancy arises. 1.2 Pitch Sensation Model. A possible implication of ICA in the sensory cortex is feature extraction. Since the major function of the sensory cortex is to process its input “information,” it is natural that information theory provides some insights to the processing (e.g., Barlow, 1961; Atick, 1992). As an algorithm to realize information theory, ICA can provide a principle underlying the feature extraction done by the sensory cortex. Actually, ICA and a sparse coding scheme that assumes a sparse distribution for the sources in ICA successfully demonstrated the emergence of orientation-selective neu-

2 Because the standard BSS assumes the source distribution is nongaussian, this existing algorithm was called nonlinear factor analysis.

120

S. Maeda, W.-J. Song, and S. Ishii

rons (e.g., Olshausen & Field, 1996; Bell & Sejnowski, 1997; Rao & Ballard, 1999) and extraclassical receptive-field phenomena (Rao & Ballard, 1999) in the visual cortex. In the auditory system, Lewicki (2002) showed that the cochlear filter can be realized by bases trained from natural sounds within the formulation of ICA. In this article, we apply our nonlinear noisy ICA to reproducing some aspects of pitch sensation. Here, we briefly introduce the history of pitch sensation modeling and why ICA has a relationship with this psychoacoustic topic. Pitch is a psychological quantity that represents how high or low the sound is perceived when compared with the perception of a pure tone composed of a single sinusoidal wave. A sound wave, an air wave of condensation and rarefaction, is essentially a one-dimensional signal. Although there should be arbitrariness in detecting individual auditory sources (auditory streams) from such a onedimensional signal, animals are able to detect several or more auditory streams. This grouping of auditory streams is often called auditory stream segregation (Bregman & Campbell, 1971). Perceiving a pitch is an instance of such an auditory grouping. From its definition, it is easily understood that the perceived pitch is significantly dependent on frequency components of heard sound signals. However, various other factors are also involved with it. For example, one perceives a “middle C3 code” on a piano as a simple pitch, but the code consists of not only the pure tone corresponding to the perceived pitch, but also its harmonic components (Lindsay & Norman, 1977). Seebeck (1843) showed that even if the fundamental frequency of a harmonic combination tone is removed from the combination, one still perceives the same pitch as the tone frequency of the “missing fundamental.” This perception is called “virtual pitch,” “residue pitch,” or “low pitch.” We term this phenomenon “virtual pitch” in the rest of this article. Several hypotheses have been presented to account for the pitch perception. Helmholtz (1863) proposed a place theory in which the perceived pitch corresponds to the place that vibrates most on the basilar membrane of a cochlea, and virtual pitch occurs because the difference in components of a combination tone makes a similar place vibrate to that most vibrated by a missing fundamental tone, in a nonlinear fashion. However, later studies found that place theory cannot explain a pitch-shift phenomenon: one perceives an upward (or downward) shift of virtual pitch proportional to the upward (or downward) shift of a harmonic combination tone (de Boer, 1956; Schouten, Ritsma, & Cardozo, 1962). Because virtual pitch is due to the frequency difference of neighboring components in a harmonic combination, according to place theory, the pitch is not allowed to change when the harmonic combination is shifted with maintaining the combination structure. Place theory was, moreover, unsuccessful in explaining the observation in a masking experiment, in which band-limited noise is imposed on the fundamental frequency. In the experiment, a strong noise was incurred so as to disturb the nonlinear influence around the cochlear membrane vibrated

Nonlinear and Noisy Extension of Independent Component Analysis

121

most by the fundamental frequency. Although virtual pitch should disappear according to place theory, it was still perceived in the experiment. On the other hand, Schouten (1940) proposed a temporal theory, which explained the pitch perception in terms of temporal relationships among components in a combination tone. This theory supposed that one detects peaks of a sonic wave and perceives the interval of the peaks as pitch. However, later studies found that the pitch perception is not actually very sensitive to the phase shift of a combination tone, although the shift greatly changes the temporal pattern (Patterson, 1973), and pitch perception is not impaired when a combination of even-numbered multiple tones is presented to one ear and that of odd-numbered tones to the other ear (Houtsma & Goldstein, 1972). To explain these pitch perception phenomena, various central pattern transformation models were also proposed (Wightman, 1973; Goldstein, 1973; Terhardt, 1974), which fundamentally supposed linear transformation in the central auditory system after a peripheral spectral analysis. For example, in Wightman’s model, matrix the transformation (i−1)(2j−1) 2 is component-wise represented as Wij = n cos π for i = 0 2n and Wij = n1 for i = 0, and an internal representation for pitch is represented as s = Wx, where x denotes the power spectrum of input sounds. When x is obtained in the continuous frequency domain, the above transformation matrix is simply represented by an interval-dependent function ∞ 1 s(τ ) = 2π x(ω) cos ωτ dω or, equivalently, an autocorrelation function −∞ T 1 s(τ ) = limT→∞ 2T −T v(t)v(t − τ )dt, where v(t) denotes the instantaneous sound pressure at time t and T the sound duration. As can be seen in the above expression, Wightman’s pattern transformation model is not influenced by a phase shift of the sounds, which is a desirable property of a pitch perception model. Existing central pattern transformation models usually have this property, because they often involve a peripheral spectrum analysis. Later studies revealed that various central pattern transformation models are equivalent to each other under specific conditions (de Boer, 1977; Yoshioka, 1980). Although the central pattern transformation models successfully accounted for several aspects of virtual pitch and pitch shift phenomena, the transformation matrix, which plays an essential role in the models, must be determined beforehand by the model’s designer. It is more likely that such a transformation matrix is acquired through interaction with environments, that is, through learning from external inputs. The aim of this study is to show that the transformation matrix can be obtained as a stable point through learning from natural speech and music signals, according to our noisy nonlinear ICA. Although the existing studies on the central pattern transformation models assume (noisy) linear transformation matrices, the nonlinearity is important in our learning-based approach to cover the dynamic range of natural sounds, which tend to be distributed in a log-normal

122

S. Maeda, W.-J. Song, and S. Ishii

fashion. Due to this feature of natural sounds, the sound intensity has often been measured in decibels to represent its psychological effect. Noisy setting is advantageous to relaxing the assumption of ICA; for example, it allows dimensional disagreement between input sound data and internal representation. After learning, virtual pitch phenomena were simulated, and the existence region of virtual pitch in the model was examined; the results are consistent with those by psychophysical experiments. Accordingly, our study implies that animals may have a mechanism to perform nonlinear noisy ICA instead of an inherent transformation matrix. This letter is organized as follows. In section 2, we show the equivalence between the MLE and the probabilistic entropy maximization. In section 3, we formulate our noisy nonlinear ICA and derive its learning rules, then discuss the relationship with existing ICA models. In section 4, we show pitch sensation simulations. Section 5 is presents our conclusion. 2 Probabilistic Entropy Maximization First, we show that the entropy maximization is equivalent to MLE in a more general condition than that assumed in the existing study (Cardoso, 1997). MLE is defined as maximize E[log p(x|θ)], θ

(2.1)

where x = [x1 , . . . , xn ]T is an n-dimensional random variable representing an input. E[·] denotes the expectation with respect to the true p.d.f. of x, q(x). We assume that the probabilistic model p(x|θ) has an m-dimensional hidden variable y, p(x|θ ) =

p(y)p(x|y, θ)dy,

(2.2)

where the prior distribution for the hidden variable, p(y), is assumed to be uniform within a certain domain, for example, an m-dimensional hypercube, (0, 1)m . We further assume that x is generated by a parametric transformation φ from y plus a gaussian noise; namely, p(x|y, θ ) is given by T 1 (2.3) exp −β x − φ(y; θ) −1 x − φ(y; θ ) , Z(θ) where Z(θ) is a normalization term such that the integral p(x|y, θ )dx is 1. β > 0 is a constant hyperparameter, and 2 /β is a covariance matrix of the gaussian noise. φ(y; θ) denotes a basis function parameterized by parameter θ. The superscript T denotes the transpose. Probabilistic entropy maximization is defined by means of probabilistic sampling as follows. If y is p(x|y, θ) =

Nonlinear and Noisy Extension of Independent Component Analysis

123

sampled according to the sampling distribution p(y)p(x|y, θ ) p(x|θ ) and the model distribution p(x|θ) is sufficiently close to the true distribution q(x), the distribution of y, q(y), is approximated as q(y) ≡ ∫ q(x)p(y|x, θ )dx = p(x|y,θ ) p(y) ∫ q(x) p(x|θ) dx ≈ p(y). Since we have assumed that p(y) is uniform, q(y) also becomes uniform, implying that the entropy of q(y) is maximized under the condition that y’s support is fixed. The probabilistic entropy maximization maximizes the entropy of q(y) where y is probabilistically sampled as above. Since the above process is mathematically equivalent to the approximation of q(x) by p(x|θ), MLE is a method that performs probabilistic entropy maximization. However, probabilistic entropy maximization does not necessarily employ the same cost function as that of MLE; the solution by the probabilistic entropy maximization corresponds to the MLE solution only when the model p(x|θ ) is able to represent the true distribution q(x). Therefore, an MLE employing equations 2.2 and 2.3 is a special case of the probabilistic entropy maximization. Now we show that the probabilistic entropy maximization is reduced to the deterministic entropy maximization when the basis function φ(y; θ ) is invertible and β approaches infinity. By defining u ≡ φ(y) and ψ ≡ φ −1 , the following relations between u and y hold: dy = |J(u)|du

(2.4)

y = ψ(u),

(2.5)

where

|J(u)| is the absolute determinant of the Jacobian matrix J(u) ≡ ∂y ∂u. Then equation 2.3 becomes p(x|y) =

1 exp −β(x − u)T −1 (x − u) ≡ p(u|x). Z(θ)

(2.6)

p(u|x) ≡ p(x|y) is regarded as a p.d.f. of u because p(u|x)du = 1 and p(u|x) > 0. p(u|x) approaches a Dirac’s delta function, p(u|x) ≈ δ(u − x), as β goes to infinity. This means the sampling by p(y|x, θ ) ∝ p(x|y, θ )p(y) approaches the deterministic transformation y = ψ(x). By substituting equations 2.4, 2.5, and 2.6 into equation 2.2, we derive p(x|θ ) = |J(x)|,

(2.7)

where J(x) ≡ ∂y ∂x and y = ψ(x). Since the entropy of the transformed variable y = ψ(x; θ) is given by H(y) = H(x) + q(x) log |J(x; θ )| dx, maximization of equation 2.7 is identical to the deterministic entropy maximization with respect to y, indicating that the probabilistic entropy maximization is reduced to the deterministic entropy maximization when the basis function φ(y; θ) is invertible and β approaches infinity. Also, an MLE of

124

S. Maeda, W.-J. Song, and S. Ishii

equations 2.2 and 2.3 is identical to the maximization of equation 2.7 when the basis function φ(y; θ) is invertible and β approaches infinity. The above discussion suggests two points: (1) the probabilistic entropy maximization is a generalization of the deterministic entropy maximization, and (2) the deterministic entropy maximization can be solved by MLE. This generalization is useful for understanding the relationship between our model and existing ICA models based on the deterministic entropy maximization, as described later. Since the only assumption we have used is that the basis function φ(y; θ ) is an invertible function, our result about the above equivalence between entropy maximization and MLE is more general than that obtained by Cardoso (1997), who showed the equivalence when ψ(x) is a deterministic function such that ψ(x) = f (Wx) where W is a square and regular matrix and f (x) is a component-wise monotonically increasing nonlinear function. 3 Model and Algorithm 3.1 Probabilistic Model. Here, we describe the details of the probabilistic model we use. Suppose x is an n-dimensional input signal and s is an m-dimensional hidden variable that obeys a component-wise independent distribution p(s) = m i=1 pi (si ). We consider a square or undercomplete situation, that is, n is not smaller than m (n ≥ m). Our probabilistic model is then given more specifically by p(x|A, θ) = p(s)p(x|s, A, θ )ds (3.1) p(x|s, A, θ) =

p(s) =

1 Z(A, θ) T × exp −β g(x; θ) − As (x; θ ) g(x; θ ) − As (3.2) m 1 i=1

2

(1 − tanh2 (si )),

(3.3)

where A is an n×m matrix, g(x; θ) is a component-wise monotonic function, and Z(A, θ) is a normalization term such that the integral p(x|s, θ, A)dx is 1. The stochastic generative process p(x|s, θ, A) is based on a postnonlinear ICA model, namely, x = f (As) + n, where f (·) is the inverse function of g(x; θ ). n is a gaussian noise whose mean and covariance are 0 and a diagonal matrix (2β)−1 I, respectively, where I denotes the identity matrix. When (x − f (As)) is close to zero, it is approximated as x − f (As) ≈ f (g(x; θ ); θ )(g(x; θ ) − As), where f (x; θ) is a diagonal matrix whose (i, i) ∂ f (xi ;θi ) element is [ f (x; θ )]ii = ∂xi . Using this approximation, g(x; θ ) obeys a gaussian distribution whose mean and covariance are 0 and a diagonal matrix (x; θ ), respectively. The (i, i) element of (x; θ ) is [ (x; θ )]ii =

Nonlinear and Noisy Extension of Independent Component Analysis

125

{ f (g(xi ; θ ); θ )}2 . Note that our model cannot be described by a noisy linear ICA after the application of a nonlinear transformation, y = Aˆs + n and y = f (x; θ ), because the covariance of the noise n is dependent on x in our formulation. In this study, we use g(xi ; θ ) = log(xi ) + ci , with which f (yi ; θ ) = exp(yi − ci ) and [ (x; θ)]ii = { f (g(xi ; θ ); θ )}2 = x2i . Parameters of our probabilistic model are A and θ ≡ {ci |i = 1, . . . , n}. β is a hyperparameter. The prior defined by equation 3.3 corresponds to the application of a nonlinear transformation tanh(y) when the model is reduced to a deterministic entropy maximization model. This prior is a gaussian-like unimodal distribution, but more kurtotic than the gaussian distribution. Our model is identical to a cepstrum analysis in a special parameter setting. The cepstrum analysis first transforms a given sound into an amplitude spectrum by applying a Fourier transformation. The amplitude spectrum is then transformed by a logarithm function, and again another Fourier transformation is applied. The first step is also done as a preprocess in our model. The latter step is also realized in our model if the component-wise nonlinear function is given as g(x; θ) = log(x), the mixing matrix A is fixed at a square Fourier transformation matrix, and the hyperparameter β approaches infinity. Our model becomes a deterministic transformation in the limit of β going to infinity. In this case, our model becomes equivalent to the cepstrum analysis in the above-mentioned setting. 3.2 Generalized EM Algorithm. Because there is a hidden variable s, an EM algorithm (Dempster et al., 1977) is used for the maximization of the log likelihood 3.1 with respect to parameters A and θ . ¯ we obtain • E-step: By using the current parameter estimators, A¯ and θ, the expected log likelihood,

¯ θ) ¯ = Q(A, θ |A,

T 1 ¯ θ¯ ) log p(xt , st |A, θ )dst , p(st |xt , A, T t=1

(3.4)

where {xt } is the set of T input samples and {st } is the set of corresponding hidden variables. ¯ θ) ¯ with respect to parameters A and θ . • M-step: Maximize Q(A, θ|A, 3.2.1 E-Step. Since the parameter-dependent terms in log p(x, s|A, θ ) = −β(g(x; θ ) − As)T (x; θ)(g(x; θ) − As) − log Z(A, θ ) − log p(s) include only the first- and second-order statistics of the hidden variable s, the integral in equation 3.4 is calculated by the Laplace approximation. ¯ θ¯ ) appearing in equaAccording to the Laplace approximation, p(s|x, A, tion 3.4 is approximated as a gaussian distribution peaked at the MAP esti-

126

S. Maeda, W.-J. Song, and S. Ishii

¯ θ) ¯ θ¯ ) + log p(s)}: ¯ = arg maxs {log p(x|s, A, mator, sˆ ≡ arg max p(s|x, A,

¯ θ¯ ) ∝ exp − 1 (s − sˆ )T H(ˆs)(s − sˆ ) , p(s|x, A, 2

(3.5)

¯ θ). ¯ where H(ˆs) ≡ −∇∇ log p(ˆs|x, A, To obtain the MAP estimator sˆ , a gradient-ascent method is used: ¯ + ϕ(s), ¯ ¯ − As) s ∝ 2β A¯ T (g(x; θ)

(3.6)

∂ log p(s)

where ϕ(s) ≡ = −2 tanh(s). ∂s Then the first and second moments are calculated as ¯ θ)sds ¯ s ≡ p(s|x, A, = sˆ sT s ≡

¯ θ)ss ¯ T ds = sˆ sˆ T + H(ˆs)−1 , p(s|x, A,

(3.7) (3.8)

where H(ˆs)−1 is estimated under the assumption of low noise such that β is fairly large: −1 1 ≈ (AT A)−1 H(ˆs)−1 = 2βAT A − ∇∇ log p(ˆs) 2β 1 + 2 (AT A)−1 ∇∇ log p(ˆs)(AT A)−1 . 4β

(3.9)

3.2.2 M-Step. Using the statistics, equations 3.7 and 3.8, and neglecting the normalization term (− log Z(A, θ)) under the assumption of low noise, ¯ θ¯ ) are calculated as the derivatives of the expected log likelihood Q(A, θ|A, T ∂Q 2β ≈ t et sˆ Tt − t AH(ˆst )−1 ∂A T t=1

(3.10)

T 2β ∂Q ≈− et,i ( f (g(xt,i ; θi ); θi ))2 , ∂ci T t=1

(3.11)

where et,i and xt,i denote the ith elements of et ≡ (g(xt ; θ ) − Aˆst ) and xt , respectively. The parameter c is analytically obtained from equation 3.11 as m 2 log x − ˆ s x a i,t ij j,t t=1 i,t j=1 , T 2 x t=1 i,t

T ci = −

(3.12)

Nonlinear and Noisy Extension of Independent Component Analysis

127

while some constraint is needed to obtain analytically the matrix A. When the function g(xt ; θ) is linear in equation 3.10, the M-step equation ∂Q/∂A = 0 is analytically solved as T 1 A= g(xt ; θ)ˆsTt T t=1

T 1 sˆ t sˆ Tt + H(ˆst )−1 T t=1

−1 ,

(3.13)

which provides the noisy ICA presented previously (Bermond & Cardoso, 1999; Girolami, 2001). In a nonlinear case, however, the analytical solution is not directly obtained, and a gradient-based optimization method is used in our algorithm. Since the M-step is based on a gradient-ascent method in this case, our EM procedure is an instance of the generalized EM algorithm.

Using the natural gradient AAT ∂Q ∂A instead of the gradient ∂Q/∂A, the learning is much accelerated in practice (Amari et al., 1996; Lewicki & Sejnowski, 2000): AAT

T 2β ∂Q ≈ AAT t g(xt ; θ) − Aˆst sˆ Tt − AAT t AH(ˆst )−1 . ∂A T t=1

Since the MAP estimator sˆ should satisfy s = 0 or, equivalently, 2βAT (g(x; θ ) − As) + ϕ(s) = 0, the following equation holds: AAT

T 1 ∂Q A φ(ˆst )ˆsTt + 2βAT t AH(ˆst )−1 . ≈− ∂A T t=1

(3.14)

When the noise is low, that is, β is large enough, H(ˆs)−1 is approximated as 1 H(ˆs)−1 ≈ 2β (AT A)−1 + 4β1 2 (AT A)−1 ∇∇ log p(ˆs)(AT A)−1 , and this leads to AAT

∂Q ∂W

T ∂Q 1 1 A I + ϕ(ˆst )ˆsTt + ≈− ∇∇ log p(ˆst )(AT t A)−1 . ∂A T t=1 2β

(3.15)

In particular, when A has an inverse matrix W ≡ A−1 , noting the fact that T = −AT ∂Q ∂A A , the natural gradient 3.15 becomes ∂Q T ∂Q T T W W = −AT A W W ∂W ∂A

T 1 1 −1 T T I + φ(ˆst )ˆst + ∇∇ log p(ˆst )W t W W, = T t=1 2β

(3.16)

which is closely related to the noisy ICA which uses the bias removal technique, presented previously by Douglas et al. (1998). In their noisy ICA,

128

S. Maeda, W.-J. Song, and S. Ishii

however, the sign of the third right-hand-side term in equation 3.16 was opposite, and sˆ was given as sˆ = Wx instead of the MAP estimator we use. Note that local convergence is assumed in our generalized EM algorithm, while the convergence of the above bias removal technique remains 1 unclear. If the term ( 2β ∇∇ log p(ˆst )W t−1 W T ) is ignored in equation 3.16, in addition, it becomes the learning rule of noise-free ICA, which has been proposed from various perspectives as described in section 1. Although the noise-free ICA is a special case of the noisy ICA, existing noisy ICA learning rules based on MLE (e.g., Belouchrani & Cardoso, 1995; Hyv¨arinen, 1998a; Bermond & Cardoso, 1999; Højen-Sørensen et al., 2002; Girolami, 2001) do not include the noise-free ICA learning rule, except for the one by Lewicki and Sejnowski (2000). Lewicki and Sejnowski derived a noisefree ICA learning rule by omitting a term that is unrelated to the likelihood ∫ p(s)p(x|s, θ)ds, but our learning rule does not include such a term. Instead, 1 the term ( 2β ∇∇ log p(ˆst )W t−1 W T ), which appears in our learning rule, automatically vanishes in the noiseless limit. They reported that the additional term makes the learning unstable, whereas ours does not. This difference arises probably because Lewicki and Sejnowski directly evaluated the like lihood function p(s)p(x|s)ds by approximating p(s)p(x|s) as a gaussian distribution centered at the MAP estimator sˆ , which is dependent on the estimation of A, whereas our EM formulation introduced the expected log ¯ θ) ¯ in which the MAP estimator sˆ is independent of the likelihood Q(A, θ|A, estimation of parameters A and θ. Although their approximation incorporated how the MAP estimator sˆ is affected by changing the parameter A, that effect was difficult to estimate and hence needed approximation, which might have caused the instability. Many of the conventional noisy ICA algorithms based on MLE calculate the integral over the hidden variable s by using point approximation with a Dirac’s delta function (Belouchrani & Cardoso, 1995; Olshausen & Field, 1996; Hyv¨arinen, 1998a; Rao & Ballard, 1999). For example, if the term t AH(ˆst )−1 is removed from equation 3.10, which corresponds to the case that the posterior is approximated by a delta function at the MAP estimator sˆ , then the updating rule becomes equivalent to the original sparse coding method (Olshausen & Field, 1996): A ∝

T 2β t (g(xt ; θ) − Aˆst )ˆsTt . T t=1

(3.17)

It should be noted that if g(xt ; θ) is linear, t becomes the identity matrix in equation 3.17. In the noiseless limit, equation 3.10 is reduced to noise-free ICA while equation 3.17 is not, and this difference stems from the approximation accuracy of the posterior. Accordingly, our new learning rule not only improves existing noisy ICA algorithms but also gives an appropriate nonlinear extension of those algorithms.

Nonlinear and Noisy Extension of Independent Component Analysis

129

Our learning rule also generalizes a noise-free nonlinear ICA derived by Lee et al. (1997) or Almeida (2004), because those algorithms are based on deterministic entropy maximization which is equivalent to MLE in the noiseless limit as mentioned in section 2. 3.3 Step Size Determination. According to our GEM algorithm, parameter A is obtained by a gradient-based optimization method, A := A + A. Here, we discuss the way to determine an appropriate value for the step size . In the GEM algorithm, the step size should be determined so that ¯ θ) ¯ is maximized. The optimization for the expected log likelihood Q(A, θ|A, ¯ θ¯ ) ≥ Q(A, ¯ θ |A, ¯ θ¯ ). We parameter A in the M-step should achieve Q(A, θ |A, then define a cost function, ¯ θ) ¯ θ|A, ¯ θ), ¯ − Q(A, ¯ F() ≡ Q(A, θ |A,

(3.18)

where A = A¯ + A. is determined such that F() is maximized. We can see F() has a simple quadratic form if either the noise is fairly low or the nonlinear function g(xt ; θ) is linear,

b 2 b2 F() ≈ −a − (3.19) + , 2a 4a T T where a = Tβ Tt=1 (Tr(st sTt AT t A)) and b = 2β t=1 (g(xt ; θ ) t A T T T ¯ Tr(·) denotes the matrix trace. The derivation of st − Tr(st st A t A)). equation 3.19 is described in section A.1. Since a and b are positive (see section A.2), the optimal step size is determined as =

b , 2a

(3.20)

with which the GEM algorithm achieves a line search for parameter A. ¯ θ¯ ) decreases as the The gain of the expected log likelihood Q(A, θ |A, noise decreases, and the optimal step size equals zero in the noiseless limit because the parameter a diverges while the parameter b converges in that limit. This indicates that the EM algorithm is no longer effective in the noiseless limit. Interestingly, although the update of A according to the GEM algorithm does not increase the expected log likelihood in this case, the update is still meaningful because it increases the likelihood itself. This implies that when the noise is small enough, the step size , which is optimal for maximizing the likelihood, is larger than that optimal for maximizing the expected likelihood. 4 Virtual Pitch Simulation 4.1 Condition. Human speech and music were used for training our model. Speech was uttered by four American, four Spanish, and two Japanese speakers; five of them were male and the other five female. The total

130

S. Maeda, W.-J. Song, and S. Ishii

duration of speech was 232 seconds. Music was classified into a category of singing voice and no singing voice. The former consisted of pop music and relaxation music, and was extracted from 25 songs so that the total duration was 116 seconds. The latter consisted of a piano concerto, a violin concerto, and an orchestra and was extracted from 24 tunes so that the total duration was 116 seconds. All the sound data were obtained from commercial compact discs randomly. No period of silence was included. The data set consisted of 10,000 segments, each of which contained 512 points. They were quantized into 16 bits and converted to the sampling frequency of 11,025 Hz to cover the existence region of virtual pitch (Ritsma, 1962). To obtain spectral information, 512-point discrete Fourier transformation (DFT) of the entire acoustic data was employed, which transformed the acoustic signals into 10,000 points of 256-dimensional spectral patterns. Before performing DFT, we applied a Hanning window to the data set for removing wideband artifactual energy. We used Fourier bases for mimicking peripheral spectral analysis for the following reason, though bandpass filters whose center frequencies are arranged in a logarithmic scale are more natural than the Fourier filters; to distinguish high-frequency components with a resolution sufficient for reproducing virtual pitch phenomena, the number of filters required by the logarithmic arrangement becomes much larger than the number of Fourier filters by the linear arrangement. We adopted the definition of pitch matching, used in the existing study (Wightman, 1973). We assume that the peak value of on MAP estimator sˆ represents the internal pitch representation: we suppose two inputs are perceived as having the identical pitch if the same component of sˆ is most activated in response to both of the two inputs. We set the size of the mixing matrix A at 256×30. The learning results were dependent on the hyperparameter β and the size and the initial value of the matrix A. Since the hyperparameter β balances the prior p(s), which makes the internal representation s small with the conditional probability p(x|s), which makes the representation error (g(x; θ) − As)T (x; θ )(g(x; θ ) − As) small, β should be set appropriately. In the simulation, β was set to 500, the initial value of A was set closely to the one used in Wightman’s model (1973), and the parameter c was constrained so that the mean E[g(x; θ )] was zero. 4.2 Results. Figure 1 shows the mixing matrix A after learning, in which the magnitude of all elements of matrix A is depicted. The jth column vector of matrix A, aj , can be interpreted as an n-dimensional spectral basis pattern due to the jth internal representation sˆj , and an input x is approximated m as a linear summation of the column vector aj as x ≈ exp j=1 aj sˆj + c . Then we call each aj a basis vector and regard sˆ as an internal representation of input x. Figure 2 shows six basis vectors out of matrix A after learning, which correspond to the column vectors surrounded by rectangles in Figure 1. Many harmonic peaks can be observed in those basis vectors. A circle

Nonlinear and Noisy Extension of Independent Component Analysis

131

Figure 1: The mixing matrix A after learning, in which the magnitude of all elements of matrix A is shown. The jth column vector of matrix A, aj , can be interpreted as an n-dimensional spectral pattern due to the jth internal representation sˆj , and an input x is approximatedas the exponential of the linear summation of those column vectors, x ≈ exp

m

j=1

aj sˆj + c . Note that both of

the components of a column vector aj and an internal representation sˆj can be negative, because we used an exponential function as a nonlinear function in this simulation. The column vectors surrounded by rectangles correspond to the basis vectors shown in Figure 2.

indicates a harmonic peak, and the title denotes the fundamental frequency calculated from the circled harmonic peaks such to minimize the mean square error between multiples of the fundamental frequency and the circled harmonic frequencies. A harmonic structure is clear especially for low fundamental frequencies (see Figure 2A), while sometimes not very clear for high fundamental frequencies (see Figure 2D). For high fundamental frequencies, the basis vectors sometimes contain large values (see Figure 2C). Such basis collapses for high fundamental frequencies occur presumably because the training data are not well represented by harmonic bases whose fundamental frequencies are high. In other words, the harmonic structure is supposed to arise especially for low fundamental frequencies because the training data are well represented by the acquired harmonic bases. A few basis vectors (see Figure 2F) exhibit no prominent harmonic structure.

132

S. Maeda, W.-J. Song, and S. Ishii $

+]

ZHLJKWVRIDEDVLV

í

'

%

+]

í

í

í

+]

(

+]

í

)

+]

&

í

í

IUHTXHQF\+]

Figure 2: Six column vectors out of matrix A after learning, each of which represents a basis vector for an internal representation sj . Each panel corresponds to a rectangle in Figure 1. A circle indicates a harmonic peak, and the title denotes a fundamental frequency calculated from the circled harmonic peaks. Most of the column vectors exhibit harmonic structures. We therefore suppose that each basis vector represents a spectral pattern of a certain pitch, which tends to have a harmonic structure. Especially for high fundamental frequencies, basis vectors sometimes contain components with large values relative to the other components (e.g., C), and a few basis vectors do not exhibit prominent harmonic structures (e.g., F).

Ritsma (1962) examined whether two stimuli—one consisting of a fundamental frequency g, its twice frequency 2g and three-times frequency 3g, and the other consisting of three higher-order harmonics of the fundamental frequency, ( f − g), f and ( f + g)—provide the same pitch sensation. More precisely, the higher-order harmonics of the fundamental frequency tone v(t) in that experiment had a modulation parameter m that controlled the mixture ratio: v(t) = m sin 2π( f − g)t + sin 2π f t + m sin 2π( f + g)t. We call the former and the latter tones a fundamental tone (FT) and a higher harmonics tone (HHT), respectively (the spectrum of each tone is shown in Figure 3), and call frequencies g and f , a spacing frequency and a center frequency, respectively. Figure 4 shows the region in which an FT and an HHT provide the identical pitch (Ritsma, 1962). A pair of an FT and an HHT is specified by parameters g and f in Figure 4. This figure shows several aspects of virtual pitch phenomena. When the center frequency f is too large, virtual pitch does not occur. When the spacing frequency g is too small or too large, virtual pitch does not occur either. Therefore, virtual pitch can be observed for moderate values of f and g. r is a reference variable denoting the rate f/g. Note that there are no common frequency components between the FT

Nonlinear and Noisy Extension of Independent Component Analysis

DPSOLWXGHVSHFWUXP

P

VSDFLQJIUHTXHQF\

J J J

+LJKHUKDUPRQLFV

% DPSOLWXGHVSHFWUXP

IXQGDPHQWDOWRQH

$

IJ I IJ

VSDFLQJIUHTXHQF\

J J J

IUHTXHQF\

DPSOLWXGHVSHFWUXP

' DPSOLWXGHVSHFWUXP

P

FHQWHUIUHTXHQF\

IUHTXHQF\

IUHTXHQF\

&

133

FHQWHUIUHTXHQF\

IJ I IJ

IUHTXHQF\

Figure 3: Spectra of a fundamental tone (FT) and a higher harmonics tone (HHT) for m = 100 and m = 50. An FT consists of a fundamental frequency g, its twice frequency 2g, and three-times frequency 3g, and an HHT consists of three higherorder harmonics of the fundamental frequency, ( f −g), f and ( f +g). Frequencies g and f are called a spacing frequency and a center frequency, respectively. A virtual pitch is judged to occur if these two stimuli, left and right, provide the same pitch sensation.

and the HHT in the region of r = f/g ≥ 5. The modulation parameter m represents the clarity of the HHT (see above). As the m value becomes large, the harmonic structure becomes prominent, and then the region of virtual pitch enlarges. Figure 5 shows activations of internal representation sˆ in our model when presented with FTs and HHTs. We used m = 100% for all simulations in this section. In each of Figures 5A, 5B, and 5C, the upper and lower rows show the amplitude spectrum of inputs and the activation pattern against the inputs, respectively. In each lower panel is a mark that indicates whether the most activated internal representation is the same between when presented with an FT and when presented with an HHT; a circle indicates a match, a cross a mismatch, and a triangle the case that the internal representation most activated by an FT was the second most activated one by an HHT. When FTs and HHTs were moderate (corresponding to from 160 to 660 Hz), the most activated internal representation was almost the same between when presented with an FT and when presented with an HHT, as can be

134

S. Maeda, W.-J. Song, and S. Ishii

Figure 4: The existence region of virtual pitch examined by Ritsma (1962), in which whether an FT and an HHT provide the same pitch sensation was psychologically examined. The two stimuli are specified by a spacing frequency g and a center frequency f . This figure shows several aspects of virtual pitch. When either the center frequency f is too large or the spacing frequency g is too small or too large, virtual pitch does not occur. Therefore, virtual pitch can be observed for moderate values of f and g. r is a reference variable denoting the rate f/g. m is a modulation parameter that represents the clarity of an HHT (see the text for the exact definition). As the m value becomes large, the harmonic structure becomes prominent and then the region of virtual pitch enlarges.

seen in Figures 5A and 5B.3 This indicates that such an FT-HHT pair led to the identical pitch sensation. In contrast, the most activated component of internal representation showed discrepancies for lower or higher fundamental frequencies, as can be seen in Figure 5C; namely, such an FT-HHT pair did not lead to the identical pitch sensation in our simulation. To examine what kind of feature of the input sound data was extracted by internal representation s, we reconstructed input sounds from the internal representation. An example is shown in Figure 6. The original input (upper panels) and the corresponding internal representation are the same as those in Figure 5B, though the y-axis has log-scale for visibility. From this figure, we see that the reconstructed spectrum had a clear harmonic structure, but the reconstruction was degraded as the center frequency was high. This result

3 FTs and HHTs are dependent on the resolution of DFT. Although individual activation of internal representation s was affected by the resolution of DFT, the overall characteristics of activation patterns did not vary regardless of the resolution.

Nonlinear and Noisy Extension of Independent Component Analysis IXQGDPHQWDOWRQH

í

í

I +]

í

I +]

% J +]

DFWLYLW\

DPSOLWXGH VSHFWUXP

IUHTXHQF\+] í

í

í

í

í

I +]

I +]

I +]

I +]

I +]

I +]

I +]

& J +]

KLJKHUKDUPRQLFV I +]

$ J +]

135

I +]

í

LQGH[RIKLGGHQYDULDEOHV

Figure 5: Activations of internal representation s when presented with FTs (the left-most column) or HHTs (the other columns) (m = 100%). (A–C) The upper and lower rows show the amplitude spectrum of input tones and the activation patterns against the inputs, respectively. In each lower panel is a mark that indicates whether the most activated internal representation is the same between when presented with an FT and when presented with an HHT. A circle indicates a match, a cross a mismatch, and a triangle the case that the internal representation most activated by an FT is the second most activated one by an HHT. A light gray bar indicates the component most activated by the FT. When FTs and HHTs are moderate (corresponding to from 160 to 660 Hz), the most activated internal representation was almost the same between when presented with an FT and when presented with the corresponding HHT, as can be seen in A and B; namely, such an FT-HHT pair produces the identical pitch sensation in our simulation. In contrast, the most activated component of internal representation is different for lower or higher fundamental frequencies, as can be seen in C; namely, such an FT-HHT pair does not produce the identical pitch sensation in our simulation.

indicates that the sound data used for the learning involve many harmonic structures that are prominent especially for low frequencies. The learning by our nonlinear noisy ICA then made the reconstruction error small, especially for low-frequency components. Although we show a single example here, the above tendency can be seen in other cases. A more intensive result is summarized in Figure 8 (for details, see the figure caption). Figure 7 shows the existence region of virtual pitch in our model—the region where the same component of sˆ was most activated in response

136

S. Maeda, W.-J. Song, and S. Ishii

IXQGDPHQWDOWRQH

DPSOLWXGH VSHFWUXP

J +]

KLJKHUKDUPRQLFV I +]

I +]

I +]

I +]

IUHTXHQF\+]

Figure 6: Comparison between an input sound spectrum and a spectrum reconstructed from internal representation sˆ . The original input (upper panels) and the corresponding internal representation are the same as those in Figure 5B, though the y-axis has a log scale for visibility. Each reconstructed spectrum has a clear harmonic structure, but the reconstruction is degraded as the center frequency is high.

to both an FT and the corresponding HHT. We show only the region of f/g ≥ 5. The circles and triangles have the same meaning as in Figure 5. A shaded dot corresponds to a cross in Figure 5. When the center frequency f was fairly large, we did not perform simulations because the frequency resolution sufficient for producing test sounds could not be realized in the current simulation condition. When the spacing frequency g was too small or too large, the number of shaded dots increased, suggesting that virtual pitch hardly occurred in that region. This tendency can also be seen in the psychophysical experiment in Figure 4. The reason that virtual pitch was observed in the moderate region can be considered as follows. When the spacing frequency was too small, virtual pitch did not occur, possibly because the frequency resolution was not sufficient; although this insufficient resolution is the case of our simulation, similar features may exist in the actual cochlear filter. When the spacing frequency was too large, virtual pitch did not occur because the harmonic structure of the basis vectors collapsed for high fundamental frequencies. We also compared the pitch sensation by our simulation with that by Wightman’s central pattern transformation model. The result is shown in Figure 9. The overall characteristics of the Wightman’s model are similar to those by our simulation, but there are some discrepancies. The main difference is that Wightman’s model produced virtual pitch even for high spacing frequencies (g > 800 Hz), whereas our simulation did not produce in that region. Because the harmonic structures of basis vectors are clearly prepared even for high spacing frequencies in Wightman’s model, the model produced virtual pitch. Note that virtual pitch does not actually occur when the spacing frequency is high (see Figure 4).

Nonlinear and Noisy Extension of Independent Component Analysis

U

137

VSDFLQJIUHTXHQF\J+]

&

%

$

FHQWHUIUHTXHQF\I+] Figure 7: Existence region of virtual pitch in our model ( f/g ≥ 5). This figure corresponds to the situation with m = 100% in Figure 4. A circle indicates a parameter pair of g and f , with which an FT and an HHT produced the same pitch, and a triangle indicates that the most activated internal representation by an FT corresponded to the next most activated internal representation by an HHT. A dot indicates a parameter pair with which the pitch sensation showed discrepancy (examples of activation pattern sˆ are shown in Figure 5, and those examples correspond to the simulation conditions surrounded by light square rectangles). When the spacing frequency g is too small or too large, the number of shaded dots increases, suggesting that virtual pitch hardly occurs in that region. This tendency can also be seen in the psychophysical experiment in Figure 4.

5 Discussion In this letter, we proposed a noisy nonlinear ICA that not only generalizes the formerly proposed noisy linear ICA, but also corresponds to the noisefree linear ICA in the noiseless limit. Exploring the estimation function that is free from the assumption of source p.d.f. in the noisy nonlinear ICA is our future work. We trained our model by using speech and music and examined the pitch sensation by the trained model. We found that the existence region of virtual pitch, obtained by our computer simulation, is qualitatively similar to

138

S. Maeda, W.-J. Song, and S. Ishii

U

VSDFLQJIUHTXHQF\J+]

㨪㨪㨪㨪

FHQWHUIUHTXHQF\I+] Figure 8: Reconstruction error between an input sound spectrum and a sound spectrum reconstructed from internal representation sˆ for various combinations of f and g. The reconstruction error denotes absolute difference between the amplitude spectrum of input sounds and the reconstructed amplitude spectrum, averaged over frequencies. A dark (light) circle indicates that the reconstruction error is large (small). While most of the error takes a relatively small value (error < 10), sometimes it becomes a very large value. Therefore, the coloring is performed in a nonlinear fashion for visibility (see inset).

that by the existing psychological experiment. Chang and Merzenich (2003) reported that continuous exposure to moderate-level noise disturbs the normal development of topographic representational order of an infant rat’s primary auditory cortex and the primary auditory cortex still has a plasticity after the exposure to the noise; namely, the topographic representation can be reorganized in a new sound environment. Their study is consistent with our hypothesis that the transformation in the central auditory system can be acquired and changed through learning, depending on the environment. In particular, the primary auditory cortex seems to be a responsible region for such learning, for several reasons; neurons in the primary auditory cortex have frequency selectivity (Merzenich, Knight, & Roth, 1975; Reale & Imig, 1980), and the primary auditory cortex is known to have neurons activated when presented with a pure tone and presented with a harmonic

Nonlinear and Noisy Extension of Independent Component Analysis

U

139

VSDFLQJIUHTXHQF\J+]

FHQWHUIUHTXHQF\I+] Figure 9: Existence region of virtual pitch in Wightman’s model ( f/g ≥ 5). This figure is plotted similar to that in Figure 7. This figure also corresponds to the situation with m = 100%. When the spacing frequency g was larger than 260 Hz, virtual pitch was observed; it was not observed when the spacing frequency g was small. This tendency of Wightman’s model can be seen in our simulation, but there are some discrepancies. The major difference is that Wightman’s model produced virtual pitch even for high spacing frequencies (g > 800 Hz), whereas our simulation did not produce in that region. We consider that this arose because the basis vectors in the Wightman’s model were prepared to have clear harmonic structures even for high spacing frequencies. However, such a feature cannot be observed in the psychoacoustic experiment (see Figure 4).

combination tone that induces a virtual pitch phenomenon (Pantev, Hoke, Lutkenhoner, & Lehnertz, 1989; Riquimaroux & Hashikawa, 1994). Since our model involves a cepstrum analysis as a special case, from an engineering viewpoint, it may be possible to obtain a pitch sensation model or a feature extraction system that can be applied to, for example, detection of the fundamental frequency, through learning from actual speech and music signals, based on our framework of nonlinear noisy ICA.

140

S. Maeda, W.-J. Song, and S. Ishii

Appendix ¯ θ¯ ) A.1 Derivation of Equation 3.19. Expected log likelihood Q(A, θ |A, is given by ¯ θ¯ ) = − 1 Q(A, θ|A, T

= −β

T T ¯ θ)(g(x ¯ β p(st |xt , A, t ; θ ) − Ast ) t=1

× t (g(xt ; θ) − Ast )dst + log Z(A, θ )

T 1 g(xt ; θ)T t g(xt ; θ) − 2g(xt ; θ)T t Ast T t=1 + Tr(st sTt AT t A) + log Z(A, θ ).

When the noise is low or the nonlinear function g(xt ; θ ) is linear, terms that are not multiplied by β can be neglected. Then, ¯ θ) ¯ θ|A, ¯ θ) ¯ − Q(A, ¯ F() ≡ Q(A, θ|A, T 1 −2g(xt ; θ)T t A st + Tr(st sTt AT t A) ≈ −β T t=1 T 1 ¯ . −2g(xt ; θ)T t A¯ st + Tr(st sTt A¯ T t A) +β T t=1 Substituting A for A¯ + A, it becomes = 2β

T 1 g(xt ; θ)T t (A¯ + A) st − g(xt ; θ )T t A¯ st T t=1

T 1 ¯ −Tr(st sTt (A¯ + A)T t (A¯ + A))+Tr(st sTt A¯ T t A) T t=1 T β Tr(st sTt AT t A) = − 2 T t=1

+β

+

T 2β ¯ . g(xt ; θ)T t A st − Tr(st sTt AT t A) T t=1

This is identical to equation 3.19. A.2 Sign of Parameters a and b. The parameter b is positive because the gradient descent always satisfies F ( = 0) > 0, which yields b > 0. Determining whether the parameter a is positive is a bit complicated.

Nonlinear and Noisy Extension of Independent Component Analysis

141

The term Tr(st sTt AT t A) is a sum of all elements of the matrix that is a Hadamard product of two matrices st sTt and AT t A. A Hadamard product of two matrices P and Q is represented as P Q, and the (i, j) element of P Q is denoted as pij qij , where pij and qij are the (i, j) elements of matrix P and Q, respectively. To determine the positivity of the second term, the following lemmas are used. Lemma 1. When a matrix U is (semi-) positive definite, the sum of all elements of U is positive. Lemma 2. When two n × n matrices P and Q are (semi-) positive definite, the Hadamard product of them, U = P Q, is also (semi-) positive definite. Lemma 3.

The matrices st sTt and AT t A are (semi-) positive definite.

When a matrix U is (semi-) positive definite, the sum of all elements of U, 1T U1, is (semi-) positive. This fact is identical to lemma 1. Next, we prove lemma 2. When n×n matrices P and Q are (semi-) positive definite, P and Q are represented as P = LpT Lp and Q = LTq Lq , where Lp and Lq are lower triangular matrices of P and Q, respectively. Let aij or bij be the (i, j) element of the matrix Lp or Lq , respectively. Then the (i, j) element of U = P Q, Uij , is calculated as Uij = [(LpT Lp ) (LTq Lq )]ij n n = aki akj bli blj k=1 l=1

=

n2

(αm,i )(αm,j ),

m=1

where αm,i = aq(m)i br(m)i , q(m) = m/n, and r(m) = m − q(m)n. Let M be an n2 × n matrix whose (i, j) element is αi,j . Then the above relation can be described in a matrix form: U = MT M. This proves that the matrix U is (semi-) positive definite. Next, we prove lemma 3. The matrix st sTt is obviously (semi-) positive definite. AT t A is also positive definite because t is positive definite, and for any nonzero vector s, sT AT t As = (As)T t (As) > 0 is satisfied. Lemmas 1, 2, and 3 indicate that the term Tr(H(ˆst )−T AT t A) is (semi-) positive.

142

S. Maeda, W.-J. Song, and S. Ishii

References Almeida, L. B. (2004). MISEP—linear and nonlinear ICA based on mutual information. Signal Processing, 84(2), 231–245. Amari, S., & Cardoso, J. F. (1997). Blind source separation—semiparametric statistical approach. IEEE Transaction on Signal Processing, 45(11), 2692– 2700. Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 752–763). Cambridge, MA: MIT Press. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Barlow, H. B. (1961). Sensory communication. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Belouchrani, A., & Cardoso, J. F. (1995). Maximum likelihood source separation by the expectation-maximization technique: Deterministic and stochastic implementation. In Proceedings of NOLTA (pp. 49–53). Las Vegas, NV. Bermond, O., & Cardoso, J. F. (1999). Approximate likelihood for noisy mixtures. In ICA ‘99 (pp. 325–330). Aussois, France. Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequences of tones. Journal of Experimental Psychology, 89(2), 244–249. Cardoso, J. F. (1997). Infomax and maximum likelihood for source separation. IEEE Trans. Auto. Contr. Sys., 4(4), 112–114. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non gaussian signals. In IEEE Proceedings-F, 140(6), 362–370. Chang, E. F., & Merzenich, M. M. (2003). Environmental noise retards auditory cortical development. Science, 300, 498–502. de Boer, E. (1956). Pitch of inharmonic signals. Nature, 178, 535–536. de Boer, E. (1977). Psychophysics and physiology of hearing. San Diego, CA: Academic Press. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society. Series B, Statistical Methodological, 39, 1–38. Douglas, S. C., Cichocki, A., & Amari, S. (1998). Bias removal technique for blind source separation with noisy measurements. Electronics Letters, 34(14), 1379–1380.

Nonlinear and Noisy Extension of Independent Component Analysis

143

Gaeta, M. & Lacoume, J. L. (1990). Source separation without a priori knowledge: the maximum likelihood solution. In EUSIPCO (Vol. 24, pp. 621–624). Berlin: Springer-Verlag. Girolami, M. (2001). A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11), 2517–2532. Goldstein, J. L. (1973). An optimum processor theory for the central formation of the pitch of complex tones. J. Acoust. Soc. Am., 54, 1496–1516. Helmholtz, H. L. F. (1863). Die Lehre von den Tonempfindungen als Physiologische ¨ die Theorie der Musik. Braunschweig. Grundlage fur Højen-Sørensen, P. A., Winther, O., & Hansen, L. K. (2002). Mean-field approaches to independent component analysis. Neural Computation, 14(4), 889–918. Houtsma, A. J. M., & Goldstein, J. L. (1972). The central origin of the pitch of complex tones: Evidence from musical interval recognition. J. Acoust. Soc. Am., 51, 520–529. Hyv¨arinen, A. (1998a). Independent component analysis in the presence of gaussian noise by maximizing joint likelihood. Neurocomputing, 22, 49–67. Hyv¨arinen, A. (1998b). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Hyv¨arinen, A. (1999). Gaussian moments for noisy independent component analysis. IEEE Signal Processing Letters, 6(6), 145–147. Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. In Proceedins of ICA 2003 (pp. 245–256). Nara, Japan. Kawanabe, M., & Murata, N. (2000). Independent component analysis in the presence of gaussian noise (Tech. Rep. 2000-03). Tokyo: University of Tokyo. Lappalainen, H., & Honkela, A. (2000). Advances in independent component analysis. Berlin: Springer-Verlag. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind separation of nonlinear mixing models. In IEEE International Workshop on Neural Networks for Signal Processing (pp. 406–415). Lewicki, M. S. (2002). Efficient coding of natural sounds. Nature Neuroscience, 5(4), 356–363. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365. Lindsay, P. H., & Norman, D. A. (1977). Human information processing: An introduction to psychology. San Diego, CA: Academic Press. Merzenich, M. M., Knight, P. L., & Roth, G. L. (1975). Representation of cochlea within primary auditory cortex in the cat. J. Neurophysiol., 38, 231–249. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pantev, C., Hoke, M., Lutkenhoner, B., & Lehnertz, K. (1989). Tonotopic organization of the auditory cortex: Pitch versus frequency representation. Science, 246, 486–488. Patterson, R. D. (1973). Physical variables determining residue pitch. J. Acoust. Soc. Am., 53, 1565–1572.

144

S. Maeda, W.-J. Song, and S. Ishii

Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87. Reale, R. A., & Imig, T. J. (1980). Tonotopic orginization in auditory cortex of the cat. J. Comp. Neurol., 192, 265–291. Riquimaroux, H., & Hashikawa, T. (1994). Units in the primary auditory cortex of the Japanese monkey can demonstrate a conversion and place pitch in the central auditory system. Journal de physique IV, C5, 419–425. Ritsma, R. J. (1962). Existence region of the tonal residue. I. J. Acoust. Soc. Am., 42(1), 191–198. Schouten, J. F. (1940). The perception of pitch. Philips Technical Review, 5, 286–294. Schouten, J. F., Ritsma, R. J., & Cardozo, B. L. (1962). Pitch of the residue. J. Acoust. Soc. Am., 34, 1418–1424. Seebeck, A. (1843). Uber die sirene. Ann. Phys. Chem., 60, 449–481. Taleb, A. & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Transaction on Signal Processing, 47(10), 2807–2820. Terhardt, E. (1974). Pitch, consonance, and harmony. J. Acoust. Soc. Am., 55, 1061–1069. Wightman, F. L. (1973). The pattern-transformation model of pitch. J. Acoust. Soc. Am., 54, 407–416. Yoshioka, M. (1980). Mathematical relations between three pitch theories. In Spring Meeting of the Acoustical Society of Japan (pp. 683–684). Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13(4), 863–882.

Received April 8, 2003; accepted June 4, 2004.

LETTER

Communicated by Edward Harrington

Online Ranking by Projecting Koby Crammer [email protected]

Yoram Singer [email protected] School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel

We discuss the problem of ranking instances. In our framework, each instance is associated with a rank or a rating, which is an integer in 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank that is as close as possible to the instance’s true rank. We discuss a group of closely related online algorithms, analyze their performance in the mistake-bound model, and prove their correctness. We describe two sets of experiments, with synthetic data and with the EachMovie data set for collaborative filtering. In the experiments we performed, our algorithms outperform online algorithms for regression and classification applied to ranking. 1 Introduction The ranking problem we discuss in this article shares common properties with both classification and regression problems. As in classification problems, the goal is to assign one of k possible labels to a new instance. Similar to regression problems, the set of k labels is structured, as there is a total order relation between the labels. We refer to the labels as ranks and without loss of generality assume that the ranks constitute the set {1, 2, . . . , k}. Settings in which it is natural to rank or rate instances rather than classify are common in tasks such as information retrieval and collaborative filtering. We use the latter as our running example. In collaborative filtering, the goal is to predict a user’s rating on new items such as books or movies given the user’s past ratings of the similar items. The goal is to determine whether a movie fan will like a new movie and to what degree, which is expressed as a rank. An example for possible ratings might be, “run-to-see, very-good, good, only-if-you-must, and do-not-bother.” While the different ratings carry meaningful semantics, from a learning-theoretic point of view, we model the ratings as a totally ordered set (whose size is five in the example above). The interest in ordering or ranking of objects is by no means new and is still the source of ongoing research in many fields, such as mathematical economics, social science, and computer science. For an overview of rankNeural Computation 17, 145–175 (2005)

c 2004 Massachusetts Institute of Technology

146

K. Crammer and Y. Singer

ing problems from a learning-theoretic point of view see Cohen, Schapire, and Singer (1999). One of the main results underscored in this article is a complexity gap between classification learning and ranking learning. To sidestep the inherent intractability problems of ranking learning, several approaches have been suggested. One possible approach is to cast a ranking problem as a regression problem. Another is to reduce a total order into a set of preferences over pairs (Freund, Iyer, Schapire, & Singer, 2003; Herbrich, Graepel, & Obermayer, 2000). The first case imposes a metric on the set of ranking rules that might not be realistic, while the second approach is time-consuming since it requires increasing the sample size from n to O(n2 ). In this letter, we consider an alternative approach that directly maintains a totally ordered set via projections. Our starting point is similar to that of Herbrich et al. (2000) in the sense that we project each instance into the real numbers. However, our work then deviates and operates directly on rankings by associating each ranking with distinct subinterval of the real numbers and adapting the support of each subinterval while learning. The article is organized as follows. In the next section, we describe a simple and efficient online algorithm that manipulates concurrently the direction onto which we project the instances and the division into subintervals. In section 3, we prove the correctness of the algorithm and analyze its performance in the mistake-bound model on various assumptions. Section 4 contains the description and analysis of a norm-optimized version of the basic ranking algorithm. We then shift our attention to a multiplicative algorithm that is described and analyzed in section 5. We provide empirical validation of the merits of the various ranking algorithms we devise in section 6. This section describes experiments comparing the ranking algorithms to online classification and regression algorithms. Finally, we conclude with a brief discussion and mention a few open problems. Before moving on to the core of the article, we point to a few closely related articles. First, a preliminary version of this letter appeared at Neural Information Processing Systems, 2001, under the title “Pranking with Ranking” (Crammer & Singer, 2001a). This article extends the conference version in numerous directions by providing complete analysis as well as three new algorithms that were not discussed in the conference version. Second, in a recent work by Shashua and Levin (2002), an SVM-based algorithm for instance ranking was described. The algorithms of Shashua and Levin share numerous properties with the algorithms presented in this article. However, it is designed for batch settings, while our focus is on online algorithms. Last, we would like to note that Harrington (2003) demonstrated empirically an improved generalization performance of our algorithm using an “averaging” technique, while preserving the order of the thresholds. 2 The Basic PRank Algorithm This letter focuses on online algorithms for ranking instances. We are given a sequence (x¯ 1 , y1 ), . . . , (x¯ t , yt ), . . . of instance-rank pairs. For concreteness,

Online Ranking by Projecting

147

we assume that each instance x¯ t is in Rn , and its corresponding rank yt is an element from finite set Y with a total order relation. We assume without loss of generality that Y = {1, 2, . . . , k} with > as the order relation. The total order over the set Y induces a partial order over the instances in the following natural sense. We say that x¯ t is preferred over x¯ s if yt > ys . We also say that x¯ t and x¯ s are not comparable if neither yt > ys nor yt < ys . We denote this case simply as yt = ys . Note that the induced partial order is of a unique form in that the instances form k equivalence classes that are totally ordered.1 We stress that although the elements of Y are denoted by integers, we do not use the fact that the integers also belong to a metric space. We assume only that the set Y is discrete and its elements are totally ordered. A ranking rule H is a mapping from the instance space to the set of rank values, H: Rn → Y . The family of ranking rules we discuss in this article employs a vector w¯ ∈ Rn and a set of k thresholds b1 ≤ · · · ≤ bk−1 ≤ bk = ∞. For convenience, we denote by b¯ = (b1 , . . . , bk−1 ) the vector of thresholds, ¯ the ranking rule excluding bk , which is fixed to ∞. Given a new instance x, ¯ The predicted rank is then first computes the inner product between w¯ and x. ¯ x¯ < br . defined to be the index of the first (smallest) threshold br for which w· Such a ranking rule divides the space into parallel, equally ranked regions: all the instances that satisfy br−1 < w¯ · x¯ < br are assigned the same rank ¯ the predicted rank of r. Formally, given a ranking rule defined by w¯ and b, ¯ = minr∈{1,...,k} {r : w¯ · x¯ − br < 0}. Note that the above an instance x¯ is H(x) minimum is always well defined since we set bk = ∞. The analysis that we use in this letter is based on the mistake-bound model for online learning. The algorithms we describe work in rounds. On round t, the learning algorithm gets an instance x¯ t . Given x¯ t , the algorithm outputs a rank, yˆ t = minr {r : w¯ · x¯ t − br < 0}. It then receives the correct ¯ We say that rank yt and updates its ranking rule by modifying w¯ and b. our algorithm made a ranking mistake if yˆ t = yt and wish to make the predicted rank as close (in a sense described later in this article) as possible to the true rank. Formally, the goal of the learning algorithm is to minimize the ranking loss, which is defined to be the number of thresholds between the true rank and the predicted rank. Using the representation of ranks as integers in {1, . . . , k}, the ranking loss after T rounds is equal to the cumulative difference between the predicted rank values and true rank values, T ˆ t − yt |. The algorithm we describe updates its ranking rule only on t=1 |y rounds on which the predicted rank value was incorrect. Such algorithms are called conservative. We now describe the update rule of the algorithm, which is motivated by the perceptron algorithm for classification; hence, we call it the PRank algorithm, shorthand for perceptron ranking. For simplicity, we omit the ¯ y) and index of the round when referring to an input instance rank pair (x, ¯ Since b1 ≤ b2 ≤ · · · ≤ bk−1 ≤ bk , the predicted the ranking rule w¯ and b. 1

For a discussion of this type of partial orders see Kemeny and Snell (1962).

148

K. Crammer and Y. Singer

¯ x¯ > br for r = 1, . . . , y−1 and w· ¯ x¯ < br for r = y, . . . , k−1. rank is correct if w· We represent the above inequalities by expanding the rank y into k − 1 virtual variables y1 , . . . , yk−1 . We set yr = +1 for the case w¯ · x¯ > br and yr = −1 for w¯ · x¯ < br . Put another way, a rank value y induces the vector (y1 , . . . , yk−1 ) = (+1, . . . , +1, −1, . . . , −1) where the maximal index r for which yr = +1 is y−1. Thus, the prediction of a ranking rule is correct if yr (w¯ · x¯ − br ) > 0 for all r. If the algorithm makes a mistake by ranking x¯ as yˆ instead of y, then there is at least one threshold, indexed r, for which the ¯ x¯ is on the wrong side of br , that is, yr (w· ¯ x−b ¯ r ) ≤ 0. To correct the value of w· mistake, we need to “move” the values of w¯ · x¯ and br toward each other. We ¯ x−b ¯ r ) ≤0 and do so by modifying only the values of the br ’s for which yr (w· replace them with br − yr . We also replace the value of w¯ with w¯ + ( yr )x¯ where the sum is taken over the indices r for which there was a prediction error, that is, yr (w¯ · x¯ − br ) ≤ 0. An illustration of the update rule is given in Figure 1. In the example, we used the set Y = {1, . . . , 5}. (Note that b5 = ∞ is omitted from all the plots in Figure 1.) The correct rank of the instance is y = 4, and thus the value of w¯ · x¯ should fall in the fourth interval, between b3 and b4 . However, in the illustration, the value of w¯ · x¯ fell below b1 and the predicted rank is yˆ = 1. The threshold values b1 , b2 , and b3 are a source of the error since the ¯ To compensate for the mistake, the value of b1 , b2 , b3 is higher than w¯ · x. algorithm decreases b1 , b2 , and b3 by a unit value and replaces them with b1 − 1, b2 − 1, and b3 − 1. It also modifies w¯ to be w¯ + 3x¯ since

yr = 3.

¯ x−b ¯ r )≤0 r:yr (w·

¯ 2 . This update is illustrated at Thus, the inner product w¯ · x¯ increases by 3x the middle plot of Figure 1. The prediction rule after the update is illustrated on the right-hand side of Figure 1. Note that after the update, the predicted rank of x¯ is yˆ = 3, which is closer to the true rank y = 4. The pseudocode of algorithm is given in Figure 2. To conclude this section, we note that PRank can be straightforwardly combined with Mercer kernels (Vapnik, 1998) and voting techniques (Freund & Schapire, 1999) often used for improving the performance of margin classifiers in batch and online settings (Cristianini & Shawe-Taylor, 2000). To do so, we assume access to an inner-product space X equipped with a kernel operator K: X × X → R. We now need to replace any innerproduct operation the algorithm performs with an implicit inner-product operation defined via the kernel operator. We thus keep inner-product operators rather than explicit vectors. For instance, the update of PRank becomes K(w¯

t+1

, ·) ← K(w¯ , ·) + t

r

τrt

K(x¯ t , ·),

Online Ranking by Projecting

149

P re d icte d ra n k

1

C o rrec t in te rv a l

2

3

4

5

C o rrec t in te rv a l

U pd a ted p re d icte d ra nk

1

2

C o rrec t in te rv a l

3

4

5

Figure 1: An Illustration of the update rule. The ranking rule predicts a rank of yˆ = 1 instead of y = 4 (top). The update decreases the thresholds b1 , b2 , b3 by ¯ with w ¯ + 3¯x (center), yielding that the predicted rank of one unit and replaces w x¯ after the update is yˆ = 3 (bottom).

and thus the inner product w¯ T · x¯ t becomes t−1 s=1

τrt K(x¯ s , x¯ t ).

r

3 Analysis Before we prove the mistake bound of the algorithm, we first need to show that it maintains a consistent hypothesis. That is, we need to show that PRank preserves the correct order of the thresholds; otherwise, it might be impossible to induce a rank prediction rule. We prove that the consistency of thresholds is maintained by showing inductively that for any ranking rule that can be derived by the algorithm along its run, (w¯ 1 , b¯1 ), . . . , (w¯ T+1 , b¯T+1 ) the set of inequalities btr ≤ · · · ≤ btk−1 hold for all t. Clearly, since the initialization of the thresholds is such that b11 ≤ b12 ≤ · · · ≤ b1k−1 , then it suffices

150

K. Crammer and Y. Singer

Initialize: Set w¯ 1 = 0¯ , b11 , . . . , b1k−1 = 0, b1k = ∞ Loop: For t = 1, 2, . . . , T • Receive a new instance x¯ t ∈ Rn • Predict: yˆ t = min {r : w¯ t · x¯ t − btr < 0} r∈{1,...,k}

• Receive a new rank-value yt • If yˆ t = yt update w¯ t (otherwise set w¯ t+1 = w¯ t , ∀r: bt+1 = btr ): r 1. For r = 1, . . . , k − 1 : 2. For r = 1, . . . , k − 1 :

If yt ≤ r Then ytr = −1 Else ytr = 1 If (w¯ t · x¯ t − btr )ytr ≤ 0 Then τrt = ytr Else τrt = 0

3. Update: For r = 1, . . . , k − 1 :

t t ¯ w¯ t+1 ← w¯ t + r τr x t − τt bt+1 ← b r r r

¯ = minr∈{1,...,k} {r : w¯ T+1 · x¯ − bT+1 Output: H(x) < 0}. r Figure 2: The PRank algorithm.

to show that the claim holds inductively. For simplicity of the proof below, let us write the update rule of PRank in an alternative form. Let [[π ]] be 1 if the predicate π holds and 0 otherwise. We now rewrite the value of τrt (from Figure 2) as τrt = ytr [[(w¯ t · x¯ t − btr )ytr ≤ 0]]. Note also that the values of btr are integers for all r and t since for all r, we initialize b1r = 0 and bt+1 − btr ∈ {−1, 0, +1}. r Lemma 1 (order preservation). Let w¯ t and b¯t be the current ranking rule, where bt1 ≤ · · · ≤ btk−1 , and let (x¯ t , yt ) be an instance rank pair fed to PRank on round t. Denote by w¯ t+1 and b¯t+1 the resulting ranking rule after the update of PRank. Then bt+1 ≤ · · · ≤ bt+1 1 k−1 . Proof. In order to show that PRank preserves a nondecreasing order of the thresholds, we use the definition of the algorithm for ytr . We define ytr = +1 t+1 for r < yt and ytr = −1 for r ≥ yt . To prove that bt+1 r+1 ≥ br , we rewrite the threshold update as bt+1 = btr − ytr [[(w¯ t · x¯ t − btr )ytr ≤ 0]]. r

(3.1)

Online Ranking by Projecting

151

t+1 for all feasible r, it is sufficient to show In order to prove that bt+1 r+1 ≥ br that

btr+1 − btr ≥ ytr+1 [[(w¯ t · x¯ t − btr+1 )ytr+1 ≤ 0]] − ytr [[(w¯ t · x¯ t − btr )ytr ≤ 0]].(3.2) Since by our inductive assumption btr+1 ≤ btr and btr , btr+1 ∈ Z, we get that the value of btr+1 − btr on the left-hand side of equation 3.2 is a nonnegative integer. Recall also that ytr = 1 if yt > r and ytr = −1 otherwise, and therefore ytr+1 ≤ ytr . We now need to analyze two cases. We first consider the case where ytr+1 = ytr , which implies that ytr+1 = −1, ytr = +1. In this case, the right-hand side of equation 3.2 is at most zero, and the claim trivially holds. The second case is when ytr+1 = ytr . Here we get that the value of the right-hand side of equation 3.2 cannot exceed 1. If btr+1 > btr , then since both values are integers, the bound of 1 on the right-hand side of equation 3.2 implies that the value of btr cannot exceed the value of btr+1 . We are thus left with the case where btr = btr+1 and ytr+1 = ytr . For this case, we have that the terms ytr+1 [[(w¯ t · x¯ t − btr+1 )ytr+1 < 0]] and ytr [[(w¯ t · x¯ t − btr )ytr < 0]] attain the same value. Therefore, the right-hand side of equation 3.2 is zero and bt+1 = bt+1 r r+1 . This completes the proof. We now turn to the mistake-bound analysis of the algorithm. In order to simplify the analysis of the algorithm, we introduce the following notation. ¯ we denote by v¯ ∈ Rn+k−1 Given a hyperplane w¯ and a set of k−1 thresholds b, ¯ ¯ For brevity, ¯ ¯ b). the vector that is a concatenation of w and b, that is, v¯ = (w, we refer to the vector v¯ as a ranking rule. Given two vectors v¯ = (w¯ , b¯ ) and ¯ 2 . Note that ¯ we have v¯ · v¯ = w¯ · w¯ + b¯ · b¯ and v ¯ 2 = w ¯ 2 + b ¯ b), v¯ = (w, n a vector v¯ induces a partial order of the vectors R . Theorem 1 (mistake bound). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence to PRank where x¯ t ∈ Rn and yt ∈ {1, . . . , k}. Denote by R2 = maxt x¯ t 2 . If there exists a ranking rule v¯ ∗ = (w¯ ∗ , b¯∗ ) with b∗1 ≤ · · · ≤ b∗k−1 of a unit norm that classifies the entire sequence correctly with margin γ = minr,t {(w¯ ∗ ·x¯ t −b∗r )ytr } > 0, the ranking loss of the algorithm Tt=1 |yˆ t − yt | is at most (k − 1)

R2 + 1 . γ2

Proof. Let us examine an example (x¯ t , yt ) that the algorithm received on round t. By definition, the algorithm ranked the example using the ranking rule v¯ t , which is composed of w¯ t and the set of thresholds b¯t . Similarly, v¯ t+1 is t t ¯ the ranking rule (w¯ t+1 , b¯t+1 ) after round t. Therefore, w¯ t+1 = w¯ t + r τr x t − τ t for r = 1, 2, . . . , k − 1. Let us denote by nt = |y t − yt | the ˆ and bt+1 = b r r r

152

K. Crammer and Y. Singer

difference between the true rank and the predicted rank. Since τrt is zero for t t t t all the indices r such thatt sign(w¯ · x¯ − br )yr > 0, then it is straightforward t to verify that n = r |τr |. Note that if there was not a ranking mistake on round t, then τrt = 0 for r = 1, . . . , k − 1, and thus also nt = 0. To prove the theorem, we bound t nt from above by bounding v¯ t 2 from above and below. First, we derive a lower bound on v¯ t 2 by bounding v¯ ∗ · v¯ t+1 . Substituting the values of w¯ t+1 and b¯t+1 , we get that v¯ ∗ · v¯ t+1 = v¯ ∗ · v¯ t +

k−1

τrt (w¯ ∗ · x¯ t − b∗r ).

(3.3)

r=1

We further bound the term on the right-hand side by considering two cases corresponding to whether τrt is positive or zero. Using the definition of τrt from the pseudocode in Figure 2, we first examine the case where (w¯ t · x¯ t − btr )ytr ≤ 0 and therefore τrt = ytr . From the assumption that v¯ ∗ ranks the examples correctly with a margin of at least γ , we get that τrt (w¯ ∗ · x¯ t −b∗r ) ≥ γ . The second case is when (w¯ t · x¯ t − btr )ytr > 0. In this case, we have τrt = 0 and thus τrt (w¯ ∗ · x¯ t − b∗r ) = 0. Combining the two cases and summing now over r, we get k−1

τrt (w¯ ∗ · x¯ t − b∗r ) ≥

r=1

k−1

|τrt |γ = nt γ .

(3.4)

r=1

Combining equations 3.3 and 3.4, we get that v¯ ∗ ·v¯ t+1 ≥ v¯ ∗ ·v¯ t +nt γ . Unfolding the sum, we get that after T rounds, the projection of the vector v¯ T+1 on v¯ ∗ satisfies v¯ ∗ · v¯ T+1 ≥

nt γ = γ

t

nt .

(3.5)

t

Recall that Cauchy-Schwartz inequality implies that v¯ T+1 2 v¯ ∗ 2 ≥ (v¯ T+1 · v¯ ∗ )2 . Using equation 3.5 with Cauchy-Schwartz inequality and the assumption that v¯ ∗ is of a unit norm, we get the following lower bound: v¯

T+1 2

≥

2 n

t

γ 2.

(3.6)

t

We next bound the norm of v¯ T+1 from above. As before, assume that an example (x¯ t , yt ) was ranked using the ranking rule v¯ t , and denote by v¯ t+1

Online Ranking by Projecting

153

the ranking rule at the end of round t. We now expand the values of w¯ t+1 and b¯t+1 whose sum is the norm of v¯ t+1 and get τrt w¯ t · x¯ t − btr v¯ t+1 2 = w¯ t 2 + b¯t 2 + 2 +

r

2

x¯ t 2 +

τrt

(τrt )2 .

r

r

Since τrt ∈ {−1, 0, +1}, we have that ( r τrt )2 ≤ (nt )2 and r (τrt )2 = nt , and we therefore get v¯ t+1 2 ≤ v¯ t 2 + 2 τrt w¯ t · x¯ t − btr + (nt )2 x¯ t 2 + nt . (3.7) r

We further develop the second term using the update rule of the algorithm and get τrt w¯ t · x¯ t − btr = [[(w¯ t · x¯ t − btr )ytr ≤ 0]] (w¯ t · x¯ t − btr )ytr ≤ 0. (3.8) r

r

Plugging equation 3.8 into equation 3.7 and using the bound x¯ t 2 ≤ R2 , we get that v¯ t+1 2 ≤ v¯ t 2 + (nt )2 R2 + nt . Thus, the ranking rule we obtain after T rounds of the algorithm is upper bounded by (nt )2 + nt . (3.9) v¯ T+1 2 ≤ R2 t

t

Combining the lower bound v¯ T+1 2 ≥ ( t nt )2 γ 2 with the upper bound of equation 3.9, we have that

2 t

n

γ 2 ≤ v¯ T+1 2 ≤ R2

t

(nt )2 +

t

nt .

t

Dividing the above equations by γ 2 t nt , we finally get R2 [ t (nt )2 ]/[ t nt ] + 1 nt ≤ . γ2 t Since by definition, nt is at most k − 1, it implies that (nt )2 ≤ nt (k − 1) = (k − 1) nt . t

t

t

(3.10)

154

K. Crammer and Y. Singer

Plugging this inequality into equation 3.10, we get the desired bound, T

|yˆ t − yt | =

t=1

T

nt ≤

t=1

(k − 1)R2 + 1 R2 + 1 ≤ (k − 1) . γ2 γ2

We now turn our attention to the inseparable case in which there does not exist a positive margin value γ . To analyze the inseparable case, we use an analysis technique suggested by Freund and Schapire (1999). (This proof technique was first informally given by Vapnik, 1998.) To prove a mistake bound for the inseparable case, each example (x¯ t , yt ) is augmented with a slack variable, denoted dt . Informally, for each time step t, the variable dt designates how much the margin assumption is violated by the example. Formally, we obtain the following mistake bound for the inseparable case: Theorem 2 (mistake bound for inseparable case). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence for PRank where x¯ t ∈ Rn and yt ∈ {1, . . . , k}. Denote by R2 = maxt x¯ t 2 . Let v¯ ∗ = (w¯ ∗ , b¯∗ ) be a ranking rule of a unit norm with b∗1 ≤ · · · ≤ b∗k−1 . Let γ > 0 and define dt = max{0, γ − min{(w¯ ∗ · x¯ t − b∗r )ytr }}. r

Denote by D2 = by T

t 2 t (d ) .

(3.11)

Then the ranking loss of the algorithm is bounded above

|yˆ t − yt | ≤ (k − 1)

(D +

t=1

√ R2 + 1)2 . γ2

The proof of the theorem is based on the proof technique for theorem 1 and is given in the appendix. To conclude this section, we describe and briefly analyze a simplified version of PRank. This version shares the same algorithmic skeleton as PRank and differs only in the way it modifies its ranking rule. Rather than modifying all of the thresholds in the error set, this version chooses a single threshold to modify and update w¯ accordingly. Since a single threshold is modified on each round, we use the abbreviation Si-PRank to refer to this version. More formally, we replace step 3 in Figure 2 with the following steps, 3. Define update index: If yˆ t > yt Then r = yˆ t − 1 Else r = yˆ t 4. Update: w¯ t+1 ← w¯ t + τrt x¯ t bt+1 ← btr − τrt r bt+1 ← bts s

(s = r).

Online Ranking by Projecting

155

It is to verify that this choice of a single threshold to update preserves the overall order of the threshold. The proof is immediate consequence of the lemma 1. In addition, following exactly the same proof technique of theorem 1, we get that the mistake bound of PRank also holds for Si-PRank.2 Surprisingly, as we see later in section 6, in practice Si-PRank performs slightly better than PRank. 4 A Norm-Optimized Version of PRank The PRank algorithm presented in the previous section performs the same form of update whenever a ranking error occurs. Furthermore, even when the projection of xt onto w¯ t yields the correct rank value, the value of w¯ t · xt might lie very close to one of the thresholds btr . The difference between the projection of xt and the closest threshold plays a similar role to the notion of margin in classification problems and was used in theorem 1 to derive a mistake bound for PRank. In this section, we present a version of PRank that solves on each round a mini-optimization problem that balances between two opposing requirements. On one hand, we require that the new ranking rule v¯ t+1 be as similar as possible to the previous ranking rule v¯ t , which encompasses all our knowledge on past examples. On the other hand, we force the new ranking rule to rank the most recent example x¯ t correctly and with a large enough margin. Formally, assuming that there was a rank prediction error on round t, then we require that the new rule (w¯ t+1 , b¯t+1 ) t would satisfy, minr {(w¯ t+1 · x¯ t − bt+1 r )yr }} ≥ β, where β is a positive constant. These two requirements yield the following optimization problem: min ¯ b¯ w,

1 ¯ − (w¯ t , b¯t )2 ¯ b) (w, 2

subject to: (w¯ · x¯ t − br )ytr ≥ β

for r = 1, . . . , k − 1.

(4.1)

We call this version the Norm-Optimized PRank algorithm, or No-PRank in short. The pseudocode of the algorithm appears in Figure 3. Before analyzing the algorithm, let us first further develop the optimization problem given in equation 4.1 by expanding its Lagrangian function, k−1 1 ¯ − (w¯ t , b¯t )2 − ¯ b) L = (w, τr [(w¯ · x¯ t − br )ytr − β]. 2 r=1

(4.2)

2 In fact, an alternative bound can be given for Si-PRank, which states that the number of rounds on which an error occurs (indicated by a strictly positive value of the loss) is upper bounded by

R2 + 1 . γ2

156

K. Crammer and Y. Singer

Input: Minimal margin parameter β Initialize: Set w¯ 1 = 0¯ , b11 , . . . , b1k−1 = 0, b1k = ∞ Loop: For t = 1, 2, . . . , T • Receive a new instance x¯ t ∈ Rn . • Predict: yˆ t = min {r : w¯ t · x¯ t − btr < 0} r∈{1,...,k}

• Receive a new rank-value yt • If yt = yˆ t update w¯ t and b¯t : (otherwise set w¯ t+1 = w¯ t , b¯t+1 = b¯t ): 1. For r = 1, . . . , k − 1 :

If yt ≤ r Then ytr = −1 Else ytr = 1.

2. Update: set (w¯ t+1 , b¯t+1 ) ∈ Rn+k−1 to be the minimizer of: min ¯ b¯ w,

1 ¯ − (w¯ t , b¯t )2 ¯ b) (w, 2

subject to: (w¯ · x¯ t − br )ytr ≥ β

for r = 1, . . . , k − 1.

¯ = minr∈{1,...,k} {r : w¯ T+1 · x¯ − bT+1 < 0}. Output: H(x) r Figure 3: The Norm-Optimized PRank algorithm.

Here, τr are nonnegative Lagrange multipliers. Taking the derivative of equation 4.2 with respect to w¯ and comparing it to zero, we get ∂ L = w¯ − w¯ t − x¯ t τr ytr = 0 ∂ w¯ r

w¯ = w¯ + t

⇒

τr ytr

x¯ t . (4.3)

r

Repeating the process for br , we get ∂ L = br − btr + τr ytr = 0 ∂br

⇒

br = btr − τr ytr .

(4.4)

Plugging the value of w¯ from equation 4.3 and b from equation 4.4 into equation 4.2, we get the following dual problem: 2 1 t 2 t 1 2 minQt (τ¯ ) = x¯ τr yr + τ τ 2 2 r r r + τr ytr w¯ t · x¯ t − btr − β r

s.t 0 ≤ τr r = 1, . . . , k − 1.

(4.5)

Online Ranking by Projecting

157

Before proceeding to discuss the formal properties of the No-PRank algorithm, it is worth noting that the new vector obtained at the end of round t, w¯ t+1 is a linear combination of w¯ t and the current instance x¯ t . Therefore, it is rather straightforward to use No-PRank in conjunction with kernel methods by maintaining w¯ t in its dual form as a weighted combination of the instances x¯ 1 , . . . , x¯ t . Note also that the optimization problem of No-PRank reduces to a simple optimization problem with a quadratic objective function and nonnegativity constraints and can be solved using standard convex optimization tools (Fletcher, 1987). Let us now discuss the formal properties of No-PRank. The following simple lemma states that No-PRank is a conservative online algorithm as it modifies its ranking rule if and only if the minimal margin requirements are not attained for the current instance. Lemma 2 (conservativeness). Let (x¯ t , yt ) denote the input to No-PRank on round t, and let ytr be −1 if yt ≤ r and +1 otherwise. Then if ytr w¯ t · x¯ t − btr ≥ β for all r = 1, . . . , k−1 the optimum of No-PRank’s optimization problem is attained at τr = 0 for all r. Proof. Consider the dual form of No-PRank’s optimization problem given by equation 4.5. The objective function Qt (τ¯ ) is composed of three summands. The first two are clearly nonnegative and attain a value of zero iff all the multipliers τr are zero. If ytr (w¯ · x¯ − btr ) ≥ β, then the third summand is linear in τ¯ with nonnegative coefficients. Thus, its minimum is also attained when τr is zero for all r. Next we show that No-PRank preserves the order of the thresholds along its run and thus can always serve as a valid ranking rule. Lemma 3 (order preservation). Let w¯ t and b¯t be the current ranking rule, where bt1 ≤ · · · ≤ btk−1 , and let (x¯ t , yt ) be an instance-rank pair fed to No-PRank on round t. Denote by w¯ t+1 and b¯t+1 the resulting ranking rule after the update of No-PRank. Then bt+1 ≤ · · · ≤ bt+1 1 k−1 . Proof. In order to show that No-PRank maintains a correct (monotonically increasing) order of the thresholds, we use the variables employed by the algorithm along its run. Namely, we set ytr = +1 for r < yt and ytr = −1 for t+1 for all r, we expand b¯t+1 and show that r ≥ yt . To prove that bt+1 r+1 ≥ br t − ytr τrt . btr+1 − btr ≥ ytr+1 τr+1

(4.6)

We need to analyze two different settings. The first setting we analyze is when ytr+1 = ytr , which implies that ytr+1 = −1 and ytr = +1. In this case, the right-hand side of equation 4.6 is at most zero, while the left-hand side of

158

K. Crammer and Y. Singer

the equation is at least zero. Thus, the claim holds. The other case is when ytr+1 = ytr = y. In this case, equation 4.6 becomes t − τrt ). btr+1 − btr ≥ y(τr+1

(4.7)

Assume by contradiction that the optimal value of equation 4.5 does not satisfy equation 4.7 for some r. We now construct another feasible set for τ¯ that yields a lower value of the objective function Q. Let us define  τs − y τs = τs + y  τs

s=r+1 s=r otherwise,

for some value of > 0, which is determined below. Informally, the value of is set to be small enough so that the constraint τs ≥ 0 still holds. (This is possible since τr+1 > 0.) Since τ¯ and τ¯ differ only at their r and r + 1 components, we get

Q(τ¯ ) − Q(τ¯ ) =

1 2 2 ) + τr [yr (w¯ t · x¯ t − btr ) − β] (τ + τr+1 2 r + τr+1 [yr+1 (w¯ t · x¯ t − btr+1 ) − β] 1 − (τr 2 + τr+1 2 ) − τr [yr (w¯ t · x¯ t − btr ) − β] 2 − τr+1 [yr+1 (w¯ t · x¯ t − btr+1 ) − β].

Expanding τ¯ , we now get

Q(τ¯ ) − Q(τ¯ ) =

1 ((τr + y )2 + (τr+1 − y )2 ) 2 + (τr + y )[yr (w¯ t · x¯ t − btr ) − β] + (τr+1 − y )[yr+1 (w¯ t · x¯ t − btr+1 ) − β] 1 − (τr 2 + τr+1 2 ) − τr [yr (w¯ t · x¯ t − btr ) − β] 2 − τr+1 [yr+1 (w¯ t · x¯ t − btr+1 ) − β]

= (y )2 + y (τr − τr+1 ) + y [yr (w¯ t · x¯ t − btr ) − β] − y [yr+1 (w¯ t · x¯ t − btr+1 ) − β]. Denoting both yr and yr+1 simply as y and using the fact y ∈ {−1, +1}, we get

Q(τ¯ ) − Q(τ¯ ) = (y )2 + y (τr − τr+1 ) − br + br+1 = [ − (y(τr+1 − τr ) − (br+1 − br ))].

Online Ranking by Projecting

159

Since we assumed by contradiction that y(τr+1 − τr ) − (br+1 − br ) = A > 0, we can choose 0 < ≤ A/2 and get

Q(τ¯ ) − Q(τ¯ ) = (A/2 − A) = − A/2 < 0, which contradicts the assumption that τ is the optimal solution of equation 4.5. We now turn to the analysis of the performance of No-PRank. We first bound the sum of the weight employed by the dual program as they accu t mulate along the run of No-PRank t r τr . Based on the bound on the weights, we then prove a mistake bounds on the number of rounds on which an error occurred, that is, yt = yˆ t . As we see shortly, the bound depends on both the value of the attainable margin γ and the margin parameter, β, employed by the algorithm. Theorem 3 (bound on weights). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence to No-PRank where x¯ t ∈ Rn and yt ∈ {1, . . . , k}. Let v¯ ∗ = (w¯ ∗ , b¯∗ ) be ∗ ∗ ∗ 2 ¯ a ranking rule with b1 ≤ · · · ≤ bk−1 and w + r (b∗r )2 = 1. Assume that v¯ ∗ classifies the entire sequence correctly with a margin value γ = minr,t {(w¯ ∗ · x¯ t − b∗r )ytr } > 0. Then the total sum of the weights generated by No-PRank is bounded by T t=1

τrt ≤ 2

r

β , γ2

where β is a predefined parameter of the algorithm (see Figure 3). Proof. Let us concentrate on an example (x¯ t , yt ) that the algorithm received on round t. By construction, the algorithm ranked the example using the ranking rule v¯ t , which is composed of w¯ t and the thresholds b¯t . Similarly, we denote by v¯ t+1 the updated rule (w¯ t+1 , b¯t+1 ) after round t, that is, w¯

t+1

= w¯ + t

ytr τrt

= btr − ytr τrt for r = 1, 2, . . . , k − 1. x¯ t and bt+1 r

r

To bound t r τrt from above, we derive bounds on v¯ t 2 from both above and below. First, we derive a lower bound on v¯ t 2 by bounding v¯ ∗ · v¯ t+1 . Substituting the values of w¯ t+1 and b¯t+1 , we get v¯ ∗ · v¯ t+1 = v¯ ∗ · v¯ t +

k−1 r=1

τrt ytr (w¯ ∗ · x¯ t − b∗r ).

160

K. Crammer and Y. Singer

Using the assumption that v¯ ∗ ranks the data correctly with a margin of at least γ , we get that ytr (w¯ ∗ · x¯ t − b∗r ) ≥ γ : v¯ ∗ · v¯ t+1 ≥ v¯ ∗ · v¯ t +

k−1

τrt γ = v¯ ∗ · v¯ t + γ

r=1

k−1

τrt .

r=1

Unfolding the sum, we get that after T rounds, the algorithm satisfies v¯ ∗ · v¯ T+1 ≥ γ

τrt .

t,r

Plugging this result into Cauchy-Schwartz inequality, v¯ T+1 2 v¯ ∗ 2 ≥ (v¯ T+1 · v¯ ∗ )2 , and using the assumption that v¯ ∗ is of a unit norm, we get the lower bound, v¯

T+1 2

≥

2 τrt

γ 2.

(4.8)

t,r

We next bound the norm of v¯ T+1 from above. As before, assume that an example (x¯ t , yt ) was ranked using the ranking rule v¯ t , and denote by v¯ t+1 the ranking rule after the round. We now expand the values of w¯ t+1 and b¯t+1 in the norm of v¯ t+1 and get v¯ t+1 2 = w¯ t 2 + b¯t 2 + 2 + x¯ t 2

τrt ytr (w¯ t · x¯ t − btr )

r

2 τrt ytr

+

(ytr τrt )2 .

r

r

We add and subtract the term 2β equation and get v¯ t+1 2 = w¯ t 2 + b¯t 2 + 2 + x¯

t 2

r

t r τr

τrt [ytr (w¯ t · x¯ t − btr ) − β]

r 2

τrt ytr

on the right-hand side of the above

+

ytr τrt

r

= w¯ t 2 + b¯t 2 + 2Q(τ¯ t ) + 2β

2

+ 2β

τrt

r

τrt ,

(4.9)

r

where we used the definition of Q from equation 4.5 to obtain the last equality.

Online Ranking by Projecting

161

Note that τ¯ = 0 is a feasible solution for the optimization problem posed in equation 4.5. The value attained by Q for this particular choice of τ¯ is zero. Since τ¯ t is the minimizer Q(τ¯ ), we must have that

Q(τ¯ t ) ≤ 0.

(4.10)

Substituting equation 4.10 in equation 4.9, we obtain the following bound on the norm of v¯ t+1 in terms of the norm of v¯ t : τrt . (4.11) v¯ t+1 2 ≤ v¯ t 2 + 2β r

Thus, the ranking rule we obtain after T rounds of the algorithm satisfies the upper bound: v¯ T+1 2 ≤ 2β

τrt .

(4.12)

t,r

Combining the lower bound of equation 4.8 with the upper bound of equation 4.12, we have that

2 τrt

γ 2 ≤ v¯ T+1 2 ≤ 2β

t,r

t,r

τrt ≤

τrt .

t,r

Dividing both sides by γ 2

2β . γ2

t t,r τr ,

we finally get (4.13)

We now discuss some properties of the algorithm and its corresponding loss bound. As mentioned above, the algorithm updates its ranking rule only on rounds on which the margin is less than β, which reflects a minimal margin requirement. Informally, β can be viewed as a minimal difference requirement between the projection of the example x¯ t onto w¯ t and any of the thresholds btr (see Figure 4). Thus, the larger β is, the better is the separation between the rank levels. Formally, the algorithm attempts to enclose all the examples in subintervals of the real numbers defined by the thresholds. As stated above, based on theorem 3, we can derive a mistake bound for NoPRank. Surprisingly, the dependency on the margin parameter β cancels out, and the end result is a mistake bound that depends on only the geometrical properties of the problem: the normalized margin. Corollary 1 (mistake bound). Assume the conditions of theorem 3 hold, and denote by R = maxt x¯ t . Then the number of rounds for which No-PRank made

162

K. Crammer and Y. Singer

β

b1

β

WX

b2

b3

b4

Figure 4: An illustration of the margin required by the No-PRank algorithm.

a mistake is upper bounded by

2

R2 + 1 . γ2

Proof. To prove the corollary, we use theorem 3 and show that whenever an error occurred on round t, the sum of the dual weights r τrt is at least β/(R2 + 1). Since w¯ t+1 must satisfy the constraints of equation 4.1, we now show that for r = 1, . . . , k − 1, t (w¯ t+1 · x¯ t − bt+1 r )yr ≥ β.

from equation 4.3 and bt+1 = Substituting w¯ t+1 = w¯ t + x¯ t r τrt ytr and bt+1 r r t t t br − τr yr from equation 4.4, we get that the dual variables τ1t , . . . , τk−1 satisfy β≤

w¯ + x¯ t

t

· x¯ −

τst yts

t

btr

+

τr ytr

ytr .

s

Rearranging terms in the above equation, we get β≤

w¯ + x¯ t

t

τst yts

· x¯ − t

s

= (w¯ t · x¯ t − btr )ytr + ytr x¯ t 2

btr

+

τr ytr

ytr

τst yts + τrt (ytr )2 .

(4.14)

s

Since a rank prediction error occurred on round t, there exists r such that t·x ¯ t − btr )ytr ≤ 0. In addition, since ytr ∈ {−1, +1}, we have that ytr s τst yts (w¯ ≤ s τst . Using these two inequalities in equation 4.14 in conjunction with

Online Ranking by Projecting

163

the fact that τrt ≥ 0, we get β ≤ 0 + x¯ t 2

τrt +

r

≤ (R2 + 1)

τrt

r

τrt .

r

We have thus shown that if there was a rank prediction error on round t, then β τrt . ≤ R2 + 1 r

(4.15)

To prove the corollary, we combine theorem 3 with equation 4.15. Denote by M the number of rounds with errors. We now show that on each prediction 2 + 1) ≤ t , and for the rest of the rounds, we simply such round, β/(R τ r r bound the sum r τrt from below by 0. We thus get that M

R2

β τrt . ≤ +1 t,r

Applying theorem 3, we get M

β R2 + 1 β ⇒ M ≤ 2 . ≤ 2 R2 + 1 γ2 γ2

Note that the mistake bounds of theorem 1 and corollary 1 are identical up to a multiplicative (k − 1)/2 factor. Furthermore, if we employ the t t trivial t factt that |y − yˆ | ≤ k − 1, we can immediately obtain a bound on ˆ |y − y | from corollary 1 that is a 2 factor of the bound given in theot rem 1. However, the bound for No-PRank is more refined, as in addition to the mistake bound, we can also bound the total sum of the weights t,r τrt . Unfortunately, since τrt may be arbitrarily close to zero, the more refined analysis does not yield a better mistake bound. One possible direction for improving the mistake bound itself may be obtained by modifying the minimal margin requirement from β to β|yt − yˆ t |. Another viable approach is to replace the fixed margin constraint (w¯ t+1 · x¯ t − brt+1 )ytr ≥ β with a margin requirement that is dependent on the difference between the correct label yt and the specific threshold r. Specifically, we define a new set of constraints (w¯ t+1 · x¯ t − brt+1 )ytr ≥ βar , where ar = yt − r for r < yt and ar = r + 1 − yt for r ≥ yt . Therefore, the further the threshold r from the interval containing the projection of the instance on the hyperplane, the larger the margin we require. We leave these possible extensions to future research.

164

K. Crammer and Y. Singer

5 A Multiplicative Version of PRank In this section, we give a multiplicative version of PRank that we term MuPRank. This version is analogous to the basic PRank update, but it modifies w¯ and b¯ in a multiplicative manner. That is, on each round with rank prediction error, the current weight vector w¯ t and thresholds b¯t are multiplied by factors that depend on x¯ t . The motivation for this version is the multiplicative updates employed by the work of Warmuth and colleagues on online prediction (see, e.g., Kivinen & Warmuth, 1997). Mu-PRank maintains a normalized ranking rule, that is, (w¯ t , b¯t )1 = 1 for all t. As in PRank, on round t, Mu-PRank computes the vector τ¯ t , which determines the update of w¯ t and b¯t . It then takes the exponent of the updated ranking rule and normalizes it so that the 1 norm of (w¯ t+1 , b¯t+1 ) will be 1. The pseudocode of the algorithm is given in Figure 5. Examining the logarithm of b¯t on each round, it is rather simple to verInput: Learning rate η 1 Initialize: Set w1i = n+k−1 Loop: For t = 1, 2, . . . , T

i = 1, . . . , n , b11 , . . . , b1k−1 =

• Receive a new instance x¯ t ∈ Rn ,

1 n+k−1 ,

b1k = ∞

x¯ t ∞ ≤ 1

• Predict: yˆ t = min {r : w¯ t · x¯ t − btr < 0} r∈{1,...,k}

• Receive a new rank-value yt • If yt = yˆ t update w¯ t and b¯t (otherwise set w¯ t+1 = w¯ t , b¯t+1 = b¯t ): 1. For r = 1, . . . , k − 1 :

If yt ≤ r Then ytr = −1 Else ytr = 1 If (w¯ t · x¯ t − btr )ytr ≤ 0 Then τrt = ytr Else τrt = 0

2. For r = 1, . . . , k − 1 : 3. Define:

Zt =

n

wti eηxi

t

τt r r

i=1

+

k−1

btr e−ητr

t

r=1

4. Update:

t t For i = 1, . . . , n : wt+1 ← wti eηxi r τr /Z t i t For r = 1, . . . , k − 1 : bt+1 ← btr e−ητr /Z t r

¯ = minr∈{1,...,k} {r : w¯ T+1 · x¯ − bT+1 Output: H(x) < 0} r Figure 5: The multiplicative algorithm.

Online Ranking by Projecting

165

ify that, like PRank, Mu-PRank preserves the order of the thresholds. The additional step Mu-PRank employs in its update stage is the normalization of its ranking rule. However, since this normalization is applied to all the thresholds, it clearly keeps the order of the thresholds intact. More formally, log(btr ) can be written as ηutr + log(Ct ), where utr is an integer and Ct is independent of r. Applying exactly the same proof technique used in lemma 1 to utr gives the following corollary: Lemma 4 (order preservation). Let w¯ t and b¯t denote the current ranking rule and assume that bt1 ≤ · · · ≤ btk−1 . Then, after (x¯ t , yt ) is fed to Mu-PRank, the new rule (w¯ t+1 , b¯t+1 ) preserves the order of the thresholds, bt+1 ≤ · · · ≤ bt+1 1 k−1 . Next, we analyze the mistake bound of Mu-PRank. Note that Mu-PRank employs a learning rate parameter, η. The mistake bound of Mu-PRank given in the following theorem depends on this value. If an priori lower bound on the margin γ is known, we can fix the value of η so as to minimize the loss bound of Mu-PRank. The resulting bound is described in corollary 2, which follows theorem 4. Theorem 4 (mistake bound). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence to Mu-PRank, where x¯ t ∈ Rn , x¯ t ∞ ≤ 1 , and yt ∈ {1, . . . , k}. Assume that there exists a ranking rule v¯ ∗ = (w¯ ∗ , b¯∗ ) with b∗1 ≤ · · · ≤ b∗k−1 and w¯ ∗ , b¯∗ 1 = 1, which classifies the entire sequence correctly with margin γ = minr,t {(w¯ ∗ ·x¯ t −b∗r )ytr } > 0. Then the ranking loss of Mu-PRank, Tt=1 |yˆ t − yt | is, at most, log k + n − 1 . 2 log eη(k−1) +e + ηγ −η(k−1) Proof. As before, let v¯ t = (w¯ t , b¯t ) and v¯ t+1 = (w¯ t+1 , b¯t+1 ) denote the ranking rules at the beginning and end of round t, respectively. To prove the theorem, we analyze the decrease in the Kullback-Leibler (KL) divergence (Cover & Thomas, 1991) between v¯ t and v¯ ∗ . The KL divergence of two dis q is DKL (p q) = i pi log(pi /qi ). To prove crete probability distributions p, the theorem, we examine the change of the KL divergence in two consecutive rounds: t = DKL (v¯ ∗ v¯ t+1 ) − DKL (v¯ ∗ v¯ t ). We bound

t t from above and below. We first bound this sum from above.

166

K. Crammer and Y. Singer

As a reminder, we use nt to denote |yˆ t − yt |. Using this notation, we get n

t =

w∗i log

i=1 n

=

w∗i log

=

w∗i

+

i=1

+

Zt t

k−1

k−1

b∗r log

r=1

eηxi

i=1

n

w∗i wti

b∗r

+

τt r r

k−1

b∗r btr

b∗r log

r=1

Zt e−ητr

t

k−1 τrt (w¯ ∗ · xt − b∗r ) log Z t − η

r=1

≤ log(Z t ) − ηγ

r=1

k−1

|τrt | = log(Z t ) − ηγ nt ,

(5.1)

r=1

where we used the definition of nt and τrt in conjunction with the fact that v¯ ∗ achieves a margin of γ to obtain the inequality above. We now bound log(Z t ). We need the following inequality, a + x ηa a − x −ηa e + e , 2a 2a

eηx ≤

(5.2)

which holds for a, η > 0 and x ∈ [−a, a]. (The proof of this inequality is an immediate application of the convexity of the exponent function.) Now recall that

Zt =

n

wti eηxi

t

i=1

τt r r

+

k−1

btr e−ητr . t

(5.3)

r=1

We bound the left-hand side and the right-hand side of the above sum sep-

arately. For the left-hand side, we bound each term eηxi r τr . Using equation 5.2 in conjunction with the fact that |xti | ≤ 1 and | r τrt | ≤ k − 1, we get t

eηxi

t

τt r r

≤

t

k − 1 + xti r τrt η(k−1) k − 1 − xti r τrt −η(k−1) + . e e 2(k − 1) 2(k − 1)

(5.4)

Similarly, |τrt | ≤ 1 ≤ k − 1, and thus eητr ≤ t

k − 1 + τrt η(k−1) k − 1 − τrt −η(k−1) + . e e 2(k − 1) 2(k − 1)

Using the bounds from equations 5.4 and 5.5 in 5.3, we get

Z ≤ t

i

wti

k − 1 + xti r τrt η(k−1) k − 1 − xti r τrt −η(k−1) + e e 2(k − 1) 2(k − 1)

(5.5)

Online Ranking by Projecting

+

r

=

i

+

btr

167

k − 1 + τrt η(k−1) k − 1 − τrt −η(k−1) + e e 2(k − 1) 2(k − 1)

1 1 wti (eη(k−1) + e−η(k−1) ) + btr (eη(k−1) + e−η(k−1) ) 2 2 r

τrt (w¯ t · xt − btr )

r

eη(k−1) + e−η(k−1) . 2(k − 1)

(5.6)

The definition of τrt implies that τrt (w¯ t ·xt −btr ) ≤ 0. (Either there was a ranking error or τrt = 0). In addition, since Mu-PRank normalizes its ranking rule at the end of each round, we know that (w¯ t , b¯t )1 = i wti + r btr = 1. Using these facts in equation 5.6, we get the following bound on Z t : 1 Z t ≤ (eη(k−1) + e−η(k−1) ). 2

(5.7)

Using equation 5.7 in equation 5.1, we obtain an upper bound on t :

1 η(k−1) −η(k−1) +e ) − ηγ nt . t ≤ log (e 2 Since a ranking error occurred, we know that nt ≥ 1. In addition, the argument of the logarithm is at least 1; we can rearrange terms and write 1 η(k−1) t −η(k−1) t ≤ n log (e +e ) − ηγ . (5.8) 2 Summing over t, we get 1 η(k−1) t ≤ log + e−η(k−1) − ηγ nt . e 2 t t

(5.9)

To bound t from below, we unravel the sum and get t = (DKL (v¯ ∗ v¯ t+1 ) − DKL (v¯ ∗ v¯ t )) t

t

= DKL (v¯ ∗ v¯ T+1 ) − DKL (v¯ ∗ v¯ 1 ) ≥ −DKL (v¯ ∗ v¯ 1 ), where the inequality is due to the fact that the KL divergence is always ¯ 1 , we get that DKL (v¯ ∗ v¯ 1 ) = log(k − 1 + n). nonnegative. Using the value of w Combining the lower bound on t t with equation 5.9, we get

2 log(k − 1 + n) ≥ log η(k−1) e + e−η(k−1)

+ ηγ

t

nt .

168

K. Crammer and Y. Singer

Rearranging terms, we finally get the desired bound: t

log k − 1 + n . n ≤ 2 + ηγ log eη(k−1) +e −η(k−1) t

As discussed above, the bound of theorem 4 depends on the learning rate η. If γ or a lower bound on γ is known, we can set η to be 1 k−1+γ η= log . 2(k − 1) k−1−γ For this choice of η, we get the following corollary: Corollary 2. η=

If we run Mu-PRank with

k−1+γ 1 log , 2(k − 1) k−1−γ

then under the assumptions of theorem 4, the cumulative ranking loss obtained by Mu-PRank is bounded above by

t

2 log

n ≤ (k − 1)

t

n+k−1 . γ2

This corollary implies that the cumulative ranking loss of Mu-PRank is inversely proportional to square of the margin γ . Note that a direct comparison of this bound to the mistake bound of PRank (see theorem 1) is not possible as the notion of margin here is different. (For PRank, we used the 2 norm of the instances and the ranking rule to define the margin, while for MuPRank, we tacitly use the 1 norm of the ranking rule in conjunction with the ∞ of the instances.) This relation between the bounds also holds between additive and multiplicative online algorithms for classification (Rosenblatt, 1958; Littlestone, 1987, 1988, 1989). Putting these differences in the notion of margin aside, the two bounds exhibit the same quadratic dependency on the margin. 6 Experiments In this section, we describe experiments we performed that compared the variants of PRank discussed in the previous section with two other online learning algorithms applied to ranking: a multiclass generalization of the Perceptron algorithm (Crammer & Singer, 2001b), denoted MCP, and the Widrow-Hoff algorithm (Widrow & Hoff, 1960) for online regression learning, which we denote by WH. For WH we fixed its learning rate to a

Online Ranking by Projecting

169

constant value. The hypotheses that the variants of PRank and the two other algorithms maintain share similarities but are different in their complexity: PRank maintains a vector w¯ of dimension n and a vector of k − 1 modifiable ¯ totaling n + k − 1 parameters; MCP maintains k prototypes, thresholds b, which are vectors of dimension n, yielding kn parameters; WH maintains a single vector w¯ of size n. Therefore, MCP builds the most complex hypothesis, while WH builds the simplest. We describe two sets of experiments with two different data sets. The data set used in the first experiment is synthetic and was generated in a similar way to the data set used by Herbrich et al. (2000). We first generated points x¯ = (x1 , x2 ) uniformly at random from the unit square [0, 1]2 . Each point was assigned a rank y from the set {1, . . . , 5} according to the following ranking rule, y = maxr {r : 10((x1 − 0.5)(x2 − 0.5)) + ξ > br } where b¯ = (−∞, −1, −0.1, 0.25, 1) and ξ is a normally distributed noise with a zero mean and a standard deviation of 0.125. We generated sequences of instance rank pairs, each of length 8000. We fed the sequences to PRank, SiPRank, MCP, and WH. We then obtained predictions for each instance. We converted the real-valued predictions of WH into ranks by rounding each prediction to its closest rank value. As in Herbrich et al. (2000), we used a nonhomogeneous polynomial of degree 2, K(x¯ 1 , x¯ 2 ) = ((x¯ 1 · x¯ 2 ) + 1)2 as the inner product operation between each input instance and the hyperplanes that the four additive algorithms maintain. At each time step, we computed for each algorithm the accumulative ranking loss normalized by the instantaneous sequence length. Formally, the time-averaged loss after T rounds is (1/T) Tt |yˆ t − yt |. We computed the loss for each algorithm we tested for T = 1, . . . , 8000. To increase the statistical significance of the results, we repeated the process 100 times, picking a new random instance rank sequence of length 8000 each time, and then averaged the instantaneous losses across the 100 runs. The results are depicted on the left-hand side of Figure 6. The size of the symbols in the plot is larger than 95% confidence intervals for each result depicted in the figure. In this experiment, the performance of MCP is constantly worse than the performance of WH and PRank. WH initially suffers the smallest instantaneous loss, but after about 500 rounds, both PRank and Si-PRank start to outperform MCP and WH. Eventually the ranking loss that PRank and Si-PRank suffer is significantly lower than both WH and MCP. In the second set of experiments, we used the EachMovie data set (McJones, 1997). This data set is used for collaborative filtering tasks and contains ratings of movies provided by 61,265 people. Each person in the data set viewed a subset of movies from a collection of 1623 titles. Each viewer rated each movie that she saw using one of six possible ratings: 0, 0.2, 0.4, 0.6, 0.8, 1. We chose subsets of people who viewed a significant number of movies, extracting for evaluation people who have rated at least 100 movies. There were 7542 such viewers. We chose at random one per-

170

K. Crammer and Y. Singer

1.1

PRank Si−PRank WH MC−Perceptron

1 0.9

Rank Loss

0.8 0.7 0.6 0.5 0.4 0.3 0

1000

2000

3000

4000

5000

6000

7000

8000

Round

Figure 6: Comparison of the time-averaged ranking loss of PRank, Si-PRank, WH, and MCP on synthetic data.

son among these viewers and set the person’s ratings to be the target rank. We used the ratings of the other viewers as features. Thus, the goal is to learn to predict the “taste” of a random user with the user’s past ratings serving as a feedback and the ratings of fellow viewers as features. The prediction rule associates a weight with each fellow viewer and therefore can be seen as learning correlations between the tastes of different viewers. Next, we subtracted 0.5 from each rating; therefore the possible rating values are −0.5, −0.3, −0.1, 0.1, 0.3, 0.5. This linear transformation enabled us to assign a value of zero to movies that have not been rated. We fed each pair of feature and rank level in an online fashion. Since we chose viewers who rated at least 100 movies, we were able to perform at least 100 rounds of online predictions and updates. We repeated this experiment 500 times, each time choosing a random viewer as the target rank. The results are given in the left-hand plots of Figure 7. The error bars in the plot indicate 95% confidence levels. We repeated the experiment using viewers who have seen at least 200 movies. (There were 1802 such viewers.) The results of this experiment are given on the right-hand plots of Figure 7. The plots at the top of the figure compare the additive algorithms: PRank, Si-PRank, MCP, and WH. Along the entire run of the algorithms, PRank and Si-PRank are significantly

Online Ranking by Projecting 2.4

171 2.2

PRank Si−PRank WH MC−Perceptron

2.2

1.8 Rank Loss

Rank Loss

2 1.8 1.6

1.6 1.4

1.4

1.2

1.2

1

1 0

PRank Si−PRank WH MC−Perceptron

2

20

40

60

80

0.8 0

100

50

Round

1.9

1.7

200

Si−PRank η 0.50 η 1.00 η 4.00 η 16.00

1.7 1.6 1.5

1.6

Rank Loss

Rank Loss

150

1.8

Si−PRank η 0.50 η 1.00 η 4.00 η 16.00

1.8

1.5 1.4

1.4 1.3 1.2

1.3

1.1

1.2 1.1 0

100 Round

1

20

40

60 Round

80

100

0.9 0

50

100

150

200

Round

Figure 7: Comparison of the time-averaged ranking loss of the variants of PRank, WH, and MCP on the EachMovie data set using viewers who rated at least 200 movies (top) and at least 100 movies (bottom).

better than WH and consistently better than the multiclass perceptron algorithm, although the last employs a hypothesis that is substantially bigger. Comparing the two additive variants of PRank, we see that the ranking loss of Si-PRank is slightly lower than that of PRank. The plots at the bottom of the figure compare the best of the additive algorithms, Si-PRank, with the Mu-PRank run with different learning rates (η = 0.5, 1, 4, 16). One of the goals of this experiment is to check the dependency of the performance of Mu-PRank on its learning rate. It is clear from the two bottom plots of Figure 7 that the performance of Mu-PRank is sensitive to the choice of the learning rate. Note, though, that the best learning rate is problem dependent: the best learning rate is η = 4 for the first partition of EachMovie, while η = 0.5 results in the smallest ranking loss in the case of the second partition. This type of behavior is also exhibited by multiplicative algorithms in other problems such as binary classification (Kivinen & Warmuth, 1997). Nonetheless, Mu-PRank outperforms most of the other algorithms we compared for a broad range

172

K. Crammer and Y. Singer

of learning rates. Focusing first on the left plot, we observe that after about 60 rounds, Si-PRank achieves the best performance, while at the start of the training process, its ranking loss is almost inferior to most of the copies of Mu-PRank. Summing up, both the additive and multiplicative versions of PRank outperform regression and classification algorithms when evaluated with respect to the ranking loss. While the superior performance of PRank is not surprising, it provides an empirical validation of the formal analysis presented in the previous sections. 7 Conclusion In this article, we described a family of algorithms for instance ranking. The roots of the algorithms go back the perceptron algorithm (Rosenblatt, 1958). One of the major results in the article is the description of a new approach for solving ranking problems. While most of the previous approaches reduce the problem of ranking to a classification of pairs, we presented an alternative approach that builds a connection between rank levels and subintervals of the reals. An open problem that arises from both the theoretical analysis and the empirical results is deciding what version of PRank to use. In particular, PRank and Si-PRank share the same mistake bound, while in our experiments, Si-PRank performed better than PRank. All of the versions of PRank presented in this article are online algorithms, and for their analysis we used the mistake-bound model. An interesting research direction is the design and analysis of algorithms for batch settings in which all of the training examples are given at once. As mentioned in section 1, Shashua and Levin (2002) have described a batch algorithm for ranking. An interesting question is how the different variants relate to the algorithm of Shashua and Levin. In addition to continuing our research on online algorithms for ranking problems, an interesting research direction is the design and generalization analysis of batch algorithms for various ranking losses. Appendix: Technical Proofs A.1 Proof of Theorem 2. To prove the theorem, we introduce a slight modification of theorem 1 by relaxing the assumption that v¯ ∗ is of a unit norm and adding the norm of v¯ ∗ to the mistake bound of the basic PRank algorithm. We thus write the mistake bound of theorem 1 as T t=1

|yˆ t − yt | ≤ (k − 1)

(R2 + 1)v¯ ∗ 2 . γ2

Since the case D = 0 reduces to setting of theorem 1, we can assume that D > 0. We prove the theorem by transforming the inseparable problem into a separable one. We do so by expanding each original instance x¯ t ∈ Rn into

Online Ranking by Projecting

173

a vector z¯ t ∈ Rn+T as follows. The first n coordinates of z¯ t are set to x¯ t . The n + t coordinate of z¯ t is set to , which is a positive real number whose value is set below. The rest of the coordinates of z¯ t are set to zero. We similarly extend the ranking rule (w¯ ∗ , b¯∗ ) to (u¯ ∗ , c¯∗ ) ∈ R(n+T) × Rk−1 as follows. We rewrite the value of dt from equation 3.11 as dt = max γ − min{(w¯ ∗ · x¯ t − b∗r )ytr }, 0, γ − min{(w¯ ∗ · x¯ t − b∗r )ytr } , (A.1) r≥yt

r
and define st to be the following indicator function:  −1 0 st =  +1

dt = γ − minr≥yt {(w¯ ∗ · x¯ t − b∗r )ytr } dt = 0 . dt = γ − minr
(A.2)

Note that if st = −1, then dt = (w¯ ∗ · x¯ t − b∗r )ytr for some r ≥ yt , and thus ytr = −1. Similarly, if st = +1, then dt = (w¯ ∗ · x¯ t − b∗r )ytr for some r < yt and ytr = +1. We set the first n columns u¯ ∗ to be w¯ ∗ , and the n + t coordinate of t t u¯ ∗ is set to be sd and the rest of the coordinates of u¯ ∗ are set to zero. We now show that (u¯ ∗ , c¯∗ ) achieves a margin value γ on the expanded sequence. Using the definition of st and dt , we get st dt − b∗r )ytr = (w¯ ∗ · x¯ t − b∗r )ytr + st dt ytr

(u¯ ∗ · z¯ t − c∗r )ytr = (w¯ ∗ · x¯ t +

= (w¯ ∗ · x¯ t − b∗r )ytr + dt

≥ (w¯ ∗ · x¯ t − b∗r )ytr + γ − (w¯ ∗ · x¯ t − b∗r )ytr

= γ.

(A.3)

Note that by construction, ¯zt 2 ≤ R2 + 2 . Since the norm of (w¯ ∗ , b¯∗ ) is 1, we have u¯ + ¯c = w¯ + b¯∗ 2 + ∗ 2

∗ 2

∗ 2

st dt 2 t

=1+

D2 . 2

We now apply the bound of theorem 1 in the form discussed above and get that T t=1

|yˆ t − yt | ≤ (k − 1)

(R2 + 2 + 1)(1 + γ2

D2 ) 2

.

(A.4)

√ Setting 2 = D R2 + 1 in the equation above yields the desired bound.

174

K. Crammer and Y. Singer

It remains to show that the predictions of the algorithm on each element of the original sequence and on the expanded sequences are identical. This part follows exactly the same line of proof used in theorem 1 from Freund and Schapire (1998) and is thus omitted. Acknowledgments Thanks to Sanjoy Dasgupta and Rob Schapire for numerous discussions on ranking problems and algorithms. Thanks also to Eleazar Eskin and Uri Maoz for carefully reading the manuscript. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication reflects only the authors’ views. References Cohen, W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. Journal of Artificial Intelligence Research, 10, 243–270. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Crammer, K., & Singer, Y. (2001a). Pranking with ranking. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Crammer, K., & Singer, Y. (2001b). Ultraconservative online algorithms for multiclass problems. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory. Amsterdam: Springer. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Fletcher, R. (1987). Practical methods of optimization (2nd ed.). New York: Wiley. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37, 277–296. Harrington, E. F. (2003). Online ranking/collaborative filtering using the perceptron algorithm. In Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC: AIII Press. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large marging rank boundaries for ordinal regression. In A. Smola, B. Scholkopf, ¨ & D. Schuurmans (Eds.), Advances in large margin classifiers. Cambridge, MA: MIT Press. Kemeny, J. G., & Snell, J. L. (1962). Mathematical models in the social sciences. Cambridge, MA: MIT Press. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–64. Littlestone, N. (1987). Learning when irrelevant attributes abound. In 28th Annual Symposium on Foundations of Computer Science (pp. 68–77). Los Angeles: IEEE.

Online Ranking by Projecting

175

Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2, 285–318. Littlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. Unpublished doctoral dissertation, University of California, Santa Cruz. McJones, P. (1997). Eachmovie collaborative filtering data set. DEC Systems Research Center. Available online at: http://www.research.digital.com/SRC/ eachmovie/. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–407. Shashua, A., & Levin, A. (2002). Ranking with large margin principle: Two approaches. In S. Becker, S. Thrun, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. 1960 IRE WESCON Convention Record, 96–104. Received October 31, 2003; accepted June 1, 2004.

LETTER

Communicated by John Shawe-Taylor

On Learning Vector-Valued Functions Charles A. Micchelli charles [email protected] Department of Mathematics and Statistics, State University of New York, University at Albany, Albany, NY, 12222, U.S.A.

Massimiliano Pontil [email protected] Department of Computer Sciences, University College London, London WC1E, England, UK

In this letter, we provide a study of learning in a Hilbert space of vectorvalued functions. We motivate the need for extending learning theory of scalar-valued functions by practical considerations and establish some basic results for learning vector-valued functions that should prove useful in applications. Specifically, we allow an output space Y to be a Hilbert space, and we consider a reproducing kernel Hilbert space of functions whose values lie in Y . In this setting, we derive the form of the minimal norm interpolant to a finite set of data and apply it to study some regularization functionals that are important in learning theory. We consider specific examples of such functionals corresponding to multiple-output regularization networks and support vector machines, for both regression and classification. Finally, we provide classes of operator-valued kernels of the dot product and translation-invariant type. 1 Introduction The problem of computing a function from empirical data is addressed in several areas of mathematics and engineering. Depending on the context, this problem goes under the name of function estimation (statistics), function learning (machine learning theory), and function approximation and interpolation (approximation theory), among others. The type of functions typically studied are real-valued functions (in learning theory, the related binary classification problem is often treated as a special case). There is a large literature on the subject. We recommend Cherkassky and Mulier (1998), Cristianini and Shawe-Taylor (2000), Evgeniou, Pontil, and Poggio (2000), Vapnik (1998), Wahba (1990). In this work, we address the problem of computing functions whose range is in a Hilbert space Y , discuss ideas within the perspective of learnNeural Computation 17, 177–204 (2005)

c 2004 Massachusetts Institute of Technology

178

C. Micchelli and M. Pontil

ing theory, and elaborate on their connections to interpolation and optimal estimation. Despite its importance, learning vector-values functions has been only marginally studied within the learning theory community, and this article is a first attempt to set down a framework to study this problem. We focus on Hilbert spaces of vector-valued functions that admit a reproducing kernel (Aronszajn, 1950). In the scalar case, these spaces have received considerable attention over the past few years in machine learning theory due to the successful application of kernel-based learning methods to complex data, including images, text data, speech data, and biological data (see, e.g., Cristianini & Shawe-Taylor, 2000; Scholkopf ¨ & Smola, 2002; Vapnik, 1998). We briefly motivate learning vector-valued functions when Y is an ndimensional Euclidean space. One straightforward approach consists in separately representing each component of a vector-valued function f = ( f1 , . . . , fn ) by a linear space of smooth functions and then learn these components independently, for example, by minimizing some regularized error functional. This approach does not capture relations between components of f and so will be suboptimal when these relations occur. For example, if the components of f represent similar classification tasks, we can benefit by learning the tasks simultaneously (see, e.g., Evgeniou & Pontil, 2004). Similarly, if f maps to a space of images, the components of f are associated with pixels that may be related. In such cases, we should introduce a measure of smoothness of f that models relations among its components. We propose to do this by using a matrix-valued kernel K: X × X → Rn×n that reflects the interaction among the components of f . This letter provides a foundation for this approach. Vector-valued learning offers both practical and theoretical challenges. We establish some basic insights into kernel-based vector-valued learning as well as indications where this may be valuable for applications. In section 2, we outline the theory of reproducing kernel Hilbert spaces (RKHS) of vector-valued functions. These RKHSs admit a kernel with values that are bounded linear operators on Y . They have been studied by Burbea and Masani (1994), but only in the context of complex analysis and used recently for the solution of partial differential equations by Amodei (1997). Section 3 treats the problem of minimal norm interpolation (MNI) in the context of RKHS. MNI plays a central role in many approaches to function estimation, and so we highlight its relation to learning. In particular, in section 4, we use MNI to resolve the form of the minimizer of regularization functionals in the context of vector-valued functions. In section 5, we discuss the form of operator-valued kernels that are either of the dot product or translationinvariant form as they are often used in learning. Finally, in section 6, we describe examples where we feel there is practical need for vector-valued learning as well as report on numerical experiments that highlight the advantages of the proposed methods.

On Learning Vector-Valued Functions

179

2 Reproducing Kernel Hilbert Spaces of Vector-Valued Functions Let Y be a real Hilbert space with inner product (·, ·), X a set, and H a linear space of functions on X with values in Y . We assume that H is also a Hilbert space with inner product ·, ·. Definition 1. We say that H is a reproducing kernel Hilbert space (RKHS) when for any y ∈ Y and x ∈ X the linear functional that maps f ∈ H to (y, f (x)) is continuous. In this case, according to the Riesz lemma (see, e.g., Akhiezer & Glazman, 1993), there is, for every x ∈ X and y ∈ Y , a function K(x|y) ∈ H such that for all f ∈ H, (y, f (x)) = K(x|y), f . Since K(x|y) is linear in y, we write K(x|y) = Kx y where Kx : Y → H is a linear operator. The above equation can be now rewritten as (y, f (x)) = Kx y, f .

(2.1)

For every x, t ∈ X , we also introduce the linear operator K(x, t): Y → Y defined, for every y ∈ Y , by K(x, t)y := (Kt y)(x).

(2.2)

We say that H is normal provided there does not exist (x, y) ∈ X × (Y \{0}) such that the linear functional (y, f (x)) = 0 for all f ∈ H. In proposition 1, we state the main properties of the function K. To this end, we let L(Y ) be the set of all bounded linear operators from Y into itself, and for every A ∈ L(Y ), we denote by A∗ its adjoint. We also use L+ (Y ) to denote the cone of nonnegative bounded linear operators, that is, A ∈ L+ (Y ) provided that, for every y ∈ Y , (y, Ay) ≥ 0. When this inequality is strict for all y = 0, we say A is positive definite. Finally, we denote by Nm the set of positive integers up to and including m. Proposition 1. If K(x, t) is defined, for every x, t ∈ X, by equation 2.2 and Kx is given by equation 2.1, the kernel K satisfies, for every x, t ∈ X , the following properties: a. For every y, z ∈ Y , we have that (y, K(x, t)z) = Kt z, Kx y

(2.3)

b. K(x, t) ∈ L(Y ), K(x, t) = K(t, x)∗ , and K(x, x) ∈ L+ (Y ). Moreover, K(x, x) is positive definite for all x ∈ X if and only if H is normal.

180

C. Micchelli and M. Pontil

c. For any m ∈ N, {xj : j ∈ Nm } ⊆ X , {yj : j ∈ Nm } ⊆ Y we have that

(yj , K(xj , x )y ) ≥ 0.

(2.4)

j,∈Nm 1

d. Kx = K(x, x) 2 . 1

1

e. K(x, t) ≤ K(x, x) 2 K(t, t) 2 . f. For every f ∈ H and x ∈ X , we have that 1

f (x) ≤ f K(x, x) 2 . Proof. We prove property a in proposition 1 by merely choosing f = Kt z in equation 2.1 to obtain that Kx y, Kt z = (y, (Kt z)(x)) = (y, K(x, t)z). Consequently, from this equation, we conclude that K(x, t) admits an algebraic adjoint K(t, x) defined everywhere on Y , and so the uniform boundness principle (Akhiezer & Glazman, 1993) implies that K(x, t) ∈ L(Y ), and K(x, t) = K(t, x)∗ . Moreover, choosing t = x in equation 2.3 proves that K(x, x) ∈ L+ (Y ). As for the positive definiteness of K(x, x), merely use equations 2.1 and 2.3. These remarks prove property b. As for property c, we again use equation 2.3 to get j,∈Nm

(yj , K(xj , x )y ) =

Kxj yj , Kx y =

j,∈Nm

Kxj yj 2 ≥ 0.

j∈Nm

For the proof of property d, we choose y ∈ Y and observe that Kx y2 = (y, K(x, x)y) ≤ yK(x, x)y ≤ y2 K(x, x), 1

which implies that Kx ≤ K(x, x) 2 . Similarly, we have that K(x, x)y2 = (K(x, x)y, K(x, x)y) = Kx K(x, x)y, Kx y ≤ Kx yKx K(x, x)y ≤ yKx 2 K(x, x)y, thereby implying that Kx 2 ≥ K(x, x), which proves property d. For the claim e, we compute K(x, t)y2 = (K(x, t)y, K(x, t)y) = Kx K(x, t)y, Kt y ≤ Kx K(x, t)yKt y ≤ yKt Kx K(x, t)y,

On Learning Vector-Valued Functions

181

which gives K(x, t)y ≤ Kx Kt y and establishes property e. For our final assertion, we observe, for all y ∈ Y , that (y, f (x)) = Kx y, f ≤ f Kx y ≤ f yKx , which implies the desired result: f (x) ≤ f Kx . Definition 2. We say that K: X × X → L(Y ) is a kernel if it satisfies properties b and c in proposition 1. So far we have seen that if H is a RKHS of vector-valued functions, there exists a kernel. In the spirit of Moore-Aronszajn’s theorem for RKHS of scalar functions (Aronszajn, 1950), it can be shown that a kernel determines an RKHS of vector-valued functions. We state the theorem below; however, as the proof parallels the scalar case, we do not elaborate on the details. Theorem 1. If K: X × X → L(Y ) is a kernel, then there exists a unique (up to an isometry) RKHS, which admits K as the reproducing kernel. Let us observe that in the case Y = Rn , the kernel K is an n × n matrix of scalar-valued functions. The elements of this matrix can be identified by appealing to equation 2.3. Indeed, choosing y = ek , and z = e , k, ∈ Nn , these being the standard coordinate bases for Rn , yields the formula (K(x, t))k = Kx ek , Kt e .

(2.5)

In particular, when n = 1, equation 2.1 becomes f (x) = Kx , f , which is the standard reproducing kernel property, while equation 2.3 reads K(x, t) = Kx , Kt . The case Y = Rn serves also to illustrate some features of the operatorvalued kernels that are not present in the scalar case. In particular, let H , ∈ Nn be RKHS of scalar-valued functions on X with kernels K , ∈ Nn and define the kernel whose values are n × n matrices by the formula D = diag(K1 , . . . , Kn ). Clearly, if { f : ∈ Nn }, {g : ∈ Nn } ⊆ H, we have that f, g = f , g , ∈Nn

182

C. Micchelli and M. Pontil

where ·, · is the inner product in the RKHS of scalar functions with kernel K . Diagonal kernels can be effectively used to generate a wide variety of operator-valued kernels that have the flexibility needed for learning. We have in mind the following construction. For every set {Aj : j ∈ Nm } of r × n matrices and {Dj : j ∈ Nm } of diagonal kernels, the operator-valued function K(x, t) =

Aj∗ Dj (x, t)Aj , x, t ∈ X

(2.6)

j∈Nm

is a kernel. We conjecture that all operator-valued kernels are limits of kernels of this type. Generally, the kernel in equation 2.6 cannot be diagonalized, that is, it cannot be rewritten in the form A∗ DA, unless all the matrices Aj , j ∈ Nm can be transformed into a diagonal matrix by the same matrix. For r much smaller than n, equation 2.6 results in low-rank kernels, which should be an effective tool for learning. In particular, in many practical situations, the components of f may be linearly related, that is, for every x ∈ X , f (x) lies on a linear subspace M ⊆ Y . In this case, it is desirable to use a kernel that has the property that f (x) ∈ M, x ∈ X for all f ∈ H. An elegant solution to this problem is to use a low-rank kernel modeled by the output examples themselves, namely, K(x, t) = λj yj Kj (x, t)yj∗ , j∈Nm

where λj are nonnegative constants and Kj , j ∈ Nm are some prescribed scalar-valued kernels. Property c of proposition 1 has an interesting interpretation concerning the RKHS of scalar-valued functions. Every f ∈ H determines a function F on X × Y defined by F(x, y) := (y, f (x)), x ∈ X , y ∈ Y .

(2.7)

We let H1 be the linear space of all such functions. Thus, H1 consists of functions that are linear in their second variable. We make H1 into a Hilbert space by choosing F = f . It then follows that H1 is an RKHS with a reproducing scalar-valued kernel defined, for all (x, y), (t, z) ∈ X × Y , by the formula K1 ((x, y), (t, z)) = (y, K(x, t)z). This idea is known in the statistical context (e.g., Cressie, 1993), and can be extended in the following manner. We define, for any p ∈ Nn , the family of functions Kp ((x, y), (t, z)) := (y, K(x, t)z)p , (x, y), (t, z) ∈ X × Y .

On Learning Vector-Valued Functions

183

The lemma of Schur (see, e.g., Aronszajn, 1950) implies that Kp is a scalar kernel on X × Y . The functions in the associated RKHS consist of scalar-valued functions that are homogeneous polynomials of degree p in their second argument. These spaces may be of practical value for learning polynomial functions of projections of vector-valued functions. A kernel K can be realized by a mapping : X → L(W , Y ) where W is a Hilbert space by the formula K(x, t) = (x)∗ (t), x, t ∈ X , and functions f in H can be represented as f = w for some w ∈ W , so that f H = wW . Moreover, when H is separable, we can choose W to be the separable Hilbert space of square summable sequences (Akhiezer & Glazman, 1993). In the scalar case, Y = R, is referred to in learning theory as the feature map, and it is central in developing kernel-based learning algorithms (see, e.g., Cristianini & Shawe-Taylor, 2000; Scholkopf ¨ & Smola, 2002). 3 Minimal Norm Interpolation In this section, we turn to the problem of minimal norm interpolation of vector-valued functions within the RKHS framework already developed. This problem consists in finding, among all functions in H that interpolate a given set of points, a function with minimum norm. We will see later that the minimal norm interpolation problem plays a central role in characterizing regularization approaches to learning. Definition 3. For distinct points {xj : j ∈ Nm }, we say that the linear functionals defined for f ∈ H as Lxj f := f (xj ), j ∈ Nm are linearly independent if and only if there does not exist {cj : j ∈ Nm } ⊆ Y (not all zero) such that for all f ∈ H, (cj , f (xj )) = 0. (3.1) j∈Nm

Lemma 1. The functionals {Lxj : j ∈ Nm } are linearly independent if and only if for any {yj : j ∈ Nm } ⊆ Y , there are unique {cj : j ∈ Nm } ⊆ Y such that K(xj , x )c = yj , j ∈ Nm . (3.2) ∈Nm

Proof. We denote by Y m the mth Cartesian product of Y . We make Y m into a Hilbert space by defining for every c = (cj : j∈ Nm ) ∈ Y m and d = (dj : j ∈ Nm ) ∈ Y m their inner product c, d := j∈Nm (cj , dj ). Let us consider the bounded linear operator A: H → Y m , defined for f ∈ H by Af = ( f (xj ) : j ∈ Nm ).

184

C. Micchelli and M. Pontil

Therefore, by construction, c satisfies equation 3.1 when c ∈ Ker(A∗ ) and, hence, the set of linear functionals {Lxj : j ∈ Nm } is linearly independent if and only if Ker(A∗ ) = {0}. Since Ker(A∗ ) = Ran(A)⊥ (see Akhiezer & Glazman, 1993), this is equivalent to the condition that Ran(A) = Y m . From the reproducing kernel property, we can identify A∗ , by computing for c ∈ Y m and f ∈ H: c, Af =

(cj , f (xj )) =

j∈Nm

Kxj cj , f .

j∈Nm

Thus, we have established that A∗ c = j∈Nm Kxj cj . We now consider the symmetric bounded linear operators B := AA∗ : Y m → Y m , which we identify for c = (cj : j ∈ Nm ) ∈ Y m as  Bc = 

 K(x , xj )cj : ∈ Nm  .

j∈Nm

Consequently, equation 3.2 can be equivalently written as Bc = y, and so this equation means that y ∈ Ran(B) = Ker(A∗ )⊥ . Since it can be verified that Ker(A∗ ) = Ker(B), we have shown that Ker(A∗ ) = {0} if and only if the linear functionals {Lxj : j ∈ Nm } are linearly independent and the result follows. Theorem 2. If the linear functionals Lxj f = f (xj ), f ∈ H, j ∈ Nm are linearly independent, then the unique solution to the variational problem min{ f 2 : f (xj ) = yj , j ∈ Nm }

(3.3)

is given by fˆ =

Kxj cj ,

j∈Nm

where {cj : j ∈ Nm } ⊆ Y is the unique solution of the linear system of equations

K(xj , x )c = yj , j ∈ Nm .

(3.4)

∈Nm

Proof. Let f be any element of H such that f (xj ) = yj , j ∈ Nm . This function always exists since we have shown that the operator A maps onto Y m . We set g := f − fˆ and observe that f 2 = g + fˆ2 = g2 + 2 fˆ, g + fˆ2 .

On Learning Vector-Valued Functions

185

However, since g(xj ) = 0, j ∈ Nm , we obtain that

fˆ, g =

j∈Nm

Kxj cj , g =

(cj , g(xj )) = 0.

j∈Nm

It follows that f 2 = g2 + fˆ2 ≥ fˆ2 , and we conclude that fˆ is the unique solution to equation 3.3. When the set of linear functionals {Lxj : j ∈ Nm } is dependent, generally data y ∈ Y m may not admit an interpolant. Thus, the variational problem in theorem 2 requires that y ∈ Ran(A). Since Ran(A) = Ker(A∗ )⊥ = Ker(B)⊥ = Ran(B), we see that if y ∈ Ran(A), that is, when the data admit an interpolant in H, then the system of equations 3.2 has a (generally not unique) solution, and so the function in equation 3.4 is still the solution of the extremal problem. Hence, we have proved the following result. Theorem 3. If y ∈ Ran(A) the minimum of problem 3.3 is unique and admits the form fˆ = j∈Nm Kxj cj , where the coefficients {cj : j ∈ Nm } solve the linear system of equations,

K(xj , x )c = yj , j ∈ Nm .

∈Nm

An alternative approach to proving the above result is to trim the set of linear functionals {Lxj : j ∈ Nm } to a maximally linearly independent set and then apply theorem 2 to this subset of linear functionals. This approach, of course, requires that y ∈ Ran(A) as well. 4 Regularization We begin this section with the approximation scheme that arises from the minimization of the functional E( f ) := yj − f (xj )2 + µ f 2 , (4.1) j∈Nm

where µ is a fixed positive constant and {(xj , yj ) : j ∈ Nm } ⊆ X × Y . Our initial remarks shall prepare for the general case 4.5 treated later. Problem 4.1 is a special form of regularization functionals introduced by Tikhonov, Ivanov,

186

C. Micchelli and M. Pontil

and others to solve ill-posed problems (Tikhonov & Arsenin, 1977). Its application to learning is discussed, for example, in Evgeniou et al., (2000). A justification, from the theory of optimal estimation, that regularization is the optimal algorithm to learn a function from noisy data is given in Melkman and Micchelli (1979). Theorem 4. fˆ =

If fˆ minimizes E in H, it is unique and has the form Kxj cj

(4.2)

j∈Nm

where the coefficients {cj : j ∈ Nm } ⊆ Y are the unique solution of the linear equations

(K(xj , x ) + µδj )c = yj , j ∈ Nm .

(4.3)

∈Nm

Proof. that

The proof is similar to that of theorem 2. We set g = f − fˆ and note

E( f ) = E( fˆ) +

g(xj )2 − 2

j∈Nm

(yj − fˆ(xj ), g(xj ))

j∈Nm

+ 2µ fˆ, g + µg2 . Using equations 2.1, 4.2, and 4.3 gives the equations fˆ, g =

(cj , g(xj ))

j∈Nm

(yj − fˆ(xj ), g(xj )) = µ

j∈Nm

(cj , g(xj )),

j∈Nm

and so it follows that E( f ) = E( fˆ) +

g(xj )2 + µg2 ,

j∈Nm

from which we conclude that fˆ is the unique minimizer of E. The representation theorem above embodies the fact that regularization is also an MNI problem in the space H × Y m . In fact, we can make this space into a Hilbert space by setting, for every f ∈ H and ξ = (ξj : j ∈ Nm ) ∈ Y m , ( f, ξ )2 := ξj 2 + µ f 2 j∈Nm

On Learning Vector-Valued Functions

187

and note that the regularization procedure 4.1 is equivalent to the MNI problem defined in theorem 3 with linear functionals defined at ( f, ξ ) ∈ H × Y m by the equation Lxj ( f, ξ ) := f (xj ) + ξj , j ∈ Nm corresponding to data y = (yj : j ∈ Nm ). Let us consider again the case that Y = Rn . The linear system of equations 4.3 reads (G + µI)c = y,

(4.4)

and we view G as an m × m block matrix, where each block is an n × n matrix (so G is an mn×mn scalar matrix), and c = (cj : j ∈ Nm ), y = (yj : j ∈ Nm ) are vectors in Rmn . Specifically, the jth, kth block of G is Gjk = K(xj , xk ), j, k ∈ Nm . Proposition 1 ensures that G is symmetric and positive semidefinite. Moreover, the diagonal elements of G are positive semidefinite. There is a wide variety of circumstances where the linear systems of equations 4.4 can be effectively solved. Specifically, as before, we choose the kernel K := AT DA, where D := diag(K1 , . . . , Kn ) and each K , ∈ Nn is a prescribed scalarvalued kernel, and A is a nonsingular n × n matrix. In this case, the linear system of equations 4.4 becomes ˜ = y, A˜ T D˜ Ac where A˜ is the m × m block diagonal matrix whose n × n diagonal blocks are formed by the matrix A and D˜ is the m × m matrix whose jth, kth block is D˜ jk := D(xj , xk ), j, k ∈ Nm . When A is upper triangular, this system of ˜ = y and equations can be efficiently solved by first solving the system A˜T Dz ˜ then solving the system Ac = z. Both of these steps can be implemented by solving only m × m systems coupled with vector substitution. We can also reformulate the solution of the regularization problem 4.1 in terms of the feature map. Indeed, if : X → L(W , Y ) is any such map, where W is some Hilbert space, the solution that minimizes equation 4.1 has the form f (x) = (x)w where w ∈ W is given by the formula −1  ∗ (xj )(xj ) + µI ∗ (xj )yj . w= j∈Nm

j∈Nm

Ym × R

Let V: + → R be a prescribed function and consider the problem of minimizing the functional E( f ) := V(( f (xj ) : j ∈ Nm ), f )

(4.5)

over all functions f ∈ H. A special case is covered by the functional of the form E( f ) := Q(yj , f (xj )) + h( f ), j∈Nm

188

C. Micchelli and M. Pontil

where h: R+ → R+ is a strictly increasing function and Q: Y × Y → R+ is some prescribed loss function. In particular, functional 4.1 corresponds to the choice h(t) := µt2 , t ∈ R+ , and Q(y, f (x)) = y − f (x)2 . However, even for this choice of h, other loss functions are important in applications (see Vapnik, 1998). Within this general setting, we provide a representation theorem for any function that minimizes the functional in equation 4.5. This result is wellknown in the scalar case (see, e.g., Scholkopf ¨ & Smola, 2002). The proof below uses the representation for minimal norm interpolation presented above. This method of proof has the advantage that it can be extended to Banach spaces (see Micchelli & Pontil, 2004, for discussion). Theorem 5. If for every y ∈ Y m , the function h: R+ → R+ defined for t ∈ R+ by h(t) := V(y, t) is strictly increasing and f0 ∈ H minimizes the functional 4.5, then f0 = j∈Nm Kxj cj for some {cj : j ∈ Nm } ⊆ Y . In addition, if V is strictly convex, the minimizer is unique. Proof. Let f be any function such that f (xj ) = f0 (xj ), j ∈ Nm , and define y0 := ( f0 (xj ) : j ∈ Nm ). By the definition of f0 , we have that V y0 , f0 ≤ V y0 , f , and so f0 = min{ f : f (xj ) = f0 (xj ), j ∈ Nm , f ∈ H}.

(4.6)

Therefore, by theorem 3, the result follows. When V is strictly convex, the uniqueness of a global minimum of E is immediate. There are examples of functions V above for which the functional E given by equation 4.5 may have more than one local minimum. The question arises to what extent a local minimum of E has the form described in theorem 5. Our next result establishes that the form of any local minimizer has the same form as in theorem 5. This result seems new even in the scalar case. First, let us explain what we mean by a local minimum of E. A function f0 ∈ H is a local minimum for E provided that there is a positive number such that whenever f ∈ H satisfies f0 − f ≤ , then E( f0 ) ≤ E( f ). Our first observation shows that the conclusion of theorem 5 remains valid for local minima of E. Theorem 6. If V satisfies the hypotheses of theorem 5 and f0 ∈ H is a local minimum of E, then f0 = j∈Nm Kxj cj for some {cj : j ∈ Nm } ⊆ Y .

On Learning Vector-Valued Functions

189

Proof. If g is any function in H such that g(xj ) = 0, j ∈ Nm and t a real number such that |t|g ≤ , then V y0 , f0 ≤ V y0 , f0 + tg . Consequently, we have that f0 ≤ f0 + tg, from which it follows that ( f0 , g) = 0. Thus, f0 satisfies equation 4.6, and the result follows. So far, we have obtained a representation theorem for both global or local minima of E when V is strictly increasing in its last argument. Our second observation does not require this hypothesis. Instead, we shall see that if V is merely differentiable at a local minima and its partial derivative relative to its last coordinate is nonzero, the conclusion of theorems 5 and 6 remains valid. To state this observation, we explain precisely what we require of V. There exist c = (cj : j ∈ Nm ) ∈ Y m and a nonzero constant a ∈ R such that for any g ∈ H, the derivative of the univariate function h defined for t ∈ R by the equation h(t) := V(( f0 (xj ) + tg(xj ) : j ∈ Nm ), f0 + tg) is given by h (0) =

(cj , g(xj )) + a f0 , g.

j∈Nm

For example, the regularization functional in equation 4.1 has this property. Theorem 7. If V is differentiable at a local minimum f0 ∈ H of E as defined above, then there exists c = (cj : j ∈ Nm ) ∈ Y m such that f0 = j∈Nm Kxj cj . Proof. By the definition of f0 , we have that h(0) ≤ h(t) for all t ∈ R such that |t|g ≤ for some > 0. Hence, h (0) = 0, and so we conclude that

1 1 (cj , g(xj )) = − Kxj cj , g f0 , g = − a j∈N a j∈N m

m

and since g was arbitrary, the result follows. We now discuss specific examples of loss functions that lead to quadratic programming (QP) problems. These problems are a generalization of the support vector machine (SVM) regression algorithm for scalar functions (see Vapnik, 1998).

190

C. Micchelli and M. Pontil

Example 1. We chooseY = Rn , y = (yj : j ∈ Nm ), f = ( f : ∈ Nn ): X → Y , and Q(y, f (x)) := ∈Nn max(0, |y − f (x)| − ), where is a positive parameter and consider the problem

min

  

j∈Nm

 

Q(yj , f (xj )) + µ f 2 : f ∈ H . 

(4.7)

Using theorem 2, we transform the above problem into the quadratic programming problem,     min (ξj + ξj∗ ) · 1 + µ (cj , K(xj , x )c ) ,   j,∈Nm j∈Nm

subject, for all j ∈ Nm , to the constraints on the vectors cj , ξj , ξj∗ ∈ Rn , j ∈ Nm that yj − ∈Nm

K(xj , x )c ≤ + ξj

∈Nm

K(xj , x )c − yj ≤ + ξj∗ ξj , ξj∗ ≥ 0,

where the inequalities are meant to hold component-wise, the symbol 1 is used for the vector in Rn whose all of components are equal to one, and “·” stands for the standard inner product in Rm . Using results from quadratic programming (see Mangasarian, 1994), this primal problem can be transformed to its dual problem,

1 (αj − αj∗ , K(xj , x )(α − α∗ )) 2 j,∈N m ∗ ∗ (αj − αj ) · yj − (αj + αj ) · 1 , +

max −

j∈Nm

subject, for all j ∈ Nm , to the constraints on the vectors αj , αj∗ ∈ Rn , j ∈ Nm that 0 ≤ αj , αj∗ ≤ (2µ)−1 . If {(αˆ j , −αˆ j∗ ) : j ∈ Nn } is a solution to the dual problem, then {ˆcj := αˆ j − αˆ j∗ : j ∈ Nm } is a solution to the primal problem. The optimal parameters {(ξj , ξj∗ ) : j ∈ Nm } can be obtained, for example, by setting c = cˆ in the primal

On Learning Vector-Valued Functions

191

problem above and minimizing the resulting function only with respect to ξ, ξ ∗ . Example 2. The setting is as in example 1, but now we choose the loss function Q(y, f (x)) := max{max(0, |y − f (x)| − ) : ∈ Nm }. In this case, problem 4.7 is equivalent to the following QP problem,     min (ξj + ξj∗ ) + (cj , K(xj , x )c ) ,   j∈Nm

j,∈Nm

subject, for all j ∈ Nm , to the constraints on the vectors cj ∈ Rn , ξj , ξj∗ ∈ R, j ∈ Nm that K(xj , x )c ≤ + ξj yj − ∈Nm

∈Nm

K(xj , x )c − yj ≤ + ξj∗ ξj , ξj∗ ≥ 0.

The corresponding dual problem is

1 (αj − αj∗ , K(xj , x )(α − α∗ )) 2 j,∈N m ∗ ∗ (αj − αj ) · yj − (αj + αj ) · 1 , +

max −

j∈Nm

subject, for all j ∈ Nm , to the constraints on the vectors αj , αj∗ ∈ Rn , j ∈ Nm that αj · 1, αj∗ · 1 ≤ (2µ)−1 , αj , αj∗ ≥ 0, where the symbol 1 is used for the vector in Rn whose components are all equal to one, and · stands for the standard inner product in Rm . We note that when Y = R, the loss functions used in examples 1 and 2 are the same as the loss function used in the scalar SVM regression problem— the loss Q(y, f (x)) := max(0, |y − f (x)| − ) (see Vapnik, 1998). In particular, the dual problems above coincide. Likewise, we can also consider squared versions of the loss functions in the above examples and show that the dual problems are still quadratic programming problems. Those problems reduce to the SVM regression problem with the loss max(0, |y − f (x)| − )2 .

192

C. Micchelli and M. Pontil

Examples 1 and 2 share some similarities with the multiclass classification problem considered in the context of SVM. This problem consists in learning a function from an input space X to the index set Nn . To every index ∈ Nn , n > 2, we associate a th class or category denoted by C . A common approach to solve such problem is by learning a vector-valued function f = ( f : ∈ Nn ): X → Rn and classify x in the class Cq such that q = argmax∈Nn f (x). Given a training set {(xj , kj ) : j ∈ Nm } ⊆ X × Nn with m > n if, for every j ∈ Nm , xj belongs to class Ckj , the output yj ∈ Rn is a binary vector with all components equal to −1 except for the k − jth component, which equals 1. Thus, the function f should separate as best as possible examples in the class C (the positive examples) from examples in the remaining classes (the negative examples).1 In particular, if we choose Q(yj , f (xj )) := ∈Nn max(0, 1 − fkj (xj ) + f (xj )), the minimization problem 4.7 leads to the multiclass formulation studied in Vapnik (1998) and Weston and Watkins (1998), while the choice Q(yj , f (xj )) := max{max(0, 1 − fkj (xj ) + f (xj )) : ∈ Nn , = kj } leads to the multiclass formulation of Cramer and Singer (2001) (see also Bennett & Mangasarian, 1993, for related work in the context of linear programming). (For further information on the quadratic programming problem formulation and discussion of algorithms for the solution of these optimization problems, see Bennett & Mangasarian, 1993; Cramer & Singer, 2001; Vapnik, 1998; and Weston & Watkins, 1998). Finally, we wish to emphasize that from a Bayesian perspective, kernels can be seen as covariances of a gaussian process in the RKHS. This relation is discussed Wahba (1990) and has been further developed in machine learning (see, e.g., Williams and Barber, 1998; Cornford, Csato, Evans, & Opper, 2004). 5 Kernels In this section, we consider the problem of characterizing a wide variety of kernels of a form that have been found to be useful in learning theory (see Vapnik, 1998). We begin with a discussion of dot product kernels. To recall n this idea, we let our space X to be R , and on X we put the usual Euclidean inner product x · y = j∈Nn xj yj , x = (xj : j ∈ Nn ), y = (yj : j ∈ Nn ). A dot product kernel is any function of x · y, and a typical example is to choose Kp (x, y) = (x · y)p , where p is a positive integer. Our first result provides a substantial generalization. To this end, we let Nm be the set of vectors in Rm whose components are nonnegative integers, and if α = (αj : j ∈ Nm ) ∈ α Nm , z ∈ Rm , we define zα := j∈Nm zj j , α! := j∈Nm αj !, and |α| := j∈Nm αj . Finally, we let C be the set of complex numbers. 1 The case n = 2 (binary classification) is not included here because it merely reduces to learning a scalar-valued function.

On Learning Vector-Valued Functions

193

Definition 4. We say that a function h: Cm → L(Y ) is entire whenever there is a sequence of bounded operators {Aα : α ∈ Nm } ⊆ L(Y ), such that for every z ∈ Cm and any c ∈ Y , the function

(c, Aα c)zα

α∈Nm

is an entire function on Cm , and for every c ∈ Y and z ∈ Cm , it equals (c, h(z)c).2 In this case we write h(z) =

Aα zα , z ∈ Cm .

(5.1)

α∈Nm

Let B1 , . . . , Bm be any n × n complex matrices and h an entire function with values in L(Y ). We let h((B1 ), . . . , (Bm )) be the n × n matrix whose j, element is the operator h((B1 )j , . . . , (Bm )j ) ∈ L(Y ) Proposition 2. If {Kj : j ∈ Nm } is a set of kernels on X × X with values in R, and h: Rm → L(Y ) an entire function of the type 5.1 where {Aα : α ∈ Nm } ⊆ L+ (Y ), then K = h(K1 , . . . , Km ) is a kernel on X × X with values in L(Y ). Proof. that

For any {cj : j ∈ Nm } ⊆ Y , and {xj : j ∈ Nm } ⊆ X , we must show

(cj , h(K1 (xj , x ), . . . , Km (xj , x ))c ) ≥ 0.

j,∈Nm

To this end, we write the sum in the form α∈Nm

αm K1α1 (xj , x ) · · · Km (xj , x )(cj , Aα c ).

j,∈Nm

Since Aα ∈ L+ (Y ), it follows that the matrix ((cj , Aα c )), j, ∈ Nm is positive semidefinite. Therefore, by the lemma of Schur (Aronszajn, 1950), we see that the matrix whose jth, th element appears in the sum above is positive semidefinite for each α ∈ Nm . From this fact, the result follows. 2 A function f : Cm → C is called entire or analytic if it has a convergent power series everywhere in Cm .

194

C. Micchelli and M. Pontil

In our next observation, we show that the function h with the property described in proposition 2 must have the form 5.1 with {Aα : α ∈ Nm } ⊆ L+ (Y ) provided it and the kernels K1 , . . . , Km satisfy some additional conditions. This issue is resolved in FitzGerald, Micchelli, and Pinkus (1995) in the scalar case. To apply those results, we find the following definition useful: Definition 5. We say that the set of kernels Kj : X × X → C, j ∈ Nm , is full if for every n and any n × n positive semidefinite matrices Bj , j ∈ Nm , there exists a sequence of sets of distinct points {{xs : ∈ Nn } : s ∈ N} ⊆ X such that for every j ∈ Nm , Bj = lims→∞ (Kj (xs , xsk )),k∈Nn . For example, when m = 1 and X is an infinite-dimensional Hilbert space with inner product (·, ·), the kernel (·, ·) is full. Proposition 3. If {Kj : j ∈ Nm } is a set of full kernels on X × X , hCm → L(Y ), there exists some r > 0 such that the constant γ = sup{h(z) : z ≤ r} is finite, and if h(K1 , . . . , Km ) is a kernel with values in L(Y ), then h is entire of the form h(z) =

Aα z α , z ∈ C m

α∈Nm

for some {Aα : α ∈ Nm } ⊆ L+ (Y ). For the proof, see Micchelli and Pontil (2003). Let us now comment on translation-invariant kernels on Rn with values in L(Y ). This case is covered in great generality in Berberian (1966) and Fillmore (1970), where the notion of operator-valued Borel measures is described and used to generalize the theorem of Bochner (see Fillmore, 1970). For the purpose of applications, we point out the following sufficient condition on a function h: Rn → L(Y ) to give rise to a translation-invariant kernel. Specifically, for each x ∈ Rn , we let W(x) ∈ L+ (Y ), and define for x ∈ Rn the function eix·t W(t) dt. h(x) = Rn

Consequently, for every m ∈ N, {xj : j ∈ Nm } ⊆ Rn , and {cj : j ∈ Nm } ⊆ Y , we have that (cj , h(xj − x )c ) = (cj eixj ·t , W(t)(c eix ·t )), j,∈Nm

Rn

On Learning Vector-Valued Functions

195

which implies that K(x, y) := h(x − t) is a kernel on Rn × Rn . In Berberian (1966) and Fillmore (1970), all translation-invariant kernels are characterized in the form h(x) = eix·t dµ(t), Rn

where dµ is an operator-valued Borel measure relative to L+ (Rn ). In particular, this says, for example, that a function ei(x−t)·tj Bj h(x − t) = j∈Nm

where {Bj : j ∈ Nm } ⊆ L+ (Rn ) gives a translation-invariant kernel too. We also refer to Berberian (1966) and Fillmore (1970) to assert the claim that the function g: R+ → R+ such that K(x, t) := g(x − t2 ) is a radial kernel on X × X into L(Y ) for any Hilbert space X has the form ∞ g(s) = e−σ s dµ(σ ), 0

where again dµ is an operator-valued measure with values in L+ (Y ). When the measure is discrete and Y = Rn , we conclude that 2 e−σj x−t Bj K(x, t) = j∈Nm

is a kernel for any {Bj : j ∈ Nm } ⊆ L+ (Rn ) and {σj : j ∈ Nm } ⊂ R+ . This is the form of the kernel we use in our numerical simulations. 6 Practical Considerations In this final section, we describe several issues of a practical nature. They include a description of some numerical simulations we have performed, a list of several problems for which we feel learning vector-valued functions is valuable, and a brief review of previous literature on the subject. 6.1 Numerical Simulations. In order to investigate the practical advantages offered by learning within the proposed function spaces, we have carried out a series of three preliminary experiments where we tried to learn a target function f0 by minimizing the regularization functional in equation 4.1 under different conditions. In all experiments, set f0 to be of the form f0 (x) = K(x, x1 )c1 + K(x, x2 )c2 , c1 , c2 ∈ Rn , x ∈ [−1, 1].

(6.1)

In the first experiment, the kernel K was given by a mixture of two gaussian functions, K(x, t) = Se−6x−t + Te−30x−t , x, t ∈ [−1, 1], 2

2

(6.2)

196

C. Micchelli and M. Pontil

where S, T ∈ L+ (Rn ). Here we compared the solution obtained by minimizing equation 4.1 either in the RKHS of kernel K above (we call this method 1) or in the RKHS of the diagonal kernel, I

tr(S) −6x−t2 tr(T) −30x−t2 + e e , x, t ∈ [−1, 1], n n

(6.3)

where tr denotes the trace of a squared matrix (we call this method 2). We performed different series of simulations, each identified by the number of the outputs n ∈ {2, 3, 5, 10, 20} and an output noise parameter a ∈ {0.1, 0.5, 1}. Specifically, for each pair (n, a), we computed the matrices S = A∗ A and T = B∗ B where the elements of matrices A and B were uniformly sampled in the interval [0, 1]. The centers x1 , x2 ∈ [−1, 1] and the vector parameters c1 , c2 ∈ [−1, 1]n defining the target function f0 were also uniformly distributed. The training set {(xj , yj ) : j ∈ Nm } ⊂ [−1, 1] × Rn was generated by sampling the function f0 with noise: xj was uniformly distributed in [−1, 1] and yj = f0 (xj ) + j with j also uniformly sampled in the interval [−a, a]. In all simulations, we used a set of m = 20 points for training and a separate validation set of 100 data to select the optimal value of the regularization parameter. Then we computed on a separate test set of 200 samples the mean squared error between the target function f0 and the function learned from the training set. This simulation was repeated 100 times; that is, for each pair (n, a), 100 test error measures were produced for method 1 and method 2. As a final comparison measure of method 1 and method 2, we computed the average difference between the test error of method 2 and the test error of method 1 and its standard deviation, so, for example, a positive value of this average means that method 1 performs better than method 2. These results are shown in Table 1, where each cell contains the average difference of the test squared errors and, below it, its standard deviation. Not surprisingly, the results indicate that the use of the nondiagonal kernel always on the average improves performance, especially when the number of outputs increases. In the second experiment, the target function in equation 6.1 was modeled by the diagonal kernel, 2 2 K(x, t) = I ae−6x−t + be−30x−t , x, t ∈ [−1, 1], where a, b ∈ [0, 1], a + b = 1. In this case, we compared the optimal solution produced by minimizing equation 4.1 in either the RKHS of the above kernel (we call this method 3) or the RKHS of the kernel an bn 2 2 Se−6x−t + Te−30x−t , x, t ∈ [−1, 1] tr(S) tr(T) (we call this method 4). The results of this experiment are shown in Table 2, where it is evident that the nondiagonal kernel very often does worse

On Learning Vector-Valued Functions

197

Table 1: Results of Experiment 1. a\n

2

3

5

10

20

0.1

2.01e−4 (1.34e−4) 1.83−3 (1.17e−3) .0167 (.0104)

9.28e−4 (5.07e−4) .0276 (.0123) .0238 (.0097)

.0101 (.0203) .0197 (.126) .0620 (.0434)

.0178 (.0101) .0279 (.0131) .3397 .3608

.0331 (.0238) .5508 (.3551) .5908 (0.3801)

0.5 1

Notes: The average difference between the test error of method 1 and the test error of method 2 with its standard deviation (bottom row in each cell) for different values of the number of outputs n and the noise level a. The average error of method 1 (not reported here) ranged between 2.80e−3 and 1.423.

Table 2: Results of Experiment 2. a\n

2

3

5

10

20

0.1

8.01e−4 (6.76e−4) 1.57e−3 (1.89e−3) .0113 (.0086)

7.72e−4 (8.05e−4) .0133 (.0152) .0149 (.0245)

-6.85e−4 (5.00e−4) 7.30−3 (4.51e−3) 9.17e−3 (5.06e−3)

6.72e−4 5.76e−4 -7.30e−3 (4.51e−3) .0242 (.0103)

7.42e−4 3.99e−3 8.64e−3 (4.63e−3) .0233 (.0210)

0.5 1

Notes: The average difference between the test error of method 3 and the test error of method 4 with its standard deviation for different values of the number of outputs n and the noise level a. The average error of method 3 (not reported here) ranged between 3.01e−3 and 6.95e−2.

than the diagonal kernel. Comparing these results with those in Table 1, we conclude that the nondiagonal kernel does not offer any advantage when n = 2, 3. But this type of kernel helps dramatically when the number of outputs is larger than or equal to 5 (it does little damage in experiment 2 and great benefit in experiment 1). There is no current explanation for this phenomenon. Thus, it appears to us that such a kernel and its generalization, as we discussed above, may be important for applications where the number of outputs is large and complex relations among them are likely to occur. The last experiment addressed hyperparameters tuning. Specifically, we again choose a target function as in equation 6.1 and K as in equation 6.2, but now, S = S0 where S0q = 1 if = q, −.4 if = q + 1 (with periodic conditions, i.e., n + 1 = 1), and zero otherwise, and T = T0 is a sparse outof-diagonal positive definite binary matrix. Thus, S0 models correlations between adjacent components of f (it acts as a difference operator), whereas T0 accounts for possible “far-away” component correlations. We learn this

198

C. Micchelli and M. Pontil

target function by means of three kernel models. Model 1 consists of kernels as in equation 6.2 with S = S0 + λI and T = ρT0 , λ, ρ ≥ 0. Model 2 is as model 1 but with ρ = 0. Finally, model 3 consists of diagonal kernels as in equation 6.3, with S = S0 and T = νT0 , ν > 0. The data generation setting was as in the above experiments except that the matrices S0 and T0 were kept fixed in each simulation (this explains the small variance in the plots below) and the output noise parameter was fixed to 0.5. Hyperparameters of each model, as well as the regularization parameter in equation 4.1, were found by minimizing the squared error computed on a separate validation set containing 100 independently generated points. Figure 1 depicts the mean square test error with its standard deviation as a function of the number of training points (5,10,20,30) for n = 10, 20, 40 output components. The result clearly indicated the advantage offered by model 1. Although the above experiments are preliminary and consider a simple data setting, they enlighten the value offered by matrix-valued kernels. 6.2 Application Scenarios. We outline some practical problems arising in the context of learning vector-valued functions where the above theory can be of value. Although we do not treat the issues that would arise in a detailed study of these problems, we hope that our discussion will motive substantial studies of them. A first class of problems deals with finite-dimensional vector spaces. For example, an interesting problem arises in reinforcement learning or control when we need to learn a map between a set of sensors placed in a robot or autonomous vehicle and a set of actions taken by the robot in response to the surrounding environment (see, e.g., Franke et al., 1998). Another instance of this class of problems deals with the space of n × m matrices over a field whose choice depends on the specific application. Such spaces can be equipped with the standard Frobenius inner product. One of many problems that come to mind is to compute a map that transforms an ×k image x into an n × m image y. Here both images are normalized to the unit interval, so that X = [0, 1]×k and Y = [0, 1]n×m . There are many specific instances of this problem. In image reconstruction, the input x is an incomplete (occluded) image obtained by setting to 0 the pixel values of a fixed subset in an underlying image y, which forms our target output. A generalization of this problem is image denoising, where x is an n × m image corrupted by a fixed but unknown noise process and the output y is the underlying “clean” image. A different application is image morphing, which consists in computing the pointwise correspondence between a pair of images depicting two similar objects, for example, the faces of two people. The first image, x0 , is meant to be fixed (a reference image), whereas the second one, x, is sampled from a set of possible images in X = [0, 1]n×m . The output y ∈ R2(n×m) is a vector field in the input image plane that associates each pixel of the input image to the vector of coordinates of the corresponding pixel in the reference image (see, e.g., Beymer & Poggio, 1996, for more information on these issues).

On Learning Vector-Valued Functions

199

1.4

n=10

1.2 1 0.8 0.6 0.4 0.2

5

10

15

20

25

1.2

30

n=20

1 0.8 0.6 0.4 0.2

5

10

15

20

25

30

1.6

n=40 1.4 1.2 1 0.8 0.6 0.4 0.2 5

10

15

20

25

30

Figure 1: Experiment 3 (n = 10, 20, 40). Mean square test error as a function of the training set size for model 1 (solid line), model 2 (dashed line), and model 3 (dotted line). See the text for descriptions of these models.

200

C. Micchelli and M. Pontil

A second class of problems deals with spaces of strings, such as text, speech, or biological sequences. In this case, Y is a space of finite sequences whose elements are in a (usually finite) set A. A famous problem asks to compute a text-to-speech map that transforms a word from a certain fixed language into its sound as encoded by a string of phonemes and stresses. This problem was studied by Sejnowski and Rosenberg (1987) in their NETtalk system. Their approach consists of learning several Boolean functions that subsequently are concatenated to obtain the desired text-to-speech map. Another approach is to learn directly a vector-valued function. In this case, Y could be described by means of an appropriate RKHS (see below). Another important problem is protein structure prediction, which consists in predicting the 3D structure of a protein immersed in a aqueous solution (see, e.g., Baldi, Pollastri, Frasconi, & Vullo, 2002). Here x = (x j ∈ A : j ∈ N ) is the primary sequence of a protein, that is, a sequence over an alphabet of 20 possible symbols representing the amino acids present in nature. The length of the sequence varies typically between a few tens of and a few thousand elements. The output y = (yj ∈ R3 : j ∈ N ) is a vector with the same length as x and yj in the position of the jth amino acid of the input sequence. Toward the solution of this difficult problem, an intermediate step consists of predicting a neighborhood relation among the amino acids of a protein. In particular, recent work has focused on predicting a squared binary matrix C, called the contact map matrix, where C(j, k) = 1 if the amino acids j and k are at a distance smaller or of the order of 10A, and zero otherwise (see Baldi et al., 2002). Note that in the last class of problems, the output space is not a Hilbert space, since the sum of two vectors of different length is not defined. An approach to overcome this difficulty is to embed Y into a RKHS. This requires choosing a scalar-valued kernel G: Y × Y → R, which provides a RKHS space HG . We associate every element y ∈ Y with the function G(y, ·) ∈ HG and denote this map by G: Y → HG . If we wish to learn a function f : X → Y , we instead learn the composition mapping f˜ := G ◦ f . We then can obtain f (x) for every x ∈ X as f (x) = argminy∈Y G(y, ·) − f˜(x)G . If G is chosen to be bijective, the minimum is unique. This idea is also described in Weston, Chapelle, Elisseeff, Scholkopf, ¨ and Vapnik, (2003), where some applications of it are presented. As a third and final class of problems that illustrate the potential of vector-valued learning, we point to the possibility of learning a curve, or even a manifold. In the first case, the usual paradigm for the computer generation of a curve in Rn is to start with scalar-valued basis functions {Mj : j ∈ Nm } and a set of control points, {cj : j ∈ Nm } ⊆ Rn , and consider

On Learning Vector-Valued Functions

201

the vector-valued function cj Mj f = j∈Nm

(see, e.g., Micchelli, 1995). We are given data {yj : j ∈ Nm } ⊆ Rn , and we want to learn the control points. However the input data, which belong to a unit interval, say X = [0, 1], are unknown. We then fix {xj : j ∈ Nm } ⊂ [0, 1], and we learn a parameterization τ : [0, 1] → [0, 1] such that f (τ (xi )) = yi . So the problem is to learn both τ and the control points. Similarly, if {Mj : j ∈ Nm } are functions in Rk , k < n, we face the problem of learning a manifold M obtained by embedding Rk in Rn . It seems that to learn the manifold M, a kernel of the form Mj (x)Mj (t)Aj K(x, t) = j∈Nm

where {Aj : j ∈ Nm } ⊂ L+ (Rn ) would be appropriate because the functions generated by such kernel lie in the manifold generated by the basis functions, namely, K(x, t)c ∈ M for every x, t ∈ Rk , c ∈ Rn . 6.3 Previous Works on Vector-Valued Learning. The problem of learning vector-valued functions has been addressed in the statistical literature under the name of multiple response estimation or multiple output regression (see Hastie, Tibshirani, & Friedman, 2002). Here we briefly discuss two methods that we find interesting. A well-studied technique is the Wold partial least-squares (PLS) approach. This method is similar to principal component analysis but with the important difference that the principal directions are computed simultaneously in the input and output spaces, both being finite-dimensional Euclidean spaces (see, e.g., Hastie et al., 2002). Once the principal directions have been computed, the data are projected along the first n directions and a least-square fit is computed in the projected space where the optimal value of the parameter n can be estimated by means of cross-validation. We note that like principal component analysis, partial least squares are linear models, that is, they can model only vector-valued functions that depend linearly on the input variables. Recent work by Rosipal and Trejo (2001) and Bennett and Embrechts (2003) has reconsidered PLS in the context of scalar RKHS. Another statistically motivated method is the Breiman and Friedman (1997). curd & whey procedure. This method consists of two main steps. First, the coordinates of a vector-valued function are separately estimated by means of a least-squares fit or by ridge regression. Then they are combined in a way that exploits possible correlations among the responses (output

202

C. Micchelli and M. Pontil

coordinates). The authors show experiments where their method is capable of reducing prediction error when the outputs are correlated and not increasing the error when the outputs are uncorrelated. The curd & whey method is also primarily restricted to model linear relations of possibly nonlinear functions. However, it should be possible to “kernelize” this method following the same lines as in Bennett and Embrechts (2003). Acknowledgments We are grateful to Dennis Creamer, head of the Computational Science Department at National University of Singapore, for providing both of us with the opportunity to complete this work in a scientifically stimulating and friendly environment. Kristin Bennett of the Department of Mathematical Sciences at RPI and Wai Shing Tang of the Mathematics Department at NUS provided us with several helpful references and remarks. Bernard Buxton of the Department of Computer Science at UCL read a preliminary version of the manuscript and made useful suggestions. Finally, we are grateful to Phil Long of the Genome Institute of NUS for discussions that led to theorem 6, as well as the referees for their useful comments. This work was partially supported by NSF grant ITR-0312113. References Akhiezer, N.I., & Glazman, I.M. (1993). Theory of linear operators in Hilbert spaces (Vol. 1). New York: Dover. Amodei, L. (1997). Reproducing kernels of vector-valued function spaces. In A. Le Meehaute, C. Rabut, & L. L. Schumaker (Eds.), Curves and Surfaces: Proceedings of Chamonix 1996. Nashville, TN: Vonderbilt University Press. Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404. Baldi, P., Pollastri, G., Frasconi, P., & Vullo, A. (2002). New machine learning methods for the prediction of protein topologies. In P. Frasconi & R. Shamir (Eds.), Artificial intelligence and heuristic methods for bioinformatic. Amsterdam: IOS Press. Bennett, K.P., & Embrechts, M.J. (2003). An optimization perspective on partial least squares. In J. Suykens, G. Horrath, S. Basu, C. Micchelli, & J. Vandewalle (Eds.), Advances in kearning theory: Methods, models, and applications. (pp. 227– 250). Amsterdam: IOS Press. Bennett, K.P., & Mangasarian, O.L. (1993). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 722–734. Berberian, S.K., (1966). Notes on spectral theory. New York: Van Nostrand. Beymer, D., & Poggio, T. (1996). Image representations for visual learning. Science, 272(5270), 1905–1909. Breiman, L., & Friedman, J. (1997). Predicting multivariate responses in multiple linear regression (with discussion). J. Roy. Statist. Soc. B., 59, 3–37.

On Learning Vector-Valued Functions

203

Burbea, J., & Masani, P. (1984). Banach and Hilbert spaces of vector-valued functions. New York: Pitman. Cherkassky, V., & Mulier, F.(1998). Learning from data: Concepts, theory, and methods. New York: Wiley. Cornford, D., Csato, L., Evans, D., & Opper, M. (2004). Bayesian analysis of the scatterometer wind retrieval inverse Problem: Some new approaches. Journal of the Royal Statistical Society B, 66, 1–17. Cramer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265– 292. Cressie, N.A.C. (1993). Statistics for spatial data. New York: Wiley. Cristianini N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. Proc. of the 10th ACM SIGKOD Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1–50. Fillmore, P.A. (1970). Notes on operator theory. New York: Van Nostrand. FitzGerald, C.H., Micchelli, C. A., & Pinkus, A. M. (1995). Functions that preserve families of positive definite functions. Linear Algebra and Its Applications, 221, 83–102. Franke, U., Gavrila, D., Goerzig, S., Lindner, F., Paetzold, F., & Woehler, C. (1998). Autonomous driving goes downtown. IEEE Intelligent Systems 13, 40–48. Hastie, T., Tibshirani, R., & Friedman, J. (2002). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer-Verlag. Mangasarian, O.L. (1994). Nonlinear programming. Philadelphia: SIAM. Melkman, A.A., & Micchelli, C.A., (1979). Optimal estimation of linear operators in Hilbert spaces from inaccurate data. SIAM Journal of Numerical Analysis, 16(1), 87–105. Micchelli, C.A. (1995). Mathematical aspects of geometric modeling. Philadelphia: SIAM. Micchelli, C.A., & Pontil, M. (2003). On learning vector-valued functions (Research Note No. RN/03/08). London: Department of Computer Science, University College London. Micchelli, C.A., & Pontil, M. (2004). A function representation for learning in Banach spaces. In J. Shawe-Taylor & Y. Singer (Eds), Proceedings of the Seventeenth Annual Conference on Learning Theory. New York: Springer-Verlag. Rosipal, R., & Trejo, L. J. (2001). Kernel partial least squares regression in reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 2, 97–123. Scholkopf, ¨ B., & Smola, A.J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Sejnowski, T.J., & Rosenberg, C.R. (1987). Parallel networks which learn to pronounce English text. Complex Systems, 1, 145–163. Tikhonov, A. N., & Arsenin, V. Y. (1997). Solutions of ill-posed problems. Washington, DC: Winston. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.

204

C. Micchelli and M. Pontil

Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, ¨ B., & Vapnik, V.N. (2003). Kernel dependency estimation. In Advances in neural information processing Systems, 15, Cambridge, MA: MIT Press. Weston, J. & Watkins, C. (1998). Multi-class support vector machines. (Tech. Rep. No. CSD-TR-98-04). London: Department of Computer Science, Royal Holloway, University of London. Williams, C.K.I. & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Trans. on Patt. Anal. Mach. Intel., 20, 1342–1351. Received November 5, 2003; accepted May 27, 2004.

LETTER

Communicated by Stephen Jose Hanson

RSPOP: Rough Set–Based Pseudo Outer-Product Fuzzy Rule Identification Algorithm Kai Keng Ang [email protected]

Chai Quek [email protected] Centre for Computational Intelligence, Nanyang Technological University, School of Computer Engineering, Singapore 639798

System modeling with neuro-fuzzy systems involves two contradictory requirements: interpretability verses accuracy. The pseudo outer-product (POP) rule identification algorithm used in the family of pseudo outerproduct-based fuzzy neural networks (POPFNN) suffered from an exponential increase in the number of identified fuzzy rules and computational complexity arising from high-dimensional data. This decreases the interpretability of the POPFNN in linguistic fuzzy modeling. This article proposes a novel rough set–based pseudo outer-product (RSPOP) algorithm that integrates the sound concept of knowledge reduction from rough set theory with the POP algorithm. The proposed algorithm not only performs feature selection through the reduction of attributes but also extends the reduction to rules without redundant attributes. As many possible reducts exist in a given rule set, an objective measure is developed for POPFNN to correctly identify the reducts that improve the inferred consequence. Experimental results are presented using published data sets and real-world application involving highway traffic flow prediction to evaluate the effectiveness of using the proposed algorithm to identify fuzzy rules in the POPFNN using compositional rule of inference and singleton fuzzifier (POPFNN-CRI(S)) architecture. Results showed that the proposed rough set–based pseudo outer-product algorithm reduces computational complexity, improves the interpretability of neuro-fuzzy systems by identifying significantly fewer fuzzy rules, and improves the accuracy of the POPFNN. 1 Introduction Neural networks and fuzzy systems are very popular techniques in soft computing (Zadeh, 1994). Neuro-fuzzy hybridization synergizes these two techniques by combining the human-like reasoning style of fuzzy systems with the learning and connectionist structure of neural networks. Neuro-fuzzy hybridization is widely termed fuzzy neural networks (FNN) or neuroNeural Computation 17, 205–243 (2005)

c 2004 Massachusetts Institute of Technology

206

K. Ang and C. Quek

fuzzy systems (NFS) in the literature (Mitra & Hayashi, 2000; Buckley & Hayashi, 1994, 1995; Nauck, Klawonn, & Kruse, 1997; Lin & Lee, 1996). Neuro-fuzzy systems (the more popular term is used henceforth) incorporates the human-like reasoning style of fuzzy systems through the use of fuzzy sets and a linguistic model consisting of a set of if-then fuzzy rules. Thus, the main strength of neuro-fuzzy systems is that they are universal approximators (Castro, 1995; Ying, Ding, Li, & Shao, 1999; Tikk, Koczy, ´ & Gedeon, 2003) with the ability to solicit interpretable if-then rules (Guillaume, 2001). The identification of interpretable fuzzy rules thus forms the most important aspect in the design of neuro-fuzzy systems. A number of articles on research on the learning of fuzzy rules from numerical data have been published (Lin & Lee, 1991; Hayashi, Nomura, Yamasaki, & Wakami, 1992; Ishibuchi, Tanaka, & Okada, 1994; Yager, 1994; Shann & Fu, 1995; Lozowski, Cholewo, & Zurada, 1996; Quek & Zhou, 2001; Tung & Quek, 2002b), and a survey is given in Mitra and Hayashi (2000). The strength of neuro-fuzzy systems involves two contradictory requirements in fuzzy modeling: interpretability versus accuracy. In practice, one of the two properties prevails. The fuzzy modeling research field is divided into two areas: linguistic fuzzy modeling, which is focused on interpretability, mainly the Mamdani model (Mamdani & Assilian, 1975) given in equation 1.1, and precise fuzzy modeling, which is focused on accuracy, mainly the Takagi-Sugeno-Kang (TSK) model (Takagi & Sugeno, 1985; Sugeno & Kang, 1988) given in equations 1.2 and 1.3 (Guillaume, 2001; Casillas, Cordon, ´ Herrera, & Magdalena, 2003). A recent comprehensive coverage on interpretability issues of fuzzy modeling is given in Casillas et al. (2003). Rk : IF x1 is Ak,1 AND · · · AND xn1 is Ak,n1 THEN y is Bk

(1.1)

Rk : IF x1 is Ak,1 AND · · · AND xn1 is Ak,n1 THEN yk = ak x + bk n3 y= wk (x)yk ,

(1.2) (1.3)

k=1

where x, y are the input vector x = [x1 , . . . , xn1 ] and output value, respectively, Ak,i , Bk are the linguistic labels with fuzzy sets associated defining their meaning, n1 is the number of inputs, and n3 is the number of rules. The consequents in equation 1.2 are linear functions of the inputs, whereas the consequents in equation 1.1 are simply linguistic labels. Therefore, the TSK model has decreased interpretability but increased representative power compared to the Mamdani model (Casillas et al., 2003). Recently, a number of articles on research addressing the issues of the former have been reported (Johansen & Babuˇska, 2003; Jin, 2000; Yen, Wang, & Gillespie, 1998). In contrast, an increased number of fuzzy rules are needed in the latter to yield the same representative power of the former. However, if the number of fuzzy rules is large in any of these models, then any semantic

RSPOP

207

interpretability is lost (Tikk & Baranyi, 2003; Nauck, 2003). Moreover, if the number of rules is bounded as a practical limitation, then the universal approximation property of the models does not hold anymore (Moser, 1999). Thus, the interpretability issue that arises from the large number of fuzzy rules motivates the complexity reduction of rule-based models; a recent survey is given in Kaynak, Jezernik, and Szeghegyi (2002). This issue is similar to the problems encountered by numerical data-driven techniques in data mining (Han & Kamber, 2001) due to ultra-large data and redundant data. These techniques rely on heuristics to guide their search or reduce their search space horizontally or vertically (Lin & Cercone, 1997). Horizontal reduction is realized by the merging of identical data tuples or the quantization of continuous numerical values. Vertical reduction is realized by the application of feature selection methods. The former corresponds to fuzzy sets and the latter to fuzzy rule pruning and reduction in neuro-fuzzy systems. In the literature, fuzzy rule identification algorithms apply fuzzy rule pruning based on certainty factors (Chakraborty & Pal, 2004; Shann & Fu, 1995), or identify fuzzy rules based on certain heuristic threshold (Quek & Zhou, 2001; Tung & Quek, 2002b). Recently, rough set methods (Pawlak, 1991) have been shown to reduce pattern dimensionality significantly and proven to be viable data mining techniques (Swiniarski & Skowron, 2003). Shen and Chouchoulas (2002) presented a rough-fuzzy approach in fuzzy rule identification using the rough set attribute reduction algorithm (RSAR) (Shen & Chouchoulas, 1999) to reduce the attributes of fuzzy rules identified from the rule induction algorithm (RIA) (Lozowski et al., 1996). This approach not only reduced the sensitivity of the fuzzy rules to the data dimension but also outperformed the fuzzy rules before reduction. This motivates the investigation on the rough-fuzzy approach of fuzzy rule identification in this work. This letter proposes a novel rough set–based pseudo outer-product (RSPOP) fuzzy rule identification algorithm for the pseudo outer-product– based fuzzy neural networks (POPFNN) (Ang, Quek, & Pasquier, 2003; Zhou & Quek, 1996; Quek & Zhou, 1999). The pseudo outer-product (POP) fuzzy rule identification algorithm (Quek & Zhou, 2001) used in POPFNN is a simple one-pass rule identification algorithm that is fast, reliable, efficient, and easy to understand. However, it has suffered from an exponential increase in the number of fuzzy rules identified, as well as an increased computational complexity that arose with high-dimensional data. This decreased the interpretability of POPFNN in linguistic fuzzy modeling as well as accuracy in high-dimensional modeling. Therefore, instead of pruning fuzzy rules based on certainty factors that introduce inconsistencies, as described in Pal and Pal (1999), or the use of a heuristic threshold to govern the derivation of fuzzy rules, the proposed RSPOP algorithm integrates the sound concept of knowledge reduction from rough set theory (Pawlak, 1991) with the POP algorithm (Quek & Zhou, 2001). Compared with rough set attribute reduction (RSAR) (Shen & Chouchoulas, 1999), the proposed

208

K. Ang and C. Quek

RSPOP algorithm not only performs reduction of redundant attributes, but also extends the reduction to the fuzzy rules without redundant attributes. A brief background of POPFNN and rough set (Pawlak, 1991) is provided in sections 2 and 3. Section 4 gives a detailed description of the proposed RSPOP algorithm. Experimental results and analysis are provided in section 5. Finally conclusions are presented in section 6. 2 Pseudo Outer-Product-Based Fuzzy Neural Network POPFNN is a family of neuro-fuzzy systems based on the linguistic fuzzy model (Quek & Zhou, 2001). Three members of POPFNN exist in the literature: POPFNN-CRI(S) (Ang et al., 2003), which is based on the commonly accepted fuzzy compositional rule of inference; POPFNN-TVR (Zhou & Quek, 1996), which is based on truth value restriction; and POPFNN-AARS(S) (Quek & Zhou, 1999), which is based on the approximate analogical reasoning scheme. The POPFNN architecture is a five-layer neural network, as shown in Figure 1. For simplicity, only the interconnections for the output ym are shown in the figure. Each layer performs a specific fuzzy operation, and the nodes and operations of each layer are noted with a superscript of I to V for clarity. The inputs and outputs are represented as nonfuzzy vector X = [x1 , x2 , . . . , xi , . . . , xn1 ] and nonfuzzy vector Y = [y1 , y2 , . . . , yl , . . . , yn5 ], respectively. The fuzzification of the inputs and the defuzzification of the outputs are, respectively, performed by the input linguistic and output linguistic layers, while the fuzzy inference is collectively performed by the rule, condition, and consequence layers. The numbers of neurons in the condition, rule, and consequence layers are defined in equations 2.1 to 2.3, respectively: n2 =

n1

Ji

(2.1)

i=1

n1 n3 = Ji n4 =

i=1 n5

Lm ,

(2.2) (2.3)

m=1

where Ji is the number of linguistic labels for the ith input; Lm is the number of linguistic labels for the mth output; n1 is the number of inputs; n2 is the number of neurons in the condition layer; n3 is the number of rules or rulebased neurons; n4 is the number of linguistic labels for the output; and n5 is the number of outputs. Neurons in the input linguistic layer are called input nodes. Each input node Ii1 represents an input linguistic variable (Zadeh, 1975a, 1975b, 1975c)

RSPOP

209 y1

ym

yn5

O1V

OmV

OnV5

IV IV OL1,1 OL1,2

Consequent

OL1,IVL

1

OLIV m,1

I Antecedent ci ∈Ck

II IL1,2

IV OLIV n ,1 OLn ,2

OLIV m, Lm

5

wk ,m,1 wk ,m ,l wk ,m, Lm

d m ∈DVk

R1III

II IL1,1

OLIV m,l

c1

IL1,IIJ

1

RkIII

Weights

ILiII, j

Consequence Layer IV 5 (Output-Label nodes)

OLIV n ,Ln 5

wkIII,m Rule Layer III (Rule nodes)

RnIII3

cni

ci

ILiII,1

5

Output Linguistic Layer V (Output nodes)

ILiII, J

i

ILIIn ,1 ILIIn ,2 1

1

I1I

IiI

I nI1

x1

xi

xn1

ILIIn , J n 1

1

Condition Layer II (Input-Label nodes)

Input Linguistic Layer I (Input nodes)

Figure 1: Structure of pseudo outer-product-based fuzzy neural networks (POPFNN).

of the corresponding input xi . Neurons in the condition layer are called input-label nodes. Each input-label node ILII i,j represents the jth linguistic label of the ith input node from the input layer. The input label nodes constitute the antecedent of the fuzzy rules. Each input label node is represented by a membership function µi,j (x). Two commonly used fuzzy membership function are gaussian membership function, described by two parameters (vi,j , wi,j ), and trapezoidal membership function, described by a fuzzy interval formed by four parameters (αi,j , βi,j , γi,j , δi,j ). Neurons in the rule-based layer are called rule nodes. Each rule node RIII k represents an if-then fuzzy rule. Neurons in the consequence layer are called output label nodes. The output label node represents the lth linguistic label of the output ym . The output

210

K. Ang and C. Quek

label nodes constitute the conclusions of the if-then fuzzy rules. Each output label node is represented by a membership function µm,l (x) similar to the input label node where gaussian or trapezoidal membership functions can be used. The neurons in the output linguistic layer are called output nodes. The output node OV m represents the output linguistic variable of the output ym . The function of an output node is to perform defuzzification, where each conclusion expressed in terms of fuzzy sets in the consequence layer is converted to a single real number. 2.1 Learning Process of POPFNN. The learning process of POPFNN consists of three phases: fuzzy membership generation, fuzzy rule identification, and supervised fine-tuning. Various fuzzy membership generation algorithms can be used: learning vector quantization (LVQ; Kohonen, 1989), fuzzy Kohonen partitioning (FKP; Ang et al., 2003), or discrete incremental clustering (DIC; Tung & Quek, 2002a). Generally, the POP algorithm and its variant, LazyPOP, are used to identify the fuzzy rules (Quek & Zhou, 2001). Figure 2 shows the learning process of the POP algorithm in the identification of the output label nodes for a rule node RIII k . is linked to only one input label node from each Each rule node RIII k input node and only one output label node from each output node. The links of rule nodes to the antecedent in the condition layer and the consequent in the consequence layer are mathematically denoted as sets (CI , DV ) in this article. The antecedent of rule nodes is represented by a set CI = {c1 , c2 , . . . , ci , . . . , cn1 } where ci | ci ∈ ILII i is a condition variable. The set of condition labels that ci can assume is semantically represented by the set of II II II II input label nodes ILII i = {ILi,1 , ILi,2 , . . . , ILi,j , . . . , ILi,Ji }, but a computational notation of ILII i = {1, 2, . . . , j, . . . , Ji } is used in this article. Similarly, the consequent of rule nodes is represented by a set DV = {d1 , d2 , . . . , dm , . . . , dn5 }, where dm | dm ∈ OLIV m is a consequent variable. The set of consequent labels that dm can assume is semantically represented by the set of output label IV IV IV IV nodes OLIV m = {OLm,1 , OLm,2 , . . . , OLm,l , . . . , OLm,Lm }, but a computational notation of OLIV m = {1, 2, . . . , l, . . . , Lm } is used in this article. The specific links of the kth rule node RIII k to the antecedent in the condition layer and the consequent in the consequence layer are denoted as (CIk , DV k ). During are computationally repPOP learning, the consequents of rule node RIII k . A description of POP algorithm (Quek & Zhou, resented as weights wIII k 2001) follows: POP Algorithm • Step 1: Rule initialization. Initialize the links CIk of rule nodes to input label nodes for k = 1 · · · n3 such that all possible combinations of rules are created using equation 2.7. Initialize wIII k,m,l = 0 for k = 1 · · · n3 , m = 1 · · · n5 , l = 1 · · · Lm .

RSPOP

211 Output Linguistic Layer V (Output nodes)

OmV

OLIV m,1

Consequent

R1III

I Antecedent ci ∈Ck

II IL1,2

Consequence Layer IV (Output-Label nodes)

OLIV m, Lm

wk ,m,1 wk ,m,l wk ,m, Lm

d m ∈DVk

POP Learning

II IL1,1

OLIV m,l

c1

IL1,IIJ

1

RkIII

Weights

Rule Layer III (Rule nodes)

RnIII3 cni

ci

ILiII,1

ILiII, j

ILiII, J

IiI

I1I

wkIII,m

i

ILIIn ,1 ILIIn ,2 1

1

I nI1

ILIIn , J n 1

1

Condition Layer II (Input-Label nodes)

Input Linguistic Layer I (Input nodes)

Figure 2: Pseudo outer-product in the identification of output label nodes for kth rule node

• Step 2: Hebbian learning. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 , . . . , {Xp , Yp }, . . . , {Xn , Yn }}, determine weights wIII k,m,l using pseudo other-product Hebbian learning (Quek & Zhou, 2001; Hebb, 1949) given in equation 2.4. For p = 1 · · · n, m = 1 · · · n5 , wIII k,m,l =

n3

III fk,p µm,l (ym,p )

(2.4)

p=1 IV III III where wIII k,m,l is the weight of the link between Rk and OLm,l ; fk,p is the firing strength of the pth instant of training data; µm,l is the membership value of the lth output label node for the mth output; and ym,p is the the mth element of the pth instant of Y. End for.

• Step 3: Rule formulation. Determine the link of each rule node RIII k to out= {d , d , . . . , d , . . . , d }, where d∈ put label nodes, denoted as DV 1 2 m n5 k IV OLIV and OL = {1, 2, . . . , l, . . . , L }. For k = 1 · · · n , m = 1 · · · n m 3 5: m m

212

K. Ang and C. Quek

a. Compute the consequence link with the highest weight using equation 2.5: III wIII k,m,lmax = max wk,m,l . l

(2.5)

b. Select or delete the link to the consequence of mth output using equation 2.6: dm =

lmax ∅

if wIII k,m,lmax > 0 . if wIII k,m,lmax = 0

(2.6)

V V c. Delete rule node RIII k if Dk = ∅, that is, ∀dm ∈ Dk , dm = ∅. End for. d. Update n3 with the total number of remaining rules. III in step The initialization of rules in step 1 and the computation of fk,r I 2 involve an integer rule index k. The computation of k given Ck is given in equation 2.7. This equation is constructed such that each condition variable ci ∈ CIk construes a digit in a positional number system where c1 is the least significant digit and cn1 is the most significant digit. As the set of each i−1 condition labels ci can assume is ILII i = {1, 2, . . . , Ji }, the term p=0 Jp computes the associated weight of each digit, and the addition and subtraction of unity in the equation ensures that all possible combinations of CIk fall within the contiguous integer range of k = 1 · · · n3 ,

k=

n1 i=1

 (ci − 1) ×

i−1

 Jp  + 1,

J0 = 1,

(2.7)

p=0

where Ji is the number of linguistic labels for the ith input; ci is the condition I variable of the ith input of rule Rk s. t. ci ∈ CIk , ci ∈ ILII i ; Ck is the set of condition variables of rule Rk , CIk = {c1 , c2 , . . . , ci , . . . , cn1 }; and ILII i is the set II of condition labels ci can assume, ILi = {1, 2, . . . , Ji }. 3 Preliminaries on Rough Set Theory Pawlak (1991) introduced rough set theory to deal with imprecise or vague concepts. A recent development of rough set theory and artificial intelligence is available in Curry (2003). A rough set is an approximation of a vague concept by a pair of precise concepts called the lower and upper approximations. The lower approximation is a description of the domain objects that are known with certainty to belong to the subset of interest, whereas the upper approximation is a description of the objects that possibly belong

RSPOP

213

to the subset. In addition, rough set theory provides rigorous mathematical techniques for discovering regularities in data. The approach in this letter applies two fundamental concepts of rough set knowledge reduction in the derivation of fuzzy rules: a reduct and the core. A reduct of knowledge is its essential part, which suffices to define all basic concepts occurring in the considered knowledge, whereas the core is the most important part of the knowledge. Central to this rough set approach is the concept of attribute dispensability in terms of indiscernability relations (Pawlak, 1991). An attribute R ∈ R is dispensable if it satisfies IND(R) = IND(R − {R}),

(3.1)

where IND(P) is the indiscernibility relation over P, which is the intersection of all equivalence relations belonging to P. The rough set knowledge reduction uses the value attribute pair to represent the knowledge of decision rules (Pawlak, 1991). When decision rules are represented as such, rough set theory then supplies the logical methods employing attribute dispensability and decision rule consistency for knowledge reduction and analysis. Because decision rules do not have the attributes of truth or falsity as opposed to formulas, consistency and inconsistency are used instead to describe decision rules in computational terms. The following sections elaborate on the knowledge representation system (KRS), attribute dispensability, and consistency from rough set theory. 3.1 Knowledge Representation System. To express mathematically how rough set theory is applied in knowledge reduction, Tarski’s style semantics of the decision logic (DL) language is used in Pawlak (1991), where the notions of a model and satisfiability are employed. Knowledge is represented as a model S = (U, A) where U is the universe of discourse and A is the set of attributes. The formula S is satisfied by an element x in U if and only if an extraction of attribute value from x belongs to the set of attribute A. In other words, an object x ∈ U satisfies S if and only if a(x) = v, where a ∈ A and v ∈ V, V = Va and Va being the set of attribute value for a ∈ A. An atomic formula is represented as (a, v) in DL language for any a ∈ A and v ∈ V. A formula is expressed as (a1 , v1 ) ∧ (a2 , v2 ) ∧ · · · ∧ (ai , vi ) · · · ∧ (an , vn ),

(3.2)

where vi is the attribute constant value, vi ∈ Vai ; ai is the attribute constant, ai ∈ A; and ∧ is the conjunction connectives. In a decision rule system, A = {C ∪ D}, where C is the set of conditional attributes and D is the set of decision attributes. If P = {a1 , a2 , . . . , ai , . . . , an } and P ⊆ A, then the formula in equation 3.2 is called a P-basic formula (in short, P-formula). The satisfied in the system disjunction of all P-formulas S is represented as S P. If P = A, then S A is called the characteris-

214

K. Ang and C. Quek

Table 1: Example of the Knowledge Representation System Using Attribute Value Table Adopted from Pawlak (1991). U

a1

a2

a3

1 2 3 4 5 6

1 2 1 1 2 1

0 0 1 1 1 0

2 3 1 1 3 3

tic formula of S. The example in Table 1 illustrates how the knowledge is represented. Expressing the atomic formula (ai , v) in short as ai,v and omitting the conjunction symbol ∧, the characteristic formula of the knowledge in Table 1 is A = a1,1 a2,0 a3,2 ∧ a1,2 a2,0 a3,3 ∧ a1,1 a2,1 a3,1 ∧ a1,2 a2,1 a3,3 ∧ a1,1 a2,0 a3,3 . (3.3) S

Now, given a decision rule φ → ψ where φ and ψ are the C-basic and Dbasic formula, respectively, then the decision rule φ → ψ is called a CD basic decision rule, or in short CD rule denoted as (C, D). Considering the knowledge in Table 1, assume that C = {a1 , a2 } and D = {a3 }, sets C and D associate the following CD decision rules: a1,1 a2,0 → a3,2 a1,2 a2,0 → a3,3 a1,1 a2,1 → a3,1 a1,2 a2,1 → a3,3 a1,1 a2,0 → a3,3

(3.4)

3.2 Rule Consistency. When a decision rule φ → ψ satisfies S, the decision rule is consistent in S if and only if for any decision rule φ → ψ in S, φ = φ implies ψ = ψ (Pawlak, 1991). Considering the knowledge in Table 1 with C = {a1 , a2 } and D = {a3 }, rule a1,1 a2,0 → a3,2 is inconsistent with rule a1,1 a2,0 → a3,3 . Thus, the decision rule base (C, D) is inconsistent. If either rule is removed, then (C, D) is consistent. 3.3 Attribute Reduction. Considering a consistent CD decision rule base (C, D), an attribute ai ∈ C is dispensable if and only if ((C − {ai }), D) is consistent (Pawlak, 1991). If all attributes ai ∈ C are indispensable in (C, D), then (C, D) is independent. The concept of a reduct of knowledge in rough set theory is the subset of attributes R ⊆ C, where (R, D) is independent

RSPOP

215

and consistent. The concept of the core of knowledge in rough set theory is the set of all indispensable attributes of knowledge (C, D), denoted as CORE(C, D). 3.4 Decision Rule Reduction. The concept of a reduct can also be applied to simplify dispensable conditions of decision rules similar to attribute reduction performed on the decision rule base (Pawlak, 1991). Before proceeding further, an addition denotation, /, is required. Considering a CD rule φ → ψ and Q ⊆ C, the notation φ/Q means the Q basic decision rule obtained from only those consistent in Q; others are removed. An attribute ai is dispensable in the decision rule φ → ψ if the removal of ai from C is also consistent. In other words, an attribute ai ∈ C is dispensable in φ → ψ that is consistent in S if and only if φ/(C − {ai }) → ψ is also consistent in S. 4 Rough Set–Based Pseudo Outer-Product Although the POP algorithm (Quek & Zhou, 2001) is a simple, easy-tounderstand one-pass learning algorithm that performs fast, reliable, and efficient consequence identification for fuzzy rules, it has to compute all possible rules during the learning process in equation 2.4 (Quek & Zhou, 1999). As a result, the fuzzy rules identified using POP usually consist of a large number of rules. The total number of rules identified in equation 2.3 also increases rapidly with an increase in the dimension of the input space or the number of input labels. As pointed out in Tikk and Baranyi (2003) and Nauck (2003), if the number of fuzzy rules identified is large, then any semantic interpretability of the fuzzy rules is lost. To resolve the problem of identifying a large number of rules, the lazy pseudo outer-product (LazyPOP) learning algorithm is proposed in Quek and Zhou (1999). Instead of considering all possible rules, this algorithm uses three user-defined threshold criteria—noise criterion (NC), ambiguity criterion (AC), and focus criterion (FC)—in the selection and pruning of fuzzy rules. Similarly, the RULEMAP algorithm in the generic self-organizing fuzzy neural network (GenSoFNN) uses two user-defined thresholds, ThresISP and ThresOSP , to govern the identification of fuzzy rules (Tung & Quek, 2002b). The introduction of heuristic thresholds in the identification and pruning of fuzzy rules serves the purpose of reducing the number of identified fuzzy rules. However, the accuracy of the neuro-fuzzy systems and the interpretability of the fuzzy rules identified are usually heavily dependent on the selection of these threshold values. Therefore, this motivates work to enhance POP using the rough set approach instead of using heuristic thresholds to identify a reduced number of fuzzy rules. The proposed rough set– based pseudo outer-product (RSPOP) learning algorithm consists of three parts: RSPOP rule identification, RSPOP attribute reduction, and RSPOP rule reduction. A description of the RSPOP learning algorithm follows with a detailed explanation of the rationale of the algorithm:

216

K. Ang and C. Quek

RSPOP Rule Identification • Step 1: Initialization. Initialize CIk and wIII k,m,l as in step 1 of POP. • Step 2: Hebbian learning. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }}, for p = 1 · · · n: a. Determine the input labels CIk,p = {c1 , c2 , . . . , ci , . . . , cn1 } for the pth instant of X with the maximum output for each input using II ci = jmax | max(oII i,jmax ,p ) = max(oi,j,p )∀ i = 1 · · · n1 j

ci ∈

CIk,p ,

(4.1)

where k is the index to rule nodes to be computed with CIk,p II using equation 2.7; oII i,j,p is the output of ILi,j for the pth instant of X; and jmax is the label index j = 1 · · · Ji for a particular ith input from ILII i,j that yields the highest output for the pth instant of X. b. Compute the firing strength of the most influential rule node RIII k using III fk,p = min(oII i,ci ),

(4.2)

i

I II where oII i,ci is the output of ILi,ci , ci ∈ Ck,p for the pth instant of X. c. Compute the output labels DV k,p = {d1 , d2 , . . . , dm , . . . , dn5 } for the pth instant of Y with the maximum membership value using

dm = lmax | µm,Jmax (ym,p ) = max µm,l (ym,p )∀ m = 1 · · · n5 , l

dm ∈ DV k,p ,

(4.3)

where µm,l is the membership function of OLIV m,l for the pth instant of Y; ym,p is the mth element of the pth instant of Y; and Imax is the label index l = 1 · · · Lm for a particular mth output from OLIV m,l that yields the maximum membership value for the pth instant of Y. d. Update the weights wIII k,m,dm of the most influential rule node III Rk using a Hebbian learning rule (Hebb, 1949) given in III V wIII k,m,dm = fk,p × µm,dm (ym,p )∀ m = 1 · · · n5 , dm ∈ Dk,p , (4.4) IV III where wIII k,m,dm is the weights of Rk to OLm,dm with the highest III membership value; fk,p is the firing strength of the most influ-

ential rule node RIII k for the pth instant of X from equations 4.1

RSPOP

217

to 4.3; and µm,dm is the maximum membership value of the mth output for the pth instant of Y from equation 4.3. End for. • Step 3: Rule formulation. Determine the link of each rule node RIII k to output label nodes and update n3 with the total number of rules as in step 3 of POP. RSPOP Attribute Reduction • Step 1: Baseline. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }} and the rules in POPFNN before reduction as CI , DV ), compute the baseline objective measure of CD rule base CI , DV ). For p = 1 · · · n: Compute the POPFNN consequence output for pth instant of X using COm,l,p = oIV m,l,p ∀ m = 1 · · · n5 , l = 1 · · · Lm .

(4.5)

End for. • Step 2: Reduction. Perform input attribute reduction for each attribute ci of the rule base. For i = 1 · · · n1 : a. Compute the consistency of the CD rule base ((CI − {ci }), DV ). ((CI − {ci }), DV ) is inconsistent if the following equation is true when rules with the same conditions ignoring attribute ci infer different consequents: I V ∀Rk : ((CIk − {ci }), DV k ), ∃Rk : ((Ck − {ci }), Dk ) | V ((CIk − {ci }) = (CIk − {ci })) ∧ (DV k = Dk ) ∀ k,

k = 1 · · · n3 .

(4.6)

b. If ((CI − {ci }), DV ) is consistent, compute the objective measure of the CD rule base ((CI − {ci }), DV ) using IV COreduct m,l,p = om,l,p ∀ p = 1 · · · n, m = 1 · · · n5 , l = 1 · · · Lm . (4.7)

c. Detect the deterioration of the objective measure by comparing the objective measure computed in step 2b with the baseline objective measure. The objective measure is deteriorated if the following equation is true when one of the correct consequent outputs (COm,l,p , l = dm ) of using reduct is decreased. The object measure is also deteriorated if the incorrect consequent output (COm,l,p , l = dm ) is greater than that of a correct consequent

218

K. Ang and C. Quek

output: V ∃p, ∃m s.t. ∀l((COreduct m,l,p < COm,l,p ), l = dm ∈ Dk,p ) ∨ reduct V ((COreduct m,l,p > COm,l,p ) ∧ (COm,l,p > COm,dm ,p ), l = dm ∈ Dk,p )

∀p = 1 · · · n, m = 1 · · · n5 , l = 1 · · · Lm .

(4.8)

d. If there is no deterioration of objective measure, then allow the reduction of attribute ci using the two following equations, remove duplicated rules, and recompute n3 : (CI , DV ) = ((CI − {ci }), DV ) COm,l,p =

COreduct m,l,p

(4.9)

∀ p = 1 · · · n, m = 1 · · · n5 ,

l = 1 · · · Lm .

(4.10)

End for. RSPOP Rule Reduction • Step 1: Baseline. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }} and the rules in POPFNN before reduction as (CI , DV ), compute the baseline objective measure COm,l,p of CD rule base (CI , DV ) as in step 1 of RSPOP Attribute reduction. • Step 2: Reduction. Perform rule reduction for each attribute ci of each RIII k . For k = 1 · · · n3 , i = 1 · · · n1 : V I a. Based on the picked rule RIII k : (Ck , Dk ) and the picked condition attribute ci , denoting CD rule as φ → ψ, compute the consistency of all CD rules φ/(CIk − ci ) → ψ using

∃Rk : ((CIk − {ci }), DV k ) | V ((CIk − {ci }) = (CIk − {ci })) ∧ (DV k = Dk )

∀ k = 1 · · · n3 .

(4.11)

b. If the CD rules φ/(CIk −ci ) → ψ are consistent, remove attribute ci from these CD rules and compute the objective measure of POPFNN using these reduced rules with equation 4.7. c. Detect the deterioration of objective measures of using reduced rules if equation 4.8 is true. d. If there is no deterioration of objective measures, allow the reduction of rules, and remove duplicated rules. End for. The first part, RSPOP rule identification, addresses the inherent problem of the POP algorithm in considering all possible rules. In equation 2.4 of the

RSPOP

219

III POP algorithm, the weight wIII k,m,l is dependent on the firing strength fk,p for

the pth instant of training data {X, Y}. In the computation of oIV m,l , the highest IV III from RIII results in the largest o . This implies that the kth rule with fk,p m,l k III the highest firing strength fk,p is the most influential rule for the pth instant of training data {X, Y}. However, it is not necessary to consider all possible III rules to determine the rule index k to the rule node RIII k with the highest fk,p . III The firing strength fk,p for the pth instant of training data {X, Y} is computed from the same equation given in III fk,p = min(oII i,j ),

(4.12)

i

II where oII i,j is the output of the input label node ILi,j that forms the antecedent conditions for the ith input to the kth fuzzy rule for the pth instant of X. Expanding equation 4.12 with the objective of getting a maximum fkIII gives III II II II = min(max(oII fk,p 1,c1 ), max(o2,c2 ), . . . , max(oi,ci ), . . . , max(on1 ,cn )), (4.13) 1

where ci is the label index j = 1 · · · Ji for a particular ith input from ILII i,j that yields the highest output for the pth instant of X. Therefore, the maximum III firing strength fk,p for the pth instant of training data tuple {X, Y} can be computed from the minimum of all the maximum output of the condition layer for each input. This allows the identification of the most influential rule I node RIII k with the conditions Ck,p = {c1 , c2 , . . . , ci , . . . , cn1 } where ci ∈ ILi for the pth instant of training data tuple {X, Y} in equation 4.1 of RSPOP rule identification step 2a. The time complexity of the reduction step 2a of both RSPOP attribute reduction and RSPOP rule reduction is O(n23 ), which is dependent on the number of fuzzy rules. If POP is used, then the reduction step inherits the problem from POP in considering a large number of rules given in equation 2.2. RSPOP rule identification improves on POP such that given the pth instant of training data tuple {X, Y}, RSPOP identifies only one most influential rule. Because this rule may already have been identified from the previous instant of the training data, the total number of fuzzy rules identified using RSPOP rule identification is thus bounded from above as given in n3 ≤ n,

(4.14)

where n is the number of training tuples and n3 is the number of rules identified. Thus, RSPOP rule identification improves on the time complexity of POP by reducing the number of rules considered by POP, which is exponentially

220

K. Ang and C. Quek

dependent on the dimension of the input space in equation 2.2, to an upper bound as given in equation 4.14. Comparing RSPOP against LazyPOP, the former identifies only one influential rule for a given training data tuple, but the latter identifies a number of legitimate rules based on a heuristic noise criterion. This gives RSPOP a stronger advantage in identifying a significantly reduced number of fuzzy rules for high-dimensional modeling, which improves the interpretability of the fuzzy rules identified. To allow the reduction of attributes and rules in POPFNN, the firing strength of rules of POPFNN has to accommodate the removal of condition attributes. Therefore, the firing strength of a rule RIII k in the rule layer of POPFNN for handling reduced attributes and rules is modified to I fkIII = min(oII i,ci ), ci ∈ Ck , ci = ∅, ∀ i = 1 · · · n1 . i

(4.15)

In the reduction of identified fuzzy rules, the effectiveness of a reduced rule base must not deteriorate compared to the rule base prior to reduction. Therefore, an objective measure is vital in assessing the effectiveness of the neuro-fuzzy system before and after reduction. In rough set theory, there can be only one core of a knowledge, but there can be many possible reducts (Pawlak, 1991). This objective measure thus serves the purpose of identifying the reducts that improve but do not deteriorate the performance of the neuro-fuzzy system. The fundamental objective of a fuzzy rule identification algorithm is to identify rules that infer correct consequents based on the conditions and consequents learned from training data. Thus, an objective measure can be construed based on the correct inference of consequents from training data. Given the training data {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }}, the lth consequent of the mth output for the pth instant of training data {X, Y} is oIV m,l , which is computed in the consequence layer. The correct consequent of the mth output is computed from the maximum membership of output labels OLIV m,l given ym,p . Therefore, an improvement in the objective measure is an increase in oIV m,lmax | µm,lmax (ym,p ) = max µm,l (ym,p ). A deterioration in the l

IV IV objective measure is an increase in oIV m,l | l = lmax , om,l > om,lmax . This objective measure is used in RSPOP attribute reduction and RSPOP rule reduction to assess the performance of POPFNN before and after reduction. The application of attribute reduction and rule reduction in RSPOP is as described in section 3. Section 5 gives a more detailed explanation of the operation of RSPOP in the process of identifying dispensible attributes and interpretable rules. The next section presents experimental results using published data sets and a real-world application involving highway traffic flow prediction to evaluate the effectiveness of using RSPOP to identify the fuzzy rules in pseudo outer-product–based fuzzy neural networks using compositional rule of inference and singleton fuzzifier (POPFNN-CRI(S)) architecture.

RSPOP

221

5 Numerical Experiments Two sets of experiments are performed to evaluate the performance of the proposed RSPOP algorithm in the identification of fuzzy rules in the POPFNN-CRI(S) architecture (Ang et al., 2003): (1) three sets of Nakanishi data set extracted from published papers (Nakanishi, Turksen, & Sugeno, 1993; Sugeno & Yasukawa, 1993) and (2) highway traffic modeling and prediction.

5.1 Nakanishi Data Set. This section outlines the experiments performed to evaluate the effectiveness of fuzzy rules identified using the proposed RSPOP algorithm in the POPFNN-CRI(S) architecture (abbreviated as RSPOP-CRI) on three data sets: a nonlinear system, the human operation of a chemical plant, and the daily price of a stock in a stock market. These three data sets were used by Nakanishi et al. (1993) to evaluate six reasoning methods: Mamdani version of Zadeh’s CRI, Turksen’s version of interval-valued CRI, Turksen’s versions of point-valued and interval-valued AARS, and Sugeno’s version of position and gradient type of reasoning. They are also used in the classic paper by Sugeno and Yasukawa (1993), where qualitative modeling based on fuzzy logic is proposed. A qualitative model of a system is one built with numerical input-output data without a prior knowledge of a system. The use of fuzzy-logic-based qualitative modeling instead of nonfuzzy modeling stems from the fact that system behavior can be qualitatively described using natural language where a precise mathematical model of these complex systems cannot be obtained (Sugeno & Yasukawa, 1993). Each of these three data sets is split into three groups: A, B, and C (Nakanishi et al., 1993). In the following experiments, groups A and B of all data sets are used as the training data sets, and group C is used as the ex ante test data sets. The experimental results are compared against POPFNN-CRI(S) using the same gaussian membership functions generated in RSPOP-CRI but with fuzzy rules identified using POP (abbreviated as POP-CRI) and other neuro-fuzzy systems: adaptive-network–based fuzzy inference systems (Jang, 1993) with subtractive clustering (Chiu, 1994; denoted as ANFIS), evolving fuzzy neural networks (denoted as EFuNN; Kasabov, 2001), and dynamic evolving neural-fuzzy inference system (denoted as DENFIS; Kasabov & Song, 2002). ANFIS is configured with a range of influence of 0.5 for subtractive clustering, while EFuNN and DENFIS are both configured with default parameters.

5.1.1 Example 1: A Nonlinear System. In this experiment, POPFNNCRI(S) is used to model the nonlinear system. Table 2 shows the centroid v and width w of the gaussian membership function generated using a modified variant of the learning vector quantization (Kohonen, 1989) al-

222

K. Ang and C. Quek

Table 2: Gaussian Membership Function of Example 1 Generated for POPFNNCRI(S). ILII

Attribute

OLIV

Label j

vi,j

wi,j

v2,j

w2,j

v3,j

w3,j

v4,j

w4,j

v1,j

w1,j

1 2 3 4

1.599 2.535 3.527 4.472

0.216 0.561 0.567 0.209

1.293 2.206 4.412

0.140 0.548 0.329

1.459 2.615 4.069 4.870

0.209 0.693 0.480 0.054

1.185 2.261 3.599 4.340

0.105 0.646 0.445 0.102

0.498 0.498 0.718 0.806

1.499 2.330 3.526 4.869

gorithm (abbreviated as MLVQ) from the training data set.1 A total of 4, 3, 4, 4 clusters from the input data X1 to X4 , respectively, and a total of 4 clusters from the output data Y1 are generated. The gaussian membership functions generated form the input label and output label nodes of POPFNN-CRI(S). Table 3 shows the initial and reduced rules identified using RSPOP with the same training data set. The feature selection method in Nakanishi et al. (1993) discards the inputs X3 and X4 , whereas RSPOP discards the input X3 and only partially discards X4 in some of the rules. A detailed explanation on the attribute and rule reduction process of RSPOP is provided here in Table 3. From the identified input and output membership functions, the sets CI = {c1 , c2 , c3 , c4 } and DV = {d1 } are used to represent the condition and consequent attributes, respectively. The attribute values c1 ∈ {1, 2, 3, 4}, c2 ∈ {1, 2, 3}, c3 ∈ {1, 2, 3, 4}, c4 ∈ {1, 2, 3, 4}, and d1 ∈ {1, 2, 3, 4} are used to represent the 4, 3, 4, 4, 4 gaussian membership functions for condition variables c1 to c4 and consequent variable d1 , respectively. RSPOP identified an initial 22 rules before attribute and rule reduction. In the process of attribute reduction, one reduct ((CI − {c3 }), DV ) that results in consistent rules is identified. Using this reduct, the computation of objective measure resulted in no deterioration, and thus this reduct is used and duplicated fuzzy rules are removed. An example of duplicated fuzzy rules using this reduct from the initial rules in Table 3 is R9 and R10 . After attribute reduction, the attribute c3 is redundant, and the number of rules is reduced from 22 to 20 after removing duplicated rules. In the process of rule reduction, several groups of rules are identified with dispensable condition. An example of a pair of rules with dispensable condition c1 from the attribute reduced fuzzy rules in Table 3 is R5 and R9 . However, the computation of the objective measure using this pair of reduced rules resulted in deterioration; thus, this

1 The authors’ work on the generation of interpretable fuzzy membership functions will be communicated in a separate article. It suffices here to consider the algorithm as a modified variant of learning vector quantization algorithm as used in POPFNN-TVR (Zhou & Quek, 1996).

RSPOP

223

Table 3: Fuzzy Rules of Example 1 Identified Using RSPOP. Rules From RSPOP Rule Identification

Rules After RSPOP Attribute Reduction DV k

CIk

Rules After RSPOP Rule Reduction DV k

CIk

DV k

CIk

k

c1

c2

c3

c4

d1

k

c1

c2

c3

c4

d1

k

c1

c2

c3

c4

d1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

2 2 1 4 3 3 1 3 2 2 4 3 1 4 1 3 3 1 2 2 1 4

3 1 2 2 2 3 1 1 2 2 3 2 3 3 2 2 3 1 1 2 2 3

1 2 3 4 1 1 2 2 2 3 3 4 1 1 2 2 2 3 4 1 2 2

1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4

1 3 2 1 2 1 2 2 2 2 1 2 2 1 3 1 1 3 2 2 3 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2 2 1 4 3 3 1 3 2 4 1 4 1 3 3 1 2 2 1 4

3 1 2 2 2 3 1 1 2 3 3 3 2 2 3 1 1 2 2 3

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4

1 3 2 1 2 1 4 2 2 1 2 1 3 1 1 3 2 2 3 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

2 2 1 4 3 3 1 3 2 4 1 1 3 1 2 2 1

3 1 2 2 2 3 1 1 2 3 3 2 2 1 1 2 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 2 0 2 2 2 0 3 3 3 3 3 4 4

1 3 2 1 2 1 4 2 2 1 2 3 1 3 2 2 3

pair of rules is not reduced. As the rule reduction process continues, groups of reducible rules that resulted in no deterioration of objective measures are used with the dispensable conditions set to 0, and the number of rules is reduced from 20 to 17. The numbers 0–4 that represent condition and consequent labels in Table 3 do not show that the fuzzy rules identified are interpretable. To illustrate the intuitiveness of the fuzzy rules identified, a mapping of semantic labels to each condition and consequent labels is required. As formulated in Zadeh (1975a, 1975b, 1975c), a linguistic variable is characterized by a quintuple (L, T(L), U, G, M), where L is the name of the variable; T(L) is the linguistic term set of L; U is a universe of discourse; G is a syntactic rule that generates T(L); and M is a semantic rule that associates each T(L) with its meaning. Here, the names L of the input and output linguistic variables of the data set are {X1 , X2 , X3 , X4 } and {Y}, which are represented by condition and consequent variables {c1 , c2 , c3 , c4 } and {d1 }, respectively. The linguistic terms are represented by the attribute values, which are {1, 2, 3, 4} for attribute variables c1 , c2 , c3 , d1 and {1, 2, 3} for attribute variable c2 . The meaning of each linguistic label M is represented by its gaussian membership function in Table 2. A mapping of semantic labels such as T(·) = {Low,Medium,High,Very High} reveals the intuitiveness of the rules identified in Table 3, as shown in Figure 3 and Table 4.

membership function µ4,j(x4)

K. Ang and C. Quek

1

T(2) Medium

T(1) Low

T(3) High

T(4) Very High

0.8 0.6 0.4 0.2 0 1

1

2

3

4

input value x1

T(1) T(2) Low Medium

5

T(3) High

0.8 0.6

⇒

0.4 0.2 0 0

1

1

2

3

4

5

input value x2

T(1) Low

6

membership function µ1,l(y1)

membership function µ2,j(x2)

membership function µ1,j(x1)

224

1

T(1) T(2) T(3) T(4) Low Medium High Very High

0.8 0.6 0.4 0.2 0 0

2

4

output value y1

6

8

T(3) T(4) High Very High

T(2) Medium

0.8 0.6 0.4 0.2 0 0

1

2

3

input value x4

4

5

Figure 3: Semantic interpretation of the fuzzy sets in Table 2.

After the fuzzy rules are identified, the output using the test data set from example 1 is then computed. Figure 4 shows the experimental results obtained from RSPOP-CRI compared to POP-CRI and other neuro-fuzzy systems. Consolidated experimental results on the three data sets in terms of square of the Pearson product-moment correlation value (R2 ) and mean square error (MSE) are given in Table 9. 5.1.2 Example 2: Human Operation of a Chemical Plant. In this experiment, the POPFNN-CRI(S) is used to model the human operation of a chemical. Table 5 shows the centroid v and width w of the gaussian membership functions generated using MLVQ from the training data set. A total of 4, 5,

RSPOP

225

Table 4: Semantic Interpretation of the Fuzzy Rules in Table 3. R1 : R4 : R6 : R10 : R13 :

IF X1 is medium IF X1 is very high IF X1 is high IF X1 is very high IF X1 is high THEN Y is low IF X1 is low IF X1 is high IF X1 is high IF X1 is medium IF X1 is low IF X1 is medium IF X1 is medium THEN Y is medium IF X1 is medium IF X1 is low IF X1 is low IF X1 is low THEN Y is high IF X1 is low

R3 : R5 : R8 : R9 : R11 : R15 : R16 : R2 : R12 : R14 : R17 : R7 :

AND X2 AND X2 AND X2 AND X2 AND X2

is high is medium is high is high is medium

AND X4 is low AND X4 is low

AND X2 AND X2 AND X2 AND X2 AND X2 AND X2 AND X2

is medium is medium is low is medium is high is low is medium

AND X4 AND X4 AND X4 AND X4 AND X4 AND X4 AND X4

is low is medium is medium is medium is high is high is very high

AND X2 AND X2 AND X2 AND X2

is low is medium is low is medium

AND X4 AND X4 AND X4 AND X4

is low is high is high is very high

AND X2 is low

AND X4 is high

AND X4 is medium

5

Data Value

4

Actual RSPOP-CRI POP-CRI Sugeno (R-G) Mamdani Turksen (IVCRI) ANFIS

3

2

1

0 0

5

10

15

20

25

Data Instance Figure 4: Experimental results of RSPOP-CRI on test data of example 1 compared against POP-CRI and other neuro-fuzzy systems.

226

K. Ang and C. Quek

Table 5: Gaussian Membership Function of Example 2 Generated for POPFNNCRI(S). Attribute Label j 1 2 3 4 5 6

ILII vi,j 4.615 5.519 5.950 6.522

wi,j

v2,j

w2,j

v3,j

w3,j

OLIV v4,j

w4,j

v5,j

w5,j

v1,j

0.542 −0.244 0.055 693.5 1036.5 −0.389 0.111 −0.100 0.060 909.47 0.258 −0.152 0.037 2421.0 809.9 −0.204 0.062 0.000 0.060 2579.7 0.258 −0.090 0.037 3770.9 809.9 −0.100 0.060 0.100 0.056 3870 0.344 0.001 0.054 6742.3 1782.8 0.000 0.060 0.193 0.056 6757.8 0.154 0.092 0.100 0.056 0.194 0.056

w1,j 1002.2 774.14 774.14 1732.7

4, 6, 4 clusters from the input data X1 to X5 , respectively, and a total of 4 clusters from the output data Y1 are generated. Table 6 shows the initial and reduced rules identified using RSPOP. RSPOP identified an initial 24 rules, which reduced to 14 rules after attribute reduction and remained at 14 rules after rule reduction. The feature selection method in Nakanishi et al. (1993) discards the inputs X2 , X4 , and X5 , whereas RSPOP discards the inputs X1 , X2 , and X5 . Figure 5 shows the experimental results obtained from RSPOP-CRI compared against POP-CRI and other neuro-fuzzy systems. Consolidated experimental results on the three data sets are given in Table 9. Similar to example 1, a term set T(·) has to be used to map the attribute values semantically. The semantic interpretation of the fuzzy rules identified is omitted henceforth since translating the gaussian membership functions generated in Table 5 and the fuzzy rules identified in Table 6 is a rather intuitive process. 5.1.3 Example 3: Daily Data of a Stock in a Stock Market. In this experiment, the POPFNN-CRI(S) is used to model the daily data of a stock market. Table 7 shows the centroid v and width w of the gaussian membership function generated using MLVQ from the training data set. A total of three to five clusters from the input data X1 to X10 and a total of four clusters from the output data Y1 are generated. Table 8 shows the initial and reduced rules identified using RSPOP. RSPOP identified an initial 50 rules and reduced to 23 rules after attribute and rule reduction. The feature selection method in Nakanishi et al. (1993) discards the inputs X1 , X2 , X3 , X6 , X7 , X9 , and X10 , whereas RSPOP discards the inputs X1 , X2 , X3 , X6 , X10 and partially discards X5 , X7 , X8 , and X9 in some of the rules. Figure 6 shows the experimental results obtained from POPFNN-CRI(S) using the proposed RSPOP compared against POPFNNCRI(S) using POP and other neuro-fuzzy systems. Consolidated experimental results on the three data sets are given in Table 9. 5.1.4 Discussion. Table 9 shows the consolidated experimental results of examples 1 to 3 in terms of square of the Pearson product-moment cor-

RSPOP

227

Table 6: Fuzzy Rules of Example 2 Identified Using RSPOP. Rules from RSPOP Rule Identification

Rules After RSPOP Attribute and Rule Reduction DV k

CIk

DV k

CIk

k

c1

c2

c3

c4

c5

d1

k

c1

c2

c3

c4

c5

d1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1 2 1 1 1 1 3 2 2 1 1 1 2 2 3 1 4 4 1 3 4 1 4 1

4 4 4 1 4 4 4 1 2 4 3 4 1 5 5 2 1 4 4 2 3 4 3 4

4 2 4 3 4 4 3 2 3 4 4 4 3 3 3 4 1 1 4 1 1 4 1 4

2 3 3 4 4 5 6 3 3 3 4 4 5 1 2 2 3 3 3 4 4 4 5 3

1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4

4 2 4 4 4 4 3 2 3 4 4 4 3 3 3 4 1 1 4 1 1 4 1 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 2 4 3 4 4 3 3 3 3 3 1 1 1

2 3 3 4 4 5 6 3 5 1 2 3 4 5

0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 2 4 4 4 4 3 3 3 3 3 1 1 1

Table 7: Gaussian Membership Function of Example 3 Generated for POPFNNCRI(S). ILII

Attribute Label j

vi,j

wi,j

v2,j

w2,j

v3,j

w3,j

v4,j

w4,j

v5,j

w5,j

1 2 3 4 5

−0.052 0.065 0.181 0.296 0.412

0.070 0.069 0.069 0.069 0.069

−0.029 0.108 0.280 0.378

0.082 0.082 0.059 0.059

−12.231 −4.405 10.460 19.760 26.719

4.696 4.696 5.580 4.175 4.175

−6.541 10.980 25.491

10.512 8.707 8.707

−3.429 −2.388 −0.898 −0.127 0.922

0.625 0.625 0.462 0.462 0.629

Attribute Label j 1 2 3 4 5

ILII v6,j

w6,j

v7,j

−13.671 −4.518 0.623 7.815

5.492 3.085 3.085 4.315

−8.332 −4.692 −1.130 3.489

w7,j

v8,j

OLIV w8,j

v9,j

w9,j

v10,j

w10,j

v1,j

2.184 −17.637 4.161 −18.024 4.585 −19.222 4.658 −19.122 2.137 −10.702 4.161 −10.382 3.441 −11.460 4.658 2.972 2.137 −3.322 3.301 −4.647 2.481 −2.962 3.131 23.592 2.771 2.180 3.301 −0.512 2.481 2.256 2.671 32.693 7.818 3.382 5.248 3.456 6.708 2.671

w1,j 13.256 12.372 5.461 5.461

228

K. Ang and C. Quek

10000 8000

Data Value

6000

4000 Actual RSPOP-CRI POP-CRI Sugeno (R-G) Mamdani Turksen (IVCRI) DENFIS

2000 0

-2000 0

5

10

15

20

25

30

35

Data Instance Figure 5: Experimental results of RSPOP-CRI on test data of example 2 compared against POP-CRI and other neuro-fuzzy systems.

50 40

Data Value

30 20 10 0 Actual RSPOP-CRI POP-CRI Sugeno (R-G) Mamdani Turksen (IVCRI) ANFIS

-10 -20 -30 -40 0

10

20

30

40

50

Data Instance Figure 6: Experimental results of RSPOP-CRI on test data of example 3 compared against POP-CRI and other neuro-fuzzy systems.

RSPOP

229

Table 8: Fuzzy Rules of Example 3 Identified Using RSPOP. Rules from RSPOP Rule Identification DV k

CIk k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 .. . 41 42 43 44 45 46 47 48 49 50

c1 4 4 4 4 4 3 1 3 5 2 5 1 4 1 2 1 1

c2 1 2 2 1 1 1 1 1 2 1 1 1 3 1 1 1 1

c3 3 3 3 3 4 2 1 3 4 1 5 2 3 1 2 1 1

c4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

c5 2 2 3 1 2 4 3 4 4 4 5 3 4 3 4 3 5

4 5 5 4 5 4 5 1 5 5

4 2 2 4 4 4 4 2 4 3

3 5 5 3 5 3 5 2 3 4

3 1 1 2 3 2 3 1 3 2

5 5 5 5 5 5 5 5 5 5

.. .

c6 1 2 2 2 4 3 3 2 2 3 3 3 3 3 3 3 3

c7 3 2 3 1 4 2 3 2 3 3 4 2 2 3 3 4 3

c8 3 1 3 3 2 3 3 4 4 4 4 3 3 3 3 3 4

c9 1 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4

c10 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 4 3 3 3 3 3 4

3 4 3 3 4 3 3 4 4 4

4 4 4 4 4 5 5 5 5 5

5 5 5 5 5 5 5 5 5 5

4 4 5 5 5 5 5 5 5 5

d1 2 2 1 2 2 2 2 2 2 2 2 2 4 2 2 2 2 .. . 1 2 2 1 1 1 1 2 1 1

relation value (R2 ) and MSE. Comparing the modeling accuracy in terms of MSE, POP-CRI outperformed the rest on example 1, DENFIS outperformed the rest on example 2, and RSPOP-CRI outperformed the rest on example 3. From Table 9, ANFIS and DENFIS yielded significantly accurate performance on all three data sets. As these two neuro-fuzzy systems are based on the TSK model (Sugeno & Kang, 1988; Takagi & Sugeno, 1985) used for precise fuzzy modeling, these models generally yield decreased interpretability but increased accuracy (Casillas et al., 2003). In contrast, POPFNN-CRI(S) is based on the Mamdani model (Mamdani & Assilian, 1975) used for linguistic fuzzy modeling, which is focused on interpretability. The objective of the comparison with the recent TSK-based models is to obtain a quality check on the performance of RSPOP-CRI since interpretability is not directly comparable between these two models.

230

K. Ang and C. Quek

Table 8, continued. Rules After RSPOP Attribute and Rule Reduction DV k

CIk k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

c1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

c2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

c3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

c4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 3 3 1 3 1 2 1 1 2 2 2 2 3 1 2

c5 2 2 3 1 2 4 3 4 5 0 4 4 2 4 5 5 4 4 5 5 4 4 4 4 5 5 5 5 5

c6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

c7 3 2 3 1 4 2 3 3 0 0 2 3 4 4 4 0 4 4 2 4 2 3 3 3 3 3 3 4 4

c8 3 1 3 3 2 0 3 4 4 3 3 0 1 5 5 4 4 4 5 5 2 3 3 4 4 5 5 5 5

c9 1 2 2 3 3 3 3 3 0 4 4 4 5 3 0 0 4 4 4 4 5 5 5 5 5 5 5 5 5

c10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

d1 2 2 1 2 2 2 2 2 2 2 4 2 2 2 1 1 2 1 2 1 1 2 3 2 1 1 1 2 1

Table 9: Consolidated Experimental Results on Nakanishi Data Sets. Example 1 2 3

R2 MSE R2 MSE R2 MSE

POP-CRI

RSPOP-CRI

Sugeno (P-G)

Mandani

Turksen (IVCRI)

0.877 0.270 0.946 5.630 × 105 0.733 76.221

0.856 0.383 0.983 2.124 × 105 0.922 24.859

0.845 0.467 0.990 1.931 × 106 0.700 168.90

0.490 0.862 0.937 6.580 × 105 0.865 40.84

0.609 0.706 0.993 2.581 × 105 0.661 93.02

ANFIS

EFuNN

DENFIS

0.853 0.286 0.780 2.968 × 106

0.720 0.566 0.946 7.247 × 105 0.756 72.542

0.805 0.411 0.995 5.240 × 104

0.875 38.062

0.810 69.824

RSPOP

231

Table 10 shows a benchmark of the number of rules and learning performance of RSPOP-CRI against POP-CRI on the Nakanishi data sets performed on a Pentium 4 2.0 GHz Dell PC with 512 Mb RAM. The table shows that as the dimensionality of data sets A to C increases from 4 to 10, the number of fuzzy rules identified using RSPOP remains bounded from above by the number of tuples of the training data. In contrast, the number of fuzzy rules identified using POP increases exponentially from 192 to 3 million. From the results on example 1 in Table 10, the number of fuzzy rules is notably reduced from 192 for POP-CRI to 17 for RSPOP-CRI, but the accuracy in terms of MSE is slightly compromised from 0.270 to 0.383. A quality control check on the accuracy with other neuro-fuzzy systems in Table 9 showed an exceptional performance in terms of MSE for POP-CRI, whereas RSPOP-CRI yielded comparable performance with ANFIS, DENFIS, and EFuNN. This shows that the interpretability of the fuzzy rules in POPFNN-CRI(S) identified using RSPOP improved, but accuracy is not significantly compromised on example 1. From the results on example 2 in Table 10, the number of fuzzy rules identified is significantly reduced from 1920 for POP-CRI to 14 for RSPOP-CRI, and the accuracy in terms of MSE improved marginally from 5.630 × 105 to 2.124 × 105 . A quality control check on the accuracy with other neuro-fuzzy systems in Table 9 showed exceptional performance in terms of MSE for DENFIS, whereas RSPOP-CRI yielded comparable performance with EFuNN. This shows that the interpretability of the fuzzy rules in POPFNN-CRI(S) identified using RSPOP improved significantly with a marginal improvement in accuracy on example 2. From the results on example 3 in Table 10, the number of fuzzy rules identified is extensively reduced from 3 million for POP-CRI to 29 for RSPOPCRI, and the accuracy in terms of MSE significantly improved from 76.2 to 24.9. With the increase in the dimension of the input space from 5 in example 2 to 10 in example 3, the number of rules identified using POP increases exponentially from 1920 to 3 million. The POP algorithm, which uses an outer product form of Hebbian learning, has to compute all possible rules given in equation 2.2 during the learning process in equation 2.4. The exponential increase in the number of rules identified with the increase in the dimension of input space and/or the number of input labels has been reported in Quek and Zhou (2001). Although the LazyPOP algorithm was proposed in Quek and Zhou (2001) to reduce the number of rules identified, it relies on heuristic thresholds. Thus, the accuracy and interpretability of the rules identified are heavily dependent on the selection of these threshold values. In contrast, RSPOP enhances POP using the rough set reduction and identifies the reduced number of fuzzy rules without the use of heuristic thresholds. A quality control check on the accuracy with other neuro-fuzzy systems in Table 9 showed superior performance in terms of MSE for RSPOP-CRI. This shows that both the interpretability and the accuracy of POPFNN-CRI(S) using RSPOP to identify fuzzy rules improved significantly on example 3.

232

K. Ang and C. Quek

In addition, Table 10 shows that the time taken to train RSPOP-CRI compared to POP-CRI is reduced from 1,082,866 ms to 184 ms. This shows that the proposed RSPOP algorithm executes significantly faster than the POP algorithm on example 3. Therefore, these experimental results show that as the dimensionality of the data set increases, significant improvement in both the interpretability and accuracy, as well as computation complexity, is obtained for POPFNN-CRI(S) using the proposed RSPOP algorithm to identify fuzzy rules compared to the POP algorithm. 5.2 Traffic Flow Prediction. This section outlines the experiments performed to evaluate the effectiveness of the fuzzy rules identified using the proposed RSPOP algorithm in the POPFNN-CRI(S) architecture on the modeling of traffic flow data where a precise mathematical model cannot be obtained. The raw traffic flow data for the experiments is obtained from Tan (1997). The raw data were collected at site 29 located at exit 15 along the eastbound Pan Island Expressway (PIE) in Singapore using loop detectors embedded beneath the road surface. The inductive loop detectors were preinstalled by the Land Transport Authority of Singapore (LTA) in 1996 along major roads to facilitate traffic flow data collection. Figure 7 shows a picture of the location where the traffic flow data were collected. There are five lanes at the site: two exit lanes (lanes 4–5) and three straight lanes (lanes 1–3) for the main traffic. In the experiment, only the traffic flow data for the three straight lanes (lanes 1–3) are used. The traffic flow data set consists of four attributes: time and traffic density of lanes 1–3. The POPFNN-CRI(S) architecture is used to model the traffic flow trend, and the trained neuro-fuzzy system is used to predict traffic density for time t + τ where τ = 5, 15, 30, 45, 60 minutes. Figure 8 shows the traffic flow density data for the three straight lanes spanning a period of six days: September 5–10, 1996. The data set is divided into three cross-validation groups of training and test sets, exactly as in Tung and Quek (2004), on the traffic flow prediction case study using generic self-organizing fuzzy neural network (GenSoFNN) (Tung & Quek, 2002b). These three data set groups are named CV1, CV2, and CV3 and are extracted from the traffic flow data set as shown in Figure 8. The recall of POPFNN-CRI(S) using MLVQ to generate gaussian membership functions and RSPOP to identify fuzzy rules (abbreviated as RSPOPCRI) with CV1, CV2, CV3 and the prediction results of CV2 and CV3, CV1 and CV3, CV1 and CV2 for τ = 5 minutes are presented in Figure 9. The square of the Pearson product-moment correlation value (R2 ) and the MSE are then computed and averaged to form the prediction results of lane 1 traffic flow density for τ = 5 minutes. A total of 15 prediction results from RSPOP-CRI are subsequently consolidated in Figure 10 in prediction of lanes 1 to 3 for τ = 5, 15, 30, 45, 60 minutes. Figure 10 also presents the consolidated traffic flow prediction results of GenSoFNN (from Tung & Quek, 2004), POPFNN-CRI(S) with the same gaussian membership functions gen-

4 5 10

25 35 50

192 1,920 3,000,000

46 485 1,082,866

0.270 5.630 × 105 76.221

MSE of Test Data

RSPOP-CRI

22 24 50

188 202 184

20 14 43

17 14 29

Number of Rules Train Time Number of Rules Number of Rules Identified (ms) After Attributes After Rule Reduction Reduction

0.383 2.124 × 105 24.859

MSE of Test Data

RSPOP-CRI POP-CRI GenSOFNN-CRI(S) EFuNN DENFIS

14.4 40.0 50.0 234.5 9.7

0.146 0.173 0.164 0.186 0.153

Neuro-fuzzy Systems Number of Rules MSE

Table 11: Benchmark on the Average Number of Rules and Average MSE of Using RSPOP-CRI Against POP-CRI and Other Neuro-Fuzzy Systems on Traffic Flow Prediction.

1 2 3

Number of Number of Number of Rules Train Time Example Attributes Training Tuples Identified (ms)

POP-CRI

Table 10: Benchmark of POPFNN-CRI(S) Using POP Against RSPOP on Nakanishi Data Sets.

RSPOP 233

234

K. Ang and C. Quek

Figure 7: Location of site 29 along the Pan Island Expressway in Singapore, where traffic flow data are collected.

erated but using POP to identify the fuzzy rules (abbreviated as POP-CRI), evolving fuzzy neural networks (EFuNN; Kasabov, 2001), and the dynamic evolving neural-fuzzy inference system (DENFIS; Kasabov & Song, 2002). Table 11 shows a benchmark on the average number of rules and the average MSE of RSPOP-CRI compared to POP-CRI and other neuro-fuzzy systems. Figure 10 clearly shows a superior prediction accuracy of RSPOP-CRI in terms of both R2 and MSE compared against other neuro-fuzzy systems. Table 11 shows that on average, RSPOP identifies significantly fewer rules with increased accuracy compared to POP on the same POPFNN-CRI(S) architecture. Table 11 also shows that RSPOP-CRI uses an average of 14.4 fuzzy rules, fewer than other neuro-fuzzy systems except DENFIS, which uses a lower average of 9.7 fuzzy rules. However, DENFIS is based on the TSK model (Sugeno & Kang, 1988; Takagi & Sugeno, 1985) used for precise fuzzy modeling, which generally yields decreased interpretability but increased accuracy (Casillas et al., 2003). In contrast, POPFNN-CRI(S) is based on the Mamdani model (Mamdani & Assilian, 1975) used for linguistic fuzzy modeling, which is focused on interpretability. The objective of comparing RSPOP-CRI with DENFIS is to obtain a quality check on the prediction accuracy of RSPOP-CRI since interpretability is not directly comparable between these two models. The experimental result shows that the proposed RSPOP

RSPOP

235

Lane 1

Lane 2

Lane 3

6

Normalized Density

5

CV2

CV1 4

5/9 Thur

6/9 Fri

7/9 Sat

CV3

8/9 Sun

9/9 Mon

10/9 Tue

3 2 1 0

0.68

0.18

0.70

0.20

0.70

0.18

0.68

0.18

0.68

0.18

Normalized Time

Figure 8: Traffic flow density of three straight lanes along PIE at site 29.

algorithm is able to identify effective fuzzy rules, which improves the accuracy of POPFNN-CRI(S) used for linguistic fuzzy modeling to the extent that it is comparable to DENFIS used for precise fuzzy modeling. Therefore, the experimental results once again show that significant improvement in both interpretability and accuracy is obtained for POPFNN-CRI(S) using the proposed RSPOP algorithm to identify fuzzy rules. 6 Conclusion A novel rough set pseudo outer-product (RSPOP) algorithm is proposed in this article for the identification of fuzzy rules in pseudo outer-productbased fuzzy neural networks (POPFNN) (Ang et al., 2003; Zhou & Quek, 1996; Quek & Zhou, 1999). The proposed RSPOP algorithm addressed the problem of exponential increase in the number of fuzzy rules identified as well as increased computational complexity faced by the pseudo outerproduct (POP) fuzzy rule identification algorithm (Quek & Zhou, 2001) with high-dimensional data. Instead of pruning learned fuzzy rules through the use of certain heuristic thresholds in existing rule identification algorithms, RSPOP integrates the sound concept of knowledge reduction from rough set theory with the POP algorithm. RSPOP not only performed feature selection through the reduction of redundant attributes but also extends the

236

K. Ang and C. Quek

3

Lane 1 τ =5 Recall for CV1

2 1.5 1 0.5 0

3

100

200 300 400 Time Instance

Lane 1 τ =5 Recall for CV2

Normalized Density

Normalized Density

1

1

200

400 600 Time Instance

800

Lane 1 τ =5 Prediction of CV1 & CV3 Actual Predicted

2.5

0.5

2 1.5 1 0.5

100

200 300 400 Time Instance

0

500

Lane 1 τ =5 Recall for CV3

3 Actual Recall

2 1.5 1 0.5

200

400 600 Time Instance

800

Lane 1 τ =5 Prediction of CV1 & CV2 Actual Predicted

2.5 Normalized Density

2.5 Normalized Density

1.5

3 Actual Recall

1.5

0

2

0

500

2

3

Actual Predicted

0.5

2.5

0

Lane 1 τ =5 Prediction of CV2 & CV3

2.5 Normalized Density

2.5 Normalized Density

3 Actual Recall

2 1.5 1 0.5

100

200 300 400 Time Instance

500

0

200

400 600 Time Instance

800

Figure 9: Recall and prediction of lane 1 traffic density of RSPOP-CRI for τ = 5.

reduction to the fuzzy rules without redundant attributes. Because there are many possible reducts for a given rule set, an objective measure is developed based on the POPFNN architecture to identify the reducts that improve rather than deteriorate the inferred consequence after attribute and rule reduction. Using the rough set approach and the objective measure, the proposed RSPOP algorithm is able to identify a significantly reduced number of fuzzy rules compared to the POP algorithm, which improves the interpretability of POPFNN. However, as neuro-fuzzy systems generally

RSPOP

237 Lane 1 Prediction

0.9

Average R 2

0.86

0.2 Average MSE

0.88

Lane 1 Error

0.25

RSPOP-CRI POP-CRI GENSOFNN EFuNN DENFIS

0.84 0.82 0.8

0.15 RSPOP-CRI POP-CRI GenSOFNN EuFNN DENFIS

0.1

0.78 10

20 30 40 Time Interval τ

0.05 0

60

10

Lane 2 Prediction

0.86

0.82

20 30 40 Time Interval τ

0.78 0.76

0.15

RSPOP-CRI POP-CRI GenSOFNN EuFNN DENFIS

0.1

0.74 10

20 30 40 Time Interval τ

10

20 30 40 Time Interval τ

Average MSE

0.2 0.15 RSPOP-CRI POP-CRI GenSOFNN EuFNN DENFIS

0.1

20 30 40 Time Interval τ

50

60

0.25

0.8

10

50

Lane 3 Error

0.3

0.85

0.75 0

0.05 0

60

RSPOP-CRI POP-CRI GENSOFNN EFuNN DENFIS

0.9 Average R 2

50

Lane 3 Prediction

0.95

60

0.2

0.8

0.72 0

50

Lane 2 Error RSPOP-CRI POP-CRI GenSOFNN EFuNN DENFIS

0.84 Average R 2

50

Average MSE

0.76 0

60

0.05 0

10

20 30 40 Time Interval τ

50

60

Figure 10: Consolidated experimental results of RSPOP-CRI on traffic flow prediction compared to POP-CRI and other neuro-fuzzy systems.

suffer from an interpretability versus accuracy dilemma, experiments are performed to evaluate both the interpretability and accuracy of using the proposed RSPOP algorithm. Two sets of experiments were performed on the pseudo outer-productbased fuzzy neural networks using the compositional rule of inference and singleton fuzzifier (POPFNN-CRI(S)) architecture using a modified variant of learning vector quantization (MLVQ) algorithm to generate gaussian

238

K. Ang and C. Quek

membership functions and the proposed RSPOP algorithm to identify fuzzy rules. In the first set of experiments, the fuzzy rules identified using RSPOP are studied on the published data sets in the work of Nakanishi et al. (1993). Three experiments were performed using group A and B for training and group C for testing from each data set, similar to the experiments conducted in Nakanishi et al. (1993). The fuzzy rules identified after each step of using RSPOP are clearly tabulated for each experiment. A semantic interpretation of the gaussian membership functions and the fuzzy rules identified are provided using the first experimental results. Close examination of the reduced fuzzy rules obtained showed that some of the attributes selected after RSPOP attribute reduction are similar to the attributes selected in Nakanishi et al. (1993). The reduced fuzzy rules showed that the RSPOP rule reduction process is capable of partial attribute selection, whereas the variable selection process in Nakanishi et al. (1993) is capable only of crisp attribute selection. In addition, the RSPOP rule reduction process operates mainly on the identified fuzzy rules, not on the raw training data. This gives RSPOP a stronger advantage on ultra-large data sets since the identified fuzzy sets used by the rules present a horizontally reduced input (Lin & Cercone, 1997). RSPOP subsequently complements this horizontal reduction with vertical reduction using attribute reduction and rule reduction. Most significant, experimental results showed that the number of fuzzy rules identified using the proposed RSPOP algorithm are bounded from above by the number of tuples of the training data. Thus, the number of rules identified using RSPOP are significantly reduced compared to POP. Furthermore, computational complexity is extensively reduced and accuracy in terms of MSE improved as the dimensionality of the data set increases. The modeling accuracy of POPFNN-CRI(S) using the proposed RSPOP algorithm is also compared to three other reasoning methods: the Mamdani’s version of Zadeh’s CRI, Turksen’s version of interval-valued CRI, and Sugeno’s version of position and gradient type of reasoning in Nakanishi et al. (1993). In addition, it is compared against other neuro-fuzzy systems: adaptive-network-based fuzzy inference systems (ANFIS; Jang, 1993), evolving fuzzy neural networks (EFuNN; Kasabov, 2001), and dynamic evolving neural-fuzzy inference system (DENFIS; Kasabov & Song, 2002). ANFIS and DENFIS yielded significantly good accuracy on all three data sets. As these two neuro-fuzzy systems are based on the TSK model (Sugeno & Kang, 1988; Takagi & Sugeno, 1985) used for precise fuzzy modeling, these models generally yield decreased interpretability but increased accuracy (Casillas et al., 2003). In contrast, POPFNN-CRI(S) is based on the Mamdani model (Mamdani & Assilian, 1975) used for linguistic fuzzy modeling, which is focused on interpretability. The objective of the comparison with the recent TSK-based models is to obtain a quality check on the accuracy performance of POPFNN-CRI(S) using RSPOP since interpretability is not directly comparable between these two models. The consolidated experimental results showed that with an increase in the dimensionality of the data, the pro-

RSPOP

239

posed RSPOP algorithm identifies a significantly reduced number of fuzzy rules, which improves interpretability and yields superior accuracy in the POPFNN-CRI(S) architecture compared to other neuro-fuzzy systems. The second set of experiments were performed to evaluate the effectiveness of fuzzy rules identified in the POPFNN-CRI(S) architecture using RSPOP on a real-world application involving traffic flow prediction. Experimental results are compared to POPFNN-CRI(S) using the same gaussian membership functions generated but fuzzy rules identified using POP, generic self-organizing fuzzy neural network (GenSoFNN) using BackPOLE for supervised learning, evolving fuzzy neural networks (EFuNN) (Kasabov, 2001), and dynamic evolving neural-fuzzy inference system (DENFIS) (Kasabov & Song, 2002). Comparing the experiment results of POPFNNCRI(S) using RSPOP against POPFNN-CRI(S) using POP, the attribute and rule reduction of the former not only reduced the number of fuzzy rules identified but also improved prediction accuracy. Comparing the results of the POPFNN-CRI(S) using RSPOP against GenSoFNN, the former again yielded better prediction accuracy with fewer rules. This shows that the fuzzy rules identified by the proposed RSPOP algorithm are effective and require no additional supervised tuning. Experimental results also showed that the POPFNN-CRI(S) using RSPOP yielded better prediction accuracy compared to EFuNN and DENFIS. However, the former used an average of 14.4 fuzzy rules, whereas DENFIS used only an average of 9.7 fuzzy rules. Although interpretability in terms of number of fuzzy rules is not directly comparable between the neuro-fuzzy systems, experimental results showed that the proposed RSPOP algorithm is able to identify effective fuzzy rules, which improves the accuracy of POPFNN-CRI(S) used for linguistic fuzzy modeling to the extent that it is comparable to DENFIS used for precise fuzzy modeling. Therefore, the experimental results once again show that significant improvement in both interpretability and accuracy is obtained for POPFNN-CRI(S) using the proposed RSPOP algorithm to identify fuzzy rules. The potential of the proposed RSPOP algorithm discussed in this article is exciting because this algorithm extends the application of POPFNN for ultra-large and redundant data without an exponential increase in rules and computational complexity. The proposed RSPOP algorithm not only improves the interpretability of POPFNN by identifying significantly fewer rules, it also improves the accuracy of POPFNN. The strong attribute and rule-reduction properties of RSPOP also rid the user of POPFNN from the tedious process of identifying the influential input attributes. This research is in line with the research direction of Centre for Computational Intelligence (formerly the Intelligent Systems Laboratory in Nanyang Technological University Singapore). The Centre for Computational Intelligence undertakes active research in intelligent neuro-fuzzy systems for the modeling of complex, nonlinear, and dynamic problem domains. Examples of neural and neuro-fuzzy systems developed are modified cerebellar artic-

240

K. Ang and C. Quek

ulation controller (MCMAC; Ang & Quek, 2000), generic self-organizing fuzzy neural network (GenSoFNN; Tung & Quek, 2002b), and pseudo outerproduct-based fuzzy neural network (POPFNN; Zhou & Quek, 1996). These have been applied to novel and interesting applications such as automated driving (Pasquier, Quek, & Toh, 2001), signature forgery detection (Quek & Zhou, 2002), gear control for automotive continuous variable transmission (Ang, Quek, & Wahab, 2002), and fingerprint verification (Quek, Tan, & Sagar, 2001). References Ang, K. K., & Quek, C. (2000). Improved MCMAC with momentum, neighborhood, and averaged. IEEE Transactions on Systems, Man and Cybernetics, Part B, 30(3), 491–500. Ang, K. K., Quek, C., & Pasquier, M. (2003). POPFNN-CRI(S): Pseudo outer product–based fuzzy neural network using the compositional rule of inference and singleton fuzzifier. IEEE Transactions on Systems, Man and Cybernetics, Part B, 33(6), 838–849. Ang, K. K., Quek, C., & Wahab, A. (2002). MCMAC-CVT: A novel on-line associative memory based CVT transmission control system. Neural Networks, 15(2), 219–236. Buckley, J. J., & Hayashi, Y. (1994). Fuzzy neural networks: A survey. Fuzzy Sets and Systems, 66(1), 1–13. Buckley, J. J., & Hayashi, Y. (1995). Neural nets for fuzzy systems. Fuzzy Sets and Systems, 71(3), 265–276. Casillas, J., Cordon, ´ O., Herrera, F., & Magdalena, L. (2003). Interpretability issues in fuzzy modeling. Berlin: Springer-Verlag. Castro, J. L. (1995). Fuzzy logic controllers are universal approximators. IEEE Transactions on Systems, Man and Cybernetics, 25(4), 629–635. Chakraborty, D., & Pal, N. R. (2004). A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification. IEEE Transactions on Neural Networks, 15(1), 110–123. Chiu, S. L. (1994). Fuzzy model identification based on cluster estimation. Journal of Intelligent and Fuzzy Systems, 2(3), 267–278. Curry, B. (2003). Rough sets: Current and future developments. Expert Systems, 20(5), 247–250. Guillaume, S. (2001). Designing fuzzy inference systems from data: An interpretability–oriented review. IEEE Transactions on Fuzzy Systems, 9(3), 426–443. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Hayashi, I., Nomura, H., Yamasaki, H., & Wakami, N. (1992). Construction of fuzzy inference rules by NDF and NDFL. International Journal of Approximate Reasoning, 6(2), 241–266. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.

RSPOP

241

Ishibuchi, H., Tanaka, H., & Okada, H. (1994). Interpolation of fuzzy if-then rules by neural networks. International Journal of Approximate Reasoning, 10(1), 3– 27. Jang, J.-S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics, 23(3), 665–685. Jin, Y. (2000). Fuzzy modeling of high-dimensional systems: Complexity reduction. IEEE Transactions on Fuzzy Systems, 8(2), 212–221. Johansen, T. A., & Babuˇska, R. (2003). Multiobjective identification of TakagiSugeno fuzzy models. IEEE Transactions on Fuzzy Systems, 11(6), 847–860. Kasabov, N. (2001). Evolving fuzzy neural networks for supervised/unsupervised online. IEEE Transactions on Systems, Man and Cybernetics, Part B, 31(6), 902–918. Kasabov, N. K., & Song, Q. (2002). DENFIS: Dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems, 10(2), 144–154. Kaynak, O., Jezernik, K., & Szeghegyi, A. (2002). Complexity reduction of rule based models: A survey. In Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference (Vol. 2, pp. 1216–1221). Piscataway, NJ: IEEE. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Lin, C.-T., & Lee, C. S. G. (1991). Neural-network-based fuzzy logic control and decision system. IEEE Transactions on Computers, 40(12), 1320–1336. Lin, C.-T., & Lee, C. S. G. (1996). Neural fuzzy systems: A neuro-fuzzy synergism to intelligent systems. Upper Saddle River, NJ: Prentice Hall. Lin, T. Y., & Cercone, N. (1997). Rough sets and data mining: Analysis of imprecise data. Boston: Kluwer. Lozowski, A., Cholewo, T. J., & Zurada, J. M. (1996). Crisp rule extraction from perceptron network classifiers. In Proceedings of International Conference on Neural Networks, vol. Plenary, Panel and Special Sessions (pp. 94–99). Washington, D.C. Mamdani, E. H., & Assilian, S. (1975). An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies, 7(1), 1– 13. Mitra, S., & Hayashi, Y. (2000). Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks, 11(3), 748–768. Moser, B. (1999). Sugeno controllers with a bounded number of rules are nowhere dense. Fuzzy Sets and Systems, 104(2), 269–277. Nakanishi, H., Turksen, I. B., & Sugeno, M. (1993). A review and comparison of six reasoning methods. Fuzzy Sets and Systems, 57(3), 257–294. Nauck, D. D. (2003). Measuring interpretability in rule-based classification systems. In Fuzzy Systems, 2003. FUZZ ’03. 12th IEEE International Conference (Vol. 1, pp. 196–201). Piscataway, NJ: IEEE. Nauck, D., Klawonn, F., & Kruse, R. (1997). Foundations of neuro-fuzzy systems. New York: Wiley. Pal, N. R., & Pal, T. (1999). On rule pruning using fuzzy neural networks. Fuzzy Sets and Systems, 106(3), 335–347.

242

K. Ang and C. Quek

Pasquier, M., Quek, C., & Toh, M. (2001). Fuzzylot: A novel self-organising fuzzy–neural rule–based pilot system for automated vehicles. Neural Networks, 14(8), 1099–1112. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Boston: Kluwer. Quek, C., Tan, K. B., & Sagar, V. K. (2001). Pseudo-outer product based fuzzy neural network fingerprint verification system. Neural Networks, 14(3), 305– 323. Quek, C., & Zhou, R. W. (1999). POPFNN-AAR(S): A pseudo outer-product based fuzzy neural network. IEEE Transactions on Systems, Man and Cybernetics, Part B, 29(6), 859–870. Quek, C., & Zhou, R. W. (2001). The POP learning algorithms: Reducing work in identifying fuzzy rules. Neural Networks, 14(10), 1431–1445. Quek, C., & Zhou, R. W. (2002). Antiforgery: A novel pseudo-outer product based fuzzy neural network driven signature verification system. Pattern Recognition Letters, 23(14), 1795–1816. Shann, J. J., & Fu, H. C. (1995). A fuzzy neural network for rule acquiring on fuzzy control systems. Fuzzy Sets and Systems, 71(3), 345–357. Shen, Q., & Chouchoulas, A. (1999). Combining rough sets and data-driven fuzzy learning for generation of classification rules. Pattern Recognition, 32(12), 2073–2076. Shen, Q., & Chouchoulas, A. (2002). A rough-fuzzy approach for generating classification rules. Pattern Recognition, 35(11), 2425–2438. Sugeno, M., & Kang, G. T. (1988). Structure identification of fuzzy model. Fuzzy Sets and Systems, 28(1), 15–33. Sugeno, M., & Yasukawa, T. (1993). A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1), 7. Swiniarski, R. W., & Skowron, A. (2003). Rough set methods in feature selection and recognition. Pattern Recognition Letters, 24(6), 833–849. Takagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man and Cybernetics, 15(1), 116–132. Tan, G. K. (1997). Feasibility of predicting congestion states with neural network models. Unpublished final year project report, Nanyang Technological University, Singapore. Tikk, D., & Baranyi, P. (2003). Exact trade-off between approximation accuracy and interpretability: Solving the saturation problem for certain FRBSs. In J. Casillas, O. Cordon, ´ F. Herrera, & L. Magdalena (Eds.), Interpretability issues in fuzzy modeling (pp. 587–604). Berlin: Springer-Verlag. Tikk, D., Koczy, ´ L. T., & Gedeon, T. D. (2003). A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning, 33(2), 185–202. Tung, W. L., & Quek, C. (2002a). DIC: A novel discrete incremental clustering technique for the derivation of fuzzy membership functions. In Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence (pp. 178– 187). Berlin: Springer-Verlag.

RSPOP

243

Tung, W. L., & Quek, C. (2002b). GenSoFNN: A generic self-organizing fuzzy neural network. IEEE Transactions on Neural Networks, 13(5), 1075–1086. Tung, W. L., & Quek, C. (2004). Supervised learning in neural fuzzy systems (Part 2)—BackPole: A back propagation algorithm based on objective learning errors. Manuscript submitted for publication. Yager, R. R. (1994). Modeling and formulating fuzzy knowledge bases using neural networks. Neural Networks, 7(8), 1273–1283. Yen, J., Wang, L., & Gillespie, C. W. (1998). Improving the interpretability of TSK fuzzy models by combining global learning and local learning. IEEE Transactions on Fuzzy Systems, 6(4), 530–537. Ying, H., Ding, Y., Li, S., & Shao, S. (1999). Comparison of necessary conditions for typical Takagi-Sugeno and Mamdani fuzzy systems as universal approximators. IEEE Transactions on Systems, Man and Cybernetics, Part A, 29(5), 508–514. Zadeh, L. A. (1975a). The concept of a linguistic variable and its application to approximate reasoning—I. Information Sciences, 8(3), 199–249. Zadeh, L. A. (1975b). The concept of a linguistic variable and its application to approximate reasoning—II. Information Sciences, 8(4), 301–357. Zadeh, L. A. (1975c). The concept of a linguistic variable and its application to approximate reasoning—III. Information Sciences, 9(1), 43–80. Zadeh, L. A. (1994). Fuzzy logic, neural networks, and soft computing. Communications of the ACM, 37(3). Zhou, R. W., & Quek, C. (1996). POPFNN: A pseudo outer-product based fuzzy neural network. Neural Networks, 9(9), 1569–1581. Received September 4, 2003; accepted June 2, 2004.

REVIEW

Communicated by Satinder Singh

Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms Florentin Worg ¨ otter ¨ [email protected]

Bernd Porr [email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland

In this review, we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spike-timing-dependent plasticity (STDP). This review introduces the most influential models and focuses on two questions: To what degree are reward-based (e.g., TD learning) and correlationbased (Hebbian) learning related? and How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe that reward-based and correlation-based learning are indeed very similar. Machine control is then used to introduce the problem of closed-loop control (e.g., actor-critic architectures). Here the problem of evaluative (rewards) versus nonevaluative (correlations) feedback from the environment will be discussed, showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question, we compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus, and cortex) and the molecular biophysics of glutamatergic and dopaminergic synapses. Finally, we discuss the different algorithms used to model STDP and compare them to reward-based learning rules. Certain similarities are found in spite of the strongly different timescales. Here we focus on the biophysics of the different calciumrelease mechanisms known to be involved in STDP. 1 Introduction Flexible reaction in response to events requires foresight. This holds for humans and animals but also for robots or other agents that interact with their environment. Thus, predicting the future has been a central objective not only for soothsayers in ancient times and their more modern equivalents— the stock market analysts—but for every agent that needs to survive in a Neural Computation 17, 245–319 (2005)

c 2005 Massachusetts Institute of Technology

246

F. Worg ¨ otter ¨ and B. Porr

sometimes hostile environment. Accurate, and thus useful, predictions can be made only if they are based on prior knowledge, which must be obtained by analyzing the relations between past events. Many events encountered by an agent are linked in a temporally causal way through the laws of physics and the causal structure of the world. Examples from the living and the technical world, some more sensor related and others more related to motor actions, make this clear: 1. Smell precedes taste when foraging, and the sound of moving prey may precede its smell. 2. The sequence “warm,” “hot,” “hotter” should be followed by “very hot” or by “pain.” 3. Leaning sideways will lead to a reflex-like antagonistic motor action in order to prevent falling. 4. The motor action of feeding will lead to a taste sensation. 5. The bell precedes the food in Pavlov’s classical conditioning experiments. 6. The position of a robot arm is preceded by the force patterns that converge onto it. In these examples we have deliberately mixed sensor-sensor events (e.g., items 1, 2) with motor-motor (item 3) and sensor-motor events (item 4). We will see that theoretical treatment needs to take care of these distinctions. At the same time, we have also introduced unimodal (temperature over time in item 2) and multimodal events (all others), which will also require different treatment. The list of such causally related events or actions, which will occur during the lifetime of an agent, is endless. All of these events have in common that they are temporally (auto- or cross-) correlated with each other such that they form a temporal sequence. Agents that are able to act in response to earlier events in such a sequence chain have, without doubt, an evolutionary advantage. As a consequence, learning to correctly interpret temporal sequences and deduce appropriate actions is a major incentive for any agent fostering its survival. Actually, the problem with which we are faced is twofold: learn to predict an event, and learn to perform an anticipatory action. This has been called the prediction- and the control-problem (Sutton, 1999). 1.1 Structure of This Review. The goal of this review is threefold. (1) We use machine control to introduce the basic reinforcement learning terminology. (2) We then compare the different neuronal models for reward-based TD learning, which are used in so-called actor-critic architectures of animal control, with each other and with the anatomy and physiology of the basal gan-

Temporal Sequence, Prediction, and Control

247

Figure 1: Cross links between the different fields (see the text). Small equals signs denote that these algorithms produce identical results after convergence.

glia. Here will strongly focus on the biophysics of dopaminergic synapses and try to provide a summary of the most essential synaptic mechanisms that have been identified so far. (3) Finally, we compare reward-based learning mechanisms (e.g., TD learning) with correlation-based Hebbian learning, trying to relate them to the basic mechanisms of long-term potentiation (LTP) and long-term depression (LTD) as found in spike-timing-dependent plasticity (STDP). Figure 1 shows the organization of this review. This figure depicts the most important links between the different subfields and algorithms; the most influential ones are shown in red. The article proceeds from left (machine learning) to right (synaptic plasticity). The flow in the sections (direction of the arrows) is always from top to bottom, except for all “green” components, which are concerned with biophysical aspects. Here we proceed upward. For specific reviews on the three different fields we refer readers to Reinforcement learning (Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998), animal conditioning (Sutton & Barto, 1990; Balkenius & Moren, 1998), the dopamine reward system of the brain (Schultz, 1998, 2002; Schultz & Dickinson, 2000), and data and models of synaptic plasticity (Martinez & Derrick, 1996; Malenka & Nicoll, 1999; Bennett, 2000; Bi & Poo, 2001; van Hemmen, 2001; Bi, 2002).

248

F. Worg ¨ otter ¨ and B. Porr

2 The Basic Concepts of Reinforcement Learning It is fair to say that currently, almost all methods for predictive control in robots rely on reinforcement learning (RL), and strong indications exist in the literature that this type of learning plays a central role in animals too. A major thread that is followed throughout this article concerns the distinction between reward-based versus correlation-based learning methods, or between evaluative versus nonevaluative feedback (see Figures 7 and 9). RL, at least in its original formulation, is strictly reward based. Hence, an early conclusion would be that animals rely heavily on reward-based learning. Synaptic plasticity, on the other hand, is correlation based (Hebbian). How can this apparent conflict be resolved? How can reward-based mechanisms be reformulated such that they can be captured by correlation-based learning rules? And what are the problems when trying to do this? These are the questions we address in this review. The central message of this article is that reward-based and correlationbased methods are very similar, if not identical, in the open-loop condition, where the actions of the learner will not influence the learning, while they are clearly different in the closed-loop condition, hence during the normally existing behavioral feedback. The algorithms for reinforcement learning are treated to a great extent in the literature (Sutton & Barto, 1998) and shall not be described here, but we must describe the basic assumptions of RL to be able to compare it to animal learning. To this end we will only discuss systems with a finite number of discrete states, commonly known as finite Markov decision problems (MDP).1 We assume that an RL agent is able to visit these states and that these states will convey information about the decision problem. RL further assumes that in visiting a state, a numerical reward will be collected, where negative numbers may represent punishments. Each state has a changeable value attached to it. From every state, there are subsequent states that can be reached by means of actions. The value of a given state is basically defined by the averaged future reward, which can be accumulated by starting actions from this particular state. Here, we could look at all or just at some of the follow-up actions that are possible in starting from the given state. The different algorithms for RL take different approaches

1 Thus, RL also assumes that such systems “follow the Markov property.” Essentially this means that it is unimportant along which path a certain state has been reached. Once there, the state itself contains all relevant information for future calculations. Many times the Markow property cannot be guaranteed in real-world decision problems, which poses a practical problem when wanting to employ RL methods. Also we note that conventional RL needs to be augmented by additional mechanisms if one wants to employ it to more complex (e.g., time and space continuous) systems. These aspects shall not be discussed here, but see Reynolds (2002), and Santos and Touzet (1999a, 1999b).

Temporal Sequence, Prediction, and Control

249

toward reward averaging, which shall not be discussed here. Actions will follow a policy, which can also change. The goal of RL is to maximize the expected cumulative reward (the “return”) by subsequently visiting a subset of states in the MDP. This, for example, can be done by assuming a given (unchangable) policy. Often, however, a subgoal of RL is to try to achieve this in an optimal way by finding the best action policy to travel through state-space. In this context, RL methods are mainly employed to address two related problems: the prediction and the control problem. 1. Prediction only. RL is used to learn the value function for the policy followed. At the end of learning, this value function describes for every visited state how much future reward we can expect when performing actions starting at this state. 2. Control. By means of RL, we wish to find that particular set of policies that maximizes the reward when traveling through state-space. This way, we have at the end obtained an optimal policy that allows for action planning and optimal control. Several algorithms have been designed to calculate or approximate value functions or to find optimal action policies, most notably, dynamic programming (Bellman, 1957), Monte Carlo prediction and control, TD learning (Sutton, 1988), SARSA (Rummery, 1995; Sutton, 1996), and Q learning (Watkins, 1989; Watkins & Dayan, 1992). They are reviewed in Sutton and Barto (1998). Most of the above-named algorithms for RL rely on the method of temporal differences (TD methods). The formalism of this method is described in the appendix and its neuronal version later in the main text. Using the TD method, the rewards, which determine learning, enter the formalism in an additive way (see equation A.3). Thus, TD methods are in the first instance strictly noncorrelative as opposed to Hebb rules, where a multiplicative correlation of pre- and postsynaptic activity drives the learning. Thus, at first it seems hard to introduce a correlative relation in RL to make it compatible with synaptic plasticity, hence with Hebbian learning. Two observations can be made to mitigate this situation. First, we note that the backward TD(λ) method (see the appendix) contains so-called eligibility traces x, which enter the algorithm in a multiplicative way (see equation A.11) and can thus be used to define a correlation-based process. The concept of such traces shall be explained in a more neuronal context in greater detail in section 3. Here it suffices to say that this way, TD learning becomes formally related to Hebbian learning. There is, however, a more interesting, but also more problematic, way to introduce a correlation-based process that is relevant for biological or biomimentic agents: one could try to define the rewards (or punishments) by means of signals derived from sensor inputs. Naively, pain is a punishment, while pleasure is a reward. Thus, rewards can be correlated to sensor events.

250

F. Worg ¨ otter ¨ and B. Porr

This seems to be a simple and obvious way to introduce a correlation-based process mainly used in so-called actor-critic models of machine or animal control, which are introduced later (in section 4.1). Actor-critic architectures are closed-loop control structures, where the actions of the agent (the animal) will influence its own (sensor) inputs. Thus, before we can discuss those, we must first describe the different algorithms in their open-loop versions. 3 Neuronal Architectures for Prediction and Control—Open Loop Condition The goal of the next sections is to transfer these concepts of RL, for example, TD learning, to neural networks related to different brain structures. At the same time, we will discuss other algorithms that are less directly related to the RL formalism. 3.1 Relating RL to Classical Conditioning. Classical conditioning represents in its simplest form a learning paradigm where the pairing of two subsequent stimuli is learned such that the presentation of the first stimulus is taken as a predictor of the second one. In the descriptions above, we have discussed the prediction problem but have treated only the (action-driven) transition between states. Only a vague indication has been given so far about how to use temporally correlated signals for learning. Thus, the goal of this section is to show how to augment these aspects such that we can use correlations for learning. This way, we will see that reinforcement learning and (differential) Hebbian learning are related to each other. 3.1.1 Early Neural Differential Hebbian Architectures. In the traditional example of Pavlov’s dog (Pavlov, 1927), we have the unconditioned stimulus (US), food, which is preceded by the conditioned stimulus (CS), bell. The CS predicts the US, and after learning, the dog starts salivating in response to the CS. This represents an open-loop paradigm: the action of salivating will not influence the presentation of the stimuli. This is different from instrumental conditioning, which represents a closed-loop paradigm discussed later in the context of actor-critic architectures. A model of classical conditioning should produce an output that, before learning, responds to the US only; after learning, the output should occur earlier in response to the CS. Furthermore, we note that the US and CS are temporally correlated. To capture this, we need a correlative property in reinforcement learning that in its machine-learning formalism (see the appendix) is not immediately visible, because machine learning relies entirely on the transition between states by means of actions, and correlations do not play any role.

Temporal Sequence, Prediction, and Control

X1

251

CS

x

e-trace

X0

Dw1 + w1 w0 = 1

v(t)

US (reward)

Figure 2: Simple correlative temporal learning mechanism applying an eligibility trace (e-trace) at input x1 (CS) to ensure signal overlap at the moment when x0 (US) occurs; ⊗ denotes the correlation.

This, however, can be achieved by using eligibility traces.2 To this end, we need to restructure the formalism in a neuronal way, where stimuli converge via synapses at a neuron. These architectures are in the context of classical conditioning generally called stimulus substitution architectures, because the first, earlier stimulus after learning substitutes in effect the second, later stimulus. The goal is to generate an output signal at this neuron that will, at the end of learning, directly respond to the CS. This way, the output can be used as a predictor signal for other processes, such as for control. This should be achieved by strengthening the synapse that transmits the CS during learning. Figure 2 shows such a structure at the beginning of learning. Note that this model level is still quite far removed from the biophysical models of synaptic plasticity, which we discuss later. The US converges with a strong, driving synapse ω0 at the neuron. To keep things consistent with the descriptions above, one could call the US “reward.” The CS precedes the US and converges with an initially weak synapse ω1 . However, the CS elicits a decaying trace (e-trace) at its own synapse. This, biophysically unspecified, trace is meant to capture the idea that the CS synapse remains eligible for modification for some time after the CS has ended. We can now correlate the output V, which at the beginning of learning mirrors the US, with this eligibility trace and use the result of this correlation, depicted by ω1 , to change the weight of the CS-synapse. This idea was first formalized by Sutton and

2 Historically eligibility traces were first developed in the context of classical conditioning models (Hull, 1939, 1943; Klopf, 1972, 1982) and only later were introduced in the context of machine learning (Sutton, 1988; Singh & Sutton, 1996). Our review emphasizes the structural and functional similarities of the different algorithms; therefore, we do not follow the historical route.

252

F. Worg ¨ otter ¨ and B. Porr

Barto (1981) in their classical modeling study:3 ω1 (t + 1) = ω1 (t) + γ [v(t) − v(t)]x(t),

(3.1)

where they introduced two eligibility traces, x(t) at the input and v(t) at the output, given by x(t + 1) = αx(t) + x(t)

(3.2)

v(t + 1) = βv(t) + (1 − β) v(t),

(3.3)

with control parameters α and β. Mainly they discuss the case of β = 0, where v(t) = v(t − 1), which turns their rule into ω1 (t + 1) = ω1 (t) + γ [v(t) − v(t − 1)]x(t).

(3.4)

Before learning, this neuron will respond only to the US, while after learning, it will respond to the CS as well. In a later review, Sutton and Barto (1990) discuss that their older model is faced with several problems. For example, for short or negative interstimulus intervals (ISIs) between CS and US, the model predicts strong inhibitory conditioning (see Figure 8 in Sutton & Barto, 1990). Some reports exist that show weak inhibitory conditioning, but mostly weak excitatory conditioning seems to be found in these cases as well (Prokasy, Hall, & Fawcett, 1962; Mackintosh, 1974, 1983; Gormezano, Kehoe, & Marshall, 1983).4 3.1.2 Isotropic Sequence Order Learning. Some of the problems of the Sutton and Barto (1981) model come from the fact that input lines are not treated identically, which leads to an asymmetrical behavior of this model. In a more recent approach, we have achieved this by employing a specific differential Hebbian learning rule onto all synaptic weights (Porr & Worg ¨ otter, ¨ 2002, 2003a, 2003b). The main distinguishing features of isotropic sequence order learning (ISO learning) are that (1) all input lines are treated equal, so there is no a priori built-in distinction between CS and US and all lines learn; (2) eligibility traces are created by bandpass filtering the inputs; (3) learning is purely correlation based and synapses can grow or shrink depending on 3 Here we briefly mention that we deliberately have not discussed some older work, like the Rescorla-Wagner rule (Rescorla & Wagner, 1972, Fig. 1) or the δ rule of Widrow and Hoff (1960), which is at the root node of all these algorithms. Indeed, analytical proofs exist that the results obtained with the δ rule are identical to those obtained with the RescorlaWagner rule (Sutton & Barto, 1981) and the TD(1) algorithm (Sutton, 1988), denoted by the small equals signs in Figure 1. 4 More problems are discussed in Sutton and Barto (1990), for example, with respect to the so-called delay-conditioning paradigm, which can be solved by some specific modifications of the Sutton and Barto (1981) model. We refer readers to Sutton and Barto (1990) for these specific issues.

Temporal Sequence, Prediction, and Control

253

Figure 3: Isotropic sequence order learning. (A) Structure of the algorithm for N + 1 inputs. For notations, see the text. A central property of ISO learning is that all weights can change. (B) Weight change curve calculated analytically for two inputs with identical resonator characteristics (h). The optimal temporal difference for learning is denoted as Topt . (C) Linear development of ω1 for two active inputs (x0 , x1 ). At time-step 40,000, input x0 is switched off, and as a consequence of the orthogonality property of ISO learning, ω1 stops to grow. (D) Development of 10 weights ω1i , i = 0, . . . , 9 in a robot experiment (Porr & Worg ¨ otter, ¨ 2003a). All weights are driven by input x1 but are connected to different resonators hi1 , which create a serial compound representation of x1 (see Figure 10A). The robot’s task was obstacle avoidance. At around t = 150 s, it has successfully mastered it, and the input x0 , which corresponds to a touch sensor, is not triggered again. As a consequence, we observe that the weights ω1i stop to change (compare to C).

the temporal sequence of their inputs; and (4) inputs can take any form of being analog or pulse coded. As a consequence, this approach is linear. Figure 3 A shows the structure of the algorithm The system consists of ¯ N + 1 linear filters h receiving inputs x and producing outputs x: x¯ i = xi ∗ hi ,

(3.5)

254

F. Worg ¨ otter ¨ and B. Porr

where the asterisk denotes a convolution. The transfer functions h shall be those of bandpass filters. They are specified by h(t) =

1 at e sin(bt), b

(3.6)

with: a := Re(p) = −π f/Q, b := Im(p) =

(2π f )2 − a2 .

(3.7)

f is the frequency of the oscillation and Q the damping characteristic. To get an idea what these filters do, we consider pulse inputs. In this case, the filtered signals will consist of damped oscillations (Grossberg & Schmajuk, 1989; Grossberg, 1995; Grossberg & Merrill, 1996), which span across some temporal interval until they fade. Thus, bandpass filtering essentially amounts to applying an eligibility trace to all inputs. The filtered signals connect with corresponding weights ω to one output unit v. The output v(t) is given as v(t) =

N

ωi x¯ i .

(3.8)

i=0

Learning takes place according to a differential Hebbian learning rule, d ωi = µx¯ i v dt

µ 1,

(3.9)

where v is the temporal derivative of v. Note that µ is very small. (See Porr & Worg ¨ otter, ¨ 2003a, for a complete description of the ISO learning algorithm and its properties.) First, we note that the system is linear, and weight changes can be calculated analytically (see Figure 3B), as shown in Porr and Worg ¨ otter ¨ (2003a). If we consider just one input, then we find that it is after filtering orthogonal to the derivative of its output. Intuitively this can be understood when looking at two pulse inputs x0 , x1 . In this case, both trace signals x¯ 0 , x¯ 1 are damped sine waves, and the derivative of the output is thus a sum of damped cosine waves. Thus, if one input becomes zero, the derivative of the output will be orthogonal to the remaining input. Mathematically, it can be shown that this holds for more than two inputs too (Porr & Worg ¨ otter, ¨ 2003a). Thus, inputs will not influence their own synapses, and learning is strictly hetero-synaptic. As a consequence, a very nice feature emerges for pairs of synapses: weight change will stop as soon as one input becomes silent (see Figure 3C). This leads to an automatic self-stabilizing property for the network in control applications (Figure 3D; see also section 4.3.1).

Temporal Sequence, Prediction, and Control

255

3.1.3 The Basic Neural TD Formalism and Its Implementation. Most influential in the field of neuronal temporal sequence learning algorithms currently is the TD formalism. To define it in a neuronal way, we replace the “states” from traditional RL with “time-steps” and assume that rewards can be retrieved at each such time-step. Furthermore, we assume that distant rewards will count less than immediate rewards, introducing a discounting factor γ . Then we define the total reward (called the “return”) as the discounted sum of all expected future rewards (similar to the formalism in the appendix): R(t) = r(t + 1) + γ r(t + 2) + γ 2 r(t + 3) + . . . ,

(3.10)

and this can be rewritten as R(t) = r(t + 1) + γ R(t + 1).

(3.11)

Now we define a neuron v such that it will function as a prediction neuron. To this end, we assume that the output v is able to predict the reward. As a consequence, we would hope that R(t + 1) ≈ v(t + 1) and that R(t) ≈ v(t), replacing (compare equation A.5), v(t) ≈ r(t + 1) + γ v(t + 1).

(3.12)

Since this is only approximately true (until convergence), we can define an error (compare equation A.9) with δ(t) = r(t + 1) + γ v(t + 1) − v(t),

(3.13)

realizing that this error function contains a derivative-like term, v(t+1)−v(t), which is shifted one step into the future. This results from the fact that the update of the value of a state requires visiting at least the next following state. In an artificial neural network such as that depicted in Figure 4, this does not produce problems, but in a rigorous neuronal implementation, one should be careful to avoid this acausality, for example, applying the appropriate delays or a more realistic eligibility trace (see below). Now we can update the synaptic weights with ωi ← ωi + α δ(t) xi (t),

(3.14)

where xi (t) is the eligibility trace associated with stimulus xi . In an older review, Sutton and Barto (1990) discuss to what degree this specific TD model can explain the experimental observations in classical conditioning, and they show that it surpasses their older approach in many aspects. This shall not be discussed here, though, because we wish to concentrate next on the neuronal aspects of TD learning—its network implementations, its possible neuronal counterparts in the brain, and its synaptic biophysics.

256

F. Worg ¨ otter ¨ and B. Porr

A #1

Start: w0 = 0 w1 = 0

reward, US

Predictive Signals

Start: w0 = 1 w1 = 0

#2

X1 x

X0 x

v

~ v’

End: w0 = 1 w1 = 0

~ d=v’+ r reward

x

Xn X1

End: w0 = 1 w1 = 1

d x

#3

v

~ v’ (n-i)t

~ v’ v(t)

X0

~ d=v’+ r

B

C r d

x

w0 = 1

d

x

t t

~ v’

X1 X0 reward

r

v(t-t)

v(t)

X1

v(t)

X0 reward

w0 = 1

Figure 4: Basic neural implementation of the TD rule. (A) The first three learning steps (#1,#2,#3) performed with the circuit drawn in the shaded inset, which represents what probably is the most basic version of a neuronal TD implementation using a serial compound stimulus representation. The symbol v˜ represents a forward-shifted difference operation: v˜ (t) = v(t + 1) − v(t). Other symbols are explained in the text. (B,C) Same circuit but using x0 as reward signal. (C) The acausal forward shift of the difference operation is avoided by introducing unit delays τ into the learning circuits. This allows calculating the derivative by means of an inhibitory interneuron (black).

Temporal Sequence, Prediction, and Control

257

The circuit diagram in Figure 4 shows a simple implementation of the TD rule in a neuronal model. It is constructed to most closely relate to the basic neuronal TD formalism introduced above, keeping the number of necessary parameters minimal. The first neuronal models with an architecture similar to Figure 4 were devised by Montague, Dayan, Person, and Sejnowski (1995). The goal of our implementation is to arrive at an output v, which reacts after learning to the onset of the CS denoted as xn and maintains its activity until the reward terminates. Such an ideally rectangular signal is reminiscent of the response of so-called reward-expectation neurons, which shall be shown in section 3.3. To achieve this goal, we represent the CS internally by a chain of n + 1 delayed pulses xi , with a unit delay τ , which is equal to the pulse width. We construct this chain such that the signal x0 coincides with the reward. In some sense, the pulse chain xi represents a special (distributed) kind of eligibility trace; it is often called a serial compound representation (Sutton & Barto, 1990) first used in a neural network architecture by Montague, Dayan, & Sejnowski (1996). The pulse protocols 1, 2, and 3 in Figure 4 show the first three trials for this algorithm starting with all weights at zero. Thus, in the first trial (#1), the output v is zero, and a positive prediction error δ occurs together with the reward. The weight change is calculated using equation 3.14 with ωi = δ xi . Thus, in the first trial, 1, only the multiplication of x0 with δ will yield a result different from zero, and only this weight changes. In the next trial, v reacts together with x0 ; the derivative is shown below already time-shifted by one step forward as demanded above. As a consequence, δ also moves forward (Montague et al., 1996), and now the correlation between x1 and δ yields 1. This process continues in trial 3, where we show only a few traces, and so on until the last weight ωn grows. As a result, v becomes a rectangle pulse starting at xn and ending at the end of the reward pulse. Essentially we have implemented a backward TD(1) algorithm (see the appendix) this way. It is obvious that the special treatment of the reward as an extra line leading into the learning subcircuitry is not really necessary. Figure 4B shows that the input signal x0 can replace the reward signal when setting ω0 = 1 from the beginning. This shows that reward-based architectures are in some cases equivalent to stimulus-substitution architectures, and we are again approaching the original architecture of the old Sutton and Barto (1981) model. Note that the circuit diagram shown in Figures 4A and 4B cannot be implemented with analog (neuronal) hardware because it contains the forward shift of the derivative term denoted by v˜ . Figure 4C shows how to modify the circuit in order to ensure a causal structure. This requires adding the unit delay τ in the learning subcircuits to all input lines (and the reward, if an extra reward line exists). As a result, the δ signal will appear with the same delay. This modification, however, allows us also to introduce a small circuit for calculating the difference term by means of an inhibitory interneuron. The architecture in Figure 4C is presumably the simplest way to implement

258

F. Worg ¨ otter ¨ and B. Porr

A Sutton & Barto, 1981

B Sutton and Barto,1988 TD-learning

r

X

E x1

w1

S

S

X

v’

E x1

v

w1 S

x0

v’ v

C ISO-Learning X

x1

h1

x0

h0

wi

S

v’ v

X Figure 5: (A,B) Comparison between the two basic models of Sutton and Barto and ISO learning (C). Inputs = x, eligibility trace = E, reward = r, resonators = h. The symbol v denotes the difference operation between subsequent output values in A, B and the differentiation in C. The small amplifier symbol represents a changeable synaptic weight ω.

a TD rule neuronally without violating causality. 3.2 Comparing Correlation Based- and TD-Methods. Let us compare the formalism of the old Sutton and Barto (1981) model, ω1 (t + 1) = ω1 (t) + γ [v(t) − v(t − 1)]x(t),

(3.15)

with the neuronal TD procedure given by ωi ← ωi + α [r(t + 1) + γ v(t + 1) − v(t)] xi (t).

(3.16)

Learning in Sutton and Barto (1981) is correlative, as in Hebbian learning, but it depends on the difference of the output v ≈ v(t) − v(t − 1) (see Figure 5A, differential Hebbian learning).5 Furthermore, since v(t) = i ω i xi , i ≥ 0 5 Other early differential Hebbian models have been devised by Kosco (1986) and Klopf (1986), who employ a derivative at the CS input to account for its possible inner transient representation.

Temporal Sequence, Prediction, and Control

259

in equation 3.15, we notice that the “reward” x0 does indeed occur in the stimulus substitution model of Sutton and Barto, but not as an independent term as in the reward-based TD-procedure. Reward-based architectures, like TD learning, allow treating the reward as an independent entity, which arises in addition to the sensor signals x (see Figure 5B). Thus, a reward signal could also be an intrinsically generated signal in the brain, which is not necessarily, or only indirectly, related to sensor inputs, as discussed above. In view of neuronal response recorded in the basal ganglia, this concept has strong appeal. This was one of the reasons that the TD formalism has been adopted to classical conditioning. How does ISO learning (see Figure 5C) compare to these methods? ISO learning uses different eligibility traces than Sutton and Barto (1981) and Sutton (1988) and employs them at all input pathways. Hence, they will influence learning via the output v (more specifically via v ). In the models of Sutton and Barto (1981) and Sutton (1988), an eligibility trace is applied only at input x1 and influences only the learning circuit but not the output v. Thus, for more than two inputs, we find for ISO learning that v = ωi x¯ i , while for TD learning, we have v = ωi xi . If we assume, as before, that x0 is identical to the reward signal and by handling the notation concerning the derivatives in a somewhat sloppy way, we can rewrite the weight change in TD learning (see equation 3.16) as dωk ωi xi x¯ k (3.17) = (x0 + v )x¯ k = x0 + dt i>0 and that of ISO learning as dωk ωi x¯ i x¯ k . = v x¯ k = dt i=0

(3.18)

As opposed to ISO learning, the derivative in TD learning is not applied to the reward, and the output is an unfiltered (no traces) sum of the weighted inputs. These differences prevent TD learning from being orthogonal between inputs and outputs. 3.2.1 Conditions of Convergence. Furthermore, we note that conditions of convergence are different in ISO as compared to TD learning. Trivially, weight growth of ω1 stops in both algorithms when x1 = 0. Otherwise, weight growth stops in TD learning when r(t + 1) + γ v(t + 1) − v(t) = 0, a condition that requires the output to take a certain value (output condition). The orthogonality between input and output in ISO learning, on the other hand, leads to the situation that dω1 /dt = x¯ 1 v = 0 if x0 = 0, which is an input condition. It is interesting to discuss how this would affect animal learning in closed-loop situations with behavioral feedback. Animals cannot measure their “output.” All they can do is sense the consequences of an

260

F. Worg ¨ otter ¨ and B. Porr

action, hence sense the consequences of some produced output (behavior) after it has been transmitted back to the animal via the environment (see Figure 7). Thus, immediate output control can be performed only by an external observer, which might, however, misjudge the validity of an action, leading to bad convergence properties of such an algorithm. Alternatively, augmented output control could be performed by the learner after environmental filtering via its input sensors. This, however, might also go wrong if the filtering is not benign. Thus, there is a clear difference between input and output control, as discussed below (see section 4). 3.3 Neuronal Reward Systems. Before we try to embed the different algorithms into “behaving systems,” hence into closed-loop architectures, we would like to open a bracket and show why the reward paradigm appears to be so strong in animal learning. So far, we are still at the level of artificial neural network implementations of the TD rule and we mentioned only in passing that a neurophysiological motivation for this exists: the dopaminergic reward system. The computational focus of this review does not allow for an in-depth discussion of this topic, but some of the most important facts should be described before discussing the corresponding modeling approaches in section 4.2. Different response types are found in the mammalian brain that seem to be related to reward processing. We associate a few of them rather directly with the nomenclature of TD learning, noting that this is an oversimplification which is necessary to fit them into the perspective of a theoretician. (For more detailed discussions, see Schultz, 1998, 2002, and Schultz & Dickenson, 2000.) Most important, one finds: • Prediction-error neurons. These are dopamine neurons (DA neurons) in the pars compacta of the substantia nigra and the medially adjoining ventral tegmental area. These neurons seem to essentially capture the properties of the δ signal in TD learning (Miller, Sanghera, & German, 1981; Schultz, 1986; Mirenowicz & Schultz, 1994; Hollerman & Schultz, 1998, for reviews, see Schultz, 1998, 2002; Schultz & Dickenson, 2000; but see Redgrave, Prescott, & Gurney, 1999). In addition, they respond to novel, salient stimuli that have attentional and possibly rewarding properties. Before learning, these neurons respond to an unpredicted reward (see Figure 6A). During learning, this response diminishes (see Figure 6D; Ljungberg, Apicella, & Schultz, 1992; Hollerman & Schultz, 1998) and the neurons will now increase their firing in response to the reward-predicting stimulus (see Figure 6B). The more predictable the reward becomes, the less strongly the neuron fires when it appears. The δ signal in TD learning essentially corresponds to δ = reward occurred − reward predicted. Thus, these neurons respond with an inhibitory transient as soon as a predicted reward fails to be delivered (“Omission,” Figure 6C). Furthermore, these neurons

Temporal Sequence, Prediction, and Control Novelty Response: no prediction, reward occurs

A

After learning: predicted reward occurs

r

no CS

E

Continuous decrease of novelty response during learning

B

After learning: predicted reward does not occur

r

CS

D

261

CS

1.0 s

Response Transfer (Population Response)

L

G

K

0.5 s

J In

Reward Expectation

Tr

r

Reward Expectation (Population Response)

Tr Tr

r

F

H

I r

C

L

r

1.0 s

1.5 s

Figure 6: Neuronal response in the basal ganglia concerned with reward processing. (A) Response of a dopamine neuron in monkey to the presentation of a reward (r, drop of liquid) without preceding CS. (B) Response of the same neuron after learning that the CS will predict the reward. The neuron now responds to the CS but not to the reward. (C) If the expected reward fails to be delivered, the neuron will be inhibited (A–C, recompiled from Schultz et al. (1997). (D) Reduction of novelty response during learning. From top to bottom, the number of trials increases in chunks of five equivalent to a growing learning experience (D recompiled from Hollerman & Schultz, 1998). (E–J) Response transfer to the earliest reward-predicting stimulus (recompiled from Schultz, 1998). (E) Control situation, presentation of a visual stimulus (light, L) will not lead to a response. (F) Novelty response, situation similar to A. (G) Response to a reward predicting Trigger (Tr) stimulus (similar to B, left part of diagram) and (H) failure to respond to a correctly predicted reward (similar to B, right part of diagram). (I) Response to a newly learned Instruction (In) stimulus, which precedes the trigger. (J) In the same way, no response to the correctly predicted reward. (K) Response of a putamen neuron, which gradually increases and maintains its firing after a trigger until the reward is delivered ((K) recompiled from Hollerman et al., 1998). (L) Population diagram of 68 striatal neurons that show an expectation of reward type response ((L) recompiled from Suri & Schultz, 2001).

262

F. Worg ¨ otter ¨ and B. Porr

show response transfer properties: when an earlier reward-predicting stimulus is introduced, the response of the neuron during learning will move forward in time and begin to coincide with the new, earlier stimulus (see Figures 6E–J). • Reward-expectation neurons. These are in the striatum, orbitofrontal cortex, and amygdala (Hollerman, Tremblay, & Schultz, 1998; Tremblay, Hollerman, & Schultz, 1998; Tremblay & Schultz, 1999).6 They respond with a prolonged maintained discharge following a trigger stimulus until the reward is delivered (see Figures 6K and 6L). Thereby, they are similar to the “neuron” in Figure 4. Delayed delivery prolongs the response, while earlier delivery shortens it. Early, during learning, those neurons will always start to fire following the trigger stimulus, apparently naively expecting a reward in every trial. Only with experience do responses become specific to the actual reward-predicting trials. In addition, it has been suggested that these neurons fire mainly as a consequence of the motivational value of the reward and less strongly following its actual physical properties. Hence, in trials where rewards are compared, they seem to estimate the relative value of the different rewards with respect to each other instead of reacting to all of them in an absolute manner. • Goal-directed neurons. These are in the striatum, the supplementary motor area, and the dorsolateral premotor cortex (Kurata & Wise, 1988; Schultz & Dickinson, 2000). These neurons show an enhanced response prior to an internally planned motor action toward external rewards but in the absence of a triggering stimulus. Some neurons continue to fire until the reward is retrieved; others stop as soon as or just before the motor action is initiated. Using the terminology from above, one can interpret the first two types of neurons as being concerned with the prediction problem, while the third type seems to be involved in addressing the control problem. These single cell data have more recently been augmented by a substantial number of fMRI studies of the human brain, that also support the idea that the ventral striatum and other parts of the brain (e.g., amygdala, nucleus accumbens, orbitofrontal cortex) could be involved in the processing of reward- or expectation-related activity (Schoenbaum, Chiba, & Gallagher, 1998; Nobre, Coull, Frith, & Mesulam, 1999; Delgado, Nystrom, Fissell, Noll, & Fiez, 2000; Elliott, Friston, & Dolan, 2000; Berns, McClure, Pagoni, & Montague, 2001; Breiter, Aharon, Kahneman, Dale, & Shizgal, 2001; Knutson, Fong, Adams, Varner, & Hommer, 2001; O’Doherty, Rolls, Francis, Bowtell, & McGlone, 2001; O’Doherty, Deichman, Critchley, & Dolan, 2002; Pagoni, Zink, Mon6 These neurons are named according to Schultz and Dickinson (2000). Sometimes they are also called “reward-prediction neurons” (Suri & Schultz, 2001).

Temporal Sequence, Prediction, and Control

263

tague, & Berns, 2002) possibly related to TD-models (O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003). However, the functional connectivity and the action of dopamine are by far more complex than suggested by the above paragraphs. Dopamine neurons ramify very broadly essentially all over the striatum. As a consequence, every striatal neuron (Lynd-Balta & Haber, 1994; Groves et al., 1995) and large numbers of neurons in superficial and deep layers of the frontal cortex (Berger, Trottier, Verney, Gaspar, & Alvarez, 1988; Williams & Goldman-Rakic, 1993) are contacted by each DA neuron. In addition, the responses of dopaminergic neurons vary to a large degree, and only a small number resemble those discussed here. Thus, the above interpretations of how the different neuron types are concerned with prediction and control may be too coarse, and the DA system may also to a large degree carry gating, motivational, novelty, saliency (Zink, Pagnoni, Martin, Dhamala, & Berns, 2003) or other context-dependent signals too. On the other hand, it has been found that dopamine-deficient mice mutants can easily perform reward learning (Cannon & Palmiter, 2003). Evidence that N-methyl-D-aspartate (NMDA) receptors and not dopamine (D2) receptors seem to be involved in specific reward processing (Hauber, Bohn, & Giertler, 2000) points in the same direction. (See also the discussion about the synaptic biophysics of the dopaminergic system section 5.) The fMRI studies also cannot help to resolve these issues, because they have shown that a variety of areas can be involved in reward processing, and some of them may be strongly modulated by attentional effects as well influencing learning (Dayan, Kakade, & Montague, 2000). In addition, the restricted spatiotemporal resolution of fMRI makes it of little help in actually designing neuronal models. This complexity cannot be exhaustively discussed in this article, and we refer readers to the literature (Schultz, 1998; Berridge & Robinson, 1998; Schultz & Dickinson, 2000; Dayan & Balleine, 2002). 4 Closed-Loop Architectures 4.1 Evaluative Feedback by Means of Actor-Critic Architectures. The action of animals or robots will normally always influence their sensor inputs. Thus, to be able to address the control problem, the algorithms discussed above will have to be embedded in closed-loop structures. To this end, so-called actor-critic models, which are strongly related to control theory, have been widely employed (Witten, 1977; Barto, Sutton, & Anderson, 1983; Sutton, 1984; Barto, 1995). Figure 7A shows a conventional feedback control system. A controller provides control signals to a controlled system, which is influenced by disturbances. Feedback allows the controller to adjust its signals. In addition, a set point is defined. In the equilibrium (without disturbance), the feedback signal X0 will take the negative value of the set point, which represents the “desired state” of the complete system. In the simplest case (set point = 0), this is zero too. The set point can

264

F. Worg ¨ otter ¨ and B. Porr

A

Disturbances

Set-Point Controller

X0

Control Signals

Controlled System

Feedback

B Context Critic Reinforcement Signal

Disturbances

Actor

Actions

Environment

(Controller)

(Control Signals)

(Controlled System)

X0 Feedback

Figure 7: Actor-critic architecture (modified from Barto, 1995). (A) Conventional feedback loop controller. (B) Actor-critic control system, where a critic influences action selection by means of a reinforcement signal.

be associated with the control goal of the system; reaching it by means of the feedback could be interpreted in the way that the system has attained homeostasis. Figure 7B shows how to extend this system into an actor-critic architecture. The critic produces evaluative, reinforcement feedback for the actor by observing the consequences of its actions. The critic takes the form of a TD error, which gives an indication if things have gone better or worse than expected with the preceding action. Thus, this TD error can be used to evaluate the preceding action. If the error is positive, the tendency to select this action should be strengthened or else lessened. Thus, actor and critic are adaptive through reinforcement learning. This relates these techniques to advanced feedforward control and feedforward compensation techniques. However, the set point is in this more general context replaced by context information. This indicates that that control goal can now be somewhat softened, which makes these architectures go beyond conventional, advanced model-free feedforward controllers. Many different ways exist to implement actor-critic architectures (see Sutton & Barto, 1998, for an example). They have become especially influential when discussing animal control, and we note that these specific architectures (see Figure 7) represent regulation problems to which animal control also belongs (Porr & Worg ¨ otter, ¨ 2003c). Other actor-critic architectures can also be designed but shall not be discussed here.

Temporal Sequence, Prediction, and Control

265

Two aspects are especially important for the discussion later: actor-critic architectures rely on the return maximization principle, which is common to all reinforcement learning paradigms, trying to maximize the expected return by choosing the best actions. Furthermore, they use evaluative feedback from the environment. Hence, feedback signals that come from the environment are not value free; instead, they are labeled reward (positive) or punishment (negative). In sections 7.1 and 7.4, we discuss how these two assumptions may pose problems when considering autonomous creatures. 4.2 Neuronal Actor-Critic Architectures. In general, all neuronal models for prediction and control that have been described in the literature so far follow an actor-critic architecture (Barto, 1995) and are focused on the interactions between the basal ganglia and the cortex (Houk, Adams, & Barto, 1995), sometimes including other brain structures as well. Most of the time the critic (predictor) is implemented in these models with great detail, while the actor (controller) in the earlier studies is only rather generally described and detail is added only in recent models. We first compare the different models of the critic and then the actors. 4.2.1 The Critic. The implementation of a TD critic shown in Figure 4C is called a reciprocal architecture because it assumes a reciprocal connection from the central summation neuron via the neuron, which calculates δ back to the synaptic modification circuit of the central summation neuron (see also Figures 8B and 8C). These types of models capture some of the basic properties of the observed neuronal responses. For example, the δ signal resembles the response of prediction-error neurons, showing the properties of response transfer and omission, while the v signal is to some degree similar to the reward-expectation neurons. Figure 8 shows a simplified circuit diagram of the basal ganglia, together with its most important inputs and outputs. We will now compare the existing models to this diagram. (For an in-depth treatment of this topic, see Daw, 2003.) • Parallel reciprocal architectures; Houk’s model. The first attempt to match the abstract architecture of a critic to the structure of cortex and basal ganglia was made by Houk et al. (1995) (see Figure 8B). So-called striosomal modules fulfill the functions of the adaptive critic. (Striosomal modules consist of striatal striosomes, subthalamic nucleus, and the DA neurons of the substantia nigra pars compacta. They are called the limbic striatum.) The prediction error (δ-) characteristics (see equation 3.13) of the DA neurons of the critic are generated by (1) equating the reward r with excitatory input from the lateral hypothalamus, (2) equating the term v(t) with indirect excitation at the DA neurons, which is initiated from striatal striosomes and channeled through the subthalamic nucleus onto the DA neurons, and (3) equating the term

266

F. Worg ¨ otter ¨ and B. Porr Frontal Cortex

Cortex (C)

A

Thalamus

VP SNr GPi

STN GPe

Striatum (S)

DA-System (SNc,VTA,RRA)

Houk et al.

C S

idealized reciprocal architecture (Montague, Suri)

B

C

C

STN +

C

S

E

Berns & Sejnowski

- DA

r

+

- DA

S

Contreras-Vidal & Schultz

D

S(VS)

+

r

- DA

Brown et al.

r

F

C (PFC)

(C) STN

S excitation

GP +

- DA inhibition

S unspecified (net excitation)

-

+DA

r

dopaminergic projection that modifies target weight

Figure 8: Matching TD learning to the basal ganglia. (A) Schematic wiring diagram of the basal ganglia showing its main inputs and outputs. VP = ventral pallidum, SNr = substantia nigra pars reticulata, SNc = substantia nigra pars compacta, GPi = globus pallidus pars interna, GPe = globus pallidus pars externa, VTA = ventral tegmental area, RRA = retrorubral area, STN = subthalamic nucleus. (B–F) Simplified circuit diagrams redrawn from the approaches of different groups adopting the same structure to make them comparable. Cortex = C, striatum = S, DA = dopamine system, PFC = prefrontal cortex, VS = ventral striatum, GP = globus pallidus, r = reward. B and C represent parallel-reciprocal architectures where both input streams to the DA system arise in parallel from the striatum, which in turn receives DA signals in a reciprocal way. Accordingly, D is a divergent (nonparallel) reciprocal architecture where the input to the DA system originates in the cortex via two separate striato-nigral pathways. The architecture in E is parallel, nonreciprocal, where the cortex is not explicitly mentioned, and that in F is divergent, nonreciprocal. Dashed lines with bullets denote a dopamine synapse. They end close to another (glutamatergic) synapse, which is the one that is modified.

Temporal Sequence, Prediction, and Control

267

v(t−1) with direct, long-lasting inhibition from striatal striosomes onto the DA neurons. This principle, depicted in Figure 8B, is a variation of Figure 4C, where we have used an interneuron to create the subtractive term. The long-lasting inhibitory component used to calculate the subtractive v(t − 1) term by Houk et al. (1995) subserves the same purpose. Is this model supported by anatomy, and will it produce the correct neuronal responses? At first, this comes down to the central question of whether a reciprocal architecture is supported by the connectivity between striatum and the DA system. Many models, like Houk’s, make the even more specific assumption that striosomes contain the main part of the reward-predicting neurons in the striatum with reciprocal connections (Houk et al., 1995; Brown, Bullock, & Grossberg, 1999; Contreras-Vidal & Schultz, 1999). This assumption is supported by the work of Gerfen (Gerfen, 1984, 1985, 1992; Gerfen, Herkenham, & Thibault, 1987). These studies in rats have shown that there is a reciprocal connection between the striosomes of the dorsal striatum and a small group of DA neurons of the SNc and SNr. In other species, anatomical evidence also supports the notion that at least weak reciprocal connections between both structures exist (Haber, Fudge, & McFarland, 2000; Joel & Weiner, 2000). There is, however, little support for the notion that this reciprocity is subcompartmentalized, because it does not seem to be the case that mostly the striosomal neurons would take part in a reciprocal connection pattern (Joel & Weiner, 2000). Rather, it seems that such connections can exist for striosomal and matrisomal neurons of the striatum in a similar way. • Comparing Houk’s model to idealized parallel reciprocal architectures. Figure 8C shows the idealized reciprocal architecture required to implement a TD rule by means of interactions between cortex (C) and striatum (S). This architecture assumes a direct excitatory pathway and an indirect inhibitory pathway. In reality the situation is reversed (Bunney, Chiodo, & Grace, 1991; Pucak & Grace, 1994; Haber et al., 2000; Joel & Weiner, 2000), and we observe as a second problem that the direct action of striatal activity onto DA neurons is inhibitory: excitation arises only indirectly as the consequence of disynaptic disinhibition (ventral striatal inhibition of GABAergic neurons in the ventral pallidum, which projects to most of the DA system; see Figure 8A). Thus, we would obtain a sign inverted situation with −v(t) and +v(t − 1), which is not in accordance with the TD rule. In conclusion, there is no support for implementing a reciprocal critic architecture relying on striosomal neurons only. However, even when relying on the entire population of striatal neurons (including the matrisomal neurons), we are still faced with the sign-inversion problem (Joel, Niv, & Ruppin, 2002). To avoid this, models must make specific

268

F. Worg ¨ otter ¨ and B. Porr

assumptions about the time courses of the involved signals. One possible solution to this problem might lie in the different, sometimes very slow, time courses of the different chemical compounds involved in the second messenger chains leading to dopamine-related synaptic modification (Houk et al., 1995). We discuss this in section 5. From a functional point of view, we find that the long-lasting inhibitory component used to calculate the subtractive v(t) term in Houk’s approach leads to the situation that this model cannot account for the precise timing of the depression during omission of a predicted reward (see Figure 6). Other timing problems arise as the consequence of the fact that this model has not implemented any kind of serial compound stimulus representation. This was first done by Montague et al., 1995, 1996) in the context of a reciprocal architecture and in the following in several models by Suri and coworkers (Suri & Schultz, 1998, 1999, 2001; Suri, Bargas, & Arbib, 2001). Suri et al. do not make an explicit link to anatomy but seem to imply that they are essentially following the approach of Houk. On closer look, however, their models are more strongly related to the ideal reciprocal architecture (see Figure 8C) than to Houk’s model (see Figure 8B). • Divergent reciprocal architectures. So far we have discussed what we would call parallel-reciprocal architectures (see the legend of Figures 8B and 8C). These models assume that two parallel pathways exist from the striatum to the DA system. An alternative source for these two streams is the limbic prefrontal cortex (PFC). Indeed, evidence exists that the PFC projects directly to the DA system (reviewed in Overton & Clark, 1997) and that an additional projection exists to the ventral striatum (Groenewegen, Berendse, Wolters, & Lohman, 1990; Parent, 1990). Via this pathway, delayed inhibition could be provided to the DA system. This is supported by findings that responses in the ventral striatum show reward expectation activity (Schultz, Apiccela, Scarnati, & Ljungberg, 1992). A model that uses such a divergent-reciprocal architecture was devised by Brown, Bullock, & Grossberg (1999; see Figure 8D). This model uses a special kind of serial compound stimulus representation by assuming a spectral timing approach (Grossberg & Schmajuk, 1989; Grossberg, 1995; Grossberg & Merrill, 1996) of a set of bandpass filtered pulses (resembling in shape that of an EPSP) with different onset times that cover the whole interstimulus interval between CS and US. In their model, these pulses represent the suprathreshold internal Ca2+ concentration. The cortex provides excitatory input to the striosomes, which project inhibitorily to the DA system (see Figure 8D). At the same time, the cortex also projects to the ventral striatum. This signal gets transferred via several stages to the DA system, where it finally exerts an excitatory action. Cortico-striatal synapses are modified in

Temporal Sequence, Prediction, and Control

269

the striosomes as well as in the ventral striatum. This architecture still fulfills the basic properties of a TD architecture; the nested, multistage computation and the synaptic modification rules used, however, no longer provide a direct link to the TD rule. The model reproduces most of the known experimental data. This is mainly a virtue of the different timing properties along both legs of the divergent input pathways to the DA system. This additional degree of freedom avoids several of the timing problems that have been found in the older models (see the discussion in Brown et al., 1999). • Parallel nonreciprocal architectures. Berns and Sejnowski (1998) have devised a parallel nonreciprocal architecture (see Figure 8E) that in essence resembles the model of Houk et al. (D), but implements several pathways, more accurately following the known anatomical structures than any of the other models. This architecture is called nonreciprocal because the weight modification does not take place in the striatum. In the model of Berns and Sejnowski (1998), weights of the STN→GP connection as well as those of the striato-nigral connection (S→DA) are modified, for which there is currently no direct evidence. The learning rule used is a three-factor Hebbian learning rule. Two factors come from the pre- and postsynaptic activity of the concerned connection; the third is a so-called error term e, calculated from the difference between the activity of the GP→DA and the S→DA connection. This error term is used as the reinforcement signal. Thus, this model does not use any primary reinforcement (reward r), and cortical input is also not explicitly mentioned. • Divergent nonreciprocal architectures. The model of Contreras-Vidal and Schultz (1999; see Figure 8F) most strongly deviates from the traditional TD rule. As in Berns and Sejnowski (1998), it assumes that the striato-nigral and not the cortico-striatal synapses are modified by means of the DA activity. In general, their model focuses on the action of the prefrontal cortex, which provides indirect input to the striosomes and from there on to the DA system. Thus, this model represents a divergent nonreciprocal architecture. The model is fairly opaque, and it seems as if this input line will lead to net excitation at the DA system, which is opposite to the assumption of the other models. Accordingly, the second pathway from the prefrontal cortex to the DA system acts inhibitorily in their scheme (divergent architecture). In addition, they have implemented an adaptive resonance network (ART-2; Carpenter & Grossberg, 1987) and use this to model an attentional and orienting subsystem. • Problems with serial compound representations. All of the more recent models use some kind of serial compound stimulus representation to cover the long temporal intervals between the stimuli. The problem of

270

F. Worg ¨ otter ¨ and B. Porr

a straightforward serial compound representation, like the one used in Figure 4, is that it predicts a gradual shift of the δ signal forward in time during learning (see steps 1–3 in Figure 4; Schultz, Dayan, & Montague, 1997). This, however, is in general not observed in real recordings where the response of the DA neurons remains in place and gradually diminishes during learning (see Figure 6D), while the response to the predictive stimulus increases. Furthermore, the model in Figure 4 also will not produce novelty responses to new and salient stimuli, which are very often observed in DA neurons. Accordingly, more recent models were designed, solving these problems by means of a more elaborate serial compound stimulus representation. For example, Suri and Schultz (1998, 1999) use a representation where only one delay of 100 ms is used between the first and all other xi , but where all xi , i > 1 are stretched in time with increasing duration and decreasing amplitude. With this architecture, they were able to more accurately reproduce the properties of real DA neurons (prediction-error neurons), including novelty responses and avoiding the forward shift of the δ signal. In this study, they found that only the one specific weight grows, which represents the temporal interval between prediction and reward. As a consequence, these models can learn only a single interstimulus interval correctly. When using other ISIs, the model will produce incorrect responses. In their later models, Suri and Schultz (2001) and Suri et al. (2001) return more closely to the architecture in Figure 4C. In addition to the serial compound representation of the stimulus, they also employ an analog eligibility trace at every xi , which they say speeds up learning and also leads to the specific shapes of the output responses. They do not show responses of prediction-error DA neurons (δ signals) during learning, and the complexity of this model is such that it is not immediately evident if it will again produce some kind of forward shift of the δ-signal. One study of Suri and Schultz (2001) focuses on the responses of reward-expectation neurons in the putamen and the orbitofrontal cortex. The other study implements a so-called extended TD model, which not only gets external stimulus input but is also internally driven by thalamic activity and by action-related signals. Thus, Suri et al. (2001) make serious attempts to implement an actor. This aspect is discussed in the next section. There is evidence that neurons in the striatum produce varying response latencies and durations (Schultz, 1998, 2002; Schultz & Dickinson, 2000) that could support the idea of a serial compound representation. Furthermore, the different models discussed so far can reproduce a wealth of physiological findings. Still, the valid parameter ranges of the existing models appear rather narrow, and the models have to be specifically tuned to reproduce the different response properties of the neurons. In addition, the wiring pattern for the serial compound rep-

Temporal Sequence, Prediction, and Control

271

resentation has to be rather specifically set up, and the unknown delay between CS and US requires a large number of neurons in the serial compound representation of xi . In general, one finds only a few shared principles in the different models. This concerns structure, where sometimes strongly differing anatomical assumptions are found between the models, as well as function, where different learning rules are used. Parallel reciprocal architectures (see Figures 8B, 8C, and 8D), which can be used to implement a neuronal version of the TD rule, are possibly supported by the connectivity of the basal ganglia. The assumption that such reciprocal connections are restricted to striosomes (which are used for the critic) does not seem to be valid, though. Idealized reciprocal architectures (see Figure 8C), which are essentially used in the models of Montague and Suri, are not directly supported by anatomy because of the sign inversion problem discussed above. From a functional point of view, accurate reproduction of the experimental data requires a fairly complex serial compound representation of the stimulus. This assumption may prove to be too inflexible, too. Divergent architectures (see Figures 8D and 8F) and nonreciprocal architectures (see Figures 8E and 8F) are not necessarily related to the TD rule anymore. Especially the divergent architectures offer additional degrees of freedom in adjusting the timing along the different input lines to the DA system. A serial compound representation is also needed in these models, and the question arises if the spectral timing approach of Brown et al. (1999) would be more flexible than the more conventional setup used in the other models. 4.2.2 The Actor. In general, less attention has been paid so far to the implementation of the actor. This may have to do with the fact that there are only few single-cell recordings available that could be matched to the activity of actor cells (goal direction neurons; see section 3.3), as opposed to the critic, where a wealth of data exist. The performance of the actor is thus judged by observing the obtained actions (“behavior”). The study of Houk et al. (1995) was also the first to suggest how to design an actor. The matrix modules (consisting of the striatal matrix, subthalamic nucleus, globus pallidus, thalamus, and frontal cortex and are called sensorimotor striatum) integrate the information from the striosomal modules (see above) and from cortical inputs in order to create output to the frontal cortex relayed through the thalamus. In contrast to the detailed discussion of the critic, Houk et al. (1995) provide only a rather general scheme of the implementation of the actor. According to their model, matrix modules generate signals that command actions or represent plans that organize other systems to generate the actual commands. Montague and colleagues (Montague et al., 1995, 1996) implement decision units as the actor in their models. This is not meant to be related to basal ganglia anatomy; rather, it represents an abstract binary decision net-

272

F. Worg ¨ otter ¨ and B. Porr

work. The TD error δ is used to influence the decision probability and to adjust the weights at the decision units appropriately during learning. This architecture allows only for binary decisions. In their 1999 model, Suri and Schultz also model a control task with an actor-critic architecture. In this model, the actor consists of one layer of neurons, each of which represents a certain action. It learned stimulus-action pairs based on the prediction error signal provided by the critic. A winnertake-all rule, implemented by means of lateral inhibition between the units, ensured that only one action was selected at a given time. In combination with the more sophisticated critic (discussed above), this still rather simple actor model was able to solve several relatively complex behavioral tasks. All of these actor models were still rather simplistic. In a recent model, Suri et al. (2001) made serious attempts to implement a more detailed actor, trying to match it to anatomy as well. As in Houk et al. (1995), the actor uses striatal matrisomes. These provide direct inhibition to the GPi/SNr complex (compare Figure 8A) and indirect excitation via the GPe and the STN. The GPi/SNr complex projects inhibitorily to the thalamus and from there on excitatorily to the cortex, which elicits the actions. The DA system provides input to the matrisomal neurons of the striatum. Its exerts an effect on the membrane potential of these cells but also on the weight of the corticostriatal synapses ω essentially with ω = DA × Pre × Post, comprising a three-factor learning rule, which relies on the activity of the DA system (DA) as well as the pre- and postsynaptic activity at the striatum cell (Miller et al., 1981; Schultz, 1998). The membrane activity of these cells follows the physiological observation that they can be in a depolarized “up” and a hyperpolarized “down” state. (We refer readers to section 5 for a more detailed description of these states.) Here, it seems that this property can improve reaction times as well as learning properties as compared to a model version without such membrane states. As in their older model, actor neurons represent single actions each, and actions are selected again by a winner-takes-all mechanism such that the winning inhibition will disinhibit the thalamus. As a consequence, the number of possible actions is limited to the number of actor neurons. The use of internal signals in the extended TD model of the critic in the model of Suri et al. (2001) leads to the situation that this model can now also perform some kind of planning. Planning refers to the behavioral strategy that an action can be selected without primary reinforcement, just as the consequence of having memorized previously experienced situation-action pairs. For example, if a reward has always occurred at the right side in a T-maze, the animal after learning will turn right without any external stimulation. Suri et al. (2001) achieve this by means of internal signals that enter the extended TD model and from there on the actor neurons. One could say that these internal signals form an internal, explicit representation of a chain of events that can be used for learning and acting.

Temporal Sequence, Prediction, and Control

273

Figure 9: Correlation-based control architecture, where the control signal is derived from the correlations between two temporally related input signals.

The performance of this model is rich and reproduces many existing data sets. However, its architecture is fairly advanced, and as a consequence it contains many free parameters, few of which have direct physiological support. The assumption that striosomes form part of the critic while matrisomes belong to the actor is not supported by the anatomy of the basal ganglia, as discussed above. Therefore, it is questionable to employ two different learning rules (TD rule versus three-factor rule) at these two sets of neurons. From a functional point of view, up- and down-membrane states are also found across all medium spiny striatal neurons (for a review, see Nicola, Surmeier, & Malenka, 2000) and not only in the matrix neurons. From a more conceptional point of view, two aspects may be problematic in the currently existing actor models. First, the number of possible actions is matched to the number of actor neurons. Thus, these neurons represent some kind of “grandmother cell” concept, where only a rather limited number of discrete decisions is possible. This may work in restricted-choice lab situations. The complexity of real-world situations, however, requires a different type of action network. Second, in general, the basal ganglia include many indirect pathways, which often amount to disinhibition. Disinhibition, however, normally acts permissive or facilatory and cannot directly be equated with excitation. Specific motor commands, on the other hand, require the concerted action of many specific and sometimes temporally rather finely tuned excitations. We believe that the release of inhibition at the thalamus performed by the actor pathway in, for example, the model of Suri et al. (2001) can at best act gating on such motor actions. Thus, action selection by means of such a gating process seems to be too crude a model for animal (motor) control. 4.3 Nonevaluative Correlation-Based Control. We have stated that actor-critic architectures follow the return maximization principle by means of evaluative feedback from the environment. Figure 9 suggests a schematic architecture that accommodates a different approach: nonevaluative feedback and disturbance minimization. This architecture utilizes the basic feed-

274

F. Worg ¨ otter ¨ and B. Porr

back loop controller from Figure 7A, but it assumes that the environment will, in a temporal sequence learning situation, provide temporally correlated signals about upcoming events like those mentioned in the introduction (e.g., smell predicts taste). This architecture follows the learning goal: learn to keep the later signal (x0 ), whatever it is, minimal by employing the earlier signal (x1 ) to elicit an appropriate action. In conventional actor-critic architectures, the critic provides an evaluation of the action (e.g., “good” or “bad”), and this evaluation influences future action selections. Such evaluations are, however, always subjective. In correlation-based control, the situation is fundamentally different. Here, the system relies on the objective difference between “early” and “late,” which arises from the structure of the input signals. Evaluations do not take place at this point. Instead the “re-”action of the feedback control loop in response to x0 , be it an attraction or a repulsion reaction, will be shifted forward in time to occur earlier now in response to x1 . Thus, in this system, evaluations do not take place during learning; instead, they are implicitly built into the (sign of the) reaction behavior of the inner x0 loop: repulsion or attraction. As a consequence, critic and actor are not necessarily separate entities anymore and can be merged into the same architectural building block. Furthermore, we note that this control strategy does not seek to maximize returns. On a complex decision topology, such maxima (or even near-maximal regions) may be hard to find by means of RL. Disturbance minimization, on the other hand, offers the advantage that the associated regions will almost always be much bigger, promising better convergence properties. 4.3.1 ISO Control: Merging Critic and Actor. In this section, we describe how ISO learning can be used to implement correlation-based control introduced in an abstract way in the previous section. The central assumption of ISO control is that any control system should start with a stable negative feedback loop (see Figure 9), for example, a reflex loop. Feedback controllers, however, suffer from a major disadvantage: they will always react only after a disturbance has taken place (see the inner loop in Figure 10A). Thus, the desired state (e.g., x0 = 0; see also Figure 9) cannot be maintained all the time. In other words, disturbances will not yet be minimal when employing feedback control. ISO control can improve on this if a temporal correlation exists between the primary disturbance and some other earlier-occurring signal (denoted by the delay τ between the inner and the outer loop in Figure 10). The ISO learning algorithm allows for learning this correlation, and as a result, the primary reflex reaction will be “shifted forward” in time, now occurring earlier, that is, before the primary reflex would have been triggered. Thus, if learning is successful, the primary reflex will be fully avoided, and disturbances are now minimal (ideally zero). Note that in the architecture shown in Figure 10A, critic and actor are no longer separate (compare Figure 7B with Figure 9). There is indeed recent experimental evidence that neurons exist where behavior and reward (actor and critic) are

Temporal Sequence, Prediction, and Control

275

Figure 10: Applying ISO learning in a control task. (A) This architecture is reminiscent of an actor-critic architecture (see Figures 7 and 9), but here the system does not use evaluative feedback (“rewards”) from the environment. Instead, it relies only on correlations between the inputs. Hence, it is assumed that the “organism” receives temporally correlated inputs, where x1 arrives earlier (e.g., a signal from a range finder reflected from an obstacle) and x0 arrives later (e.g., a signal from a touch sensor triggered at the moment of touching the obstacle). P0 , P1 denote environmental transfer functions, D a signal (“disturbance”), which arrives at the inputs: undelayed at x1 and with delay τ at x0 . Other symbols are as in Figure 3. Here we have also implemented a filter bank of 10 filters with different frequencies, all driven by the same input x1 , creating something like a serial compound representation. This was done purely to speed up learning and to create smoother output signals. Note that a filter bank approach does not destroy orthogonality and weights will still self-stabilize (see Figure 3D). After successful learning, the output V will fully compensate the disturbance D at the summation node of the inner loop leading to x0 = 0, which is equivalent to a functional elimination of the inner loop. The system has learned the inverse controller of the inner loop (Porr, von Ferber, & Worg ¨ otter, ¨ 2003). (B,C) Trajectory of a real robot early (B) and late (C) during learning in an arena with three obstacles (boxes). Collisions are denoted by the small circles (forward = solid, backward = dashed). Only forward collisions can be used for learning. In such an environment, the robot never needed more than 12 forward collisions to learn the task. This way, it is as fast as the best RL algorithms, which require sophisticated credit structuring and temporal assignment mechanisms. (Touzet, 1999; also Touzet, pers. communication, August 2003).

276

F. Worg ¨ otter ¨ and B. Porr

more closely linked (Hollerman & Schultz, 1998; Kawagoe, Takikawa, & Hikosaka, 1998; Hassani, Cromwell, & Schultz, 2001). This principle has been employed in several real-robot experiments (see Figures 10B and 10C), (http://www.cn.stir.ac.uk/predictor). We simulate touch and range-finder signals. Before learning, the simulated robot will perform a built-in retraction reaction when touching an obstacle (primary reflex reaction). All weights ωk are initially zero except the weights that belong to the touch sensor inputs, which we set to one. Thus, the output is at this stage just the signal v = x¯ 0 , where x¯ 0 is the bandpass-filtered touch senor input x¯ 0 = h0 ∗x0 . This signal is sent sign-inverted (negative feedback), but otherwise unaltered, to the motors,7 which leads to a retraction reaction. The range-finders provide the necessary earlier signal because they respond before the touch sensor is triggered. ISO learning learns this correlation. After learning, the output is v = ωk x¯ k , where k ≥ 1, because the touch sensors (x0 ) are no longer triggered. This signal will now, in same way as before but earlier, lead to a retraction reaction, and the primary reflex will be avoided. Interestingly, the principle of disturbance minimization by reflex avoidance can be employed in the same way to learn a food-retrieval task (see Porr & Worg ¨ otter, ¨ 2003b). Here a behavior emerges that looks like reward retrieval (or return maximization) but actually follows the disturbanceminimization principle. 5 Biophysics of Synaptic Plasticity in the Striatum The more recent of the discussed models of the basal ganglia begin to make rather specific assumptions about cellular and subcellular mechanisms. The striatum is currently the best-understood substrate concerning the interactions between glutamatergic (Glu) and dopaminergic (DA) synapses. Such interactions also take place in many other brain structures, but shall not be discussed here. Furthermore, we will restrict the discussion to facts that are now fairly generally acknowledged. This is meant to facilitate the computational perspective and should allow us to design restricted but at least valid models without too many open degrees of freedom. 5.1 Basic Observations. Figure 11 gives a summary of the different types of synaptic modifications at corticostriatal glutamatergic synapses targeting a spiny projection neuron in the striatum. Three possible influences can in principle affect the synapse: (1) presynaptic stimulation of the corticostriatal pathway, (2) postsynaptic activation of the projection neuron, and (3) activa7 This description is slightly simplified, because we employ steering and accelerating control, and thus two sets of neurons. The correct cross-wiring is described in Porr and Worg ¨ otter ¨ (2003a). Of importance here is that ISO control essentially works without any signal postprocessing or conditioning.

Temporal Sequence, Prediction, and Control

Nigrostriatal (”DA”) DA

Corticostriatal (”pre”) Glu

Medium-sized Spiny Projection Neuron in the Striatum (”post”)

277 pre

post

DA

Result

1

X

0

0

0

2

0

X

0

0

3

0

0

X

0

4

X

X

0

LTD (dl)

5

X

0

X

0

6

0

X

X

0

7

X

X

X

LTD (DA tonic) LTP (DA phasic)

LTP (dm)

*

Figure 11: Effects of the different activation protocols on the plasticity of corticostriatal synapses. The small x in the table denotes an active influence. Only paired pre- and postsynaptic activation (row 4), with or without DA activation (row 7) will lead to synaptic plasticity. For further explanations, see the text. References for the table: row 1: Calabresi et al., 1992a; Choi & Lovinger, 1997; Calabresi, Centonze, Gubellini, Marfia, & Bernardi, 1999. Row 2: Calabresi et al., 1992b; Choi & Lovinger, 1997; Calabresi et al., 1999. Row 3: Calabresi et al., 1987, 1992b; Umemiya & Raymond, 1997. Row 4: (LTD) Calabresi et al., 1992b; Lovinger, Tyler, & Merritt, 1993; Walsh, 1993; Wickens et al., 1996. Row 4: (LTP and mixed effects): Charpier & Deniau, 1997; Charpier, Mahon & Deniau, 1999; Akopian, Musleh, Smith, & Walsh, 2000; Partridge et al., 2000; Spencer & Murphy, 2000. Row 5 (nucleus accumbens) Pennartz, Ameeron, Groenewegan, & Lopes da Silva, 1993. Row 5: (striatum): Calabresi et al., 1999. Row 6: Calabresi et al., 1999. Row 7 (LTP with Mg2+ removed): Calabresi et al., 1992b; Walsh & Dunia, 1993. Row 7 (LTD with normal conditions, i.e., with Mg2+ ): Calabresi et al., 1992a, 1992b; Tang et al., 2001. Row 7 (LTP with pulsed DA application): Wickens et al., 1996.

tion of the nigrostriatal, dopaminergic pathway. By itself, none of the three influences can affect the Glu synapse (top; for literature references, see the figure legend). If, however, presynaptic corticostriatal stimulation is paired with postsynaptic activity, early studies have unequivocally reported that LTD occurs at the Glu synapse. More recently, these observations have been augmented by findings showing that LTP occurs mainly in the dorsomedial (dm) part of the striatum, whereas LTD is found in the dorsolateral (dl) part following the gradient of D2-like receptor density (Joyce & Marshall, 1987; Russell, Allin, Lamm, & Taljaard, 1992). In addition, the expression of LTP or LTD will also depend on the applied stimulation protocol (for review, see Reynolds & Wickens, 2002). In general, the tendency for LTD seems to be stronger than that for LTP, indicated by the different font sizes in the figure. Failing to pair pre- and postsynaptic activation, on the other

278

F. Worg ¨ otter ¨ and B. Porr

hand, will induce neither LTP nor LTD, regardless of the activation state of the dopaminergic pathway. The most robust protocol to induce LTP or LTD consists of the stimulation of all three influences. It was found that tonic, long-lasting activation of the DA pathway will induce LTD (Calabresi, Maj, Mercuri, Bernardi, 1992a; Calabresi, Maj, Pisani, Mercuri, & Bernaldi, 1992b; Tang, Low, Grandy, & Lovinger, 2001), whereas phasic, pulse-like activation leads to a less strong D1-like receptor desensitization (Memo, Lovenberg, & Hanbauer, 1982) will lead to LTP (Wickens, Begg, & Arbuthnott, 1996). Thus, this has been called a “three-factor” learning rule (Miller et al., 1981; Schultz, 1998). This terminology, however, may be misleading. Dopamine is a potent modulator of synaptic plasticity, but should not enter the equation in a multiplicative way, because the conjoint action of pre- and postsynaptic influences (middle of table, row 4) seems to suggest that two factors already suffice for synaptic modification, and in this case we would ideally set dopamine = 0. It is, however, currently a matter of debate whether the dopamine concentration indeed approaches zero under these experimental conditions, because traces of DA can be detected by high-pressure liquid chromatography (HPLC) following high-frequency stimulation of the corticostriatal pathway (Calabresi et al., 1995). Thus, it is still conceivable that under physiological conditions, the lack of a DA signal would prevent LTP or LTD altogether, because chronic near total depletion of DA prevents LTD (Calabresi et al., 1992a) and LTP (Centonze et al., 1999) (chronic denervation protocols). As a consequence, the validity of a three-factor rule cannot be conclusively ruled out. 5.2 Intracellular Processes. Figure 12 shows a summary of the intracellular actions that take place at DA and Glu synapses as far as they are understood today. The diagram is to some degree simplified; and more detailed accounts can be found in Greengard, Allen, and Nairn (1999) and Centonze, Picconi, Gubellini, Bernardi, and Calabresi (2001). Both LTP and LTD are introduced in the striatum by repetitive stimulation of the corticostriatal fibers, and this event produces massive release of both glutamate and dopamine. In the striatum, dopamine acts mainly via two receptor subtypes (D1-like, and D2-like; Sibley & Monsma, 1992). We consider here the action of glutamate onto alpha-amino-3-hydroxy-5-methyl-4-propionate (AMPA) and N-methyl-D-aspartate (NMDA) receptors only. The more modulatory action of metabotropic glutamate receptors shall not be discussed. Synaptic strength, measured, for example, by the size and slope of EPSPs, can partly be associated to the number of phosphorilated AMPA and NMDA receptors (p-AMPA and p-NMDA) which are integrated into the membrane, while dephosphorilated receptors are not active. In general, one observes that LTP occurs only when NMDA channels are active (Calabresi, Pisani, Mercuri, & Bernardi, 1996; Yamamoto et al., 1999; Partridge, Tang, & Lovinger, 2000), which in experimental in vitro conditions can be achieved by removing Mg2+ from the medium (Calabresi et al., 1992c; Calabresi, Pisani, Mercuri,

Temporal Sequence, Prediction, and Control

A

279

DA

Receptor types Dopamine Glu

NO

DA DA

Ca

AMPA

P

cGMP

2+

Ca

PKG

PP-1

5)

p-DARPP32

(T7

AMPA L-Ca

CaMKII

T75

PKA

2+

NMDA

CaM

(T34)

cAMP 2+

Ca

Glutamate

NMDA

D1-like D2-like

PP-2B

T34

DARPP32

B

Glu

AMPA

P

DA

Ca

2+

NMDA

CaMKII

PP-1

2+

(T34)

cAMP

CaM

p-DARPP32

Ca

2+

Ca

PP-2B

PKA

T34

DARPP32

DA

C

Glu DA

NO

AMPA

NMDA

P

DA

Ca2+

2+

Ca

PP-1

cGMP

CaM CaMKII

cAMP

)

(T75 PKA

p-DARPP32

PKG

PP-2B

T75

DARPP32

Figure 12: Biophysics of striatal synapses (see the text).

& Bernardi, 1996; Centonze et al., 1999). In vivo, this would require a depolarized state of the neuron, where the Mg2+ block at the NMDA channels is lessened or removed. LTD, on the other hand, is independent of the NMDAchannels (Calabresi et al., 1996; Yamamoto et al., 1999; Partridge et al., 2000) and can thus also take place in vivo during less depolarized membrane states. Basically we distinguish three pathways that can lead to the modification of the synaptic strength of the Glu synapse (see Figure 12A).

280

F. Worg ¨ otter ¨ and B. Porr

1. Glu-NMDA-Ca2+ -CaMKII pathway (right side of the diagram). This is the traditional pathway involved in the generation of LTP. During elevated calcium levels, the increased activity of Calcium/Calmodulindependent Protein Kinase II (CaMKII) leads to an increase in the phosphorilated receptors at the membrane. A counteraction arises, however, from Protein Phosphatase-2B (Calcineurin, PP-2B), which gets stimulated by Ca2+ and leads to a dephosphorilation of DARPP32 (dopamine and cyclic adenosine 3’-5’-monophosphate-regulated phosphoprotein, 32kDa) (King et al., 1984; Nishi, Snyder, & Greengard, 1997). This acts by dephosphorilating the receptors along the second pathway, discussed next. The complete cascade involved in the first pathway is more complex than drawn here (see section 6.1 for a more detailed discussion of the literature). 2. DA-D1-PKA-pDARPP32 Pathway (outer left part of the diagram). The action of dopamine onto D1-like receptors leads to an elevated level of cyclic AMP (cAMP) and increased Protein Kinase A (PKA) stimulation (Stoof & Kebabian, 1981). PKA can directly act by phosphorilating AMPA channels (Roche, O’Brien, Mamnen, Bernhardt, & Huganir, 1996; Tingley et al., 1997). At the same time, PKA shifts the equilibrium of DARPP32 toward its active phosphorilated form (pDARPP32) by phosphorilating its threonine-34 residue (T34 in the diagram: DA induced, Nishi et al., 1997; adenosine induced, Svenningsson et al., 1998), which inhibits the action of PP-1 (protein phosphatase1, Hemmings, Greengard, Tung, & Cohen, 1984). The protein PP-1 normally acts by dephosphorilating Glu receptors. Thus, for this pathway, we find an inhibition (by p-DARPP32) of dephosphorilation (by PP-1). In a side path, PKA also helps phosphorilating L-type Ca2+ channels (Surmeier, Bargas, Hemmings, Nairn, & Greengard, 1995), which adds to the calcium pool. 3. DA/NO-D2-PKG-pDARPP32 Pathway (inner left part of the diagram). The action of dopamine onto D2-like receptors seems less well understood. Recent observations from Calabresi et al. (2000) suggest that there is at least a twofold action possible. It seems that dopamine can directly act on the D2-like receptors, which leads to an inhibition of PKA (Centonze et al., 2001). However, at the same time, there exists an indirect pathway via Nitrousoxide (NO)-syntase positive interneurons (Morris et al., 1997). At these neurons, dopamine acts on D1-like receptors, which leads to the release of NO at their terminals.8 NO enhances the level of cyclic Guanosine monophosphate (cGMP) (Al-

8 This explanation is currently used to explain why a conjoint stimulation of D1-like and D2-like receptors (however, at different neuron subtypes) leads to LTD, while the stimulation of D1-like receptors only will lead to LTP (Centonze et al., 2001).

Temporal Sequence, Prediction, and Control

281

tar, Boyar, & Kim, 1990) and stimulates Protein Kinase G (PKG). PKG increases the level of phosphorilated DARPP32, this time, however, phosphorilating Thr34 and Thr75 (T75 in the diagram; (Calabresi et al., 2000). This type of phosphorilated DARPP32 acts inhibiting onto PKA (Bibb et al., 1999) but cannot efficiently inhibit PP-1. As a consequence, dephosphorilation can take place more easily at the Glu receptors. In summary, pathway 1 acts actively phosphorilating, pathways 2 acts mainly “permissive,” preventing dephosphorilation of the Glu receptors, and pathway 3 facilitates dephosphorilation. The question arises under which membrane-potential conditions these pathways are likely to be triggered. It is known that in vivo striatal projection neurons fluctuate between a hyperpolarized “down-state” and a depolarized “up-state” (for a review, see Nicola et al., 2000). The downstate corresponds approximately to the resting membrane level in a slice. Dopamine exerts different effects in the down- and the up-state. In the down-state, dopamine increases via D1-like receptors the activity of rectifying potassium currents (Pacheco-Cano, Bargas, Hernandez-Lopez, Tapia, & Galarraga, 1996) and thereby counteracts possible depolarizing influences. Near the up-state, one still observes reduced excitability from the action of the D1 receptors which leads to a decrease of Na+ as well as N- and P-type Ca2+ currents (Calabresi, Mercuri, Stanzione, Stefani, & Bernardi, 1987; Surmeier & Kitai, 1993; Surmeier et al., 1995). If sustained depolarization (e.g., from the corticostriatal pathway) drives the neuron into the up-state, dopamine will act differently. In this case, it influences again via the D1-like receptors an L-type Ca2+ -current (Hernandez-Lopez, Bargas, Surmeier, Reyes, & Galarraga, 1997) and stabilizes the depolarization this way. Thus, one could say that dopamine will in both states introduce a hysteresis to the membrane potential behavior, trying to keep it at the currently existing level. This principle has recently been modeled to some detail by Gruber, Solla, Surmeier, and Houk (2003), who report such a hysteresis effect in a membrane model that includes some aspects of the D1 cascade and L-type Ca2+ -currents. Note that such a hysteresis will also act as a thresholding mechanism, filtering weak inputs and letting stronger ones through to the pallidum (Yim & Mogenson, 1982; Brown & Arbuthnott, 1983; Toan & Schultz, 1985). Above, we have stated that in vivo LTP requires an elevated membrane potential level to remove the Mg2+ block from the NMDA-channels. Figure 12B shows in a schematic way how this would affect the concentrations of the different compounds. Active NMDA channels will lead to a strongly elevated level of Ca2+ , and this will lead, via CaMKII, to an increased tendency to phosphorilate Glu-receptors (pathway 1). Thus, plain NMDA influence, without the action of dopamine, can already lead to LTP. This conforms with the observation that pre- and postsynaptic stimulation without DA can induce LTP. It seems, however, that the depolarization level reached by this

282

F. Worg ¨ otter ¨ and B. Porr

protocol is often too weak to allow for a large enough Ca2+ influx. Thus, LTD occurs many times, which is normally associated with lower levels of Ca2+ (Malenka & Nicoll, 1999). However, if at the same time D1-like receptors are activated, depolarization gets stabilized (up-state), and pathway 2 becomes active. The increased action of L-type Ca2+ channels by PKA will increase the calcium level, and the active form of p-DARPP32 is substantially enhanced. This removes the possible dephosphorilation via PP-1 and enhances the LTP effect (preventing LTD). A balancing effect arises, though, from the increased level of Ca2+ , which stimulates PP-2B, which in turn reduces p-DARPP32 (glutamate-mediated dephosphorilation of p-DARPP32; King et al., 1984; Nishi et al., 1997). During a less depolarized membrane state (see Figure 12C), pathway 1 is largely inactive, because less Ca2+ can enter via the NMDA channels. This leads to a strongly reduced tendency of Glu-receptor phosphorilation. On the other hand, we also find that lack of Ca2+ leads to a reduced stimulation of PP-2B. The final balance of these two opposing effects, however, shifts the DARPP32 equilibrium toward its dephosphorilated, inactive form. The action of dopamine in conjunction with LTD is less well understood. If D2-like receptors get stimulated via the indirect NO pathway, DARPP32 gets activated via PKG. In vitro studies suggest that PKA and PKG, in a similar way, phosphorilate threonine-34 at DARPP32 but PKG in addition phosphorilates Thr75 (Calabresi et al., 2000). This difference may explain why p-DARPP32 activated by PKG is less efficient in inhibiting PP-1, but final, conclusive experimental evidence for this is still missing. At the moment, however, it seems safe to say that, taken together, the expected level of active, p-DARPP32 (Thr34) should be substantially lower than in the depolarized state discussed above. As a consequence, PP-1 does not get inhibited much, and it can exert its dephosphorilizing action. In parallel, the PKA pathway is inhibited by the action of the D2-like receptor cascade. Thus, the direct phosphorilizing action of PKA does also not take place. As the final result, we expect a reduction of phosphorilated Glu receptors, which leads to LTD. 5.3 Reassessment of the Different Learning Rules in View of Synaptic Biophysics. The basic neuronal formalism for the TD-rule has been given in equation 3.14 as ωi ← ωi + α δ(t) xi (t),

(5.1)

which states that a multiplication of the prediction error signal with the predictive stimulus (-trace) should drive the synaptic weight change. This rule would in principle permit learning by the correlation of the nigrostriatal DA input with the corticostriatal Glu input only, without requiring postsynaptic activity. This, however, has not been found (see the case marked by an asterisk in Figure 11). The fact that the δ signal of the TD rule is in turn derived from the neuron’s output may, however, implicitly account for this, because

Temporal Sequence, Prediction, and Control

283

trace or other temporal representation

d X1

DA

w1 x BP v(t) X0

reward

w0 = 1

Figure 13: Schematic diagram of the modulatory action of dopamine inputs (DA), calculated by a reciprocal TD architecture, onto the synaptic weight modification of a cortico-striatal synapse ω1 (gray box). The main mechanisms for the weight change is assumed to be Hebbian by means of the correlation x between the trace of a presynaptic signal (x1 ) with a postsynaptic signal (BP) in the form of a backpropagating spike, which arises in response to the input x0 .

unavoidably there will be postsynaptic activity at the striatal neuron as soon as this circuit becomes activated. The discussion of the biophysical mechanisms above strongly supports the view that the “traditional” Hebbian correlation between input (presynaptic) signal and output (postsynaptic) signal should be fundamental to the weight change, while the DA signal performs a strong modulatory action. Currently it is widely believed that Hebbian correlation between input and output, which can lead to synaptic plasticity, is mediated by backpropagating action potentials (BP spikes) or dendritic spikes (DS-spikes), which are actively or passively transmitted to the regarded synapse (Stuart & Sakmann, 1994; Buzsaki, Penttonen, Nadasdy, & Bragin, 1996; Hoffman, Magee, Colbert, & Johnston, 1997; Migliore, Hoffmann, Magee, & Johnston, 1999; for a review, see Linden, 1999, but see Goldberg, Holthoff, & Yuste, 2002). The underlying biophysical mechanisms shall be discussed later (section 6.1). In this simplified connection scheme, it is, however, conceivable that the activation of the circuit that calculates the δ signal will be associated with a backpropagating spike that travels to the synapse and provides the necessary postsynaptic depolarization. This would lead to a three-factor rule combining input, output, and the DA signal, where the DA signal and the postsynaptic activation are causally coupled (Miller et al., 1981; Schultz, 1998; Suri et al., 2001). There are, however, still some problems remaining. For example, if the result holds that pure pre-post correlation without DA signal will drive LTP and LTD, then we should not use the DA input in a multiplicative way at all. In addition, the problem occurs that we have to deal with the effects of all causally connected

284

F. Worg ¨ otter ¨ and B. Porr

events whenever presynaptic activity will drive the cell into (postsynaptic) firing. This occurs in a rather complex way in all neuronal TD models, which use a serial compound stimulus representation. Thus, the question of how all these inputs will drive the cell before and during learning, generating BP spikes, which by themselves will influence learning (together with DA), needs to be carefully addressed when trying to model temporal sequence learning by a rule that combines inputs and outputs causally. While the answer to this is still unknown, it is nonetheless clear that a plain three-factor rule is too simple to account for this. Figure 13 tries to capture these aspects in a schematic way. The circuit on the right side would be suited to calculate the δ signal by computing v(t) − v(t − 1) using delayed inhibition via an interneuron. Synaptic modification of the x1 synapse requires the correlation between pre- and postsynaptic activity (see the gray box with x) and the postsynaptic activity is supposed to be transmitted to the synapse via a BP spike (dashed line). The DA signal acts modulatory onto this correlation, but we will not rule out the possibility that it could have a gating (multiplicitive) influence, preventing LTP when it is absent. In comparison to TD, ISO learning does not attempt to model DA responses. In section 6.2.3, we will show that its algorithmic structure is more strongly related to models of spike-timing-dependent plasticity correlating presynaptic signals with postsynaptic signals only (see below). 6 Fast Differential Hebb—Relations to Spike-Timing-Dependent Plasticity The discussion has shown that there are certain—at least formal—similarities between the TD rule, the three-factor rule, and conventional correlationbased learning. Specifically, differential Hebb rules in general lead to weight growth for causally related temporal sequences, while weight shrinkage occurs when the signals are inverse causally coupled. Such a bimodal characteristic was first observed in the classical model of Sutton and Barto (1981), where inhibitory conditioning was found for negative interstimulus intervals (ISIs) between CS and US (an unwanted effect in this context). On much shorter timescales, effects have been observed at real neurons, where the sequence of pre- and postsynaptic activation influences synaptic modification. A different set of differential Hebbian learning rules has been developed to explain these effects. We will discuss some of these algorithms in this section. We will also treat the biophysics of spike-timing-dependent plasticity in order to put the learning rules into their biophysical context. 6.1 Spike-Timing-Dependent Plasticity: A Short Account. 6.1.1 Basic Observations. Hebbian (correlation-based) learning requires that pre- and postsynaptic spikes arrive within a certain small time window, which leads to an increase of the synaptic weight (Hebb, 1949). Originally

Temporal Sequence, Prediction, and Control

285

Synaptic change [%]

100

0

-60

-80

0 T (ms) 80

Figure 14: Experimental data and exponential curve fits for spike-timingdependent plasticity in a hippocampal glutaminergic neuron after repetitive correlated firing (pulses at 1Hz). The right part of the diagram shows LTP, the left part LTD. LTP occurs when the presynaptic activation precedes postsynaptic firing (defined as T > 0); for LTD the order of pre- and post- is reversed (defined as T < 0). Recompiled from Bi and Poo (2001).

it had been supposed that the temporal order of both signals is irrelevant (Bliss & Lomo, 1970, 1973; Bliss & Gardner-Edwin, 1973). However, rather early first indications arose that temporal order is indeed important (Levy & Steward, 1983; Gustafsson, Wigstrom, Abraham, & Huang, 1987; Debanne, Gahwiler, & Thompson, 1994). This notion has been extended into a clear picture of spike-timing-dependent plasticity (STDP) by the experiments of Markram, Lubke, ¨ Frotscher, and Sakmann, (1997) as well as Magee and Johnston (1997). Markram et al. (1997) used whole-cell recordings from two interconnected layer V cortical pyramidal neurons, and they found that repetitive costimulation of the two cells, with postsynaptic spikes following presynaptic spike by 10 ms, induced LTP, while reversing the pulse-protocol would induce LTD. Magee and Johnston (1997) found in the hippocampus that subthreshold synaptic inputs into a CA1 pyramidal cell could amplify backpropagating spikes when paired with postsynaptic stimulation. This results in Ca2+ influx and LTP. In the following, a large number of studies have confirmed that the expression of LTP or LTD in many cells depends on the order of pre- and postsynaptic stimulation (Bell, Han, Sugawara, & Grant, 1997; Bi & Poo, 1998; Debanne, Gahwiler, & Thompson, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Egger, Feldmeyer, & Sakmann, 1999; Feldman, 2000; Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000). Figure 14 shows a typical example of how a synapse in the hippocampus changes when ap-

286

F. Worg ¨ otter ¨ and B. Porr

plying a protocol of pairing single pre- and postsynaptic depolarizations (Bi & Poo, 2001). The curve shows that this synapse decreases in strength when the postsynaptic signal precedes the presynaptic signal (defined here as T < 0), while it grows if the temporal order is reversed (thus, T > 0) (Markram et al., 1997; Magee & Johnston, 1997; Bi & Poo, 2001). T denotes the temporal interval between post- and presynaptic signals (T := tpost −tpre ). This fairly symmetrical diagram (see Figure 14) represents in some sense the most generic type of STDP. However, several other timing-dependent synaptic modification curves have also been found, which shall only be briefly mentioned here (for a review, see Bi, 2002; Roberts & Bell, 2002). For example, Feldman (2000) found that in layer II/III pyramidal cells, the LTD part of the curve is strongly extended. In the cerebellum (electric fish), inverse STDP curves have been observed, where T > 0 induces LTD and T < 0 LTP (Bell et al., 1997; Han, Grant, & Bell, 2000). Other cell types (e.g., spiny stellate cells in layer IV of the cortex; Markram et al., 1997) seem to follow a more traditional Hebbian learning characteristic. In addition, one observes that individual STDP curves can show a high degree of variability even within the same cell class and that the zero crossing between the LTP and LTD part of the curve does not necessarily occur at T = 0. 6.1.2 Synaptic Biophysics of STDP. Currently it is widely believed that Hebbian correlation between input and output, which can lead to synaptic plasticity, is mediated by backpropagating action potentials (BP spike) which are actively or passively transmitted to the regarded synapse (Stuart & Sakmann, 1994; Buzsaki et al., 1996; Hoffman et al., 1997; Migliore et al., 1999) by means of passive and active properties of dendrites (Lasser-Ross & Ross, 1992; Regehr, Konnerth, & Armstrong, 1992; Johnston, Magee, Colbert, & Cristie, 1996). At distal dendrites, where a BP spike might already have faded, dendritic spikes can serve the same purpose (Stuart, Spruston, Sakmann, & H¨ausser, 1997; Golding, Kath, & Spruston, 2001; Saudargiene, Porr, & Worg ¨ otter, ¨ 2004; Mehta, 2004). It is generally believed that a transient increase in intracellular Ca2+ is of crucial importance for STDP as in most other forms of synaptic plasticity (Artola & Singer, 1993; Zucker, 1999). In hippocampal slices, increased Ca2+ influx is correlated with the induction of LTP (Magee & Johnston, 1997). In addition, STDP is found to depend critically on NMDA receptors that are highly permeable to Ca2+ (Magee & Johnston, 1997; Markram et al., 1997; Bi & Poo, 1998; Debanne et al., 1998; Zhang et al., 1998; Feldman, 2000; Nishiyama et al., 2000). A simple model of the mechanisms that underlie LTP and LTD can be summarized as follows: high-level intracellular Ca2+ elevation activates the calcium/calmodulin-dependent kinase II (CaMKII, see Figure 12) as well as other protein kinases and leads to subsequent LTP, whereas moderate-level Ca2+ elevation activates phosphatases (e.g., calcineurin) and results in LTD (Lisman, 1989; Malenka et al., 1989; Malinow, Schulman, & Tsien, 1989; Artola & Singer, 1993; Mulkey, Endo, Shenolikar,

Temporal Sequence, Prediction, and Control

287

& Malenka, 1994). In a first approximation, this model can also be applied to STDP, because when presynaptic stimulation immediately precedes a postsynaptic spike, the backpropagating postsynaptic spike can remove the Mg2+ block at the NMDA receptors (Mayer, Westbrook, & Guthrie, 1984; Nowak, Bregestovski, Ascher, Hebert, & Prochiantz, 1984), thereby causing high-level Ca2+ influx into the synapse (Magee & Johnston, 1997; Koester & Sakmann, 1998), whereas when presynaptic stimulation follows the postsynaptic spike, only low-level Ca2+ influx may occur through voltage-gated calcium channels and the partially blocked NMDA receptors. This model, however, poses a problem at large values of T. Experimentally, one still observes weak LTP, but the model would suggest that Ca2+ influx should be low for large T, rather leading to LTD. Thus, one would expect to find a second LTD window at larger T. However, such an additional LTD window was not observed in most studies of STDP (Bi & Poo, 1998; Debanne et al., 1998; Zhang et al., 1998; Feldman, 2000; Sjostr ¨ om, ¨ Turrigiano, & Nelson, 2001; Yao & Dan, 2001; Froemke & Dan, 2002) except for hippocampal slices, where spike timing of 20 ms indeed resulted in LTD (Nishiyama et al., 2000). A possible explanation for the missing secondary LTD window would be that not only the level but also the transient of Ca2+ determines if LTP or LTD will be induced. Indeed, LTP or LTD induced by postsynaptic photolysis of caged Ca2+ depends critically on not only the level but also the time course of the light-induced Ca2+ transient (Yang, Tang, & Zucker, 1999; Zucker, 1999). In contrast, when the process of intracellular Ca2+ elevation is slow, the condition for subsequent enzymatic reaction may be closer to a steady-state situation; therefore, the overall level of Ca2+ could become the only crucial parameter in determining the resultant synaptic modification. Currently, there is no direct evidence for this hypothesis, but the complexity of the postsynaptic density argues for a differential action when comparing a steady-state situation with a situation of a steep Ca2+ gradient. For example, Ca2+ influxes from different channels appear to activate preferentially different kinase signaling pathways (Deisseroth, Heist, & Tsien, 1998; Graef et al., 1999; Dolmetsch, Pajvani, Fife, Spotts, & Greenberg, 2001; West et al., 2001). Furthermore it is known that CaMKII can respond differently to Ca2+ oscillation at different frequencies (DeKoninck & Schulmann, 1998). Additional complexity comes from the dynamics of intracellular Ca2+ stores (Berridge, 1998; Svoboda & Mainen, 1999; Rose & Konnerth, 2001) mainly consisting of the lumen of the endoplasmatic reticulum, which stores Ca2+ at a high concentration. These stores can be accessed through ryanodine receptors as well as inositol 1,4,5-trisphosphate (IP3) receptors (Berridge, 1998), which are both Ca2+ sensitive. In addition, IP3 receptors are also activated by metabotropic glutamate receptors. Note, however, that Ca2+ release from intracellular stores is generally slow but long lasting and more global as compared to Ca2+ influx through synaptic channels. The role of Ca2+ stores in STDP is unclear, but it has been shown that in CA1 of the hippocampus, inhibiting store-release blocks the induction of LTD and some-

288

F. Worg ¨ otter ¨ and B. Porr

Figure 15: Theoretical results for STDP. (A) Analytically calculated final weight distributions after reaching equilibrium from the model of van Rossum et al. (2000). The unimodal curve is obtained from their original model assuming a multiplicative weight normalization mechanism. The bimodal curve is obtained when potentiation and depression do not depend on the weight (additive model with hard boundaries). (B–D) Ca2+ concentration (B,C) and weight change curve (D) obtained by Shouval et al. (2002). The insets in B and C show the timing of the backpropagating spike (“needle”) relative to the NMDA potential (elongated curve). Only for positive T (T = 10 ms, (C)) a strong rise in Ca2+ is observed, while it remains moderate for T < 0 (T = −10 ms, (B)). (D) The inset shows the shape of Shouval’s function. They assume that LTP occurs if the Ca2+ concentration passes threshold “P”, while LTD occurs if it stays below threshold “D.” The weight change curve (D) has the characteristic shape, but an additional negative peak is found for large T, which may not correspond to experimental findings as discussed in section 6.1.

times reverses it to LTP (Futatsugi et al., 1999; Nishiyama et al., 2000). Such effects, however, may be highly sensitive to the experimental protocol and cell-type. In summary, it seems that a rapid change to a high-level Ca2+ concentration at the postsynaptic density, probably through activated NMDA receptors, is most likely responsible for inducing LTP. Slow and prolonged Ca2+ increase, on the other hand, possibly from different sources, will lead to LTD. 6.2 Models of Fast Differential Hebbian Learning. During the past few years a wide variety of different models for STDP has been designed,

Temporal Sequence, Prediction, and Control

289

which can roughly be subdivided into two groups with different biophysical complexity; spike based and rate based. 6.2.1 Network Models of STDP. The first group of models, which began to emerge around 1996 (Gerstner, Kempter, van Hemmen, & Wagner, 1996) and, thus, anticipating the experimental findings published in 1997 (Markram et al., 1997; Magee and Johnston, 1997), are relatively abstract and do not implement any biophysical mechanism in order to generate the weight change curve. Instead these models assume a certain shape of the weight change curve as the learning rule (Gerstner et al., 1996; Song, Miller, & Abbott, 2000; Rubin, Lee, & Sompolinsky, 2001; Kempter, Leibold, Wagner, & van Hemmen, 2001b; Gerstner & Kistler, 2002), which is most often an exponential fit to both parts of the data sets in Figure 14. These models look at general network properties that arise from learning. It has long been known that plain Hebbian learning without additional mechanisms will lead to the divergence of all synaptic weights in a network, because even random correlations will lead to weight growth. Thus, network stability has been one of the central issues in theories of correlationbased learning, and several add-on mechanisms have been discussed in the literature to solve this problem. In view of this problem, the property that LTP and LTD can occur at the same synapse depending on the pre-post correlations led to the attention of theoreticians. Indeed, it was found that STDP leads to self-stabilization of weights and rates in the network as soon as the LTD-side dominates over the LTP side in the weight change curve (Song et al., 2000; Kempter, Gerstner, & van Hemmen, 2001a). As a second problem of high theoretical interest, networks need to operate competitively. Thus, normally it does not make sense to drive all synapses into similar states by means of learning, because the network will then no longer contain inner structure. Instead, most real networks in the brain are highly structured, forming maps or other subdivisions. Thus, the problem of synaptic competition also needs to be solved by the applied learning mechanisms. Accordingly, more recently network STDP models have also been successfully applied to generate (i.e., to develop) some physiological properties such as map structures (Song & Abbott, 2001; Leibold, Kempter, & van Hemmen, 2001; Kempter et al., 2001b; Leibold & van Hemmen, 2002), direction selectivity (Buchs & Senn, 2002; Senn & Buchs, 2003), or temporal receptive fields (Leibold & van Hemmen, 2001). In addition, it was found that such networks can store patterns (Abbott & Blum, 1996; Seung, 1998; Abbott & Song, 1999; Fusi, 2002). In general one can subdivide the different network models of STDP into those that use an additive versus a multiplicative learning rule. Additive models assume that changes in synaptic weights do not scale with synaptic strength, and hard boundaries are imposed to prevent divergence (Abbott & Blum, 1996; Gerstner et al., 1996; Eurich, Pawelzik, Ernst, Cowan, & Milton, 1999; Kempter, Gerstner, & van Hemmen, 1999; Roberts, 1999; Song et al.,

290

F. Worg ¨ otter ¨ and B. Porr

2000; Levy, Horn, Meilijson, & Ruppin, 2001; Cateau, Kitano, & Fukai, 2002). These models exhibit strong competition but produce unstable dynamics, most often resulting in binary synaptic distributions (see Figure 15A). As an additional problem, one finds that these models often subdivide the neuronal population not according to the actual input correlations but rather as a consequence of the self-amplification of small, initially existing deviations from symmetry. This effect also results from the strong competition taking place in these networks. Multiplicative models, on the other hand, assume a linear attenuation of the weight change as soon as the upper or lower boundaries are approached (Kistler & van Hemmen, 2000; van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001). Such a mechanism seems to correspond more closely to the experimental data, because Bi and Poo (1998) have reported that the growth of synaptic weights becomes smaller if the weight is already strong. The smoothing effects of such an attenuation lead to stable network dynamics, but because of the reduced competition, all synapses are driven into a similar equilibrium value (see Figure 15A). In a recent model devised by Gutig, ¨ Aharonov, Rotter, and Sompolinsky (2003), it has been suggested to introduce a nonlinearity in the STDP rule to control the attenuation of the weight change. As a consequence, this model sits in between the hard-threshold additive approaches and the linear-attenuation multiplicative models, and the authors can show that competition and stability are obtained across a much larger parameter space. Rate-based models rely on the average firing rate of a neuron, not on its precise firing time (Sutton & Barto, 1981; Kosco, 1986; Klopf, 1986, 1988; Kempter et al., 2001b; Kistler, 2002; Porr & Worg ¨ otter, ¨ 2002, 2003a). As such they are at first sight unrelated to the spike-based models. The earlier models, which appeared before 1990, used an implicit functional description (an equation), which relied on the derivative of the signal(s) to define their learning rule. The same “differential Hebbian” relation was reintroduced by Gerstner et al. in 1996 in the context of a spike-based model and by means of a parametric description (a bimodal curve) of the learning rule. More recently, physiological results became available that pointed out that spike timing as well as the rate can influence synaptic plasticity (Sjostr ¨ om ¨ et al., 2001; Froemke & Dan, 2002). Thus, the question arises how to relate both model classes. Roberts (1999) as well as Xie and Seung (2000) have observed that it is possible to derive a rate-based model from a spiking model but the opposite cannot be achieved without additional assumptions. Gerstner and Kistler (Gerstner & Kistler, 2002; Kistler, 2002) provide a relatively complex unifying approach to these ends, using their spike-response model of firing, which, however, makes several oversimplifying assumptions with respect to the biophysics of neurons. 6.2.2 Critique of the Network STDP Models. The central advantage of network STDP models lies in the fact that they can be treated mathematically

Temporal Sequence, Prediction, and Control

291

to a large extent. Thus, theoretical statements, for example, about network stability and competition, could be obtained, albeit at the cost of biophysical realism. In network models of STDP, the learning rule (the assumed weight change curve) remains unchanged across the local properties of the cell. The complexity of the biophysical mechanisms that control the Ca2+ influence at any given synapses does not support this assumption. It seems much more likely that different shapes of weight change curves can exist, for example, along a dendrite or at spines. As a consequence, all network models of STDP are rather homogeneous in their learning behavior. Furthermore, these models do not attempt to implement different cell types, for example, inhibitory cells. This, however, may be problematic in view of the problems of stability and competition, and the question arises to what degree these problems would still surface in more realistic networks. In addition, normally these models assume that only pairwise interactions exist between pre- and postsynaptic signals (adiabatic condition). Thus, bursts of input spikes, common to most brain areas, are not considered, which could substantially influence local Ca2+ gradients. In a similar way, triplets (or multiplets) of spikes would normally create complex sequences of pre- and postsynaptic events (Froemke & Dan, 2002) are also not considered in these models. 6.2.3 Biophysical Models of STDP and Their Relation to Differential Hebb. These models are no longer concerned with the effect of STDP on network behavior; instead, they want to explain how the characteristic shape of the STDP curves is generated by means of intra-cellular mechanisms. Some use kinetic models (Senn, Markram, & Tsodyks, 2000; Castellani, Quinlan, Cooper, & Shouval, 2001). Others follow a state-variable approach trying to find a set of appropriate differential equations to describe STDP (Rao & Sejnowski, 2001; Abarbanel, Huerta, & Rabinovich, 2002; Karmarkar & Buonomano, 2002; Karmarkar, Najarian, & Buonomano, 2002; Shouval, Bear, & Cooper, 2002; Abarbanel, Gibb, Huerta, & Rabinovich, 2003). The kinetic models implement a rather high degree of biophysical detail, sometimes including calcium, transmitter, and enzyme kinetics. The power of such models lies in the chance to understand and predict intra- or subcellular mechanism, for example, the aspect of AMPA receptor phosphorilation (Castellani et al., 2001), which is known to centrally influence the synaptic strength (Malenka & Nicoll, 1999; Luscher ¨ & Frerking, 2001; Song & Huganir, 2002). Furthermore, both approaches (Senn et al., 2000; Castellani et al., 2001) can show a relation between the designed model and Hebbian as well as BCM rules (Bienenstock, Cooper, & Munro, 1982), which proposes a sliding synaptic modification threshold. (For an elaboration on the mathematical similarity between STDP and BCM see Izhikevich & Desai, 2003.)

292

F. Worg ¨ otter ¨ and B. Porr

The approaches of Rao and Sejnowski (2001), Shouval et al. (2002), and Karmarkar and coworkers (Karmarkar & Buonomano, 2002; Karmarkar, Najarian, & Buonomano, 2002) follow a state-variable approach. Rao and Sejnowski (2001) implemented a rule based on activity differences, which makes their approach related to differential Hebbian learning and less to TD learning (Dayan, 2002). The other two models investigate the effects of different calcium concentration levels by assuming certain (e.g., exponential) functional characteristics to govern its changes. This allows them to address the question of how different calcium levels will lead to LTD or LTP (Nishiyama et al., 2000) and one of the models (Karmarkar & Buonomano, 2002) proposes to employ two different coincidence detector mechanisms to this end. The model of Shouval et al. (2002) (see Figures 15B–15D) implicitly assumes a differential Hebbian characteristic by the bimodal shape of their function (see Figure 15D, inset), which they used to capture the calcium influence. This group discussed, among other aspects, the role of the shape of the BP spike, and they concluded that a slow after-depolarization potential (more commonly known as repolarization) must exist in order to generate STDP. Thus, in their study, the shape of the BP spike will influence the shape of the weight change curve. In general, they find that the LTP part of the curve is stronger than the LTD part. This observation would prevent selfstabilization of the activity in network models (Song et al., 2000; Kempter et al., 2001a), which require a larger LTD part for achieving this effect. Interestingly, however, Shouval et al. find a second LTD part for larger positive values of T, which could perhaps be used to counteract such an activity amplification. In the hippocampus, there is currently conflicting evidence if such a second LTD part exists for large T (Pike, Meredith, Olding, & Paulsen, 1999; Nishiyama et al., 2000). ISO learning generically produces a bimodal weight change curve (see Figure 3B). Thus, at the right timescale, it should be possible to use this algorithm also in the context of STDP, provided there is a justifiable match between the components of ISO learning and the biophysical synaptic mechanisms. Figure 16 shows how to associate the components of a membrane model to those of ISO learning. We assume a plastic NMDA synapse ω1 , which receives the presynaptic activation.9 An NMDA potential gˆ N from this synapse is shown in the inset. This signal is matched to x¯ 1 = x1 ∗ h1 from ISO learning, where the asterisk denotes a convolution. The postsynaptic signal x0 arises from a BP spike. For simplicity, the BP spike is also modeled as a conductance change, integrated into the membrane potential V. Thus, the

9 The more realistic case of a mixed AMPA/NMDA synapse is discussed in the original article (Saudargiene et al., 2004).

Temporal Sequence, Prediction, and Control

A

V (mV) -30

postsyn. event x0

t

BP-spikes of different shapes

B

presyn. event x1 at plastic synapse

T

293

-40 -50 -60

Plastic Synapse

x1

g h1

w1 NMDA

gN

S

dV = g (E -V) i i dt

-70

N

0

C 0

ms

100

(gBP ) BP-Spike

x0

Dw1

50

100

150 ms

Weight-change-curves

0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -80

-40

0

40 T (ms)

Figure 16: Applying the ISO learning algorithm to STDP. (A) Schematic diagram of the model. The inset shows a comparison between an NMDA potential gˆ N and a resonator response h1 from the original ISO learning algorithm. (B) BP spikes of different shapes. (C) Weight change curves obtained when using the BP spikes from B as the postsynaptic signals.

ISO learning rule (see equation 3.9) turns into dω1 = µx¯ 1 v = µ gˆ N V , dt

(6.1)

where V is the derivative of the membrane potential. It is known that the shape of a BP spike changes with the parameters of the dendrite, getting flatter and more spread out at distal dendrites. In particular, at distal dendrites, BP spike may not play a large role anymore (Stuart et al., 1997; Golding et al., 2001) and local signals, such as dendritic spikes will provide the depolarizing signal (Schiller, Schiller, Stuart, & Sakmann, 1997; H¨ausser, Spruston, & Stuart, 2000; Golding, Staff, & Spruston, 2002; Saudargiene et al., 2004; Mehta, 2004). The ISO model captures these effects, because it can model arbitrary depolarizing signals. Figure 16B demonstrates this aspect, discussing the special case of an attenuated BP spike. Figure 16C shows that the weight change curves obtained with these different BP spikes are significantly different in shape. This emphasizes that fact that different STDP characteristics are to be expected at different locations on the dendrite or at spines and this would lead to different computational properties. The differential Hebbian rule we employed leads to the observed results as the consequence of the fact that the derivative of any generic (unimodal) postsynaptic membrane signal (like a BP-spike) will lead to a bimodal curve.

294

F. Worg ¨ otter ¨ and B. Porr

The relative temporal location of the presynaptic depolarization signal with respect to the positive (or negative) hump of this bimodal curve will then determine if the product in the learning rule is positive (weight growth) or negative (weight shrinkage). This can also lead to the fact that differential Hebbian learning turns into plain Hebbian learning without having to change the learning rule if the rise time of the BP spike is shallow (see Figure 5 in Saudargiene et al., 2004). The ISO model of STDP is a spike-based model derived from a rate-based model (ISO learning). Central to the transfer between both domains was the realization that at any synapse, pulse coding turns into analog (rate) coding as the consequence of the filtering properties of the membrane, especially the transmitter-receptor interactions. This shows that at the level of a synapse, the distinction between spike-based STDP models and rate-based STDP models may not be of great importance anymore. 6.2.4 Critique of the Biophysical STDP Models. As opposed to network models, the biophysical ones in principle have the potential to explain the subcellular mechanisms that underlie STDP in more detail. However, at the moment, the level of biophysical realism is still not very high in these models. The model of Senn et al. (2000) is restricted to the kinetics of transmitter release without implementing an explicit Ca2+ mechanism. The others implicitly or explicitly assume that low levels of Ca2+ will lead to LTD, while high levels lead to LTP. The aspect that the Ca2+ gradient also seems to be important is not addressed in most models, and more complex reaction chains are also not yet implemented in the context of STDP.10 The state-variable models (Shouval et al., 2002; Karmarkar & Buonomano, 2002; Karmarkar et al., 2002) were designed to produce a zero crossing (transition between LTD and LTP) at T = 0, which is not always the case in the measured weight change curves, which show transitions between more LTP- and more LTD-dominated shapes depending on the cell type and the stimulation protocol (Roberts & Bell, 2002). An important advantage over network models, however, is that biophysical models permit different STDP characteristics at different synaptic sites at the same cell. The weight change curve depends in all of these models on subcellular parameters and on the shape of the potential changes at the synapses (Rao & Sejnowski, 2001; Saudargiene et al., 2004). The question of differently shaped weight change curves is of relevance for aspects of dendritic synaptic development and, consequentially, also for dendritic computations. Therefore, some groups have started to address this problem by explicitly parameterizing weight change curves along dendritic structures (Panchev, Wermter, & Chen, 2002; Sterratt & van Ooyen, 2002).

10 The model of Castellani et al. (2001) focuses on bidirectional synaptic plasticity but does not directly relate this to STDP.

Temporal Sequence, Prediction, and Control

295

6.3 The Time-Gap Problem. In the preceding sections, we have mainly discussed how spike-timing-dependent plasticity can be modeled with correlation-based differential Hebbian learning rules. We started this discussion with a more general account on differential Hebbian learning, first developed by Sutton and Barto (1981) as well as Klopf (1986, 1988) and Kosco (1986). These rules, recently augmented by ISO learning (Porr & Worg ¨ otter, ¨ 2002, 2003a), have been used with rather long temporal intervals between CS and US (seconds). The same rules, however, can almost without modification also be used to account for STDP, albeit on a much faster timescale (milliseconds). This is currently probably one of the biggest computational problems of this whole field: How would it be possible to bridge the gap between the timescales of correlation-based synaptic plasticity and those of behavioral learning? It is by now widely accepted that LTP and LTD are strongly involved in processes of learning and memory (Martin, Grimwood, & Morris, 2000). At the moment, there is no straightforward way that would lead from an STDP model to, for example, a model of classical conditioning, and only a few attempts exist in this direction (Mehta, Lee, & Wilson, 2002; Sato & Yamaguchi, 2003; Melamed, Gerstner, Maass, Tsodyks, & Markram, 2004). Maybe the serial compound stimulus representations, introduced in section 3.1.3, could be “recycled” in this context, but such a representation seems to be a rather crude approximation of the neuronal interactions in the areas involved. Reverberating synfire chains (Abeles, 1991) would be another possible mechanisms to bridge the time gap (Kitano, Okamato, & Fukai, 2003), but this concept is also quite heavily debated. Thus, the relationship between fast and slow differential Hebb rules may turn out to be a mere structural one. 7 Discussion: Are Reward-Based and Correlation-Based Learning Indeed Similar? Part of the description above has shown that neuronal TD models are related to correlation-based approaches, when treating the reward input not differently from any of the other inputs (see Figure 4C). Thus, at least in their open-loop condition, reward-based learning can be equated to correlationbased rules, and they appear similar. The difference between those two approaches, however, is more subtle and surfaces only when considering closed-loop situations. Reward-based learning requires evaluative feedback from the environment, where someone (an external observer) or some process (the “critic”) defines what is rewarding and what is punishing for the learner. Correlation-based learning does not need evaluative feedback. All feedback signals from the environment that reach the learner are totally value free, and only the temporal correlations between the different input signals drive the learning. This difference has been known since the early 1980s (Klopf, 1982, 1988), and reinforcement learning had been introduced in order to address some of

296

F. Worg ¨ otter ¨ and B. Porr

the problems that exist with correlation-based methods. First and foremost, it had been observed that it is hard to define the boundary conditions for correlation-based methods. Wrong boundary conditions will lead to wrong convergence of learning. In addition, it was deemed impossible to learn conflict solving by means of correlation-based methods (McFarland, 1989). Reinforcement learning seems to be better suited to address these problems, and hopes were high that RL techniques could be used to address complex learning tasks. Alas, many robotics researchers can now confirm that RL will not solve their learning problems, and the question arises: Why is this the case? 7.1 Evaluative Versus Nonevaluative Feedback: The Credit Structuring Problem. Evaluative feedback, however, is not unproblematic. Since RL methods rely on rewards, we must find a way to define them. So far we have somewhat naively assumed that we know how much reward can be associated with a given state or situation. This, however, requires a process that places the rewards appropriately (credit structuring problem). 7.1.1 External Evaluative Feedback. Rewards would have to be placed such that learning is efficient; it is not good enough to just reward some final learning outcome (which is normally much easier). For example, the goal “learn to win in chess,” hence placing a reward at the end of a successful game, will lead to unacceptably long times to convergence (if at all) of learning.11 Faster convergence can be achieved by associating appropriate rewards with ideally many intermediate configurations of the board.12 This, however requires an in-depth analysis of the structure of the problem before credit structuring can take place. This analysis may at the end be almost as complex as the learning process itself. Thus, credit structuring is a major challenge, especially in time- and space-continuous control tasks, to which all animal behavioral tasks belong. In these cases, there are an infinite number of state-action pairs, to which rewards or punishments could be associated beforehand by defining a mapping function from the state-action space to the reward-punishment space with a so-called reinforcement function (Santos & Touzet, 1999a, 1999b). Such procedures for credit structuring can be called external evaluative feedback: an external structure, a “teacher,” explicitly provides the rewards. Obviously this situation does not normally apply to animals. 11 TD-Gammon (Tesauro, 1992) is commonly cited as one the major successes of TD learning. One should, however, keep in mind that it took millions of iterations of self-play to converge. Furthermore, it has been discussed that the structure of backgammon may be especially well suited for TD learning (Pollack & Blair, 1997). 12 In general it is observed that RL methods often converge very slowly on fine-grain or time-space continuous problems. Therefore, efforts have been made up to accelerate convergence (e.g., Wiering & Schmidhuber, 1998; see Reynolds, 2002 for a discussion).

Temporal Sequence, Prediction, and Control

297

7.1.2 Internal Evaluative Feedback. A better strategy may be to assume that the agent/algorithm performs credit structuring on its own. This can be achieved, for example, by associating only very general aversive or attractive properties (like good or bad taste of food, pain or pleasure) to states and letting the agent “experience” them via some sensor inputs. We would call such a procedure internal evaluative feedback: The agent itself provides the value of the rewards. The agent consists of a critic who criticizes the actions of the actor, such as discussed in section 4.1. There is, however, a hitch. How does the critic know how to criticize? At least for a constructed artificial agent, someone (the designer) must have provided a frame to allow the critic to do its job. Thus, we still need an external structure (the designer) that explicitly defines the reference frame, the reinforcement function, for what should be rewarding (or punishing). In simple situations, this seems easy—for example, food is rewarding, pain not. However, as soon as one would like the agent to learn a complex control task with multiple and often conflicting demands (think of robot soccer), setting a “good” frame set to drive the learning becomes nontrivial. We believe this problem is related to the famous frame problem of Dennett (1984), initially pointed out by Klopf (1988) in the context of classical conditioning. In brief, Dennett claims that is is impossible to define a complete frame set (a world model) to guide the behavior of an autonomous agent or an AI system, because the world model of the designer will never match the world as experienced (as “sensed”) by the agent. Dennett’s arguments concerned behavioral control. However, in trying to control and guide learning (by means of a critic), it may be equally impossible to define such a frame set for the reward structure as soon as the task of the critic is complex enough. We would like to call this the secondorder frame problem and would argue that the strange lack of success of RL methods in complex state action spaces may be partially due to this problem. 7.1.3 Nonevaluative Feedback. One possible solution out of this secondorder frame problem is to try to design a system that operates strictly with nonevaluative feedback where any external structure (here, the environment) will provide only value-free signals. A possible starting point for such a system would be some kind of very generally prespecified homeostasis, and the only goal of the agent would be not to maximize rewards but to minimize disturbances of this homeostasis. This way, external interference by the designer of the agent would be absolutely minimal, because even the initial state could be generated by evolutionary methods and the agent will in all instances “decide by itself” (i.e., by internal evaluation within its own reference frame) if something is rewarding or punishing. In principle, it would in this case not even be necessary to make this evaluation explicit in terms of a dedicated reward or punishment signal. Without such signals, learning would have to rely exclusively on correlative input properties, and traditional TD architectures cannot be used. Only with secondarily derived

298

F. Worg ¨ otter ¨ and B. Porr

inner reward-related signals, which seem to be represented by the dopamine responses of neurons in the basal ganglia, does TD learning become again feasible. 7.2 Bootstrapping Reward Systems? Compelling evidence exists that the dopaminergic system in the brain is related to reward processing. Thus, the above arguments by which we criticized essentially all evaluative feedback mechanisms must be mitigated to some degree. Animals seem indeed to use evaluative feedback mechanisms successfully for learning. Here, however, the situation is to some degree different. Such reward systems have presumably been bootstrapped by evolution, which has built at least the most basic reward and punishment structures into the “critic,” such that even newborn animals can rely on them. Evolutionary mechanisms of variation and selection are indeed well suited to develop value systems. Futhermore, it is conceivable that more complex internal evaluative structures can be developed on top of these early reward mechanisms. This could be done by either correlation-based learning or by learning secondary (derived) value systems from primary ones. Several interesting questions arise here. For example, how finely can such reward systems be structured in animals? Is it conceivable that several “critics” operate in parallel in an animal in order to disentangle conflicting situations? Is is possible to bootstrap reward-based systems by means of value-free correlation-based learning? Thus, in general, from this discussion, the question arises: How does the learner (the critic) learn to criticize? 7.3 The Credit Assignment Problem. In a simple summary, one could say that the above discussion has dealt with the “spatial” structure of the world: How and where to place rewards before learning will start? The temporal credit assignment problem, on the other hand, refers to the fact that rewards, especially in fine grained state-action spaces, can be terribly temporally delayed. For example, a robot will normally perform many moves through its state-action space where immediate punishments or rewards are (almost) zero and where more relevant events are rather distant in the future. As a consequence, such reward signals will only very weakly affect all temporally distant states that have preceded it. It is almost as if the influence of a reward gets more and more diluted over time, and this can lead to bad convergence properties of the RL mechanism. Many steps must be performed by any iterative RL algorithm to propagate the influence of delayed reinforcement to all states and actions that have an effect on that reinforcement (Anderson & Crawford-Hines, 1994). In addition, we observe a somewhat paradoxical situation for credit assignment that arises during learning: If an agent successfully avoids a punishment, then it will receive no punishment signal. A “no-signal” situation, however, normally occurs also for every “boring” state-action pair where the machine is far away from a source of reward or punishment. Thus, “no-signal” (or to be more accurate

Temporal Sequence, Prediction, and Control

299

“faint-signal”) states prevail throughout most of an agent’s lifetime and it is almost impossible to tell if such a state has come about through successful punishment avoidance or, more likely, just by chance. 7.4 Maximizing Rewards or Minimizing Disturbances? Speculations About Learning. Return maximization is the central paradigm of reinforcement learning, and this gives rise to the problems of credit structuring and credit assignment. Most actor-critic architectures have also been implemented adhering to this paradigm. The architecture in Figure 9 suggests an alternative where actor and critic are merged. It starts with a negative feedback loop (see Figure 7A), which creates a stable starting condition: whenever a deviation from the desired state (set point) occurs, the feedback loop will try to correct it. This property is then also used to guide the learning by means of a minimization instead of a maximization principle: the learning goal is to try to minimize deviations from the desired state, that is, to minimize disturbances of the homeostasis of the feedback loop. These two paradigms are in most cases not equivalent. Most often, one finds that maximal return is associated with a single (or a few) point(s) on the decision surface. Minimal disturbance, however, will cover a dense manifold of points, all of which represent solutions of the learning problem. Thus, correlation-based temporal sequence learning (see Figure 9) offers the advantage that credit structuring takes place without effort: every signal that enters the reflex loop will drive the learning, and if there is no signal, the situation is stable (desirable) for at least the moment. Rewards do not exist in this scheme. Credit assignment during learning also ceases to be a problem. Since we do not attempt to seek maximal return and are instead satisfied with any disturbance-free solution, we can live without credits almost everywhere on the decision surface and let the agent choose its own path until a (rare) disturbance happens. Early observations have pointed to the fact that pure homeostatic mechanisms will not develop “interesting” or even intelligent behavior (Ashby, 1952). This certainly holds for pure feedback loop control as in Figure 7A. However, as shown above, the stereotyped reflex behavior of such a homeostat can be augmented by anticipatory correlation-based learning adding a second loop (see Figure 10). As long as there are anticipatory signals, we can continue to pile loops on top of loops, creating a complex subsumption architecture (Porr & Worg ¨ otter, ¨ 2003c). However, there exist only a rather limited number of causally related anticipatory sensor events that come from the environment. A chain of three such signals could be given: sound may precede smell of the prey, which in turn precedes taste. As a consequence, the depth of a subsumption architecture, which relies exclusively on sensor events, will remain rather limited. However, in principle, nothing prevents us from using intrinsic signals (“thoughts”) to the same purpose. It does not matter to the learner where an anticipatory signal came from, such that this type of correlation-based learning should also work with internal

300

F. Worg ¨ otter ¨ and B. Porr

signals. Naively, if I think this and a certain event almost always follows, then this “thought should be strengthened.” This is certainly highly speculative, but we would like to add that through such a process of anticipatory correlation-based learning, “values” can actually be developed: anticipatory learning carries the implicit semantics that “earlier is better.” This could be a natural starting point to develop more complex value (hence “reward”) structures without having to assume external interference. 8 Conclusion This article provides an overview across the field of temporal sequence learning. Two problems were mainly discussed: how to match neuronal models to physiology and how to design biologically plausible control methods, possibly using such algorithms. We believe that there are three main questions that need to be addressed in the future: 1. How can we bridge the temporal gap between STDP (or LTP, LTD) and behavioral learning? 2. When do animals (humans) follow the return-maximization principle? Do they sometimes (often?) adopt different strategies like disturbance minimization? And how do they decide when to use the different strategies? 3. How can we implement (evolve, develop) nonevaluative feedback or strictly internal evaluative feedback, without adopting the perspective of an external observer, to arrive at strictly agent-driven prediction and control? Without answering the first question, it will be impossible to explain animal learning by means of biophysical network models of brain function. Without addressing the second set of questions, control agents may be too limited and inflexible in their behavioral and decision-making potential. Without addressing the last question, we will always design only control agents that follow our own intentions (or at least our own frame of reference). Our reference frame depends on our own internal states; thus, it can never accurately match that of any other agent which receives different inputs and performs different internal processing. As a consequence, such an agent is unavoidably doomed to fail when left to its own (really our!) devices in its own (really its!) world. It will fail to be autonomous. Appendix This appendix provides the background on the use of eligibility traces in the TD formalism of reinforcement learning. This way it becomes clear that the concept of eligibility traces has penetrated different algorithms and that the backward TD(λ) algorithms contain a term that can be used to define a

Temporal Sequence, Prediction, and Control

301

correlative process. This way, backward TD(λ) can, if desired, be related to Hebbian learning. A.1 The TD Formalism for the Prediction Problem in Reinforcement Learning. We consider a sequence st , rt+1 , st+1 , rt+2 , . . . , rT , sT . Note that rewards occur downstream (in the future) from a visited state.13 Thus, rt+1 is the next future reward that can be reached starting from state st . For simplicity, actions are not denoted here. The complete return Rt , to be expected in the future from state st , is thus given by Rt = rt+1 + γ rt+2 + γ 2 rt+3 + . . . + γ T−t−1 rT ,

(A.1)

where γ ≤ 1 is the discount factor. Reinforcement learning assumes that the value of a state V(s) is directly equivalent to the expected return E at this state, where π denotes the (here unspecified) action policy to be followed: V(s) = Eπ {Rt |st = s}.

(A.2)

Thus, the value of state st can be iteratively updated with V(st ) ← V(st ) + α[Rt − V(st )].

(A.3)

The left side of this equation represents the new value of st calculated by using the complete return Rt and holding it against the old value of st (right side). We use α as a step-size parameter, which is not of great importance here, though, and can be held constant. Note that if V(st ) correctly predicts the expected complete return Rt , the update will be zero, and we have found the final value. This method is called constant-α Monte Carlo update. It requires waiting until a sequence has reached its terminal state before the update can commence. For long sequences, this may be problematic. Thus, one should try to use an incremental procedure instead. We define a different update rule with V(st ) ← V(st ) + α[rt+1 + γ V(st+1 ) − V(st )].

(A.4)

The elegant trick is to assume that if the process converges, the value of the next state V(st+1 ) should be an accurate estimate of the expected return downstream to this state (i.e., downstream to st+1 ). Thus, we would hope that Rt = rt+1 + γ V(st+1 )

(A.5)

13 We are using a slightly simplified version of the notation conventions from Sutton and Barto (1998). This book should be consulted in case of detailed questions.

302

F. Worg ¨ otter ¨ and B. Porr

holds. Indeed, proofs exist (Sutton, 1988) that under certain boundary conditions, this procedure, known as TD(0), converges to the optimal value function for all states. In principle, the same procedure can be applied all the way downstream, writing, 2 n−1 R(n) rt+n + γ n V(st+n ). t = rt+1 + γ rt+2 + γ rt+3 + · · · + γ

(A.6)

Thus, we could update the value of state st by moving downstream to some future state st+n−1 , accumulating all rewards along the way, including the last future reward rt+n , and then approximating the missing bit until the terminal state by the estimated value of state st+n given as V(st+n ). Furthermore, we can even take different such update rules and average their results in the following way: Rλt = (1 − λ)

∞

λn−1 R(n) t

(A.7)

n=1

V(st ) ← V(st ) + α[Rλt − V(st )],

(A.8)

where 0 ≤ λ ≤ 1. This is the most general formalism for a TD rule known as forward TD(λ)-algorithm (Sutton, 1988), where we assume an infinitely long sequence. Convergence proofs are given by Dayan (1992), Peng (1993), and Dayan and Sejnowski (1994). The disadvantage of this formalism is still that for all λ > 0, we have to wait until we have reached the terminal state until update of the value of state st can commence. There is a way to overcome this problem by introducing so-called eligibility traces re-introduced in the context of RL by Watkins (1989) and Jaakkola, Jordan, and Singh (1995). Let us assume that we came from state A, and now we are currently visiting state B of an MDP. B’s value can be updated by the TD(0) rule after we have moved on by only a single step to, say, state C. We define the incremental update as before as δt = rt+1 + γ V(st+1 ) − V(st ).

(A.9)

Normally we would assign a new value to state B by performing V(sB ) ← V(sB ) + αδB , not considering any other previously visited states. In using eligibility traces, we do something different and assign new values to all previously visited states, making sure that changes at states long in the past are much smaller than those at states visited just recently. To this end, we define the eligibility trace of a state as xt (s) =

γ λxt−1 (s) γ λxt−1 (s) + 1

if if

s = st . s = st

(A.10)

Temporal Sequence, Prediction, and Control

303

Thus, the eligibility trace of the currently visited state is incremented by one, while the eligibility traces of all states decay with a factor of γ λ. Instead of updating just the most recently left state st we will now loop through all states visited in the past of this trial, which still have an eligibility trace larger than zero and update them according to V(st ) ← V(st ) + αδt xt (s).

(A.11)

In our example, we will thus also update the value of state A by V(sA ) ← V(sA ) + α δB xB (A). This means we are using the TD error δB from the state transition B → C (see equation A.9), weight it with the currently existing numerical value of the eligibility trace of state A given by xB (A), and use this to correct the value of state A “a little bit.” This procedure always requires only a single newly computed TD error using the computationally very cheap TD(0)-rule, and all updates can be performed online when moving through the state-space without having to wait for the terminal state. The whole procedure is known as backward TD(λ)-algorithm, and it can be shown that it is mathematically equivalent to forward TD(λ) described above (Sutton & Barto, 1998). All these algorithms have been designed for time- and spacediscrete problems (MDPs). A continuous approach has been described by Doya (2000). Acknowledgments This work was supported by the SHEFC INCITE and the EU ECOVISON grants. We thank P. Dayan, A. Saudargiene, and L. Smith for helpful discussions. References Abarbanel, H. D. I., Gibb, L., Huerta, R., & Rabinovich, M. I. (2003). Biophysical model of synaptic plasticity dynamics. Biol. Cybern., 89(3), 214–226. Abarbanel, H. D. I., Huerta, R., & Rabinovich, M. I. (2002). Dynamical model of long-term synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(15), 10132–10137. Abbott, L. F., & Blum, K. I. (1996). Functional significance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416. Abbott, L. S., & Song, S. (1999). Temporal asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. Solla, & D. A. Cohn, (Eds.), Advances in neural information processing systems, (pp. 69–75). Cambridge, MA: MIT Press. Abeles, M. (1991). Corticotronics: Neural circuits of the cerebral cortex. Cambridge. Cambridge University Press. Akopian, G., Musleh, W., Smith, R., & Walsh, J. P. (2000). Functional state of corticostriatal synapses determines their expression of short- and long-term plasticity. Synapse, 38, 271–280.

304

F. Worg ¨ otter ¨ and B. Porr

Altar, C. A., Boyar, W. C., & Kim, H. S. (1990). Discriminatory roles for D1 and D2 dopamine receptor subtypes in the in vivo control of neostriatal cyclic GMP. Eur. J. Pharmacol., 181, 17–21. Anderson, C., & Crawford-Hines, S. (1994). Multigrid q-learning (Tech. Rep. No. CS-94-121), Fort Collins: Colorado State University. Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentation. Trends Neurosci., 16, 480–487. Ashby, W. R. (1952). Design for a brain. London: Chapman & Hall. Balkenius, C., & Moren, J. (1998). Computational models of classical conditioning: A comparative study. (LUCS 62 ISSN 1101-8453). Lund, Sweden Lund University Cognitive Studies. Barto, A. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 835–846. Bell, C. C., Han, V. Z., Sugawara, Y., & Grankt, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Benett, M. R. (2000). The concept of long term potentiation of transmission at synapses. Prog. Neurobiol., 60, 109–137. Berger, B., Trottier, S., Verney, C., Gaspar, P., & Alvarez, C. (1988). Regional and laminar distribution of the dopamine and serotonin innervation in the macaque cerebral cortex: A radioautographic study. J. Comput. Neurol., 273, 99–119. Berns, G. S., McClure, S. M., Pagoni, G., & Montague, P. R. (2001). Predictability modulates human brain responses to reward. J. Neurosci., 21, 2793–2798. Berns, G. S., & Sejnowski, T. J. (1998). A computational model of how the basal ganglia produce sequences. J. Cogn. Neurosci., 10 (1), 108–121. Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: Hedonic impact, reward learning or incentive salience? Brain Res. Rev., 28, 309–369. Berridge, M. J. (1998). Neuronal calcium signaling. Neuron, 21, 13–26. Bi, G. Q. (2002). Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G.-Q., & Poo, M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bibb, J. A., Snyder, G. L., Nishi, A., Yan, Z., Meijer, L., Fienberg, A. A., Tsai, L. H., Kwon, Y. T., Girault, J. A., Czernik, A. J., Huganir, R. L., Hemmings, H. C., Nairn, A. C., and Greengard, P. (1999). Phosphorylation of

Temporal Sequence, Prediction, and Control

305

DARPP-32 by Cdk5 modulates dopamine signalling in neurons. Nature, 402, 669–671. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2(1), 32–48. Bliss, T. V., & Gardner-Edwin, A. R. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 357–374. Bliss, T. V., & Lomo, T. (1970). Plasticity in a monosynaptic cortical pathway. J. Physiol. (Lond.), 207, 61P. Bliss, T. V., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 331–356. Breiter, H. C., Aharon, I., Kahneman, D., Dale, A., & Shizgal, P. (2001). Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron, 30, 619–639. Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J. Neurosci., 19(23), 10502–10511. Brown, J. R., & Arbuthnott, G. W. (1983). The electrophysiology of dopamine D2 receptors: A study of the actions of dopamine on corticostriatal transmission. Neurosci., 10, 349–355. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Simultation results. J. Comput. Neurosci., 13, 167–186. Bunney, B. S., Chiodo, L. A., & Grace, A. A. (1991). Midbrain dopamine system electrophysiological functioning: A review and new hypothesis. Synapse, 9, 79–94. Buzsaki, G., Penttonen, M., Nadasdy, Z., & Bragin, A. (1996). Pattern and inhibition-dependent invasion of pyramidal cell dendrites by fast spikes in the hippocampus in vivo. Proc. Natl. Acad. Sci. USA, 93, 9921–9925. Calabresi, P., Centonze, D., Gubellini, P., Marfia, G. A., & Bernardi, G. (1999). Glutamate-triggered events inducing corticostriatal long-term depression. J. Neurosci., 19, 6102–6110. Calabresi, P., Fedele, E., Pisani, A., Fontana, G., Mercuri, N. B., Bernardi, G., & Raiteri, M. (1995). Transmitter release associated with long-term synaptic depression in rat corticostriatal slices. European J. Neurosci., 7, 1889–1894. Calabresi, P., Gubellini, P., Centonze, D., Picconi, B., Bernardi, G., Chergui, K., Svenningsson, P., Fienberg, A. A., & Greengard, P. (2000). Dopamine and cAMP-regulated phosphoprotein 32 kDa controls both striatal long-term depression and long-term potentiation opposing forms of synaptic plasticity. J. Neurosci., 20(22), 8443–8451. Calabresi, P., Maj, R., Mercuri, N. B., & Bernardi, G. (1992a). Coactivation of D1 and D2 dopamine receptors is required for long-term synaptic depression in the striatum. Neurosci. Letters, 142, 95–99.

306

F. Worg ¨ otter ¨ and B. Porr

Calabresi, P., Maj, R., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992b). Longterm synaptic depression in the striatum: Physiological and pharmacological characterization. J. Neurosci., 12, 4224–4233. Calabresi, P., Mercuri, N., Stanzione, P., Stefani, A., & Bernardi, G. (1987). Intracellular studies on the dopamine-induced firing inhibition of neostriatal neurons in vitro: Evidence for D1 receptor involvement. Neurosci., 20, 757– 771. Calabresi, P., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992c). Long-term potentiation in striatum is unmasked by removing the voltage-dependent magnesium block of NMDA receptor channels. European J. Neurosci., 4, 929–935. Calabresi, P., Pisani, A., Mercuri, N. B., & Bernardi, G. (1996). The corticostriatal projection: From synaptic plasticity to dysfunctions of the basal ganglia. Trends Neurosci., 19, 19–24. Cannon, C. M., & Palmiter, R. D. (2003). Reward without dopamine. J. Neurosci, 23(34), 10827–10831. Carpenter, G. A., & Grossberg, S. (1987). ART-2: Self-organization of stable category recognition codes for analog input pattern. Applied Optics, 26, 4919– 4930. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. (USA), 98(22), 12772–12777. Cateau, H., Kitano, K., & Fukai, T. (2002). An accurate and widely applicable method to determine the distribution of synaptic strengths formed by the spike-timing-dependent learning. Neurocomputing, 44(46), 343–351. Centonze, D., Gubellini, P., Picconi, B., Calabresi, P., Giacomini, P., & Bernardi, G. (1999). Unilateral dopamine denervation blocks corticostriatal LTP. J. Neurophysiol., 82, 3575–3579. Centonze, D., Picconi, B., Gubellini, P., Bernardi, G., & Calabresi, P. (2001). Dopaminergic control of synaptic plasticity in the dorsal striatum. European J. Neurosci., 13, 1071–1077. Charpier, S., & Deniau, J. M. (1997). In vivo activity-dependent plasticity at cortico-striatal connections: Evidence for physiological long-term potentiation. Proc. Natl. Acad. Sci. USA, 94, 7036–7040. Charpier, S., Mahon, S., & Deniau, J. M. (1999). In vivo induction of striatal long-term potentiation by low-frequency stimulation of the cerebral cortex. Neurosci., 91, 1209–1222. Choi, S., & Lovinger, D. M. (1997). Decreased probability of neuro-transmitter release underlies long-term depression and postnatal development of corticostriatal synapses. Proc. Natl. Acad. Sci. USA, 94, 2665–2670. Contreras-Vidal, J. L., & Schultz, W. (1999). A predictive reinforcement model of dopamine neurons for learning approach behavior. J. Comput. Neurosci., 6, 191–214. Daw, N. D. (2003). Reinforcement learning models of the dopamine system and their behavioral implications. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. Dayan, P. (1992). The convergence of TD(λ). Mach. Learn., 8(3/4), 341–362. Dayan, P. (2002). Matters temporal. Trends Cogn. Sci., 6(3), 105–106.

Temporal Sequence, Prediction, and Control

307

Dayan, P., & Balleine, B. W. (2002). Reward, motivation, and reinforcement learning. Neuron, 36(2), 285–298. Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neurosci., 3, 1218–1223. Dayan, P., & Seynowski, T. (1994). TD(λ) converges with probability 1. Mach. Learn., 14, 295–301. Debanne, D., Gahwiler, B. T., & Thompson, S. H. (1994). Asynchronous pre- and postsynaptic activity induces associative long-term depression in area CAI of the rat hippocampus in vitro. Proc. Natl. Acad. Sci. (USA), 91, 1148–1152. Debanne, D., Gahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.), 507, 237–247. Deisseroth, K., Heist, E. K., & Tsien, R. W. (1998). Translocation of calmodulin to the nucleus supports CREB phosphorylation in hippocampal neurons. Nature, 392, 198–202. DeKoninck, P., & Schulmann, H. (1998). Sensitivity of CaM kinase II to the frequency of Ca2+ oscillations. Science, 279, 227–230. Delgado, M. R., Nystrom, L. E., Fissell, C., Noll, D. C., & Fiez, J. A. (2000). Tracking the hemodynamic responses to reward and punishment in the striatum. J. Neurophysiol., 84, 3072–3077. Dennett, D. C. (1984). Cognitive wheels: The frame problem of AI. In C. Hookway (Eds.), Minds, machines and evolution (pp. 129–151). Cambridge: Cambridge University Press. Dolmetsch, R. E., Pajvani, U., Fife, K., Spotts, J. M., & Greenberg, M. E. (2001). Signaling to the nucleus by an L-type calcium channel-calmodulin complex through the MAP kinase pathway. Science, 294, 333–339. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Comp., 12(1), 219–245. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nature Neurosci., 2, 1098–1105. Elliott, R., Friston, K. J., & Dolan, R. J. (2000). Dissociable neural responses in human reward systems. J. Neurosci., 20, 6159–6165. Eurich, C. W., Pawelzik, K., Ernst, U., Cowan, J. D., & Milton, J. G. (1999). Dynamics of self-organized delay adaptation. Phys. Rev. Lett., 82, 1594–1597. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybern., 87, 459–470. Futatsugi, A., Kato, K., Ogura, H., Li, S. T., Nagata, E., Kuwajima, G., Tanaka, K., Itohara, S., & Mikoshiba, K. (1999). Facilitation of NMDAR-independent LTP and spatial learning in mutant mice lacking ryanodine receptor type 3. Neuron, 24, 701–713. Gerfen, C. R. (1984). The neostriatal mosaic: Compartmentalization of corticostriatal input and striatonigral output systems. Nature, 311, 461–464.

308

F. Worg ¨ otter ¨ and B. Porr

Gerfen, C. R. (1985). The neostriatal mosaic. I. Compartmental organization of projections from the striatum to the substantia nigra in the rat. J. Comp. Neurol., 236, 454–476. Gerfen, C. R. (1992). The neostriatal mosaic: Multiple levels of compartmental organization in the basal ganglia. Annu. Rev. Neurosci., 15, 285–320. Gerfen, C. R., Herkenham, M., & Thibault, J. (1987). The neostriatal mosaic: II. Patch- and matrix- directed mesostriatal dopaminergic and nondopaminergic systems. J. Neurosci., 7, 3915–3934. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. M. (2002). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Goldberg, J., Holthoff, K., & Yuste, R. (2002). A problem with Hebb and local spikes. Trends Neurosci., 25(9), 433–435. Golding, N., Kath, W. L., & Spruston, N. (2001). Dichotomy of action-potential backpropagation in CA1 pyramidal neuron dendrites. J. Neurophysiol., 86, 2998–3010. Golding, N. L., Staff, P. N., & Spruston, N. (2002). Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature, 418, 326–331. Gormezano, I., Kehoe, E. J., & Marshall, B. S. (1983). Twenty years of classical conditioning research with the rabbit. In J. M. Sprague, & A. N. Epstein, (Eds.), Progress of psychobiology and physiological psychology (pp. 198–274). Orlando, FL: Academic Press. Graef, I. A., Mermelstein, P. G., Stankunas, K., Neilson, J. R., Deisseroth, K., Tsien, R. W., & Crabtree, G. R. (1999). L-type calcium channels and GSK-3 regulate the activity of NF-ATc4 in hippocampal neurons. Nature, 401, 703– 708. Greengard, P., Allen, P. B., & Nairn, A. C. (1999). Beyond the dopamine receptor: The DARPP-32/protein phosphatase-1 cascade. Neuron, 23, 435–447. Groenewegen, H. J., Berendse, H. W., Wolters, J. G., & Lohman, A. H. M. (1990). The anatomical relationship of the prefrontal cortex with the striatopallidal system, the thalamus and the amygdala: Evidence for a parallel organization. Prog. Brain Res., 85, 95–118. Grossberg, S. (1995). A spectral network model of pitch perception. J. Acoust. Soc. Am., 98(2), 862–879. Grossberg, S., & Merrill, J. (1996). The hippocampus and cerebellum in adaptively timed learning, recognition and movement. J. Cogn. Neurosci., 8, 257– 277. Grossberg, S., & Schmajuk, N. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79– 102. Groves, P. M., Garcia-Munoz, M., Linder, J. C., Manley, M. S., Martone, M. E., & Young, S. J. (1995). Elements of the intrinsic organization and information processing in the neostriatum. In J. C. Houk, J. L. Davis, & D. G., Beiser, (Eds.), Models of information processing in the basal ganglia (pp. 51–96). Cambridge, MA: MIT Press.

Temporal Sequence, Prediction, and Control

309

Gruber, A. J., Solla, S. A., Surmeier, D. J., & Houk, J. C. (2003). Modulation of striatal single units by expected reward: A spiny neuron model displaying dopamine-induced bistability. J. Neurophysiol., 90, 1095–1114. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23(9), 3697–3714. Haber, S. N., Fudge, J. L., & McFarland, N. R. (2000). Striatonigrostriatal pathways in primates form an ascending spiral from the shell to the dorsolateral striatum. J. Neurosci., 20, 2369–2382. Han, V. Z., Grant, K., & Bell, C. C. (2000). Reversible associative depression and nonassociative potentiation at a parallel fiber synapse. Neuron, 27, 611–622. Hassani, O. K., Cromwell, H. C., & Schultz, W. (2001). Influence of expectation of different rewards on behavior-related neuronal activity in the striatum. J. Neurophysiol., 85, 2477–2489. Hauber, W., Bohn, I., & Giertler, C. (2000). NMDA, but not dopamine D2, receptors in the rat nucleus accumbens are involved in guidance of instrumental behavior by stimuli predicting reward magnitude. J. Neurosci., 20(16), 6282– 6288. H¨ausser, M., Spruston, N., & Stuart, G. J. (2000). Diversity and dynamics of dendritic signaling. Science, 11, 739–744. Hebb, D. O. (1949). The organization of behavior: A neuropsychological study. New York: Wiley-Interscience. Hemmings, H. C., J., Greengard, P., Tung, H. Y., & Cohen, P. (1984). DARPP-32, a dopamine-regulated neuronal phosphoprotein, is a potent inhibitor of protein phosphatase-1. Nature, 310, 503–505. Hernandez-Lopez, S., Bargas, J., Surmeier, D. J., Reyes, A., & Galarraga, E. (1997). D1 receptor activation enhances evoked discharge in neostriatal medium spiny neurons by modulating an L-type Ca2 + conductance. J. Neurosci., 17, 3334–3342. Hoffman, D. A., Magee, J. C., Colbert, C. M., & Johnston, D. (1997). K+ channel regulation of signal propagation in dendrites of hippocampal pyramidal neurons. Nature, 387, 869–875. Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neurosci., 1(4), 304–309. Hollerman, J. R., Tremblay, L., & Schultz, W. (1998). Influence of reward expectation on behavior-related neuronal activity in primate striatum. J. Neurophysiol., 80, 947–963. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G., Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press.

310

F. Worg ¨ otter ¨ and B. Porr

Hull, C. L. (1939). The problem of stimulus equivalence in behavior theory. Psychological Review, 46, 9–30. Hull, C. L. (1943). Principles of behavior. New York: Appleton-Century-Crofts. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comp., 15, 1511–1523. Jaakkola, T., Jordan, M. I., & Singh, S. P. (1995). On the convergence of stochastic iterative dynamic programming algorithms. Neural Comp., 6(6), 1185–1201. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535– 547. Joel, D., & Weiner, I. (2000). The connections of the dopaminergic system with the striatum in rats and primates: An analysis with respect to the functional and compartmental organization of the striatum. Neurosci., 96, 451–474. Johnston, D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Annu. Rev. Neurosci., 19, 165–186. Joyce, J. N., & Marshall, J. F. (1987). Quantitative autoradiography of dopamine D2 sites in rat caudate-putamen: Localization to intrinsic neurons and not to neocortical afferents. Neurosci., 20, 773–795. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors? J. Neurophysiol., 88, 507–513. Karmarkar, U. R., Najarian, M. T., & Buonomano, D. V. (2002). Mechanisms and significance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kawagoe, R., Takikawa, Y., & Hikosaka, O. (1998). Expectation of reward modulates cognitive signals in the basal ganglia. Nat. Neurosci., 1, 411– 416. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E., 59, 4498–4515. Kempter, R., Gerstner, W., & van Hemmen J. L. (2001a). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comp., 13, 2709–2741. Kempter, R., Leibold, C., Wagner, H., & van Hemmen, J. L. (2001b). Formation of temporal-feature maps by axonal propagation of synaptic learning. Proc. Natl. Acad. Sci. (USA), 98(7), 4166–4171. King, M. M., Huang, C. Y., Chock, P. B., Nairn, A. C., Hemmings, H. C., J., Chan, K. F., & Greengard, P. (1984). Mammalian brain phosphoproteins as substrates for calcineurin. J. Biol. Chem., 259, 8080–8083. Kistler, W. M. (2002). Spike-timing dependent synaptic plasticity: A phenomenological framework. Biol. Cybern., 87, 416–427. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and post-synaptic action potentials. Neural Comput., 12, 385–405. Kitano, K., Okamoto, H., & Fukai, T. (2003). Time representing cortical activities: Two models inspired by prefrontal persistent activity. Biol. Cybern, 88(5), 387– 394.

Temporal Sequence, Prediction, and Control

311

Klopf, A. H. (1972). Brain function and adaptive systems—a heterostatic theory. (Tech. Rep. Air Force Cambridge Research Laboratories Special Report No. 133), Alexandria, VA: Defense Technical Information Center, Cameron Station. Klopf, A. H. (1982). The hedonistic neuron: A theory of memory, learning, and intelligence. Washington, DC: Hemisphere. Klopf, A. H. (1986). A drive-reinforcement model of single neuron function. In J. S. Denker (Eds.), Neural networks for computing: AIP Conf. Proc., New York: American Institute of Physics. Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiol., 16(2), 85–123. Knutson, B., Fong, G. W., Adams, C. M., Varner, J. L., & Hommer, D. (2001). Dissociation of reward anticipation and outcome with event-related fMRI. Neuroreport, 12, 3683–3687. Koester, H. J., & Sakmann, B. (1998). Calcium dynamics in single spines during coincident pre- and postsynaptic activity depend on relative timing of back-propagating action potentials and subthreshold excitatory postsynaptic potentials. Proc. Natl. Acad. Sci. USA, 95, 9596–9601. Kosco, B. (1986). Differential Hebbian learning. In J. S. Denker, (Ed.), Neural networks for computing: AIP Conference Proc. New York: American Institute of Physics. Kurata, K., & Wise, S. P. (1988). Premotor and supplementary motor cortex in rhesus monkeys: Neuronal activity during externally- and internally-instructed motor tasks. Exp. Brain Res., 72, 237–248. Lasser-Ross, N., & Ross, W. N. (1992). Imaging voltage and synaptically activated sodium transients in cerebellar Purkinje cells. Proc. R. Soc. B. Biol. Sci., 247, 35–39. Leibold, C., Kempter, R., & van Hemmen, J. L. (2001). Temporal map formation in the barn owl’s brain. Phys. Rev. Lett., 87(24), 248101–1–248101–4. Leibold, C., & van Hemmen, J. L. (2001). Temporal receptive fields, spikes, and Hebbian delay selection. Neural Networks, 14(6-7), 805–813. Leibold, C., & van Hemmen, J. L. (2002). Mapping time. Biol. Cybern., 87, 428–439. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neurosci., 8, 791–797. Linden, D. J. (1999). The return of the spike: Postsynaptic action potentials and the induction of LTP and LTD. Neuron, 22, 661–666. Lisman, J. (1989). A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Proc. Natl. Acad. Sci. USA, 86, 9574–9578. Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. J. Neurophysiol., 67, 145–163. Lovinger, D. M., Tyler, E. C., & Merritt, A. (1993). Short- and long-term synaptic depression in rat neostratium. J. Neurophysiol., 70, 1937–1949. Luscher, ¨ C., & Frerking, M. (2001). Restless AMPA receptors: Implications for synaptic transmission and plasticity. Trends Neurosci., 24(11), 665–670.

312

F. Worg ¨ otter ¨ and B. Porr

Lynd-Balta, E., & Haber, S. N. (1994). Primate striatonigral projections: A comparison of the sensorimotor-related striatum and the ventral striatum. J. Comp. Neurol., 345, 562–578. Mackintosh, N. J. (1974). The psychology of animal learning. Orlando, FL: Academic Press. Mackintosh, N. J. (1983). Conditioning and associative learning. New York: Oxford University Press. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Malenka, R. C., Kauer, J. A., Perkel, D. J., Mauk, M. D., Kelly, P. T., Nicoll, R. A., & Waxham, M. N. (1989). An essential role for postsynaptic calmodulin and protein kinase activity in long-term potentiation. Nature, 340, 554–557. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—a decade of progress? Science, 285, 1870–1874. Malinow, R., Schulman, H., & Tsien, R. W. (1989). Inhibition of postsynaptic PKC or CaMKII blocks induction but not expression of LTP. Science, 245, 862–866. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Martin, S. J., Grimwood, P. D., & Morris, R. G. M. (2000). Synaptic plasticity and memory: An evaluation of the hypothesis. Annu. Rev. Neurosci., 23, 649–711. Martinez, J. L., & Derrick, B. E. (1996). Long-term potentiation and learning. Annu. Rev. Psychol., 47, 173–203. Mayer, M. L., Westbrook, G. L., & Guthrie, P. B. (1984). Voltage-dependent block by Mg2 + of NMDA responses in spinal cord neurones. Nature, 309, 261–263. McFarland, D. J. (1989). Problems of animal behavior. Harlow: Longman Scientific & Technical. Mehta, M. R. (2004). Cooperative LTP can map memory sequences on dendritic branches. TINS, 27, 69–72. Mehta, M. R., Lee, A. K., & Wilson, M. A. (2002). Role of experience and oscillations in transforming a rate code into a temporal code. Nature, 417(6890), 741–746. Melamed, O., Gerstner, W., Maass, W., Tsodyks, M., & Markram, H. (2004). Coding and learning of behavioral sequences. Trends Neurosci, 27(1), 11–14. Memo, M., Lovenberg, W., & Hanbauer, I. (1982). Agonist-induced subsensitivity of adenylate cyclase coupled with a dopamine receptor in slices from rat corpus striatum. Proc. Natl. Acad. Sci. USA, 79, 4456–4460. Migliore, M., Hoffmann, D. A., Magee, J. C., & Johnston, D. (1999). Role of an A-type K+ conductance in the back-propagation of action potentials in the dendrites of hippocampal pyramidal neurons. J. Comput. Neurosci., 7, 5–15. Miller, J. D., Sanghera, M. K., & German, D. C. (1981). Mesencephalic dopaminergic unit activity in the behaviorally conditioned rat. Life Sci., 29, 1255–1263. Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictability for reward responses in primate dopamine neurons. J. Neurophysiol., 72(2), 1024–1027. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728.

Temporal Sequence, Prediction, and Control

313

Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci., 16(5), 1936–1947. Morris, B. J., Simpson, C. S., Mundell, S., Maceachern, K., Johnston, H. M., & Nolan, A. M. (1997). Dynamic changes in NADPH-diaphorase staining reflect activity of nitric oxide synthase: Evidence for a dopaminergic regulation of striatal nitric oxide release. Neuropharmacology, 36, 1589–1599. Mulkey, R. M., Endo, S., Shenolikar, S., & Malenka, R. C. (1994). Involvement of a calcineurin/inhibitor-1 phosphatase cascade in hippocampal long-term depression. Nature, 369, 486–488. Nicola, S. M., Surmeier, J., & Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. Ann. Rev. Neurosci., 23, 185–215. Nishi, A., Snyder, G. L., & Greengard, P. (1997). Bidirectional regulation of DARPP-32 phosphorylation by dopamine. J. Neurosci., 17, 8147–8155. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M., & Kato, K. (2000). Calcium release from internal stores regulates polarity and input specificity of synaptic modification. Nature, 408, 584–588. Nobre, A. C., Coull, J. T., Frith, C. D., & Mesulam, M. M. (1999). Orbitofrontal cortex is activated during breaches of expectation in tasks of visual attention. Nature Neurosci., 2, 11–12. Nowak, L., Bregestovski, P., Ascher, P., Herbet, A., & Prochiantz, A. (1984). Magnesium gates glutamate-activated channels in mouse central neurones. Nature, 307, 462–465. O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 28, 329–337. O’Doherty, J., Deichman, R., Critchley, H. D., & Dolan, R. J. (2002). Neural responses during anticipation of primary taste reward. Neuron, 33, 815–826. O’Doherty, J., Rolls, E. T., Francis, S., Bowtell, R., & McGlone, F. (2001). Representation of pleasant and aversive taste in the human brain. J. Neurophysiol., 85, 1315–1321. Overton, P. G., & Clark, D. (1997). Burst firing in midbrain dopaminergic neurons. Brain Res. Rev., 25, 312–334. Pacheco-Cano, M. T., Bargas, J., Hernandez-Lopez, S., Tapia, D., & Galarraga, E. (1996). Inhibitory action of dopamine involves a subthreshold Cs+ -sensitive conductance in neostriatal neurons. Exp. Brain Res., 110, 205–211. Pagoni, G., Zink, C. F., Montague, P. R., & Berns, G. S. (2002). Activity in human ventral striatum locked to errors of reward prediction. Nature Neurosci., 5, 97–98. Panchev, C., Wermter, S., & Chen, H. (2002). Spike-timing dependent competitive learning of integrate-and-fire neurons with active dendrites. In Lecture Notes in Computer Science. Proc. Int. Conf. Artificial Neural Networks (pp. 896– 901). Berlin: Springer. Parent, A. (1990). Extrinsic connections of the basal ganglia. Trends Neurosci., 13, 254–258.

314

F. Worg ¨ otter ¨ and B. Porr

Partridge, J. G., Tang, K. C., & Lovinger, D. M. (2000). Regional and postnatal heterogeneity of activity-dependent long-term changes in synaptic efficacy in the dorsal striatum. J. Neurophysiol., 84, 1422–1429. Pavlov, P. I. (1927). Conditioned reflexes. New York: Oxford University Press. Peng, J. (1993). Efficient dynamic programming-based learning for control. Unpublished doctoral dissertation, Northeastern University. Pennartz, C. M., Ameerun, R. F., Groenewegen, H. J., & Lopes da Silva, F. H. (1993). Synaptic plasticity in an in vitro slice preparation of the rat nucleus accumbens. European J. Neurosci., 5, 107–117. Pike, F. G., Meredith, R. M., Olding, A. A., & Paulsen, O. (1999). Postsynaptic bursting is essential for ”Hebbian” induction of associative long-term potentiation at excitatory synapses in rat hippocampus. J. Physiol. (Lond.), 518, 571–576. Pollack, J. B., & Blair, A. D. (1997). Why did TD-Gammon work? In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, (pp. 10–16). Cambrdige, MA: MIT Press. Porr, B., von Ferber, C., & Worg ¨ otter, ¨ F. (2003). ISO-learning approximates a solution to the inverse-controller problem in an unsupervised behavioral paradigm. Neural Comp., 15, 865–884. Porr, B., & Worg ¨ otter, ¨ F. (2002). Isotropic sequence order learning using a novel linear algorithm in a closed loop behavioural system. Biosystems, 67(1–3), 195–202. Porr, B., & Worg ¨ otter, ¨ F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. Porr, B., & Worg ¨ otter, ¨ F. (2003b). Isotropic sequence order learning in a closed loop behavioural system. Proc. Roy. Soc. B. Porr, B., & Worg ¨ otter, ¨ F. (2003c). Isotrpic sequence order learning in a closed loop behavioural system. Roy. Soc. Phil. Trans. Mathematical, Physical, & Engineering Sciences, 361, 2225–2244. Prokasy, W. F., Hall, J. F., & Fawcett, J. T. (1962). Adaptation, sensitization, forward and backward conditioning, and pseudo-conditioning of the GSR. Psychol. Rep., 10, 103–106. Pucak, M. L., & Grace, A. A. (1994). Regulation of substantia nigra dopamine neurons. Crit. Rev. Neurobiol., 9, 67–89. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comp., 13, 2221–2237. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward? Trends Neurosci., 22, 146–151. Regehr, W. G., Konnerth, A., & Armstrong, C. M. (1992). Sodium action potentials in the dendrites of cerebellar Purkinje cells. Proc. Natl. Acad. Sci. USA, 89, 5492–5496. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory. New York: Appleton Century Crofts. Reynolds, J. N. J., & Wickens, J. R. (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15, 507–521.

Temporal Sequence, Prediction, and Control

315

Reynolds, S. I. (2002). Reinforcement learning with exploration. Unpublished doctoral dissertation, University of Birmingham. Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Comput. Neurosci., 7(3), 235–246. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Roche, K. W., O’Brian, R. J., Mammen, A. L., Bernhardt, J., & Huganir, R. L. (1996). Characterization of multiple phosphorylation sites on the AMPA receptor GluR1 subunit. Neuron, 16, 1179–1188. Rose, C. R., & Konnerth, A. (2001). Stores not just for storage, intracellular calcium release and synaptic plasticity. Neuron, 31, 519–522. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Rummery, G. A. (1995). Problem solving with reinforcement learning. Unpublished doctoral dissertation, Cambridge University. Russell, V. A., Allin, R., Lamm, M. C., & Taljaard, J. J. (1992). Regional distribution of monoamines and dopamine D1- and D2-receptors in the striatum of the rat. Neurochemical Res., 17, 387–395. Santos, J. M., & Touzet, C. (1999a). Dynamic update of the reinforcement function during learning. Connection Science [Special issue], 11 (3–4). Santos, J. M. & Touzet, C. (1999b). Exploration tuned reinforcement function. Neurocomputing [Special issue], 28(1–3), 93–105. Sato, N., & Yamaguchi, Y. (2003). Memory encoding by theta phase precession in the hippocampal network. Neural Computation, 15, 2379–2397. Saudargiene, A., Porr, B., & Worg ¨ otter, ¨ F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Comp., 16, 595–626. Schiller, J., Schiller, Y., Stuart, G., & Sakmann, B. (1997). Calcium action potentials restricted to distal apical dendrites of rat neocortical pyramidal neurons. J. Physiol., 505, 605–616. Schoenbaum, G., Chiba, A. A., & Gallagher, M. (1998). Orbitofrontal cortex and basolateral amygdala encode expected outcomes during learning. Nature Neurosci., 1, 155–159. Schultz, W. (1986). Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey. J. Neurophysiol., 56, 1439–1462. Schultz, W. (1998). Predictive reward signal of dopamine neurons. J. Neurophysiol., 80, 1–27. Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263. Schultz, W., Apiccela, P., Scarnati, E., & Ljungberg, T. (1992). Neuronal activity in monkey ventral striatum related to the expectation of reward. J. Neurosci., 12, 4595–4610. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Dickinson, A. (2000). Neuronal coding of prediction errors. Annu. Rev. Neurosci., 23, 473–500.

316

F. Worg ¨ otter ¨ and B. Porr

Senn, W. & Buchs, N. J. (2003). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Mathematical analysis. J. Comput. Neurosci., 14, 119–138. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre-and postsynaptic spike timing. Neural Comp., 13, 35–67. Seung, H. S. (1998). Learning continous attractors in recurrent networks. In M. Kearns, M. Jordan, & S. Solla, (Eds.), Advances in neural information processing systems, 10 (pp. 654–660). Cambridge, MA: MIT Press. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(16), 10831–10836. Sibley, D. R., & Monsma, F. J. J. (1992). Molecular biology of dopamine receptors. Trends Pharmakol. Sci., 13, 61–69. Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Mach. Learn., 22, 123–158. Sjostr ¨ om, ¨ P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, I., & Huganir, R. L. (2002). Regulation of AMPA receptors during synaptic plasticity. Trends Neurosci., 25(11), 578–588. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 1–20. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. Spencer, J. P., & Murphy, K. P. (2000). Bi-directional changes in synaptic plasticity induced at corticostriatal synapses in vitro. Exp. Brain Res., 135, 497–503. Sterratt, D. C., & van Ooyen, A. (2002). Does morphology influence temporal plasticity? In J. R. Dorronsoro, (Ed.), ICANN 2002 (Vol. LNCS 2415, pp. 186– 191). Berlin: Springer-Verlag. Stoof, J. C., & Kebabian, J. W. (1981). Opposing roles for D-1 and D-2 dopamine receptors in efflux of cyclic AMP from rat neostriatum. Nature, 294, 366–368. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Stuart, G., Spruston, N., Sakmann, B., & H¨ausser, M. (1997). Action potential initiation and backpropagation in neurons of the mammalian central nervous system. Trends Neurosci., 20, 125–131. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neurosci., 103(1), 65–85. Suri, R. E., & Schultz, W. (1998). Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Exp. Brain Res., 121, 350–354. Suri, R. E., & Schultz, W. (1999). A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neurosci., 91(3), 871–890. Suri, R. E., & Schultz, W. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13(4), 841–862.

Temporal Sequence, Prediction, and Control

317

Surmeier, D. J., Bargas, J., Hemmings, H. C., J., Nairn, A. C., & Greengard, P. (1995). Modulation of calcium currents by a D1 dopaminergic protein kinase/phosphatase cascade in rat neostriatal neurons. Neuron, 14, 385–397. Surmeier, D. J., & Kitai, S. T. (1993). D1 and D2 dopamine receptor modulation of sodium and potassium currents in rat neostriatal neurons. Prog. Brain Res., 99, 309–324. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Mach. Learn., 3, 9–44. Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1038– 1044). Cambridge, MA: MIT Press. Sutton, R. S. (1999). Open theoretical questions in reinforcement learning. In EUROColt, (pp. 11–17). Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychol. Review, 88, 135–170. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel, & J. Moore (Eds.), Learning and computational neuroscience: Foundation of adaptive networks. Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Svenningsson, P., Lindskog, M., Rognoni, F., Fredholm, B. B., Greengard, P., & Fisone, G. (1998). Activation of adenosine A2A and dopamine D1 receptors stimulates cyclic AMP-dependent phosphorylation of DARPP-32 in distinct populations of striatal projection neurons. Neurosci., 84, 223–228. Svoboda, K., & Mainen, Z. F. (1999). Synaptic Ca2+ : intracellular stores spill their guts. Neuron, 22, 427–430. Tang, K., Low, M. J., Grandy, D. K., & Lovinger, D. M. (2001). Dopaminedependent synaptic plasticity in striatum during in vivo development. Proc. Natl. Acad. Sci. USA, 98, 1255–1260. Tesauro, G. (1992). Practical issues in temporal difference learning. Mach. Learn., 8, 257–277. Tingley, W. G., Ehlers, M. D., Kameyama, K., Doherty, C., Ptak, J. B., Riley, C. T., & Huganir, R. L. (1997). Characterization of the protein kinase A and protein kinase C phosphorylation of the N-methyl-D-aspartate receptor NR1 subunit using phosphorylation site-specific antibodies. J. Biol. Chem., 272, 5157–5166. Toan, D. L., & Schultz, W. (1985). Responses of rat pallidum cells to cortex stimulation and effects of altered dopaminergic activity. Neurosci., 15, 683– 694. Touzet, C. (1999). Neural networks and Q-learning for robotics. Available online at: http://www.sciences-cognitives.org/scico/annuaire/Touzet Claude/ Touzet IJCNN Tut.pdf. Tremblay, L., Hollerman, J. R., & Schultz, W. (1998). Modifications of reward expectation-related neuronal activity during learning in primate striatum. J. Neurophysiol., 80, 964–977.

318

F. Worg ¨ otter ¨ and B. Porr

Tremblay, L., & Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature, 398, 704–708. Umemiya, M., & Raymond, L. A. (1997). Dopaminergic modulation of excitatory postsynaptic currents in rat neostriatal neurons. J. Neurophysiol., 78, 1248– 1255. van Hemmen, J. L. (2001). Theory of synaptic plasticity. In F. Moss, & S. Gielen, (Eds.), Handbook of biological physics, neuro-informatics, neural modelling (Vol. 4, pp. 771–823. Amsterdam: Elsevier. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23), 8812– 8821. Walsh, J. P. (1993). Depression of excitatory synaptic input in rat striatal neurons. Brain Res., 608, 123–128. Walsh, J. P., & Dunia, R. (1993). Synaptic activation of N-methyl-D-aspartate receptors induces short-term potentiation at excitatory synapses in the striatum of the rat. Neurosci., 57, 241–248. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-learning. Mach. Learn., 8, 279–292. West, A. E., Chen, W. G., Dalva, M. B., Dolmetsch, R. E., Kornhauser, J. M., Shaywitz, A. J., Takasu, M. A., Tao, X., & Greenberg, M. E. (2001). Calcium regulation of neuronal gene expression. Proc. Natl. Acad. Sci. USA, 98, 11024– 11031. Wickens, J. R., Begg, A. J., & Arbuthnott, G. W. (1996). Dopamine reverses the depression of rat corticostriatal synapses which normally follows highfrequency stimulation of cortex in vitro. Neurosci., 70, 1–5. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In IRE WESCON Convention Record (Vol. 4, pp. 96–104). New York. Wiering, M., & Schmidhuber, J. (1998). Fast online Q(λ). Mach. Learn., 33, 105–115. Williams, S. M., & Goldman-Rakic, P. S. (1993). Characterization of the dopaminergic innervation of the primate frontal cortex using a dopamine-specific antibody. Cereb. Cortex, 3, 199–222. Witten, I. H. (1977). An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34, 86–295. Xie, X., & Seung, S. (2000). Spike-based learning rules and stabilization of persistent neural activity. In S. A. Solla, T. K. Leen, & K. R. Muller, ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 199–208). Cambridge, MA: MIT Press. Yamamoto, Y., Nakanishi, H., Takai, N., Shimazoe, T., Watanabe, S., & Kita, H. (1999). Expression of N-methyl-D-aspartate receptor-dependent longterm potentiation in the neostriatal neurons in an in vitro slice after ethanol withdrawal of the rat. Neurosci., 91, 59–68. Yang, S. N., Tang, Y. G., & Zucker, R. S. (1999). Selective induction of LTP and LTD by postsynaptic Ca2+ elevation. J. Neurophysiol., 81, 781–787. Yao, H., & Dan, Y. (2001). Stimulus timing-dependent plasticity in cortical processing of orientation. Neuron, 32, 315–323.

Temporal Sequence, Prediction, and Control

319

Yim, C. Y., & Mogenson, G. J. (1982). Response of nucleus accumbens neurons to amygdala stimulation and its modification by dopamine. Brain Res., 239, 401–415. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zink, C. F., Pagnoni, G., Martin, M. E., Dhamala, M., & Berns, G. S. (2003). Human striatal response to salient nonrewarding stimuli. J. Neurosci., 23(22), 8092–8097. Zucker, R. S. (1999). Calcium- and activity-dependent synaptic plasticity. Curr. Opin. Neurobiol., 9, 305–313. Received September 4, 2003; accepted July 2, 2004.

NOTE

Communicated by James Stone

A Note on Stone’s Conjecture of Blind Signal Separation Shengli Xie [email protected]

Zhaoshui He he [email protected]

Yuli Fu [email protected] School of Electronics and Information Engineering, South China University of Technology, Guangzhou 510640, China

Stone’s method is one of the novel approaches to the blind source separation (BSS) problem and is based on Stone’s conjecture. However, this conjecture has not been proved. We present a simple simulation to demonstrate that Stone’s conjecture is incorrect. We then modify Stone’s conjecture and prove this modified conjecture as a theorem, which can be used a basis for BSS algorithms. 1 Problem Formulation Blind source separation (BSS) separates the source signals from the observed signals without any prior information of the source signals and the transfer channel. “Blind” means that both the source and the channel are unknown. The mathematical model of this problem is written as X(t) = AT S(t)

(1.1)

Y(t) = W X(t),

(1.2)

T

where equation 1.1 is the model of the mixture, while equation 1.2 is the model of the separation. The blind (unknown) source signal vector is denoted as S(t) = (s1 (t), s2 (t), · · · , sn (t))T , and the observed signal vector is denoted as X(t) = (x1 (t), x2 (t), · · · , xm (t))T . The vector Y(t) = (y1 (t), y2 (t), · · · , yn (t))T indicating the separated signals. A is the unknown n × m mixture matrix (indicating the unknown channel). W is the separating matrix. The aim of BSS is to get the expression Y(t) = PDS(t)

(1.3)

by adjusting the separation matrix W. In equation 1.3, P is a permutation matrix, while D is a diagonal matrix. Usually the source signals are only assumed to be statistically independent. Neural Computation 17, 321–330 (2005)

c 2005 Massachusetts Institute of Technology

322

S. Xie, Z. He, and Y. Fu

Many methods are used to deal with BSS, for example, the neural network method (Amari, Cichocki, & Yang 1996; Bell & Sejnowski 1995), the minimal mutual information method (Pham, 2002; Xie & Zhang, 2002), the higherorder statistics method (Belouchrani, Karim, Cardoso, & Moulines, 1997; Yeredor, 2000), and the geometric method (Taro, Hirokewa, & Itoh, 2000; Xie & Zhang, in press). Recently, based on Stone’s conjecture, a novel method was proposed to deal with BSS problems for instantaneous mixture model (Stone, 2001), convolution mixture model (Stone, 2002), and nonlinear mixture model (Martinez, & Bray, 2003). Some fast algorithms were proposed in those articles. Thus, Stone’s conjecture-based method has gained attention. Generally, new methods and algorithms should be based on a reliable theoretical basis. If Stone’s conjecture could be proved, the new algorithms proposed in Stone (2001, 2002) and Martinez and Bray (2003) indeed would have a theoretical basis. However, in many cases, Stone’s conjecture is incorrect. The purpose of this note is to modify Stone’s conjecture and prove the modified conjecture theoretically. First, we present a simple simulation to demonstrate the problems with Stone’s conjecture. Then we modify Stone’s conjecture and prove the modified conjecture as a theorem. Thus, the theorem can be used as a basis of BSS algorithms. 2 The Modification of Stone’s conjecture Stone (2001) proposed temporal predictability and Stone’s conjecture. Some algorithms of BSS were presented based on Stone’s conjecture. Now, we cite the temporal predictability and the Stone’s conjecture from the original paper (Stone, 2001). 2.1 Definition of Temporal Predictability and Stone’s Conjecture. 2.1.1 Temporal Predictability. dictability is defined as n

rz = log

(zτ − zτ )2

τ =1 n τ =1

For a given signal series z = {zτ }, its pre-

zτ − z˜ τ

2

zτ = λL zτ −1 + (1 − λL )zτ −1 : 0 ≤ λL ≤ 1, z˜τ = λS z˜ τ −1 + (1 − λS )zτ −1 : 0 ≤ λS ≤ 1, where λL = 2−1/ hl , λS = 2−1/ hs , and hS , hl are the parameters will be selected.

A Note on Stone’s Conjecture of Blind Signal Separation

323

Figure 1: Waveforms of the three independent source signals s1 , s2 and s3 .

2.1.2 Stone’s Conjecture. The conjecture is, ”The temporal predictability of any signal mixture is less than (or equal to) that of any of its component source signals” (Stone, 2001). 2.2 The Faultiness of Stone’s Conjecture. Stone (2001) gave a counterexample to his conjecture and redefined the predictability. However, the source signals in this counterexample are not independent. Also, he did not prove the original conjecture in his early and later papers. By a simple example with independent source signals, we will point out the faultiness of the Stone’s conjecture. The three independent source signals (see Figure 1) are taken from Stone’s experiments (Stone, 2001). The supergaussian signal s1 is with the maximal temporal predictability rs1 = 5.2332, the sorted gaussian noise s2 is with the second maximal temporal predictability rs2 = 3.8966, and the subgaussian sine wave s3 is with the minimal temporal predictability rsnew = 1.8381. A mixed signal is taken as snew = s1 + s2 . Its temporal predictability can be computed by the definition rsnew = 5.1045. Obviously, we have rsnew > rs2 . Therefore, Stone’s conjecture is not always true for the mixture of the subgaussian signal, gaussian signal, and supergaussian signal. In fact, some counterexamples can also be given by taking some other type of signals.

324

S. Xie, Z. He, and Y. Fu

2.3 Modification of Stone’s Conjecture. For convenience, we define an equivalent index to replace temporal predictability. Definition 1 (predictive exponent). predicted exponent is defined as n

Vz =

(zτ − zτ )2

τ =1 n τ =1

For a given signal series z = {zτ }, its

zτ − z˜τ

2

(= exp(rz )).

(2.1)

Obviously, the predictive exponent is monotonic with respect to the temporal predictability. So the predictive exponent is equivalent to temporal predictability in Stone (2001). Thus, the statements in Stone’s conjecture can be rewritten in terms of predictive exponent. Although Stone’s conjecture is not true in general, it could be modified in the following way: Modified Conjecture. The temporal predictability (or equivalently, the predictive exponent) of any linear mixture of the source signals is between the greatest temporal predictability (or equivalently, the predictive exponent) and the smallest temporal predictability (or equivalently, the predictive exponent) of the source signals. The modified conjecture can be written as a theorem and can be proved exactly. Note that Stone’s method proposed some generalized eigenvalue-based algorithms. In the algorithms, the indices (temporal predictability or, equivalently, predictive exponent) should be sorted by values. Since the index temporal predictability (or, equivalently, predictive exponent) is time varying, it is not easily used in the algorithms, so we introduce a generalized index: Definition 2 (covariance rate). Consider a signal z(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), where L(s1 (t), s2 (t), . . . , sK (t)) denotes the linear space spanned by the signals s1 (t), s2 (t), . . . , sk (t). The index Rz =

cov( fza (t), fza (t)) cov(gbz (t), gbz (t))

(2.2)

is called a covariance rate of signal z(t), where fza = zk −zk , gaz = zk −zk , and zk , z˜ k are defined as in the definition of temporal predictability.

A Note on Stone’s Conjecture of Blind Signal Separation

325

The covariance rate implies the predictive exponent. The covariance rate is more general than the definitions of predictive exponent and temporal predictability, and is more easily used in both computation and theoretical analysis. From the independence of all the source signals, we have cov(s1 (t), sj (t)) = 0, (i = j). The covariance rate Rz of signal z(t) is independent of time t(time invariant). Therefore, for ∀z1 (t), z2 (t), . . . , zm (t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), we can assume that Rz11 ≥ Rzl12 . . . , ≥ Rzlm > 0,

(2.3)

where {l1, 12, . . . , lm} is a permutation of {1, 2, . . . , m}. For any signal z(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), let z(t) = WT S(t), W ∈ Rk . Taking two constants λL , λs , 0 ≤ λs < λL ≤ 1, two estimates z(t) and z˜ (t) of the signal z(t) are defined by z(t) = λL z(t − 1) + (1 − λL )z(t − 1), t = 2, 3, . . . , N, z(1) = z(1)

(2.4)

z˜ (t) = λs z˜ (t − 1) + (1 − λs )z(t − 1), t = 2, 3, . . . , N, z˜ (1) = z(1).

(2.5)

Also, we set fza (t) = z(t) − z(t), gbz (t) = z(t) − z˜ (t), t = 1, 2, . . . , N

(2.6)

T FaS (t) = fsa1 (t), fsa2 (t), . . . , fsak (t) , T Gbs (t) = gbs1 (t), gbs2 (t), . . . , gbsk (t) .

(2.7)

and

Some properties related to the generalized index covariance rate are obtained. Property 1. have

If the source signals s1 (t), s2 (t), . . . , sk (t) are independent, we

cov Fas (t), Fas (t) =  0 cov( fsa1 (t), fsa1 (t)) a (t), f a (t))  0 cov( f s2 s2  = .. ..  . . 0 0

 ... 0  ... 0   .. ..  . . a a . . . cov( fsk (t), fsk (t))

326

S. Xie, Z. He, and Y. Fu

cov Gbs (t), Gbs (t) =  cov(gbs1 (t), gbs1 (t)) 0 b (t), gb (t))  0 cov(g s2 s2  = .. ..  . . 0 0

Property 2. We have

 ... 0  ... 0   .. ..  . . b b . . . cov(gsk (t), gsk (t))

For z(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), set z(t) = WT S(t), W ∈ RK .

cov( fza (t), fza (t)) = WT cov(Fas (t), Fas (t))W, cov(gbz (t), gbz (t)) = WT cov(Gbs (t), Gbs (t))W. Property 3. For z1 (t), z2 (t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), let z1 (t) = UT S(t), z2 (t) = VT S(t), U, V ∈ RK . We then have cov( fza1 (t), fza2 (t)) = UT cov(Fas (t), Fas (t))V, cov(gbz1 (t), gbz2 (t)) = UT cov(Gbs (t), Gbs (t))V. Property 4.

The covariance rate is homogeneous with zero order.

Property 1 tells us that the covariance of some independent signals can be calculated as some diagonal matrices. Properties 2 and 3 give the bilinear property of the covariance. And property 4 indicates that for any signal z(t) in the linear signal space and any constant c = 0, the covariance rate of z(t) is equal to that of cz(t). Property 4 will be helpful in computating the covariance rate. Property 3 is easy to understand. The proofs of properties 1, 2, and 4 are in the appendix. Now we can give the theorem to represent the modified conjecture: Theorem 1. In the linear signal space L(s1 (t), s2 (t), . . . , sK (t)), where the signals s1 (t), s2 (t), . . . , sk (t) are statistically independent, the covariance rate Rx of any signal x(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), is between the maximal and minimal covariance rates of the source signals. Under the following assumption, max

x(t)∈L(s1 (t),s2 (t),...,sk (t))

Rx = Rs1 ,

x(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)),

min

x(t)∈L(s1 (t),s2 (t),...,sk (t))

Rx = Rsk , and

A Note on Stone’s Conjecture of Blind Signal Separation

327

Theorem 1 results in Rsk ≤ Rx ≤ Rs1 The proof of theorem 1 is given in the appendix. Therefore, theorem 1 can replace Stone’s conjecture as a theoretical basis of the BSS algorithms. 3 Conclusion This note elaborates on modifying Stone’s conjecture and establishing a theoretical basis of BSS. We present a simple example to point out that Stone’s conjecture is incorrect. Then we give a modified conjecture and prove it as a theorem. The theorem we present can be used as a theoretical basis for BSS algorithms. In future work, we will discuss BSS algorithms based on theorem 1. Appendix Proof of Property 1. Since the signals s1 (t), s2 (t), . . . , sk (t) are independent, by probability theory, the signals fsa1 (t), fsa2 (t), . . . , fsak (t) are so in dependent. Thus, cov fsa1 (t), fsaj (t) = 0, when i = j, and the first equality is true. We can prove the second equality in same way. Proof of Property 2. and hence,

Since z(1) = WT S(1) , we have z(1) = WT S(1) ,

z(2) = λL z(1) + (1 − λL )z(1) = λL WT S(1) + (1 − λL )WT S(1) = WT (λL S(1) + (1 − λL )S(1)) = WT S(2).

(A.1)

Similarly, we have z(t) = WT S(t), t = 1, 2, . . . , N.

(A.2)

Then fza (t) = z(t) − z(t) = WT S(t) − WT S(t) = WT (s1 (t), s2 (t), . . . sk (t))T − (s1 (t), s2 (t), . . . , sk (t))T = WT (s1 (t) − s1 (t), s2 (t), s2 (t), . . . , sk (t) − sk (t))T T = WT fsa1 (t), fsa2 (t), . . . , fsak (t) = WT Fas (t).

(A.3)

By analogous argument, we have gbz (t) = WT Gbs (t).

(A.4)

328

S. Xie, Z. He, and Y. Fu

From the bilinear property of covariance, it follows that cov fza (t), fza (t) = WT cov Fas (t), Fas (t)(t) W cov gbz (t), gbz (t) = WT cov Gbs (t), Fas (t)(t) W.

(A.5)

The property is completed. Proof of Property 4. For the signal z(t) = WT S(t) and nonzero constant c, by the definition of covariance rate, we have cov fcza (t), fcza (t) = Rcz = cov gbcz (t), gbcz (t) =

(cW)T cov(Fas (t), Fas (t))(cW) (cW)T cov(Gbs (t), Gas (t))(cW)

= rz = c0 Rz .

(A.6)

This proves the property. Proof of Theorem 1. Let Rs1 ≥ Rs2 ≥ Rs1 · · · ≥ Rsk > 0, without loss of generality. Then cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t) ≥ ; i = 2, 3, · · · , k. (A.7) cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t) Define a matrix:



0 0  D1 =  .  .. 0 q22

0 q22 .. . 0

··· ··· .. . ···

 0 0  ..  .  qkk

cov gbs2 (t), gbs2 (t) cov fsa2 (t), fsa2 (t) = − cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t)

cov gbsk (t), gbsk (t) cov fsak (t), fsak (t) − . qkk = cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t) By equation A.7, we find that D1 is nonnegative definite. Since x(t) is the linear combination of s1 (t), s2 (t), · · · , sk (t), we can write x(t) = AT1 S(t), where A1 is a k-dimensional column vector. So we have AT1 D1 A1 ≥ 0, that is,   0 0 ··· 0 0 q22 · · · 0    AT1  . .. ..  A1 ≥ 0. ..  .. . . .  0 0 · · · qkk

A Note on Stone’s Conjecture of Blind Signal Separation

It is equivalent to  0 v11  0 v22  AT1  . ..  .. . 0 0

··· ··· .. . ···

 0 0  ..  A1 .  vkk

cov(gbs1 (t), gbs1 (t))

 0 ··· 0 v˜ 11  0 v˜ 22 · · · 0    AT1  . .. ..  A1 ..  .. . . .  0 0 · · · v˜ kk ≥ cov( fsa1 (t), ( fsa1 (t))

329



(A.8)

vii = cov(gbs1 (t), gbs1 (t)), v˜ ii = cov fsa1 (t), ( fsa1 (t) , i = 1, 2, . . . , k. From equation A.8 and property 2, we have   0 ··· 0 v˜ 11  0 v˜ 22 · · · 0    AT1  . .. ..  A1 ..  .. . . .  a a cov fs1 (t), ( fs1 (t) 0 0 · · · v˜ kk ≥   b b ˜ 0 ··· 0 v11 cov gs1 (t), (gs1 (t)  0 v˜ 22 · · · 0    AT1  . .. ..  A1 ..  .. . . .  0 0 · · · v˜ kk

=

AT1 cov Fas (t), Fas (t) A1 cov fxa (t), fxa (t) = Rx . = cov gbx (t), gbx (t) AT1 cov Gbs (t), Gbs (t) A1

(A.9)

Hence, we have Rx ≤ Rs1 . By the same arguments, we have Rx ≥ Rsk . The proof of theorem 1 is complete. Acknowledgments The work is supported by Guang Dong Province Science Foundation for Program of Research Team (grant 04205783), the National Natural Science Foundation of China (grant 60274006), the Natural Science Key Fund of Guangdong Province, China (grant 020826), the National Natural Science Foundation of China for Excellent Youth (grant 60325310), and the TransCentury Training Program, Foundation for the Talents by the State Education Commission of China. We are also grateful to the reviewers for their helpful comments and suggestions. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new algorithm for blind source separation. In D. Touretsky, M. Mozer, & M. Wasselmo (Eds.), Advances in neural information processing, 8 (pp. 757–763). Cambridge, MA: MIT Press.

330

S. Xie, Z. He, and Y. Fu

Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind source separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Karim, A. M., Cardoso, J. F., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. Signal Process, 45(2), 434–443. Martinez, D., & Bray, A. (2003). Nonlinear blind source separation using kernels. IEEE Trans. Neural Networks, 14(1), 228–235. Pham, D. T. (2002). Mutual information approach to blind separation for stationary of sources. IEEE Trans. Information Theory, 48, 1935–1946. Stone, J. V. (2001). Blind source separation using temporal predictability. Neural Computation, 13, 1559–1574. Stone, J. V. (2002). Blind deconvolution using temporal predictability. Neurocomputing, 49, 79–86. Taro, Y., Hirokawa, K., & Itoh, K. (2000). Independent component analysis by transforming a scatter diagram of mixtures of signal. Optics Communications, 173, 107–114. Xie, S., & Zhang, J. (2002). Blind separation algorithm of minimal mutual information based on rotating transform. Acta Electronica Sinica 30(5), 628–631. Xie, S., & Zhang, J. (in press). A geometric algorithm of blind source separation based on QR decomposition. Control Theory and Applications. Yeredor, A. (2000). Blind source separation via the second characteristic function. Signal Process, 80, 897–902. Received March 26, 2004; accepted July 2, 2004.

NOTE

Communicated by Barak Pearlmutter

A Further Result on the ICA One-Bit-Matching Conjecture Jinwen Ma [email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong, and School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, China

Zhiyong Liu [email protected]

Lei Xu [email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong

The one-bit-matching conjecture for independent component analysis (ICA) has been widely believed in the ICA community. Theoretically, it has been proved that under the assumption of zero skewness for the model probability density functions, the global maximum of a cost function derived from the typical objective function on the ICA problem with the one-bit-matching condition corresponds to a feasible solution of the ICA problem. In this note, we further prove that all the local maximums of the cost function correspond to the feasible solutions of the ICA problem in the two-source case under the same assumption. That is, as long as the one-bit-matching condition is satisfied, the two-source ICA problem can be successfully solved using any local descent algorithm of the typical objective function with the assumption of zero skewness for all the model probability density functions. The so-called one-bit-matching conjecture, which states that “all the sources can be separated as long as there is a one-to-one same-sign-correspondence between the kurtosis signs of all source probability density functions and the kurtosis signs of all model probability density functions,” was summarized in Xu, Cheung, and Amari (1998) and formally proved in Liu, Chiu, and Xu (2004) recently under the assumption of zero skewness for the model probability density functions. The one-bit-matching condition guarantees a successful solution of the ICA problem by globally maximizing the following cost function: J(R) =

n n

r4ij νjs kim ,

(1)

i=1 j=1

Neural Computation 17, 331–334 (2005)

c 2005 Massachusetts Institute of Technology

332

J. Ma, Z. Liu, and L. Xu

where the orthonormal matrix R = (rij )n×n = WA is to be estimated instead of W since A, i. e., the mixing matrix, is a constant one,1 νjs is the kurtosis of the source sj , and kim is a constant with the same sign as the kurtosis νim of the model probability density pi (yi ). As proved by Liu et al. (2004), under the assumption of zero skewness for the model probability density functions, this cost function is equivalent to the typical objective function of the ICA problem used by several researchers, such as Bell and Sejnowski (1995), Amari, Cichocki, and Yang (1996), and Cardoso (1999). However, in practice, typical gradient algorithms (e.g., Amari et al., 1996; Welling & Weber, 2001) cannot guarantee global optimization. Here we further analyze the local maxima of the cost function J(R) case of two inthe n sources. For convenience, J(R) can be rewritten as ni=1 j=1 r4ij kij , where kij = νjs kim is the element of matrix K = (kij )n×n . Theorem. Assume that the one-bit-matching condition is satisfied for n = 2, the local maxima of J(R) are only the permutation matrices up to sign indeterminacy. Proof. For the second-order orthonormal matrix R, there exist two disjoint fields of R that can be expressed by cos θ R1 = sin θ

− sin θ , cos θ

cos θ sin θ

sin θ , − cos θ

and R2 =

respectively, with one variable θ ∈ [0, 2π). Actually, each R1 denotes a rotation transformation in R2 (i.e., the two-dimensional real Euclidean space), while each R2 denotes a reflection transformation in R2 . Since J(R1 ) = J(R2 ), for simplicity we consider just R1 . By differentiating J(R1 ) with respect to θ, we have dJ(R1 ) = 4 cos θ sin θ [−(k11 + k22 ) cos2 θ + (k12 + k21 ) sin2 θ ], dθ d2 J(R1 ) = 4(k11 + k22 )(3 sin2 θ cos2 θ − cos4 θ) + 4(k12 + k21 ) d2 θ × (3 sin2 θ cos2 θ − sin4 θ). 1

Note that here, we additionally assume that A is square and invertible.

(2)

(3)

One-Bit-Matching Conjecture

333

Assume that the one-bit-matching condition is satisfied for n = 2 and ν1s > ν2s and k1m > k2m ; we have the two following possibilities for the matrix K. Case I. All kij > 0, that is, both sources are subgaussians or supergaussians. The necessary condition for local maxima is obtained by setting equation 2 = 0, and we have sin θ cos θ = 0 k11 + k22 tan2 θ = . k12 + k21

(4) (5)

The roots for equation 5 correspond to local minima. Solving equation 4, we ˆ have θˆ = 0, π2 , π, 3π 2 which are local maxima as equation 3 < 0 at θ = θ. It can be readily verified that each local maximum θˆ leads R1 to a permutation matrix up to sign indeterminacy in this case. Case II. k11 , k22 > 0 while k12 , k21 < 0, that is, a mixture that consists of one subgaussian and one supergaussian source. In this case, no real roots exist for equation 5. Among the roots θˆ = 0, π2 , π, 3π 2 , only 0 and π are local maxima as equation 3 < 0 there. The remaining two roots correspond to local minima. Similarly, it can be readily verified that each local maximum leads R1 to a permutation matrix up to sign indeterminacy in this case. Summing the results of cases I and II, the proof is completed. According to this theorem, when the skewness of each model probability density function is set to be zero, the two-source ICA problem can be successfully solved using any local descent algorithm of the typical objective function as long as the one-bit-matching condition is satisfied. Acknowledgments This work was supported by a grant from the Research Grant Council of the Hong Kong SAR (Project CUHK 4184/03E), and also by the Natural Science Foundation of China for Project 60071004. References Amari, S. I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind separation of sources. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.

334

J. Ma, Z. Liu, and L. Xu

Cardoso, J. F. (1999). Informax and maximum likelihood for source separation. IEEE Signal Processing Letters, 4, 112–114. Liu, Z. Y., Chiu, K. C., & Xu, L. (2004). One-bit-matching conjecture for independent component analysis. Neural Computation, 16, 383–399. Welling, M., & Weber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, 677–689. Xu, L., Cheung, C. C., & Amari, S. I. (1998). Further results on nonlinearity and separation capability of a linear mixture ICA method and learned LPM. In C. Fyfe (Ed.), Proceedings of the I&ANN’98 (pp. 39–45). Tenerife, Spain: NAISO Academic Press. Received March 24, 2004; accepted June 22, 2004

LETTER

Communicated by Andrew Barto

Robust Reinforcement Learning Jun Morimoto [email protected] Computational Brain Project, ICORP, JST, Sora Ku-gun, Kyoto, 619-0288, Japan; ATR Computational Neuroscience Laboratories, Soraku-gun, Kyoto 619-0288, Japan

Kenji Doya [email protected] ATR Computational Neuroscience Laboratories, Soraku-gun, Kyoto 619-0288, Japan; Initial Research Project, OIST, Gushikawa, Okinawa, 904-2234, Japan; and CREST, JST, Soraku-gun, Kyoto 619-0288, Japan

This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H∞ control, we consider a differential game in which a “disturbing” agent tries to make the worst possible disturbance while a “control” agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H∞ control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired. 1 Introduction In this letter, we propose a new reinforcement learning paradigm that we call robust reinforcement learning (RRL). Plain, model-free reinforcement learning (RL) requires a significant number of trials to learn policies for given tasks, and therefore it is difficult to use for online learning of realNeural Computation 17, 335–359 (2005)

c 2005 Massachusetts Institute of Technology

336

J. Morimoto and K. Doya

world problems. Thus, the use of environmental models has been quite common for both online action planning (Doya, 2000) and off-line learning by simulation (Morimoto & Doya, 2000). However, no model is perfect, and modeling errors can cause unpredictable results—sometimes worse than with no model at all. In fact, robustness against model uncertainty has been the main subject of research in the control community for the past 20 years, and the result is formalized as the H∞ control theory (Zhou, Doyle & Glover, 1996). In general, a modeling error causes a deviation of the real system state from the state predicted by the model. This can be reinterpreted as a disturbance to the model. However, the problem is that the disturbances due to modeling error can be strongly correlated, and thus standard independence assumptions may not be valid. The basic strategy to achieve robustness is to keep the sensitivity γ of the feedback control loop against a disturbance input small enough so that any disturbance due to modeling error can be suppressed if the gain of mapping from the state error to the disturbance is bounded by 1/γ . In the H∞ paradigm, those disturbance-to-error and error-to-disturbance gains are measured by the max norms of the functional mappings in order to ensure stability for any mode of disturbance. In section 2, we briefly introduce the H∞ paradigm and show that design of a robust controller can be achieved by finding a min-max solution of a value function, which is given by the Hamilton-Jacobi-Isaacs (HJI) equation. In section 3, we formulate robust reinforcement learning. We first define an augmented value function that takes into account the amplitude of the disturbance and derive a corresponding HJI equation. We then derive an online algorithm for estimating the value function, the worst-case disturbances, and the optimal controller. In section 4, we test the validity of the algorithms first in a linear inverted pendulum task. It is verified that the value function as well as the disturbance and control policies derived by the online algorithms coincide with the analytical solution given by H∞ theory. We then compare the performance of the robust reinforcement learning algorithm with standard model-based RL in the nonlinear task of pendulum swing-up (Doya, 2000). It is shown that a robust RL controller can accommodate changes in physical parameters of the pendulum that a standard RL controller cannot cope with (Morimoto & Doya, 2001b). We also applied robust RL to the cart pole swing-up task (Doya, 2000) and tested the robustness of the learned controller. 2 H∞ Control Standard H∞ control (Zhou et al., 1996) deals with a system shown in Figure 1, where G is the plant, K is the controller, u is the control input, y is the measurement available to the controller, w is an unknown disturbance, and z is the error output that is desired to be kept small. In general, the controller K is designed to stabilize the closed-loop system based on a model of the plant

Robust Reinforcement Learning

w

G

u

337

z y

w

z G

u

y

K

K

(A)

(B)

Figure 1: (A) Generalized plant G and controller K. (B) Small gain theorem.

G. However, when there is a discrepancy between the model and the actual plant dynamics, the feedback loop could be unstable. The effect of modeling error can be equivalently represented as a disturbance w generated by an unknown mapping of the plant output z, as shown in Figure 1b. The goal of the H∞ control problem is to design a controller K that brings the error z to zero while minimizing the H∞ norm of the closed-loop transfer function Tzw from the disturbance w to the output z z2 , Tzw ∞ = sup w 2 w

(2.1)

where • 2 denotes L2 norm. The small gain theorem ensures that if Tzw ∞ ≤ γ , then the system shown in Figure 1b will be stable for any stable mapping : z → w with ∞ < γ1 (Zhou et al., 1996). 2.1 Min-Max Solution to H∞ Problem. We consider the plant G with dynamics given by x˙ = f (x, u, w),

(2.2)

where x ∈ X ⊂ Rn is the state, u ∈ U ⊂ Rm is the control input, and w ∈ W ⊂ Rl is the disturbance input. From equation 2.1 and the small gain theorem, the H∞ control problem can be considered as the problem to find a controller that satisfies a constraint z2 2 Tzw ∞ 2 = sup ≤ γ 2, 2 w w 2

(2.3)

where z is the error output. In other words, because the norm z2 and the norm w2 are defined as ∞ z2 2 = zT (t)z(t) dt (2.4) 0 ∞ w2 2 = wT (t)w(t) dt, (2.5) 0

338

J. Morimoto and K. Doya

the H∞ control problem is equivalent to finding a control input u that satisfies a constraint ∞ V= (zT (t)z(t) − γ 2 wT (t)w(t)) dt ≤ 0 (2.6) 0

against all possible disturbances w with x(0) = 0. We can consider this problem as a differential game (Weiland, 1989) in which the best control output u that minimizes V is sought while the worst disturbance w that maximizes V is chosen. Thus, an optimal value function V ∗ is defined as ∞ ∗ V = min max (zT (t)z(t) − γ 2 wT (t)w(t)) dt. (2.7) u

w

0

The condition for the optimal value function is given by ∂V ∗ T 2 T f (x, u, w) , 0 = min max z z − γ w w + u w ∂x

(2.8)

which is known as the Hamilton-Jacobi-Isaacs (HJI) equation. From equation 2.8, we can derive the optimal control output u and the worst disturbance w by solving ∂zT z ∂V ∗ ∂ f (x, u, w) + =0 ∂u ∂x ∂u ∗ ∂V ∂ f (x, u, w) −2γ 2 w + = 0. ∂x ∂w

(2.9) (2.10)

3 Robust Reinforcement Learning Here we consider a continuous-time formulation of reinforcement learning (Doya, 2000) with the system dynamics x˙ = f (x, u, w),

(3.1)

and the reward r(x, u), where x ∈ X ⊂ Rn is the state, u ∈ U ⊂ Rm is the control output, and w ∈ W ⊂ Rl is the disturbance input. The basic goal is to find a policy u = g(x) that maximizes the cumulative future reward,

∞

e−

s−t τ

r(x(s), u(s)) ds,

(3.2)

t

for any given state x(t), where τ is a time constant of evaluation. However, a particular policy that was optimized for a certain environment may perform

Robust Reinforcement Learning

339

badly when the environmental setting changes. In order to ensure robust performance under a changing environment or unknown disturbance, we introduce the notion of worst disturbance in H∞ control to the reinforcement learning paradigm. In this framework, we consider an augmented reward, q(t) = r(x(t), u(t)) + ω(w(t)),

(3.3)

where ω(w(t)) is an additional reward for withstanding a disturbing input; for example, we can use quadratic cost for the disturbance input to be consistent with equation 2.7, ω(w) = γ 2 wT w,

(3.4)

where γ is the robustness parameter. The augmented value function is then defined as

∞

V(x(t)) =

e−

s−t τ

q(x(s), u(s), w(s)) ds.

(3.5)

t

The optimal value function is given by the solution of a variant of HJI equation, 1 ∗ ∂V ∗ V (x) = max min r(x, u) + ω(w) + f (x, u, w) . u w τ ∂x

(3.6)

In the RRL paradigm, the value function is updated by using the temporal difference (TD) error (Doya, 2000), δ(t) = q(t) −

1 ˙ V(t) + V(t), τ

(3.7)

while the best action and the worst disturbance are generated by maximizing and minimizing, respectively, the right-hand side of the HJI equation, r(x, u) + ω(w) +

∂V ∗ f (x, u, w). ∂x

(3.8)

We use a function approximator to implement the value function V(x(t); v), where v = (v1 , . . . , vn ) is a parameter vector. As in the standard continuous-time RL, we use the learning rule for the value function appoximator as v˙ i = ηδ(t)ei (t),

(3.9)

340

J. Morimoto and K. Doya

J

K

L

N

L

P

Q

"

%

N

S

T

P

P

U

L

V

W

N

L

X

F

@

7

7

I

1

2

3

@

5

7

B

9

;

7

;

=

?

(

@

D

7

Y

A

0

R

0

E

*

,

(

*

,

"

Figure 2: Actor-disturber-critic architecture.

where η denotes the learning rate and ei denotes eligibility trace for a parameter vi (Doya, 2000). The eligibility trace is updated by 1 ∂V(t) e˙i (t) = − ei (t) + , κ ∂vi

(3.10)

By comparing equations 2.8 and 3.6, we can see that the output error zT z in the H∞ framework is generalized as an arbitrary reward function r(x, u) in the RRL. Furthermore, the sign of the value function is flipped, and a discount factor is introduced in the RRL framework. 3.1 Model-Free Policy: Actor-Disturber-Critic. To implement robust RL in a model-free fashion, we propose the actor-disturber-critic architecture, shown in Figure 2. We define the policies of the actor and the disturber as u(t) = su (Au (x(t); vu ) + nu (t))

(3.11)

w(t) = sw (Aw (x(t); vw ) + nw (t)),

(3.12)

and

respectively, where Au (x(t); vu ) ∈ Rm and Aw (x(t); vw ) ∈ Rl are function approximators with parameter vectors vu and vw , and nu (t) ∈ Rm and nw (t) ∈ Rl are noise terms for exploration. su () and sw () are monotonically increasing output functions. For example, we can use sigmoidal functions for su () and sw () to saturate control output and disturbance input. In the actor-disturber-critic architecture, policies of the actor and disturber are

Robust Reinforcement Learning

341

represented by the parameter vectors vu and vw . The parameters of the actor and the disturber are updated by v˙ ui = ηu δ(t)nui (t)

∂Au (x(t); vu ) ∂vui

w w v˙ w i = −η δ(t)ni (t)

(3.13)

∂Aw (x(t); vw ) , ∂vw i

(3.14)

where ηu and ηw denote the learning rates. 3.2 Model-Based Policy: Using Value Gradient. When the augmented reward function q(x, u, w) in equation (3.3) is convex with respect to the action u and disturbance w, the HJI equation, 3.6, has a unique solution. Here, we assume that the augmented reward q(x, u, w) can be separated into three parts: the reward for the state L(x), the cost for the action S(u), and the cost for the disturbance (w). We specifically consider the case q(x, u, w) = L(x) −

m i=1

Si (ui ) +

l

j (wj ),

(3.15)

j=1

where Si () is a convex cost function for action variable ui and j () is a convex cost function for disturbance variable wj . In this case, the condition for the optimal action and the worst disturbance is given from the HJI equation, 3.6, as −Si (ui ) +

∂V(x) ∂ f (x, u, w) =0 ∂x ∂ui

(i = 1, . . . , m)

(3.16)

j (wj ) +

∂V(x) ∂ f (x, u, w) =0 ∂x ∂wj

(j = 1, . . . , l),

(3.17)

∂ f (x,u,w) are the ith and jth column vector of the n × m ∂wj ∂ f (x,u,w) ∂ f (x,u,w) input gain matrix and the n × l disturbance gain matrix , ∂u ∂w ∂ f (x,u,w) ∂ f (x,u,w) and ∂wj are respectively. We now assume that the input gains ∂ui

where

∂ f (x,u,w) ∂ui

and

not dependent on u and w; that is, the system is input-affine. Then the above equations have unique solutions, ∂V(x) ∂ f (x, u, w) ∂x ∂ui ∂V(x) ∂ f (x, u, w) wj = j−1 − , ∂x ∂wj

ui = Si−1

(3.18) (3.19)

342

J. Morimoto and K. Doya

where Si () and j () are monotonic functions. Accordingly, greedy policies for action and disturbance are represented in vector notation as

∂ f (x, u, w)T ∂V(x)T ∂u ∂x ∂ f (x, u, w)T ∂V(x)T w = −1 − , ∂w ∂x

u = S −1

where

∂V(x) T ∂x

(3.20) (3.21)

represents the steepest ascent direction of the value func-

tion, which is then transformed by the “transpose” models ∂ f (x,u,w) T ∂w

∂ f (x,u,w) T ∂u

and

into a direction in the control output and disturbance input space, respectively (Doya, 2000). 3.3 Linear-Quadratic Case. We consider a special case in which a linear dynamic model and quadratic reward models are given or learned as x˙ = Ax + B1 u + B2 w q(x, u, w) = −xT Qx − uT Ru + γ 2 wT w.

(3.22) (3.23)

In this case, the value function is given by a quadratic form V(x) = −xT Px, where P is the solution of a Riccati equation: AT P + PA + P ∂f

1 1 T −1 T B B − B R B 1 1 2 2 P + Q = P. γ2 τ

(3.24)

∂f

T From ∂u = B1 , ∂w = B2 , S (u) = 2Ru, (w) = 2γ 2 w, and ∂V ∂x = −2x P, the best action u in equation 3.20 and the worst disturbance w in equation 3.21 can be derived as

u = R−1 BT1 Px 1 w = − 2 BT2 Px. γ

(3.25) (3.26)

4 Simulation In this section, we show the performance of robust RL. In section 4.1, we consider the control problem of a linear inverted pendulum (see Figure 10 and appendix B) and test whether the parameters of the linear controller acquired by our robust RL algorithm converge to the analytical H∞ solution. In section 4.2, we apply robust RL to a nonlinear pendulum swing-up task (see appendix A) and compare the robustness of the controllers learned by robust RL and standard RL. We also test how the performance of the robust

Robust Reinforcement Learning

343

RL controllers change with different settings of the robustness parameter γ . In section 4.3, we apply robust RL to the cart pole swing-up task (Doya, 2000; see Figure 11 and appendix C). We show that robust RL is applicable to a challenging nonlinear control task with a higher-dimensional state space. (Refer to the appendixes for the outline of the three benchmark tasks.) Below are the function approximation methods we used for the value functions, the dynamics models, and the policies in the three tasks. Value function. We use normalized gaussian networks as the function approximator to represent the value function V(x; v) =

K

vk bk (x),

(4.1)

k=1

where b() is the normalized gaussian basis function and v = (v1 , . . . , vk )T is the weight vector (see appendix D). In the case of linear dynamics and the quadratic reward, instead of using the normalized gaussian networks, we approximate the value function by the quadratic form so that V(x) = −xT Px,

(4.2)

where P is a symmetric matrix and the parameter of the function approximator. By using this quadratic form, we can exactly represent the actual value function. The update rules of these function approximators are given by equations 3.10 and 3.11. Dynamics model. We approximate the dynamics model without disturbance input by using the function approximator fˆ(x, u; vM ), x˙ −

∂ f (x, u, w) w fˆ(x, u; vM ), ∂w

(4.3)

where vM is a parameter vector. Because we assume that the system is inputaffine as described in section 3.2, we can remove the disturbance term from ∂ f (x,u,w) the dynamics f (x, u, w) by subtracting ∂w w from x˙ . Then we can approximate input gain using the approximated model ˆf (x, u; vM ) as ∂ fˆ(x, u; vM ) ∂ f (x, u, w) . ∂u ∂u

(4.4)

For linear dynamics, we use the linear approximation ˆ + Bˆ 1 u, x˙ − B2 w Ax

(4.5)

where Aˆ and Bˆ 1 are parameter matrices. The update rule of the function approximator is given in appendix D.

344

J. Morimoto and K. Doya

Actor and disturber. Actor and disturber are represented by equations 3.12 and 3.13. Corresponding policies to the reward function (see equation A.3 in appendix A) are given by u=

1 −1 R (Au (x; vu ) + nu ), 2

(4.6)

w=

1 (Aw (x; vw ) + nw ). 2γ 2

(4.7)

We use normalized gaussian networks as the function approximator (see appendix D). The update rule of the function approximator is given in equations 3.14 and 3.15. In the linear quadratic case, the actor and the disturber in equations 3.12 and 3.13 are represented as linear controllers, Au (x; vu ) = vTu x and Aw (x; vw ) = vw x, respectively, where vu ∈ Rm and vw ∈ Rl . The output functions su () and sw () are identities. Value-gradient-based policies. Value-gradient-based policies for action and disturbance are given in equations 3.20 and 3.21. For the pendulum swing-up task, we use policies that correspond to the reward function (see equation A.3), 1 −1 ∂ f (x, u, w) T ∂V T R , 2 u ∂x 1 ∂ f (x, u, w) T ∂V T w=− 2 . 2γ w ∂x u=

(4.8) (4.9)

For the cart pole swing-up task, we use value-gradient-based policies which correspond to the reward function (see equation C.7), 1 ∂ f (x, u, w) T ∂V(x) T max u=u s , (4.10) c ∂u ∂x 1 ∂ f (x, u, w) T ∂V(x) T max (j = 1, 2). (4.11) wj = wj s − dj ∂wj ∂x Exploration strategy. In order to promote exploration, we incorporated a noise term in both policies, in equations 3.12 and 3.13 and equations 3.21 and 3.22. We used low-pass filtered noise τn n˙ u (t) = −nu (t) + σu Nu (t), and τn n˙ w (t) = −nw (t)+σw Nw (t), where Nu (t) and Nw (t) denote normal gaussian noise (Doya, 2000). Simulation setup. The physical systems were simulated by the fourthorder Runge-Kutta method, and the learning dynamics was simulated by the Euler method, both with the time step of 0.01 sec. Because calculating learning dynamics requires more computational resources and less accuracy than the physical dynamics of the pendulum and the cart pole, we used the faster numerical integration method (the Euler method) for the learning dynamics.

Robust Reinforcement Learning

345

4.1 Linear Inverted Pendulum. We first considered a linear problem in order to test if the value function and the policy learned by robust RL coincide with the analytic solution of H∞ control problem. Thus, we considered only the pendulum dynamics near the unstable equilibrium point x = (0, 0)T . The parameters for the value function in equation 4.2 were given as P=

p11 p12

p12 . p22

(4.12)

Each trial was started from an initial state x(0) = (θ (0), θ˙ (0)), where θ (0) and θ˙ (0) were selected from a uniform distribution that ranged over − π6 < ˙ θ (0) < π6 and − π6 < θ(0) < π6 , respectively. We terminated a trial after 2 seconds, or when the state of the pendulum reached |θ | > π6 . Parameters for the learning process were given as η = 1.0, ηu = 0.1, ηw = 0.1, τ = 1.0, κ = 0.1, R = 1.0, γ = 1.5, τn = 0.02, σu = 3.0, and σw = 2.0. 4.1.1 Actor-Disturber-Critic. Here, we used robust RL implemented by the actor-disturber-critic shown in equations 4.6 and 4.7. We initialized the parameters of the actor vu as vu = {−10.0, −1.0} and the disturber vw = {0.0, 0.0}, which made the system stable. The parameters of the value function P in equation 4.12 were initialized with {p11 , p12 , p22 } = {−1.0, −10.0, −1.0}. The initial parameters of the value function are consistent with the initial parameters of the actor (i.e., vu = R−1 BT1 P). Figures 3A and 3B show learning performance of the actor-disturbercritic architecture. The parameters of the actor vu and the disturber vw converged to the values close to those of policies in equations 3.25 and 3.26 derived by solving the Riccati equation, 3.25. 4.1.2 Value-Gradient-Based Robust Policy. Value-gradient-based policies for action and disturbance are given based on equations 3.25 and 3.26. The parameter matrix P was learned instead of using the solution of the Riccati equation, 3.25. Again, the parameters of the value function in equation 4.12 were initialized with {p11 , p12 , p22 } = {−1.0, −10.0, −1.0}. The parameter matrices Aˆ and Bˆ 1 for the dynamics model are initialized as Aˆ = 0 and Bˆ 1 = 0. Results in Figure 3C show that the parameters of the value function P nearly converged to the solution of the Riccati equation, 3.25. As shown in Figure 3D, the model parameter Bˆ 1 = (b1 , b2 )T used in the controller, equation 3.26, also converged to the exact model in equation A.2. 4.2 Nonlinear Pendulum Swing-Up. Now we consider the original nonlinear dynamics, equation A.2. We used the reward function defined

346

J. Morimoto and K. Doya 20

−20

p

10

v1

−40

p

0

vw

22 12

Gain

V

0

−60

2

−10 u

v2

−20

−80 −100 −120 0

w

−30

p11 200

400

600

800

u

−40 0

1000

v1 200

(A)

600

800

1000

(B) 1.2

0 −20

p

1

p

0.8

22

−40

b2

12

B_1

V

400 Trials

Trials

−60

0.6 0.4

−80

0.2

−100 −120 0

p

0

11

200

400

600 Trials

(C)

800

1000

−0.2 0

b1 50

100 Trials

150

200

(D)

Figure 3: Learning performance of (A) estimated value function for the actordisturber-critic, (B) the actor and the disturber, (C) estimated value function for the value-gradient-based robust policy, and (D) estimated dynamics model. The dash-dot lines show the exact values of the parameters.

in equation A.3 with the control cost R = 0.05. The value and policy functions were implemented by normalized gaussian networks with 21 × 21 ˙ A 4 × 4 × 2 basis network bases for each dimension of the state space {θ, θ}. ˙ u} was used for modeling the system dyfor three-dimensional space {θ, θ, namics. 4.2.1 Learning Performance. Each trial was started from an initial state x(0) = (θ(0), 0.0), where θ(0) was selected from a uniform distribution that ranged over −π < θ < π.

Robust Reinforcement Learning

347

We set the learning parameters as η = 10.0, ηu = 5.0, ηw = 5.0, τ = 1.5, κ = 0.1, and τn = 0.2. The sizes of the perturbation σu and σw were tapered off as the performance improved (Gullapalli, 1990). We took the modulation V(t)−V0 scheme σu = 2.0 min[1, max[0, VV11−V(t) −V0 ]] and σw = 1.5 min[1, max[0, V1 −V0 ]], where V0 = −2.0 and V1 = 0.0 are the minimal and maximal levels of the expected reward. We used a range of the robustness parameter γ , starting from 1.0 and gradually reducing by 0.05, and found the smallest value with which the learning method acquired the swing-up policy. As we make γ smaller, the policy becomes more robust and more conservative. For the actor-disturber-critic, we initialized the parameters of the function approximators as v = 0, vu = 0, and vw = 0. The robustness parameter was set to γ = 0.45. For the value-gradient-based robust policy, we initialized the parameter of the value function as v = 0, and the dynamics model as vM = 0. The robustness parameter was set to γ = 0.25. As suggested in Doya (2000), using the learned dynamics model and the value gradient has an advantage for learning the pendulum swing-up task. We found that the valuegradient method also has an advantage for learning a more robust policy compared to the actor-disturber-critic. Therefore, we can use a smaller robustness parameter (γ = 0.25) than that of the actor-disturber-critic method (γ = 0.45). As a measure of the swing-up performance, we defined the time for which the pendulum stayed up (|θ | < π4 ) as tup . A trial lasted for 20 seconds unless the pendulum was over-rotated (|θ | > 5π ). Upon such a failure, the trial was terminated with a reward r(t) = −2.0 for 0.5 second. We compared the learning performance of four learning schemes: actor-disturber-critic, actor-critic, value-gradient-based robust policy, and value-gradient-based standard policy. Figure 4 shows the results of learning performance. The actor-critic and the value-gradient-based standard policy learn the swing-up task faster than the actor-disturber-critic and the value-gradient-based robust policy, respectively. Results show that learning robust policies is slower than learning standard policies because the disturbance is explicitly considered during learning. 4.2.2 Value Functions and Policies. Figure 5 shows the value functions acquired by the value-gradient-based robust RL with robustness parameter γ = 0.25 and by the value-gradient-based standard RL (γ = ∞). The value function acquired by the value-gradient-based robust RL has a sharper ridge ˙ = (0.0, 0.0) and a smoother slope around the upright position, x = (θ, θ) ˙ around the bottom position, x = (θ, θ) = (π, 0.0) (see Fgiure 5A). The sharp ridge keep the pendulum at the upright position, and the smoother slope, which makes more swings around the bottom position, is suitable to cope with modeling error. Figure 6 shows the acquired actor and disturber. The

J. Morimoto and K. Doya

20

20

15

15

t_up [sec]

t_up [sec]

348

10

5

0 0

10

5

200

400

600

800

0 0

1000

200

400

Trials

800

1000

(B)

20

20

15

15

t_up [sec]

t_up [sec]

(A)

10

5

0 0

600 Trials

10

5

200

400

600 Trials

(C)

800

1000

0 0

200

400

600

800

1000

Trials

(D)

Figure 4: Comparison of the time course of learning with different control schemes: (A) actor-disturber-critic, (B) actor-critic, (C) value-gradient-based robust policy, and (D) value-gradient-based standard policy. tup : time in which the pendulum stayed up.

disturber learned the policy that mainly disturbs the actor’s policy around ˙ = (0.0, 0.0). the upright position, x = (θ, θ) 4.2.3 Robustness to Parameter Changes. In Figure 7, we compare the robustness of the value-gradient-based robust policy and the value-gradientbased standard policy to the change of physical parameters. Both policies learned to swing up and hold a pendulum that has the weight m = 1.0 kg and the coefficient of friction µ = 0.01 (see Figure 7A). The value-gradientbased robust policy took more swings, indicating its conservative control law.

349

0 −1 −2 −3

400

V

V

Robust Reinforcement Learning

200 −100

0 −1 −2 −3

0 0

400 200 −100

100 θ [deg]

−400

0 0

−200

−200 100

ω [deg/s]

θ [deg]

(A)

ω [deg/s]

−400

(B)

4 2 0 −2 −4

400

w

u

Figure 5: Shape of the value function acquired by the value-gradient-based RL after 1000 learning trials. (A) Robust RL (γ = 0.25). (B) Standard RL (γ = ∞).

1 0 −1

400

200 −100

200

0 0

−100

θ [deg]

−400

0 0

−200 100

−200 100

ω [deg/s]

θ [deg]

(A)

ω [deg/s]

−400

(B)

2.5

2.5

2

2

1.5

1.5

θ [rad/π]

θ [rad/π]

Figure 6: Shape of the control and disturbance function after 1000 learning trials. the robustness parameter was γ = 0.45 (A) Actor. (B) Disturber.

1 0.5 0 −0.5 0

1 0.5 0

Robust Standard 1

2

3 Time [sec]

(A)

4

5

6

−0.5 0

Robust Standard 5

10 Time [sec]

15

20

(B)

Figure 7: Swing-up trajectories with a pendulum with different weight and friction. The dash-dot lines show the upright position. The initial position was θ (0) = 56 π . (A) m = 1.0, µ = 0.01. (B) m = 3.0, µ = 0.5.

350

J. Morimoto and K. Doya

Table 1: Comparison with Different Robustness Parameter (Average Performance of 10 Learned Controllers). γ

0.25

0.5

1.0

∞

Max m[kg]

5.9

3.9

3.2

3.1

Next, we applied both robust and standard policies to a pendulum that has different physical parameters (m = 3.0 kg, µ = 0.5) from the originally learned environment (m = 1.0 kg, µ = 0.01). As shown in Figure 7B, the robust policy could successfully swing the pendulum up, while the standard policy could not. This result shows the robustness of the policy acquired by robust RL. We also tested how the robustness parameter γ affects robust performance of the learned controller. We compare the value-gradient-based robust policy with the three robustness parameters (γ = 0.25, 0.5, 1.0) and the value-gradient-based standard policy (γ = ∞). We trained these controllers with the pendulum mass m = 1.0 kg. Then we compared the maximum weight that each controller can successfully swing up, where the initial joint angle was θ (0) = 56 π and the coefficient of friction was the same as the originally learned environment, µ = 0.01. A trial was regarded as successful when tup > 5 seconds. We generated 10 controllers for each robustness parameter γ and then compared the average performance. The results of the comparison in Table 1 show a more robust performance with smaller γ . This results are consistent with H∞ control theory (Zhou et al., 1996). 4.3 Cart Pole Swing-Up. We compared the robust performance of the robust RL controller with the standard RL controller in the cart pole swingup task. The value and policy functions were implemented by normalized gaussian networks with 7 × 7 × 21 × 21 bases for each dimension of the state ˙ A 2 × 2 × 4 × 2 × 4 basis network for five-dimensional space space {x, v, θ, θ}. {x, v, θ, θ˙ , u} was used for modeling the system dynamics. When the cart bumped into the end of the track or when the pole overrotated (|θ| > 5π ), a terminal reward r(t) = −1.0 was given for 0.5 second. Otherwise a trial lasted for 20 seconds. We set the learning parameters as η = 20.0, τ = 1.0, κ = 0.3, ηM = 10.0, τn = V(t)−V0 0.2, σu = 2.0 min[1, max[0, VV11−V(t) −V0 ]], and σw = 2.0 min[1, max[0, V1 −V0 ]], where V0 = −1.0 and V1 = 0.0. Each trial was started from an initial state x(0) = (0.0, 0.0, θ(0), 0.0), where θ(0) were selected from a uniform distribution that ranged over −π < θ(0) < π.

Robust Reinforcement Learning

351

100

Success rate [%]

80

60

40

20 Robust RL Standard RL 0 0.1

0.15

0.2 0.25 Pole mass m_p [kg]

0.3

0.35

Figure 8: Success rate of the cart pole swing-up task (average success rate of 10 learned controllers).

Here, we compared the average swing-up performance after 1000 learning trials. We changed the pole mass from mp = 0.1 kg to mp = 0.35 kg by 0.05 kg. Each trial was started from an initial joint angle θ (0) selected from a uniform distribution that ranged over −π < θ (0) < π. Figure 8 shows the success rate of the cart pole swing-up task. The success rate was derived from the number of successful swing-up times in 100 trials. A trial was regarded as successful when tup > 5 seconds, where we defined the time for which the pole stayed up (|θ| < 12 deg) as tup . We generated 10 robust controllers (policies) and standard controllers (policies) and then compared average success rate. We used robustness parameters defined in equation C.9 in appendix C. The success rate using the robust policy with the pole mass mp = 0.35 kg was about 80%, while using standard policy was about 30%. This result shows the robustness of the acquired robust policy. Figure 9A shows that the robust policy could swing up the pole with the cart pole that has different physical parameters (pole mass mp = 0.35 kg, pole friction µp = 0.005, the cart mass mc = 1.5 kg) from the original environment (mp = 0.1 kg, µp = 2.0 × 10−6 , mc = 1.0 kg). Figure 9B shows that the standard policy failed to swing up the pole with the same cart pole used in Figure 9A. These results suggest that the robust RL can possibly be applied to a more difficult task with a higher-dimensional state space. 5 Discussion H∞ control theory gives an analytical solution only for linear systems. For nonlinear systems, there is no analytical way of solving the HJI equation.

352

J. Morimoto and K. Doya

1 0 10 −1

9 8 7

−3

6

−2

5

−1

4

0

3

1

2

2 3

1

Time [sec]

0

Position x [m]

(A)

1 0 10 −1

9 8 7

−3

6

−2

5

−1

4

0

3

1

2

2 3

1

Time [sec]

0

Position x [m]

(B) Figure 9: Cart pole swing-up trajectories with pole mass mp = 0.35 kg, pole friction µp = 0.005, and cart mass mc = 1.5 kg. The initial joint angle was θ(0) = π. (A) Value-gradiant-based robust policy. (B) Value-gradient-based standard policy.

Robust Reinforcement Learning

353

In order to derive a nonlinear H∞ controller, the value function is usually derived by using dynamic programming (Imafuku, 1999; Coraluppi & Marcus, 1999). A similar approach to the methods using dynamics programming was proposed by Nilim and Ghaoui (2004) to cope with uncertain state transition matrices. However, these methods need off-line calculation and an environmental model. Robust RL can derive a nonlinear H∞ controller by online calculation and without any prior knowledge of an environmental model. Robust RL provides a new way of using min-max solution in RL problems. Min-max RL was applied to games like backgammon (Tesauro, 1992) and Othello (Yoshioka & Ishii, 1998). Minimax Q-learning was proposed by Littman (1994, 2001) and applied to zero-sum games. However, in these studies, each player takes the same role. The min-max RL was also applied to a problem in which an airplane tries to avoid a missile and the missile tries to catch the airplane (Harmon, Baird, & Klopf, 1995). However, this study focuses on only linear control problems. We applied min-max RL in which each of two players (the control agent and the disturbing agent) has a different ability with the nonlinear problems of pendulum swing-up and cart pole swing-up. Risk-sensitive control studies are also related to our RRL framework and have an interesting application to the field of finance. Neuneirer and Mihatsch (1998) proposed risk-sensitive reinforcement learning and applied their method to acquire an optimal investment policy for an artificial stock price. Robustness during the learning process in RL framework was studied in Singh, Barto, Gruppen, and Connolly (1994) by using a constrained policy space and in Kretchmar et al. (2001) by combining standard RL with a linear robust controller. Although the target of our study is different from these studies, this kind of robustness is also important when we apply RL to real-world problems.

6 Conclusion In this study, we proposed a new RL paradigm called robust reinforcement learning (RRL). We showed that RRL can learn the analytic solution of the linear H∞ controller for the inverted pendulum dynamics around the upright position and also showed that RRL can deal with modeling error that standard RL cannot in the nonlinear inverted pendulum swing-up simulation example. We also applied RRL to the cart pole swing-up task, and a robust swing-up policy was acquired by RRL. We will apply RRL to more complex tasks like learning stand-up behaviors (Morimoto & Doya, 2000, 2001a) and biped locomotion (Atkeson & Morimoto, 2003; Morimoto & Atkeson, 2003) in future work.

354

J. Morimoto and K. Doya

Figure 10: Single-pendulum swing-up task. θ: joint angle, T: input torque, l: length of the pendulum.

Appendix A: Pendulum Swing-Up Task The task of swinging a pendulum up (Doya, 2000) is a simple but strongly nonlinear control task. The dynamics of the pendulum is given by ml2 θ¨ = −µθ˙ + mgl sin θ + T,

(A.1)

where θ is the angle from the upright position, T is the input torque, µ = 0.01 is the coefficient of friction, m = 1.0 kg is the weight of the pendulum, l = 1.0 m is the length of the pendulum, and g = 9.8 m/s2 is the gravity acceleration (see Figure 10). ˙ T and the action as u = T. We We define the state vector as x = (θ, θ) consider disturbance input w to the acceleration θ¨ . Thus the dynamics is given by an input-affine system: x˙ =

θ˙

g µ ˙ + l sin θ − ml2 θ

0 1 ml2

u+

0 w. 1

(A.2)

The augmented reward function is given by q(x, u, w) = (cos θ − 1) − Ru2 + γ 2 w2 .

(A.3)

Appendix B: Linear Inverted Pendulum For comparison of robust RL with analytical linear H∞ control, we use a local linearization of the pendulum dynamics, equation A.2, near the unstable equilibrium point x = (0, 0)T . The coefficient matrices of the linear from

Robust Reinforcement Learning

355

Figure 11: Cart pole swing-up task. θ : joint angle, x: cart position F: input force.

equation 3.23 are given by A=

0

g l

1 − mlµ2

, B1 =

0

1 ml2

, B2 =

0 . 1

(B.1)

The quadratic approximation of the augmented reward function, equation A.3, is given by q(t) = −xT Qx − Ru2 + γ 2 w2 ,

(B.2)

where Q = diag{0.5, 0}. Appendix C: Cart Pole Swing-Up Task The cart pole swing-up (see Figure 11; Doya, 2000) is a strongly nonlinear extension to the common cart pole balancing task (Barto, Sutton, & Anderson, 1983). The physical parameters of the cart pole were the same in Barto et al. (1983) and Doya (2000). Marked differences from the pendulum swing-up task are that the dimension of the state space is higher and that the disturbance input has two degrees of freedom. ˙ where x and v are, respectively, the The state vector is x = {x, v, θ, θ}, position and the velocity of the cart. The dynamics of the cart pole is modeled by the following nonlinear ordinary differential equations:

θ¨ = x¨ =

gsinθ +

2 ˙ cos θ(µc sign(x)−F−m p lθ˙ sin θ) mc +mp

l

4 3

−

mp cos2 θ mc +mp

−

µp θ˙ mp l

˙ F + mp l(θ˙ 2 sin θ − θ¨ cos θ) − µc sign(x) . mc + mp

,

(C.1)

(C.2)

356

J. Morimoto and K. Doya

The following specifications were used for the simulated pole and cart: the track ranged from −3.6 m to 3.6 m; mass of the cart: mc = 1.0 kg; mass of the pole: mp = 0.1 kg; half-length of the pole l = 0.5 m; gravity g = 9.8m/s2 ; coefficient of friction of cart on track: µc = 5.0 × 10−4 ; coefficient of friction of pole on cart: µp = 2.0 × 10−6 . F denotes applied force, x denotes the cart position, and θ denotes joint angle of the pole. The cart-pole dynamics can be written as an input-affine system, x˙ = f(x) + g(x)u + B2 w, where



(C.3) 

v

  mp l(θ˙ 2 sin θ −θ¨ cos θ)−µc sign(x) ˙   mc +mp    θ˙ f(x) =  , 2  ˙ cos θ(µc sign(x)−m µp θ˙  p lθ˙ sin θ) − mp l   g sin θ+ mc +mp

  2 l

(C.4)

mp cos θ 4 3 − mc +mp

 v 1  m +m  c p θ˙ g(x) =  cos θ  − mc +mp

m cos2 θ

    ,  

(C.5)

p 4 3 − mc +mp

l

and we used the input matrix for disturbance B2 :   0 0 1 0  B2 =  0 0 . 0 1

(C.6)

The augmented reward was given by q(x, u, w) = 0.5(cos θ − 1) − S(u) +

2

j (wj ),

(C.7)

j=1

where

u

S(u) = c

s−1

0

j (wj ) = dj

a

da, umax

wj

s 0

1

a wjmax

(C.8)

da

(j = 1, 2),

(C.9)

Robust Reinforcement Learning

s(x) =

357

2 π arctan x , 2 π

(C.10)

= 1.0 N, wmax = 10.0 N · m, c = 1.0, d1 = 1.0, and umax = 20.0 N, wmax 1 2 d2 = 1.0. Appendix D: Normalized Gaussian Network A value function is represented by V(x; v) =

K

vk bk (x),

(D.1)

k=1

where the normalized gaussian basis function is represented by ak (x) bk (x) = K , l=1 al (x)

ak (x) = e−sk (x−ck ) . T

2

(D.2)

The vectors ck and sk define the center and the size of the kth basis function, respectively. Note that if there is no neighboring basis function, the shape of the basis functions extends like sigmoid functions by the effect of normalization. In the current simulations, the centers are fixed in a grid, which is analogous to the “boxes” approach (Barto et al., 1983) often used in discrete RL. Grid allocation of the basis functions enables efficient calculation of their activation as the outer product of the activation vectors for individual input variables. In the actor-disturber-critic method, the polices are implemented as u u u(t) = su vk bk (x(t)) + n (t) , (D.3) k

w(t) = sw

vw k bk (x(t))

w

+ n (t) ,

(D.4)

k

where sw () and sw () are monotonically increasing output function, and nu and nw are the noise. In the value-gradient-based methods, the control and disturbance policies are given by ∂ f (x, u, w)T ∂bk (x) u u(t) = su vk (D.5) + n (t) , ∂u ∂x k ∂ f (x, u, w)T ∂bk (x)T w w(t) = sw − + n (t) . vk (D.6) ∂w ∂x k

358

J. Morimoto and K. Doya

To implement the input gain model, a network is trained to predict the time derivative of the state from x and u, x˙ (t) −

∂ f (x, u, w) vkM bk (x(t), u(t)). w(t) fˆ(x, u) = ∂w k

(D.7)

The weights are updated by v˙ kM (t) = η M

∂ f (x, u, w) ˆ x˙ (t) − w(t) − f (x(t), u(t)) bk (x(t), u(t)), (D.8) ∂w

where the learning rate was set to η M = 10.0 for all simulations, and the input gain of the system dynamics is given by ∂ f (x, u) M ∂bk (x, u) vk ∂u ∂u k

.

(D.9)

u=0

Acknowledgments We thank Christopher G. Atkeson, Mitsuo Kawato, and the anonymous reviewers for their helpful comments. References Atkeson, C. G., & Morimoto, J. (2003). Nonparametric representation of policies and value functions: A trajectory-based approach. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 643–1650). Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Coraluppi, S. P., & Marcus, S. I. (1999). Risk-sensitive and minmax control of discrete-time finite-state Markov decision processes. Automatica, 35, 301–309. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3, 671–192. Harmon, M. E., Baird III, L. C., & Klopf, A. H. (1995). Advantage updating applied to a differential game. In G. Tesauro, D.S. Touretzky, & T.K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 353–360). Cambridge, MA: MIT Press. Imafuku, K. (1999). Singularities of nonlinear control systems designed by HamiltonJacobi equations. Unpublished doctoral dissertation, Nara Institute of Science and Technology.

Robust Reinforcement Learning

359

Kretchmar, R. M., Young, P. M., Anderson, C. W., Hittle, D. C., Anderson, M. L., & Delnero, C. C. (2001). Robust reinforcement learning control with static and dynamic stability. International Journal of Robust and Nonlinear Control, 11, 1469–1500. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning (pp.157–163). San Francisco: Morgan Kaufmann. Littman, M. L. (2001). Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research, 2, 55–66. Morimoto, J., & Atkeson, C. G. (2003). Minimax differential dynamic programming: An application to robust biped walking. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1563–1570). Cambridge, MA: MIT Press. Morimoto, J., & Doya, K. (2000). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. In Proceedings of Seventeenth International Conference on Machine Learning (pp. 623–630). San Francisco: Morgan Kaufmann. Morimoto, J., & Doya, K. (2001a). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51. Morimoto, J., & Doya, K. (2001b). Robust reinforcement learning. In T.K. Leen, T. G., Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 1061–1067). Cambridge, MA: MIT Press. Neuneier, R., & Mihatsch, O. (1998). Risk sensitive reinforcement learning. In M.S. Kearns, S A Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 1031–1037). Cambridge, MA: MIT Press. Nilim, A., & Ghaoui, L. E. (2004). Robustness in Markov decision problems with uncertain transition matrices. In S. Thrun, L. Saul, & B. Scholkopf, ¨ (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Singh, S. P., Barto, A. G., Grupen, R., & Connolly, C. (1994). Robust reinforcement learning in motion planning. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 655–662). San Francisco: Morgan Kaufmann. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning 8, 257–277. Weiland, S. (1989). Linear quadratic games, H∞ , and the Riccati equation. In Proceedings of the Workshop on the Riccati Equation in Control, Systems, and Signals (pp. 156–159) Como, Italy. Yoshioka, T., & Ishii, S. (1998). Strategy acquisition for the game “othello” based on reinforcement learning. In S. Usui & T. Omori (Eds.), International Conference on Neural Information Processing (pp. 841–844). London: ISO Press. Zhou, K., Doyle, J. C., & Glover, K. (1996). Robust optimal control. Englewood Cliffs, NJ: Prentice Hall. Received May 13, 2004; accepted July 2, 2004.

LETTER

Communicated by Daniel Durstewitz

A Computational Model of the Functional Role of the Ventral-Striatal D2 Receptor in the Expression of Previously Acquired Behaviors Andrew James Smith [email protected]

Suzanna Becker [email protected] Psychology Department, McMaster University, Hamilton, Ontario, Canada L8S 4K1

Shitij Kapur Shitij [email protected] Center for Addiction and Mental Health, Toronto, Ontario, Canada, M5R 1T8

The functional role of dopamine has attracted a great deal of interest ever since it was empirically discovered that dopamine-blocking drugs could be used to treat psychosis. Specifically, the D2 receptor and its expression in the ventral striatum have emerged as pivotal in our understanding of the complex role of the neuromodulator in schizophrenia, reward, and motivation. Our departure from the ubiquitous temporal difference (TD) model of dopamine neuron firing allows us to account for a range of experimental evidence suggesting that ventral striatal dopamine D2 receptor manipulation selectively modulates motivated behavior for distal versus proximal outcomes. Whether an internal model or the TD approach (or a mixture) is better suited to a comprehensive exposition of tonic and phasic dopamine will have important implications for our understanding of reward, motivation, schizophrenia, and impulsivity. We also use the model to help unite some of the leading cognitive hypotheses of dopamine function under a computational umbrella. We have used the model ourselves to stimulate and focus new rounds of experimental research. 1 Introduction Dopamine is a neuromodulator of great interest because of its central role in reward and motivation, as well as in a number of human disorders, including schizophrenia, attention deficit hyperactivity disorder (ADHD), drug addiction, and Parkinson’s disease. The precise role of dopamine in each of these processes is still a matter of debate, and a number of partially overlapping hypotheses exist. With a view to formalizing and uniting some of these hypotheses, we present a novel computational model of dopamine function that offers a consistent account of some apparently diverse results Neural Computation 17, 361–395 (2005)

c 2005 Massachusetts Institute of Technology

362

A. Smith, S. Becker, and S. Kapur

from the animal behavior literature. In particular, we concentrate on the effect of dopamine manipulation on the expression of previously acquired behaviors. There are several competing hypotheses of dopamine function, which we briefly review. Some are computational and others cognitive, with the computational approaches tending to be oriented around the phasic dopamine response and its role in learning, and the cognitive hypotheses based on behavioral data and possibly pertaining more to the tonic or constant background signal. The anhedonia hypothesis (Wise, 1982; Wise, Spindler, DeWit, & Gerber 1978) suggested that dopamine mediates the hedonia associated with rewarding environmental stimuli. Evidence was drawn from animal experiments in which neuroleptics (dopamine-blocking drugs) caused extinctionlike effects. Extinction refers to the slow disappearance of a conditioned behavior if the resulting reward is not forthcoming. Recent data have eroded the anhedonia hypothesis and supported a more subtle role for dopamine in reward, motivation, and salience. For example, Berridge and Robinson (2003) suggest that reward is a multidimensional construct that can be decomposed into (among others) liking (hedonic) and wanting (motivational) components and that dopamine selectively mediates the latter. This position has gained widespread recognition as the incentive salience hypothesis. Evidence is derived from an extreme experiment in which rats that are deprived of almost all dopamine in the ventral and neostriatum simply stop eating, even though they apparently maintain the motoric capability to do so and even though their life depends on the food that is right under their noses! An analysis of affective responses when artificially fed reveals that the hedonic impact of the food (”liking”) is apparently unaffected (Berridge & Robinson, 1998). The incentive salience hypothesis is not computational in nature, although McClure, Daw, and Montague (2003) have developed a computational instantiation to account for some basic experimental data. Salamone, Cousins, and Snyder (1997) suggest that “dopamine in the nucleus accumbens [part of the ventral striatum] is important for responding to conditioned stimuli and . . . to stimuli that are spatially and temporally distant from the organism” (p. 353). General support for the claim that dopamine is implicated in responding for rewards (or avoiding punishments) that are in some way distal from the organism is provided by a wide range of experiments pertaining to conditioned avoidance (Anisman, Irwin, Zacharko, & Tombaugh 1982; Beninger & Hahn, 1983; Blackburn & Phillips, 1989; Courvoisier, 1956a; Grilly, Johnson, Minardo, Jacoby, & LaRiccia, 1984; van der Heyden & Bradford, 1988; Maffii, 1959; Wadenberg, Soliman, Vanderspek, & Kapur, 2001; Stark, Bischof, & Scheich, 1999; Wilkinson et al., 1998), animal (Richards, Sabol, & de Wit, 1999; Wade, de Wit, & Richards, 2000; Cousins, Atherton, Turner, & Salamone, 1996; Salamone et al., 1991; Salamone, Cousins, & Bucher, 1994) and human (de Wit, Enggasser, & Richards, 2002) models of impulsivity, aphagia (Berridge &

Modeling the Functional Role of Dopamine in the Ventral Striatum

363

Robinson, 1998), and instrumental responding for food (Dickinson, Smith, & Mirenowicz, 2000; Evenden & Robbins, 1983; Fowler, LaCerra, & Ettenberg, 1986; Rolls et al., 1974; Wise & Schwartz, 1981; Wise et al., 1978), drugs (see Wise, 2002, for a review), conditioned reinforcers (Taylor & Robbins, 1984), and electrical brain stimulation (Ettenberg, 1989; Fibiger, Carter, & Phillips, 1976; Rolls et al., 1974; Salamone, Kurth, McCullough, Sokolowski, & Cousins, 1993). Those studies that directly target the ventral striatum suggest that this is the site where this particular effect of dopamine manipulation is occurring (Berridge & Robinson, 1998; Cardinal, Pennicott, Sugathapala, Robbins, & Everitt, 2001; Salamone et al., 1991, 1993, 1994; Salamone, Wisniecki, Carlson, & Correa, 2001). In contrast to these motivational perspectives, Horvitz (2002) suggests a more attentional role for dopamine in the gating of sensory, reward, and motor processes. Also within this attentional category, Redgrave, Prescott, and Gurney (1999) propose that dopamine plays a role in switching between different behaviors. For example, Phillips, Stuber, Helen, Wrightman, and Carelli (2003) report that rats could be made to stop what they were doing and press a lever for cocaine (a preconditioned behavior) simply by artificially stimulating dopamine release. From a computational perspective, a highly influential model is the prediction error hypothesis of dopamine function (Hollerman & Schultz, 1998; Schultz, Dayan, & Montague, 1997; Montague, Dayan, & Sejnowski, 1996; Houk, Adams, & Barto, 1995). Electrophysiological recordings from the primate midbrain suggest that the dopaminergic signal is somewhat analogous to the prediction error signal used to drive learning in the temporal difference (TD) learning algorithm (for TD, see Sutton, 1988; Sutton & Barto, 1998). Apart from being able to account for a range of data pertaining to the phasic firing of dopamine neurons, one of the strengths of this hypothesis is that it posits both a cause (error in predicted reward) and effect (update of reward prediction) of dopamine neuron firing within a concrete computational framework. However, a number of criticisms have been discussed (Berridge & Robinson, 1998; Horvitz, 2000, 2002; Redgrave et al., 1999). One problem for the prediction error hypothesis is that dopamine is released not just following rewarding stimuli and stimuli that predict reward, but also following novel stimuli, aversive stimuli, and even stimuli that predict aversive events (for reviews, see Ikemoto & Panksepp, 1999; Horvitz, 2000, 2002; Salamone et al., 1997; Joseph, Datla, & Young, 2003). Another problem is that dopamine neurons produce not only intermittent burst or phasic firing, but also a constant background tonic firing. Recently Fiorillo, Tobler, and Schultz (2003) have also found an intermediate sustained firing between a stimulus and a subsequent reward. A more general role for dopamine is therefore suggested. However, perhaps the major consideration is that the prediction error hypothesis primarily posits a role for dopamine in learning (i.e., the first derivative of behavior), and yet there is compelling evidence that dopamine

364

A. Smith, S. Becker, and S. Kapur

is also required for the expression (’‘zeroth derivative”) of previously acquired behaviors (Cousins et al., 1996; Rolls et al., 1974; Maffii, 1959; Berridge & Robinson, 1998; Wade et al., 2000; Richards et al., 1999; Wadenberg et al., 2001). Moreover, following dopamine manipulation, behaviors appear to be affected differently depending on their relationship to the rewarding (or punishing) outcome. Although Montague, Dayan, Person, and Sejnowski (1995), Montague et al., (1996), and McClure et al. (2003) have extended the TD model to include a role for dopamine in biasing action selection, this approach has not yet been used to account for the selectivity of dopamine manipulation on distal versus proximal rewards. It is this selectivity that is the focus of this letter. A brief note on the motivation of this work is now offered. One of our long-term goals is a better computational understanding of schizophrenia, and in particular psychosis. Since the ventral striatum and the dopamine D2-receptor subtype have been strongly implicated in the disorder, the first step was to look at behavioral studies that pertain to either the ventral striatum or the specific D2-receptor manipulation, or, where possible, both. The aim of this work is not to model the striatal D2 receptor at an anatomical or physiological level, but rather to induce its functional significance at the behavioral level based on the studies referred to above. Since TD methods provide the currently preeminent computational account of phasic dopamine, our first inclination was to adopt and adapt this approach. However, we were unsuccessful in matching the data to TD and will therefore outline an alternative internal model account of motivated behavior. That is not to say that TD cannot address the data considered below, but rather that we were unable to use it to do so in a parsimonious fashion. The internal model approach described below and TD use very different representational techniques with far-reaching implications, and it is important to know which forms the sounder basis for understanding schizophrenia. Behavioral data provide our current constraint, but future work must draw on additional constraints (including physiological and electrophysiological considerations) to resolve the issue to the satisfaction of behavioral, psychiatric, and computational communities.

2 Schizophrenia, Psychosis, and Conditioned Avoidance Schizophrenia occurs with a global incidence of around 1% and is considered one of the most debilitating human disorders (Kandel, Schwartz, & Jessell, 1991). The symptoms are broadly divided into two categories: the positive and the negative. The negative symptoms, characterized by a lack of motivation, flat affect, anhedonia, eccentric behavior, social isolation, poverty of speech, and a poor attention span, are chronic and effectively untreatable by pharmacological intervention. In contrast, the positive symptoms or psychosis, which consist of delusions, disordered thoughts,

Modeling the Functional Role of Dopamine in the Ventral Striatum

365

and hallucinations, often occur in acute phases and may be mitigated or prevented with antipsychotic drugs (APDs). Although psychosis is a major component of schizophrenia, it may also occur in nonschizophrenic individuals. For insight into the nature of delusions and hallucinations, see Maher and Ross (1984) and Beck and Rector (2003), respectively. All current APDs block dopamine; moreover, there is a striking correlation between the ability of these drugs to block the dopamine D2 receptor and the dose required to mitigate psychosis (Kandel et al., 1991). Interestingly, all drugs of abuse cause an increase in dopamine in the nucleus accumbens (for reviews, see DiChiara, 1999; Kauer, 2003; Ikemoto & Panksepp, 1999), and some of these (particularly cocaine and amphetamine) can also induce psychotic symptoms in users (Bell, 1973; Connell, 1958). The number of D2 receptors is increased in the striatum of (unmedicated) schizophrenia patients in postmortem examination, an effect that is particularly pronounced in patients with positive symptoms (Kandel et al., 1991, chap. 55). The nucleus accumbens in particular may be important as a convergence site for a number of brain regions that are implicated in schizophrenia, including PFC, amygdala, and hippocampus (Grace, 2000), not to mention dopamine projections of the VTA. (See also Grace, 1991, for discussion.) These, along with other data, have led to the hypothesis that psychosis is mediated, if not caused, by an excess of dopamine in the limbic system, of which the ventral striatum is a key component (Kandel et al., 1991, chap. 55). An important preclinical drug test for potential antipsychotic efficacy is a well-established animal experimental paradigm called conditioned avoidance (CA) (Kilts, 2001; Arnt, 1982; Janssen, Niemegeers, & Schellekens, 1965; Wadenberg & Hicks, 1999). The standard CA experiment finds that if a neutral stimulus, such as an auditory tone (the conditioned stimulus or CS), regularly precedes an electric shock (unconditioned stimulus, or US), an animal will learn to avoid the shock by taking appropriate evasive action in response to the tone (Kamin, 1954; Low & Low, 1962; Black, 1963). Appropriate evasive action often involves the animal’s running or jumping to another compartment in the cage. An avoidance response is recorded if the animal runs during the tone, and an escape response is recorded if the animal waits until the arrival of the shock. A failure is recorded if the animal fails to run even when shocked. It is well established that low (noncataleptic) doses of all APDs, administered after the avoidance behavior has been acquired, selectively disrupt that avoidance response yet leave the escape response intact (Ader & Clink, 1957; Arnt, 1982; Cook & Weidley, 1957; Cook & Catania, 1964; Courvoisier, 1956b; Davidson & Weidley, 1976; Ponsluns, 1962). As the drug wears off, the avoidance response is restored (Wadenberg et al., 2001; Smith, Li, Becker, & Kapur, 2004). It seems unlikely that these effects are due to the interaction of the drug with the learning processes of the animal because of the differences in the timescales involved. For example, a rat typically requires many trials over many days to acquire the avoidance response. If an APD is

366

A. Smith, S. Becker, and S. Kapur

then administered and the rat is tested 20 minutes later (allowing the drug to take effect), the rat will stop running in response to the tone almost immediately. Similarly, when the rat is tested the following day drug free, the avoidance response will reappear almost immediately. Also, in untreated rats, the extinction of the avoidance response when the shock is actually discontinued tends to be a relatively slow process (Kamin, 1954). Therefore, the immediate impact of the drug is again striking. However APDs do also retard acquisition of the conditioned response, and dissociating the role of dopamine in performance and learning is not straightforward (Beninger, 1989). The degree of APD-induced avoidance disruption has been correlated with D2-receptor blockade, leading to the suggestion that blockade of this dopamine receptor is the neurochemical link between conditioned avoidance disruption in rats and antipsychotic action in people (Wadenberg et al., 2001). Importantly, conditioned avoidance can also be disrupted by direct intra-accumbens injections of D2-receptor antagonists (blocking D2 receptors) and by accumbens 6-OHDA lesions, destroying the dopaminereleasing neurons themselves (see Ikemoto & Panksepp, 1999, for a review). However, the common behavioral or psychological processes are currently unknown. For example, existing hypotheses of why APDs disrupt avoidance include the inhibition of an internal “fear” or “anxiety” (Cook & Weidley, 1957; Miller, Murphy, & Mirsky, 1957; Davis, Capehart, & Llewellin, 1961; Hunt, 1956), motor impairment (Ponsluns, 1962; Cook & Catania, 1964; Morpurgo, 1965; Beninger, Mason, Phillips, & Fibiger, 1980a, 1980b; Grilly et al., 1984; Aguilar, Mari-Sanmillan, Mortant-Deusa, & Minarro, 2000; Ogren & Archer, 1994), reduced responsiveness to external stimuli (Dews & Morse, 1961), decrease in sensory stimulation (Irwin, 1958; Key, 1961), and loss of attention or arousal (Low, Eliasson, & Kornetsky, 1966). However, following the incentive salience hypothesis of dopamine function, we will propose a motivational explanation that can subsequently be used to account for additional data from other experimental paradigms.

3 The Model Our model is based on the assumption that an animal builds an explicit internal model of its environment. Internal models have played a significant role in the methodologies of a number of fields, including AI. Indeed, they have been used not only within formal reinforcement learning (reviewed in Sutton & Barto, 1998), but also to model animal conditioning (Schmajuk, 1988) and the dopamine system (Schmajuk, Cox, & Gray, 2001; Suri, Bargas, & Arbib, 2001; Suri, 2001). Although the types of representation used by animals are many and varied (Balleine, Garner, Gonzalez, &

Modeling the Functional Role of Dopamine in the Ventral Striatum

367

Dickinson, 1995; Berridge & Robinson, 1998; Cardinal, Parkinson, Hall, & Everitt, 2002; Dickinson, 1980), the evidence that animals internally represent action-outcome relationships (among others) is compelling (Dickinson, 1987; Dickinson, Nicholas, & Adams, 1983). For example, Dickinson (1980) reviews the sensory preconditioning paradigm (Nader & LeDoux, 1999; Rizley & Rescorla, 1972; Talk, Gandhi, & Matzel, 2002; Young, Ahier, Upton, Joseph, & Gray, 1998). During stage 1, a tone is paired with food. During stage 2, the food is paired with illness. During stage 3, appetitive response to just the tone on its own is tested. Rats trained on stages 1 and 2 show an attenuated response in stage 3 in comparison to rats that were trained on only stage 1. This demonstrates that rats are able to integrate the knowledge acquired in stages 1 and 2, and Dickinson interprets this (and other data) as evidence for the presence of declarative representations (i.e., an internal world model). Our approach is a type of model-based reinforcement learning in which an agent learns to maximize reward through trial-and-error interaction with its environment. This satisfactorily captures the problem faced by a real animal. First, we assume that our model can recognize the current environmental stimulus and the time since its onset, by activating one of a set of predefined and fixed internal states. This assumption is similar to the tapped-delay-line representation assumption of Schultz et al. (1997), Montague et al. (1996), and others. Second, these states are built into an explicit internal model of the environment that comprises a transition function and a reward function. Third, the internal model is used to evaluate alternative actions and generate motivation via the online calculation of expected future reward. We will then propose a role for dopamine in modulating the efficacy of the connections between the states of the internal model, effectively implementing an online version of the discount factor of reinforcement learning (Sutton & Barto, 1998). Figure 1 illustrates our tapped-delay-line assumption. The temporal resolution at which stimuli are represented is arbitrary, and we choose an interval of 1s. This interval will affect the quantitative but not the qualitative nature of the results. The availability of states is assumed to be sufficient for each task that is considered. Each state, si , is associated with an intrinsic reward value, r(si ), that is assumed to be supplied by the environment. For example, a state representing shock might be assumed to elicit a reward value of -1, while a state representing food might elicit a reward of 1. Neutral stimuli will always elicit a reward value of zero. We assume that a number of different actions are available to the agent, A = {a1 , a2 , ..., am }, where m is the total number of actions available. We also assume that the agent can take an action only when a new environmental stimulus is presented. For example, actions are not taken in s‘Any ,t>0 . The internal model is described with reference to the simple neural circuit shown in Figure 2. Each time a new internal state, snew , is activated, the

368

A. Smith, S. Becker, and S. Kapur

...

... US

SUS,0

SUS,1

SUS,2

CS2

SCS2,0

SCS2,1

SCS2,2

CS1

SCS1,0

SCS1,1

SCS1,2

0

1

2

...

Figure 1: We assume that the model is able to recognize and represent the current stimulus and the time since its onset by activating one appropriate internal state unit from a predefined and fixed set, S = {s1 , . . . , sn }, where n is the total number of states. We will find it convenient to index each unit by either a stimulus-offset pair (as in the figure) or a single generic index that uniquely labels each state. For example, si is simply the ith state.

following procedure is performed: 1. Update reward estimate of the new state: ˆ new ) + α r(snew ) − R(s ˆ new ) . ˆ new ) := R(s R(s 2. Update the transition connections:

ˆ ˆ old , aold , y) := (1 − β)T(sold , aold , y) + β T(s ˆ old , aold , y) (1 − β)T(s

If y = snew , Otherwise.

(3.1)

for all states, y ∈ S. 3. Action selection: If snew represents the onset of an external stimulus, then select and take action, anew , as described below. In the above, sold is the previously active state and aold is the previously selected action. α and β are learning rates, which are arbitrarily defined by α = β = exp(− trial 100 ), where trial is the trial number. These learning rates start at 1 and are slowly reduced to 0 across trials. Some environmental states are marked as terminal states (see the environment descriptions later), and when a terminal state transition occurs, the

Modeling the Functional Role of Dopamine in the Ventral Striatum

369

Figure 2: The internal model can be interpreted using a simple neural metaphor. Each internal state unit is connected to each other state unit under each possible action by a unique transition connection, the strength of which is controlled ˆ i , a2 , sj ), will be by a transition weight. For example, the transition strength, T(s adapted during learning to reflect the probability that state sj follows state si under action a2 . Note that the direction of the transition connections is from the state-action units back to the state units. Additionally, each state maintains ˆ ∈ an estimation of the immediate reward value associated with that state, R(s S). Although this value will be immediately available in r(s ∈ S), the former represents the estimate learned by the internal model, while the latter represents the actual value coming in from the environment. In theory, r(s ∈ S) could be a distribution, for example, while Rˆ ( s ∈ S) is always a scalar value. The reward values and transition connections are adapted during learning so that an explicit internal model of the environment is approximated. This model will be used to estimate future reward and to drive motivation and action selection. From an anatomical perspective, we speculate that the states themselves are represented in cortical areas (possibly including the OFC and amygdala), while the transition connections are routed through the ventral striatum (see the discussion).

trial is ended. Over a number of trials, the agent learns to represent an explicit internal model of its environment that consists of an estimated reward function and an estimated transition function. Step 2 just redistributes the weights from the relevant state-action unit (see Figure 2) back to each state unit according to a learning rate, β. All transition strengths are initialized ˆ y, si ) = 1 for all to 1/n, and the redistribution rule ensures that ni=1 T(x, x ∈ S and y ∈ A at all times.

370

A. Smith, S. Becker, and S. Kapur

Action selection is performed by using a look-ahead process to generate the expected future reward of each action in turn and then selecting the best action. The look-ahead process involves playing through the consequences of taking each action inside the internal model. We assume that this lookahead can be performed without disturbing the modeling process described above. If actually implemented in neural circuitry, it might be necessary to maintain two copies of the internal model: one for keeping track of the environment and one for performing look-ahead. We abstract over this detail. We now propose a role for dopamine in modulating the efficacy of the transition connections. First, let zij denote the state-action unit corresponding to taking action j in state i. Next, let the activation of si and zij be denoted by ξ(si ) and ξ(zij ) respectively. Also, let FutRew(a ∈ A) be an internal register used for accumulating the total future reward of taking action a in current state, snew . The action selection step is fleshed out as follows: 3. Action Selection: If snew represents the onset of an external stimulus: (a) For each action, ai ∈ A: i. FutRew(ai ) := 0 ii. For all j∈ {1...n}: 1 If sj = snew , ξ(sj ) := 0 Otherwise. iii. For some fixed number of iterations, q: A. Propagate activation from state units to stateaction units: For all j ∈ {1...n} and all k ∈ {1...m}: ξ(sj ) If k = i, ξ(zjk ) := 0 Otherwise. B. Generate hypothetical next state: For all j ∈ {1...n}: ˆ ξ(sj ) := nk=1 m l=1 ξ(zkl ) × T(sk , al , sj ) × DAtonic C. Collect rewards for this hypothetical state: n ˆ j) FutRew(ai ) := FutRew(ai ) + j=1 ξ(sj ) × R(s D. Return to 3(a)iiiA. (b) Select and take action,   argmax FutRew(a) With prob. p, a∈A anew:=  Random action With prob. 1−p.

Modeling the Functional Role of Dopamine in the Ventral Striatum

371

In the above, 0 ≤ DAtonic ≤ 1 is the level of tonic dopamine (default = 1); q is the maximum depth of the look-ahead process, which we arbitrarily set at 50; and p controls exploration, where p = (1 + exp(5 − 0.07trial))−1 . Exploration starts at 1 and is reduced to 0 over successive trials. Exploration is an important part of behavior acquisition, since in general the initial state of the internal model will not accurately reflect the environmental contingencies. The approach we have taken is very simple. Activity cycles around the circuit of Figure 2 in order to simulate the environmental consequences of each action in sequence. The action that accumulates the greatest future reward during this process is actually selected, generating a new real sequence of environmental stimuli. A role for DAtonic as a modulator of the transition connections is proposed in step 3(a)iiiB. This will allow us to account for the finding that actions motivated by distal rewards are more vulnerable to dopamine manipulation than those motivated by proximal rewards. We have made the simplifying assumption that during the look-ahead process, the action currently being evaluated is always selected (step 3(a)iiiA). We consider a more flexible alternative to this look-ahead policy later, although the qualitative nature of the results are not contingent on this feature of the model because of the simple environments used. nProviding that DAtonic = 1, then at any stage in the look-ahead process, j=1 ξ(sj ) = 1. This is guaranteed by the starting condition in 3(a)ii and the ˆ y, sj ) = 1 for all x ∈ S and y ∈ A. More important, the fact that n T(x, j=1

activation of each state unit corresponds to the probability of that state indeed being the current state at that future time, under the look-ahead policy. However, for DAtonic < 1, these activations will decay during the look-ahead process, although the activations will still be in proportion to these probabilities. As the look-ahead process continues, FutRew(a) converges on the estimated future reward of taking action a, modulated by the online discount factor, DAtonic . Expected future reward or the return is the standard quantity to maximize in reinforcement learning problems (Sutton & Barto, 1998; Kaelbling, Littman, & Moore, 1996). Note that the agent is naturally parallel in design and relies mainly on local learning and activation rules. 4 Modeling Conditioned Avoidance A prerequisite of any model of delusions or disordered thoughts is a model of ordered thoughts. Since a model of ordered thoughts represents the holy grail of a number of disciplines, it is as well that we have a relatively simple yet highly relevant animal model of dopaminergic action at our disposal. We present the model in conjunction with a generalized version of the CA paradigm presented in Maffii (1959) (reviewed with other classic APD studies in Dews & Morse, 1961). The standard CA finding that APDs disrupt avoidance before escape is observed within Maffii’s paradigm, but Maffii

372

A. Smith, S. Becker, and S. Kapur

also makes an interesting additional observation that we wish to model. He found that after sufficient training, the rats began producing the avoidance response as soon as they were placed in the cage and before they were even presented with the tone. He termed this response to environmental context the secondary avoidance response, and the standard response to the tone the primary avoidance response. He then found that not only was the primary avoidance response more vulnerable to dopamine blockade than the escape response itself (the standard finding), but that the secondary avoidance response was more vulnerable than the primary response (see Figure 5, left). Apparently, the more distal the cue from the shock, the more vulnerable that cue was to dopamine blockade in terms of its ability to elicit a response. We can describe the standard CA environment with the finite state model of Figure 3a (left) and Maffii’s experiment with Figure 3a (right). The finite state environments defined in Figure 3 are a necessary assumption required to formalize the problem, and they effectively take on the role of an animal’s environment as well as its sensory processing. These environment descriptions will be different for each experiment that we consider. In contrast, the learning rules described above are defined once and represent our model of the behavioral processes, which we believe to pertain to the ventral striatal D2 receptor. Figure 4(a) shows the internal model after 100 learning trials. The transition connections reflect the environmental contingencies of Figure 3a (right). This internal model can then be used to generate the expected future reward of taking different actions in each environment state. For example, Figure 4e shows how activation is propagated through the internal model during the look-ahead process of the ”do nothing” action at trial onset for DAtonic = 1 for the internal model of Figure 4c. The return is generated by summing the reward estimates of each active unit in proportion to its activation. Setting DAtonic < 1 after acquisition of the internal model results in the decay of this activation during the look-ahead process, and therefore also the attenuation of the impact of increasingly distal outcomes (not shown). Figure 5 compares expected future reward thus generated with the performance of Maffii’s rats. Here we are assuming that expected future reward is a suitable basis for motivation. In order to convert this value into the probabilistic behavior of the rats as shown in Figure 5 (right), a softmax action selection mechanism would be required in place of the -greedy mechanism of step 3b. Although not shown, the model acquires avoidance behavior within a number of trials that is consistent with the amount of experience required by an actual rat (Wadenberg et al., 2001). However, this behavior is rather trivial, and the real point of interest is the selective effect of DAtonic on secondary avoidance versus primary avoidance versus escape. For interest, Figures 4c and 4e give an example of the model’s behavior in noisy or stochastic environments.

Modeling the Functional Role of Dopamine in the Ventral Striatum

Shock

5s

Tone 0

Environment cue 5s

−1 0s

Safety 0s 0

Tone

0

5s

373

Shock

0

−1 0s

0s

5s

Safety 0s 0

Run Do Nothing

5s

(a) MDP for modeling Conditioned Avoidance

Tone Start

0s

0

4s

150 0s

0 0s

Fixed reward

Immediate reward 100

0s

Terminal 0

Press lever 1 Press lever 2

(b) MDP for discounting task of (Wade et al., 2000) Figure 3: (a, left) A simple and abstract environment model for the classic CA task. Each circle represents a different environment state an agent can be in, and the arrows denote the outcome of each of the two possible actions. The arrows are labeled with the time between the action being taken and the transition actually occurring. For example, if the Do Nothing action is selected when the tone is presented, the shock will be delivered after a delay of 5 s (see Wadenberg et al., 2001, for an experimental example). We make the simplifying assumption that the agent must take one of the available actions on entry into a state, and must then wait for the ensuing transition before another action may be taken. The number inside each state represents the reward, r, associated with that state. The Safety state is the terminal state, which always transitions to itself, and the trial is terminated when the first such transition is made. Note that if the agent selects Do Nothing in the Shock state, the shock lasts only for 5 s, after which the trial is automatically terminated. This is consistent with a typical experimental setup (Wadenberg et al., 2001). (a, right) A generalized version of (left) that represents the experimental setup of Maffii (1959). (b) An abstract definition of the environment for the discounting task of Wade et al., (2000; see section 5). Trials begin in the left-most state.

374

A. Smith, S. Becker, and S. Kapur

Figure 4: (a) The state of the internal model after learning the environment of Figure 3a (right). The secondary stimulus refers to the Environment cue in Figure 3a (right) and the primary stimulus to the Tone. All learned transition ˆ are shown. The relationship to Figure 2 is as follows: The units connections, T, of Figure 2 are plotted here in weight space rather than physical or anatomical space. Also, the state-action units are removed, and the transition connections from each of the state-action units are collapsed onto the relevant state so that the connections from state to state can be seen more clearly. As a result of visuˆ a, s ) function as a simpler T(s, ˆ s ) function, it is not possible in alizing the T(s, this figure to know which action causes a given transition. However, the transition connections associated with taking the Run action in response to the onset of each external stimulus always transition to the Safety unit. The remaining transitions show the consequences of Do Nothing. (b) The state of the internal model after learning the environment of Figure 3b. (c) As in a except that temporal random noise is added to the identification of the current state. The bolder connections denote a stronger transition weight. (d) The state of the internal model after learning the T-maze environment of Figure 8(a). (e) The propagation of activation through the internal model of c during a typical look-ahead process for the Do Nothing action at the beginning of a trial, with DAtonic = 1. The process is shown for every alternate iteration.

Modeling the Functional Role of Dopamine in the Ventral Striatum

375

Figure 5: A comparison of model performance with Maffii’s data. (a) Number of secondary avoidance responses, primary avoidance responses, and escape responses under increasing doses of the neuroleptic chlorpromazine (a drug with a particularly high affinity for the D2 receptor), as a percentage of the number of responses without the drug (adapted from Maffii, 1959). (b) The change in FutRew(Do Nothing) for each of the three important states (solid line = environment cue, dashed line = tone, gray line = shock) as DAtonic is decreased. Since we use DAtonic as an abstract representation of dopamine and the model does not attempt to address the underlying neurochemical processes, the relationship between the model parameter, DAtonic , and chlorpromazine dose is uncertain. However, it is the qualitative nature of the results that is of interest, and in particular, the selective effect of dopamine blockade on secondary avoidance versus primary avoidance versus escape. For the comparison to be meaningful, we assume that FutRew(Do Nothing) can be used as a direct analogy of motivation, since the alternative action, Run, always yields an estimated future reward of zero. The horizontal line suggests an example escape cost that could be used to threshold motivation. This could explain why many studies find that low doses of neuroleptics disrupt avoidance but not escape.

5 Impulsivity, Delayed Rewards, and ADHD Attention deficit/hyperactivity disorder (ADHD) is a developmental disorder affecting 3% to 7% of school-age children, characterized by inattention, hyperactivity, and impulsivity (Frances, 2000). The single most effective treatment for the disorder is medication with psychostimulants such as methylphenidate (Phares, 2003). Psychostimulants (of which amphetamine is one) are known to increase extracellular concentrations of dopamine (Seeman & Madras, 1998), and indeed ADHD is believed to be primarily a dopaminergic/noradrenergic disorder (see Phares, 2003). Decreased blood flow has also been observed in the striatum of ADHD patients, a deficit

376

A. Smith, S. Becker, and S. Kapur

reversible by treatment with methylphenidate (reviewed in Schneider, Sun, & Roeltgen, 1994). As with psychosis, striatal dopamine seems to be of importance to ADHD. However, unlike psychosis, it may be a deficit rather than an excess that is to blame. There are no laboratory tests, neurological assessments, or attentional assessments that have been established as a diagnostic in the clinical assessment of ADHD (Frances, 2000). However, Solanto et al., (2001) have suggested that a desire to avoid delay (the delay aversion hypothesis) is one of the best characterizations of impulsivity with respect to ADHD, and Catania (in press) has argued specifically that many of the symptoms of ADHD can be accounted for by assuming too steep a discounting gradient. Studies such as de Wit et al. (2002) have confirmed that certain measures of impulsivity are indeed reduced in healthy volunteers by amphetamine. For example, subjects were asked questions such as, “Would you prefer ten dollars in thirty days or two dollars at the end of the session?” The amphetaminetreated subjects showed an increased preference for the larger but delayed reward when compared with a placebo group. If blocking dopamine can disrupt avoidance responding in rats, enhancing dopamine seems to decrease impulsivity as measured by this kind of delay discounting (Wade et al., 2000; Richards et al., 1999; de Wit et al., 2002). Furthermore, both phenomena may be mediated by the same part of the brain: the striatum (see Phares, 2003, for more on striatal involvement in ADHD). However, as with psychosis, the underlying behavioral and psychological processes are unclear. In a bid to better understand the role of dopamine in impulsivity and ADHD, Richards et al. (1999) and Wade et al. (2000) have investigated the effects of both amphetamine- and dopamineblocking drugs on delay discounting in rats. Their experimental paradigm can be summarized as follows. A thirsty rat is trained to press one of two levers. One lever yields an immediate reward (for example, 100 µl of water), and the other yields a fixed reward (150 µl of water) but only after a delay of 4 s. If the animal selects the delayed reward, a tone is presented between the lever press and the water to make the task easier. The immediate reward is then adjusted from trial to trial depending on which lever the animal selected on the previous trial. If the rat chose the immediate reward, the immediate reward is reduced by 15%, and if the rat chose the delayed reward, the immediate reward is increased by 15%. In this way, the immediate reward is varied until the rat has no particular preference. The amount of immediate reward elicited at this indifference point is then interpreted as the animal’s ”value” of the fixed, delayed reward. The rats are trained over a period of many weeks, on a variety of different starting conditions, until they become familiar with and adept at exploring the two alternatives and achieving the indifference point that suits their preference. As an additional aid, if the rat chooses the same lever twice in a row, then the immediately following trial is a forced exploration trial in which only the other lever yields a reward.

Modeling the Functional Role of Dopamine in the Ventral Striatum

377

Figure 6: The impact of various doses of (a) a dopamine-enhancing drug (amphetamine) and (b) a D2-blocking drug (raclopride), on the indifference point. The indifference point indicates the value of the delayed 150 µL alternative. Error bars indicate SEM, and asterisks denote a significant difference from dose = 0. Adapted from Wade et al. (2000).

After the rats have been trained on this procedure, they are tested under various systemic doses of both amphetamine and raclopride (dopamine D2 receptor blocker; see Figure 6). A dose-dependent effect of both drugs is observed on the indifference point. They claim that amphetamine has effectively reduced impulsivity (by increasing the value of the delayed reward), and raclopride has increased impulsivity (by decreasing the value of the delayed reward). Because of the incrementally adjusting procedure used, it is impossible to rule out a learning effect completely. However, the effects of these drugs were very quick compared with the weeks of training required for acquisition of the basic task. In the case of raclopride in particular, the lower indifference point was reached as quickly as the paradigm allowed—after just a few trials under the drug. It therefore seems likely that a significant part of the change in the animals’ behavior was due to the effect of drug on performance (rather than or in addition to its effect on learning). Wade et al. (2000) conclude that “this pattern of results indicates that blocking D2-receptors may have a selective effect on the value of delayed rewards” (p. 197). We can now use the model described in the previous section to propose an explanation that is consistent with data from CA. Figure 4b shows the result of training the agent on the environment of Figure 3b. Acquisition is again trivial (for DAtonic = 1, the agent learns to select the greater, delayed reward), but Figure 7 shows the effect of manipulating dopamine after acquisition. The delayed reward is discounted more in

378

A. Smith, S. Becker, and S. Kapur

the look-ahead process as DAtonic is reduced. In order to capture the effects of both dopamine blockade and amphetamine, we assume a baseline DAtonic < 1. The model is then able to provide a qualitative account of all drug-induced changes in impulsivity. Note that the vertical bars shown in Figure 7 can be moved apart (or together) by reducing (or increasing) the temporal resolution in Figure 1, thereby modifying the steepness of discounting. For reference, the largest dose of raclopride used (120 µg) would reduce avoidance responding in CAR by around 20% (Wadenberg, Kapur, Soliman, Jones, & Vaccarino, 2000). Notionally, the model also captures five additional observations made by Wade et al. (2000). First, independent of drug treatments, they vary the delay to the fixed reward (2 s and 8 s). They find that a delay of 2 s increases the indifference point (value of the delayed alternative) and that a delay of 8 s decreases the indifference point. Second, they find that the initial amount of water on the immediate alternative does not affect the eventual indifference point. Third, they find that the deprivational state of the animal (i.e., its thirst) does not affect the indifference point. In our model, a logical role for thirst would affect the perception of the two rewards equally (i.e., by a common factor), which could be achieved, for example, by substituting step 3(a)iiiC with: 3(a)iiiC)

Collect rewards for this hypothetical state. n ˆ j ) × ”Thirst”. FutRew(ai ) := FutRew(ai ) + j=1 ξ(sj ) × R(s

This would not affect the relative indifference point. Fourth, our model predicts that blocking dopamine should generally reduce the motivation to press either lever, and enhancing dopamine should generally increase the motivation to press a lever. Wade et al (2000) examined the tendencies of the rats to complete each trial and found that raclopride reduced but amphetamine increased the mean number of trials completed. Admittedly, completed trials is only an informal measure of motivation. Also, in a related and recent study, J.B. Richards (personal communication, 2003) finds that if rats are trained on a task in which levers yield immediate rewards, but one is certain and one is uncertain, then amphetamine does not affect the indifference point. This behavior would be produced by the model too because uncertainty is represented by the transition connection strengths, and dopamine has an equal effect on all such connections. A complete analysis of the model’s performance under these five conditions is outside the scope of the account presented here. It should be noted that Wade et al. (2000) observed the effects described above only with dopamine D2-blocking drugs. D1 receptor blockers had no significant effect. Interestingly, Cardinal et al. (2001) demonstrate that accumbens (core) lesions in rats cause exactly the same kinds of effects, leading them to conclude that the accumbens is involved in the pathogenesis of impulsive choice, a finding they suggest can shed light on ADHD, addiction, and other impulsive control disorders. (Cardinal, Robbins, & Everitt (2000)

Modeling the Functional Role of Dopamine in the Ventral Striatum

379

Figure 7: The effect of varying the DAtonic parameter after learning on the estimated future reward associated with each of the two actions from the Start state. The solid curve shows the estimated future reward of pressing lever 1 (delayed reward =140 µL), while the horizontal dashed line shows the estimated future reward of pressing lever 2 (immediate reward = 100 µL). The gray curved line shows the hypothetical value of the immediate reward necessary to balance the estimated future reward of the immediate alternative with that of the delayed alternative (i.e., the indifference point). This gray line can be used to match the model performance with data from Wade et al. (2000). The thick vertical bar shows the simulated level of dopamine that corresponds to the performance of the undrugged rats in Wade et al. (2000). The vertical bars to the left show the simulated dopamine level corresponding to the indifference point of the amphetamine-treated rats, and the vertical bars to the right show the simulated dopamine level corresponding to the performance of the raclopride-treated rats (see Figure 6). In agreement with the animal study, the model predicts that reducing DAtonic reduces the indifference point, while increasing dopamine increases the indifference point.

also find the same kinds of impulsive behaviour with systemic administration of amphetamine and combined D1/D2 blockers. It has been suggested that accumbens dopamine is necessary for producing anticipatory responses (e.g., avoidance, or lever pressing for food or water), but not consummatory responses (e.g., escape, or feeding or drinking itself) (Ikemoto & Panksepp, 1999). It is therefore noteworthy that in the impulsivity studies discussed above, two equally anticipatory responses with

380

A. Smith, S. Becker, and S. Kapur

equal motoric requirements were dissociated by dopamine manipulation. Therefore, arguments that rely solely either on anticipatory-consummatory distinctions or on motor deficits may need to be adjusted to address delaydiscounting paradigms. Finally, two important caveats are considered. First, in contrast to the data discussed above, some studies have observed that amphetamine actually increases impulsivity rather than decreases it, leading to the suggestion that the finer experimental details, such as whether a cue is presented during the delay, may be important (Cardinal et al., 2000). Perhaps one problem is that amphetamine injections directly into the accumbens increase general locomotor activity (reviewed in Pennartz, Groenewegen, & Silva, 1994), as well as an animal’s motivation to work for reinforcers (Taylor & Robbins, 1984), as predicted by the internal model since expected future reward equals motivation). Therefore, it may be difficult for impulsivity studies to separate the absolute increases in motivation due to amphetamine from the relative decreases in choice tasks. The way in which the animal internally models its environment is likely to play a large role. Second, it must be acknowledged that ADHD is a complex and multidimensional disorder, and the delay aversion hypothesis (see Solanto et al., 2001) and delay discounting hypothesis (see Catania, in press), represent only one line of argument. However, we conclude by contrasting our proposed role for dopamine in gating a look-ahead process with one of the standard characterizations of impulsivity within ADHD offered by DSM-IV (emphasis our own): “Impulsivity may lead to accidents and to engagement in potentially dangerous activities without consideration of possible consequences” (Frances, 2000, p. 86). 6 A T-Maze Experiment In a fascinating study by (Cousins et al. (1996), rats were presented with a choice between two arms of a T-maze: one leading to four food pellets, and the other to two. However, the arm leading to four pellets was obstructed with a barrier that had to be climbed (see Figure 8b). Once trained, the rats chose the arm containing four pellets on almost 100% of trials. However, after bilateral intra-accumbens injections of 6-hydroxydopamine, a treatment that destroys dopaminergic projections to the accumbens, the rats changed their behavior, selecting the unobstructed but lesser reward instead (80% of the time). A second experiment was then performed in which different rats were trained (untreated, as before) on a different version of the T-maze in which there was no reward in the unobstructed arm. Subsequent dopamine lesions only slightly reduced responding for the obstructed arm (from 100% of the time to 80%). Therefore, Cousins et al. (1996) were able to rule out the possibility that the animals in the first experiment were unable to cross the barrier because of motoric deficits for example. Rather, it appears that dopamine lesions were influencing the motivational and choice processes of the animal.

Modeling the Functional Role of Dopamine in the Ventral Striatum

381

Figure 8: (a) A formal environment model capturing the salient features of the T-maze task. For simplicity we can assume negligible delays between all state transitions. This is not a necessary assumption. (b) The T-maze experimental setup used in Cousins et al. (1996). (c) The effect of varying DAtonic ,after acquisition, on the estimated future reward associated with taking the Left (solid) and Right (dotted) actions from the Start state. The point at which the graphs cross denotes the point at which the model will switch from seeking the obstructed reward to seeking the easily available alternative.

The internal model can be used to capture these results by training it on the environment of Figure 8a. Figure 4d illustrates the resulting internal model, and Figure 8c demonstrates the effect of modulating DAtonic , after acquisition, on the estimated future reward associated with moving left or right from the start state. The model accounts for the change in animal behavior from the obstructed to the unobstructed arm under dopamine disruption. It also notionally accounts for the observation that if the rats are trained with no food in the unobstructed arm, then they continue to select

382

A. Smith, S. Becker, and S. Kapur

the obstructed arm, although apparently with less motivation. The model would continue to select the obstructed arm in this case until the solid line failed to exceed the cost of climbing the wall. The results of this study are particularly striking because the nucleus accumbens is specifically targeted with a direct dopamine challenge, eliciting a qualitative change in behavior. This result is not an isolated experiment either. It is duplicated in essence in Salamone et al. (1991, 1994) with both direct accumbens dopamine depletion and systemic D2 blockers, and Salamone et al. (1997) provides a review of many other related experiments by Salamone and colleagues that adhere to the same general pattern. 7 Instrumental Responding for Rewards A generally accepted consequence of dopamine blockade in rats is a reduction in lever pressing (also other responses) for various types of reward (Ettenberg, Koob, & Bloom, 1981; Evenden & Robbins, 1983; Fibiger et al., 1976; Fowler et al., 1986; Rolls et al., 1974; Salamone et al., 1993; Wise & Schwartz, 1981; Wise et al., 1978). Moreover, lever pressing for food or water is often found to be reduced at doses that have little or no impact on free feeding or free drinking where no lever press is required (Rolls et al., 1974; Salamone et al., 1993; see Figure 9a). However, recall that Berridge and Robinson (1998) were able to disrupt free feeding, even if the food was under the rats’ noses, by reducing accumbens and neostriatal dopamine by around 95%. We can use the current model to account for these findings by assuming the environment model of Figure 9b. In order to estimate future reward (and therefore generate motivation) for lever pressing, all three of the transitions are required. In contrast, free feeding requires only Approach and Consume, and in Berridge’s rats’ case, only the Consume transition is required. We do not show the model’s performance on this task because of the qualitative similarity between Figures 9a and 5a. However, it should be evident that lever pressing is disrupted before free feeding and free feeding before consumption in the model under the assumption of Figure 9b. With respect to this and the T-maze experiment reviewed earlier, it is an open question as to whether it is the “instrumental” or temporal distance (or both) between an action and its rewarding outcome that renders that action vulnerable to dopamine blockade. 8 Uniting Existing Hypotheses and Related Work We have presented a computational model intended to address a corpus of experimental data that point to a selective role for dopamine in the expression of previously acquired behaviors. More specifically, the D2receptor subtype within the ventral striatum (particularly the accumbens) has emerged as a common neuroanatomical thread throughout the studies we have considered. If our model of the transition connections passing

Modeling the Functional Role of Dopamine in the Ventral Striatum

383

Figure 9: (a) Lever pressing for food is attenuated at neuroleptic doses that do not drastically affect free feeding. Free drinking may be even more robust than free feeding to dopamine challenge (results adapted from Rolls et al., 1974). Note that spiroperidol primarily blocks the D2 receptor. (b) An abstract environment for instrumental responding that, in combination with the model presented, accounts for the selective vulnerability of instrumental responding to dopamine challenge.

through the ventral striatum is plausible, then we would expect to see this area being activated in response to conditioned stimuli as the look-ahead process is invoked. A number of fMRI studies in human subjects confirm this for both appetitive (Gottfried, O’Doherty, & Dolan, 2002; O’Doherty, Deichmann, Critchley, & Dolan, 2002; O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003) and aversive (Jenson et al., 2003) tasks. Also, using electrophysiological recording techniques, Schultz, Tremblay, & Hollerman (2000) have found a variety of activations in the ventral striatum that are consistent with its role in the expectation of reward across a delay. In addition to the ventral striatum, a range of fMRI, electrophysiological, and behavioral data points to a central role for the amygdala (Cador, Robbins, & Everitt, 1989; Everitt, Morris, O’Brien, & Robbins, 1991; Everitt & Robbins, 1989; Cardinal et al., 2002; Nishijo, Ono, & Nishijo, 1988; Robbins, Cador, Taylor, & Everitt, 1989; Tremblay & Schultz, 2000a, 2000b) and the orbito-frontal cortex (Arana et al., 2003; Bechara, Damasio, Tranel, & Anderson, 1998; Cardinal et al., 2002; Iversen & Mishkin, 1970; Masterman & Cummings, 1997; O’Doherty et al., 2003) in the representation of environmental reward contingencies of the kind relevant to the current discussion. It has been suggested that the OFC in particular is crucially involved in the motivational control of goal-directed behavior (Schultz et al., 2000) learning about rewards (Dias, Robbins, & Roberts, 1996), and evaluating alternatives

384

A. Smith, S. Becker, and S. Kapur

(Arana et al., 2003; Schultz et al., 2003) via a common neural currency (Montague & Berns, 2002). We therefore speculate that the states themselves may be represented cortically (originating presumably in the sensory cortex), ˆ along with the the return being represented with the reward estimates, R, in the OFC and amygdala. The proposed model can be used to unite a number of existing cognitive hypotheses of dopamine function within a formal framework. For example, Berridge and Robinson (1998) suggest that mesolimbic dopamine mediates the wanting component of reward as distinct from the liking, and in our model, following McClure et al. (2003), we interpret expected future reward as precisely this wanting. Salamone et al. (1997) suggest that “accumbens dopamine is important for responding to stimuli that are spatially and temporally distant from the organism” (p. 353). This statement precisely summarizes the psychological value of dopamine in our model. Ikemoto and Panksepp (1999) have also noted the relevance of the proximal-distal distinction to mesolimbic dopamine function. They argue that nucleus accumbens dopamine is important for invigorating flexible approach responses (distal), as distinct from consummatory responses (proximal). With respect to our model, and particularly section 7, it should be noted that consummatory behaviors such as free feeding may not be disrupted with only accumbens dopamine depletions. For example, Berridge and Robinson (1998) achieved aphagia with both accumbens and neostriatal dopamine reduction. The striatum may therefore be functionally differentiated, with the ventral region specializing in distal motivation (e.g., approach) and the dorsal region in proximal motivation (e.g., consumption). Modeling this distinction must constitute future work. Our approach is different from the TD prediction-error hypothesis (see Houk et al., 1995; Schultz et al., 1997; Montague et al., 1996), which suggests that the phasic dopamine response signals the difference (error) between the future reward predicted by the animal and the actual reward received. This error is then used to drive the learning process in a biologically plausible fashion (Waelti, Dickinson, & Schultz, 2001). In contrast, we have proposed a role for tonic dopamine in the generation of expected future reward that is independent of the acquisition process. The advantages of our approach are that we can model the effect of dopamine manipulation not only on the expression of previously acquired behaviors, but also the sensitivity of this effect to the relationship between action (or CS), and outcome (or US). A weakness of our internal model is that it fails to address the role of dopamine in the acquisition process or the phasic response of dopamine neurons themselves. We therefore suggest that future work should be oriented around hybrid approaches aimed at achieving a more comprehensive account of the neuromodulator. Toward this end, a number of discussions and models have been proffered that extend TD-based representations with

Modeling the Functional Role of Dopamine in the Ventral Striatum

385

explicit internal model representations (Daw, Courville, & Touretzky, 2004; Dayan, 2002; Dayan & Balleine, 2002; Suri & Schultz, 1998; Suri et al., 2001; Suri, 2001, 2002). However, a significant challenge remains in bridging the gap between models of dopamine neuron firing and models of behavioral and psychological phenomena in which dopamine may play a pivotal role. It is particularly important that we achieve a better understanding of whether explicit internal model representations or a cached value function (as in TD) is most appropriate for modeling the brain reward system. To conclude this section, we speculate as to why the brain might need DAtonic . Some suggestions include (1) constraint of the look-ahead process via its action as an online discount factor; (2) adaptation of the trade-off between proximal and distal rewards in response to environment cues (Wilson & Daly, 2004) or deprivational states (Giordano et al., 2002); or (3) global control of general motivated activity. 9 Model Predictions and Future Work Model is only as good as the useful predictions it makes, and so we are actively engaged in a program of experimental validation. In the account presented above, we allowed the model to make an action choice only in response to the onset of an external stimulus. However, within the context of conditioned avoidance, we have also looked at the predictions made by the model if an action can be selected in any state, including the internal timing states. These novel model-driven predictions were subsequently validated with experimental data (Smith et al., 2004), leading us to argue against the preeminent motor deficit hypothesis (Aguilar et al., 2000; Ogren & Archer, 1994) in favor of a motivational hypothesis of APD-induced avoidance disruption in rats. We are looking for interactions of dopamine manipulation with CS-US interval within CA, with a view to further testing the internal model account of motivated behavior. One of the primary motivations for undertaking this work is to create a computational dopamine hypothesis that can be used to shed light on schizophrenia. We argue that many of the negative symptoms of schizophrenia can be notionally captured by reducing DAtonic , including reduced motivation, flat affect, and anhedonia (given the suggestion that many of the rewards of life may actually be conditioned stimuli; Wise, (2002). Grace (1991) and Moore, West, and Grace (1999) have argued that a tonically hypo-dopaminergic state may be a key step to the development of psychosis, with the latter perceived as a hypersensitivity to phasic dopamine caused as result of homeostasis to the former. With respect to modeling psychosis, we suggest that an aberrant dopamine signal could lead to the construction or modulation of an aberrant internal model, and an aberrant internal model seems to be an excellent starting place to model delusions. Further research into combining a phasic (learning) role for dopamine with the internal model would be expected to flesh out this hypothesis.

386

A. Smith, S. Becker, and S. Kapur

One model assumption deserves a brief revisit. We enforced a very simple policy in the look-ahead process (step 3(a)iiiA), in which the current action being evaluated was always selected at each subsequent hypothetical state. This approach was simple and adequate given the way in which the problems were formulated. However, a more general and flexible look-ahead process would search different possible actions as one might in a game of chess, for example. However, since a branching search process is potentially costly, an alternative is to represent a Q-value at each state-action unit and then use this value to guide the search process during look-ahead. Such a value could be constructed using either a TD or Monte Carlo method (see Sutton & Barton, 1998). This would allow the agent to trade off the benefits of being able to explicitly simulate the consequences of actions (using the internal model) against the efficiency of simply using a precalculated value (an action-value function). The former is particularly important for being able to modulate behavior after acquisition based on an internal drive (e.g., salt deprivation) that was not present during conditioning. Berridge and Schulkin (1989) have demonstrated just such an ability in animals. There are already open lines of investigation into when and where internal model versus TD-like value functions influence motivated behavior (Dayan, 2002; Dayan & Balleine, 2002). However, using an action-value function to guide the online look-ahead process does not have a direct impact on the current discussion of dopamine function, which is kept as simple as possible. In conclusion, we have demonstrated that an internal model approach is able to account for a range of experimental evidence that suggests that ventral striatal dopamine D2-receptor manipulation selectively modulates motivated behavior for distal versus proximal outcomes. Whether an internal model or the cached values of the TD algorithm are better placed to model both a tonic and phasic dopamine response in this brain region is likely to have important implications for understanding a number of human disorders, including schizophrenia and ADHD. Acknowledgments This work was primarily supported by an OMHF Special Initiative grant and a NET grant from the Canadian Institutes of Health Research. S.K. is additionally supported by a Canada Research chair. Thanks also to Ming Li and Jimmy Jensen for assisting in the review of the conditioned avoidance and fMRI literature (respectively) and to Christopher Fiorillo for reviewing an early draft of this work. References Ader, R., & Clink, D. W. (1957). Effects of chlorpromazine on the acquisition extinction of an avoidance response in the rat. J. Pharmacol. Exp. Ther., 131, 144–148.

Modeling the Functional Role of Dopamine in the Ventral Striatum

387

Aguilar, M. A., Mari-Sanmillan, M. I., Morant-Deusa, J. J., & Minarro, J. (2000). Different inhibition of conditioned avoidance response by clozapine and Da D1 and D2 antagonists in male mice. Behav. Neurosci., 114(2), 389–400. Anisman, H., Irwin, J., Zacharko, R. M., & Tombaugh, T. N. (1982). Effects of dopamine receptor blockade on avoidance performance: Assessment of effects on cue-shock and response-outcome associations. Behavioral and Neural Biology, 36, 280–290. Arana, F. S., Parkinson, J. A., Hinton, E., Holland, A. J., Owen, A. M., & Roberts, A. C. (2003). Dissociable contributions of the human amygdala and orbitofrontal cortex to incentive motivation and goal selection. Journal of Neuroscience, 23(29), 9632–9638. Arnt, J. (1982). Pharmacological specificity of conditioned avoidance response Inhibition in rats: Inhibition by neuroleptics and correlation to dopamine receptor blockade. Acta Pharmacol. Toxicol. (Copenh.), 51(4), 321–329. Balleine, B. W., Garner, C., Gonzalez, F., & Dickinson, A. (1995). Motivational control of heterogeneous instrumental chains. Journal of Experimental Psychology: Animal Behaviour Processes, 21, 203–217. Bechara, A., Damasio, H., Tranel, D., & Anderson, S. W. (1998). Dissociation of working memory from decision making within the human prefrontal cortex. Journal of neuroscience, 18(1), 428–437. Beck, A. T., & Rector, N. A. (2003). A cognitive model of hallucinations. Cognitive Therapy and Research, 27(1), 19–52. Bell, D. S. (1973). The experimental reproduction of amphetamine psychosis. Archives of General Psychiatry, 29, 35–40. Beninger, R. J. (1989). Dissociating the effects of altered dopaminergic function on performance and learning. Brain Research Bulletin, 23, 365–371. Beninger, R. J., & Hahn, B. L. (1983). Pimozide blocks establishment but not expression of amphetamine-produced environment-specific conditioning. Science, 220, 1304–1306. Beninger, R. J., Mason, S. T., Phillips, A. G., & Fibiger, H. C. (1980a). The use of conditioned suppression to evaluate the nature of neuroleptic-induced avoidance deficits. J. Pharmacol. Exp. Ther., 213(3), 623-627. Beninger, R. J., Mason, S. T., Phillips, A. G., & Fibiger, H. C. (1980b). The use of extinction to investigate the nature of neuroleptic-induced avoidance deficits. Psychopharmacology (Berl.), 69(1), 11–18. Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: Hedonic impact, reward learning, or incentive salience? Brain Research Reviews, 28, 309–369. Berridge, K., & Robinson, T. E. (2003). Parsing reward. Trends in Neurosciences, 26(9), 507–513. Berridge, K. C., & Schulkin, J. (1989). Palatability shift of a salt-associated incentive during sodium depletion. Quarterly Journal of Experimental Psychology, 41B(2), 121–138. Black, A. H. (1963). The effects of CS-US interval on avoidance conditioning in the rat. Canadian Journal of Psychology, 17(2), 174–182. Blackburn, J. R., & Phillips, A. G. (1989). Blockade of acquisition of one-way conditioned avoidance responding by haloperidol and metoclopramide but not

388

A. Smith, S. Becker, and S. Kapur

by thioridazine or clozapine: Implications for screening new antipsychotic drugs. Psychopharmacology, 98, 453–459. Cador, M., Robbins, T. W., & Everitt, B. J. (1989). Involvement of the amygdala in stimulus-reward associations: Interaction with the ventral striatum. Neuroscience, 30(1), 77–86. Cardinal, R. N., Parkinson, J. A., Hall, J., & Everitt, B. J. (2002). Emotion and motivation: The role of the amygdala, ventral striatum, and prefrontal coretx. Neuroscience and Biobehavioral Reviews, 26(3), 321–352. Cardinal, R. N., Pennicott, D. R., Sugathapala, C. L., Robbins, T. W., & Everitt, B. J. (2001). Impulsive choice induced in rats by lesions of the nucleus accumbens core. Science, 292(5526), 2499–2501. Cardinal, R. N., Robbins, T. W., & Everitt, B. J. (2000). The effects of damphetamine, chlordiazepoxide, alpha-flupenthixol and behavioural manipulations on choice of signalled and unsignalled delayed reinforcement in rats. Psychopharmacology, 152, 362–375. Catania, A. C. (in press). Attention-deficit/hyperactivity disorder (ADHD): Delay-of-reinforcement gradients and other behavioral mechanisms. Behavioral and Brain Sciences. Connell, P. H. (1958). Amphetamine psychosis. London: Chapman and Hall. Cook, L., & Catania, A. C. (1964). Effects of drugs on avoidance and escape behavior. Federation Proceedings, 23, 818–835. Cook, L., & Weidley, E. (1957). Behavioral effects of some psychopharmacological agents. Ann. N.Y. Acad. Sci., 66, 740–752. Courvoisier, S. (1956a). Pharmacodynamic basis for the use of chlorpromazine in psychiatry. Quarterly Review of Psychiatry and Neurology, 17(1), 25–37. Courvoisier, S. (1956b). Pharmacodynamic basis for the use of chlorpromazine in psychiatry. Journal of Clin. Exp. Psychophathol. Quart. Rev. Psychiat. Neurol., 17, 25–37. Cousins, M. S., Atherton, A., Turner, L., & Salamone, J. D. (1996). Nucleus accumbens dopamine depletions alter relative response allocation in a Tmaze cost/benefit task. Behavioural Brain Research, 74, 189–197. Davidson, A. B., & Weidley, E. (1976). Differential effects of neuroleptic and other psychotropic agents on acquisition of avoidance in rats. Life Sci., 18(11), 1279–1284. Davis, W. M., Capehart, J., & Llewellin, W. L. (1961). Mediated acquisition of a fear-motivated response and inhibitiory effects of chlorpromazine. Psychopharmacologia, 2, 268–276. Daw, N. D., Courville, A. C., & Touretzky, D. S. (2004). Timing and partial observability in the dopamine system. In S. Still, W. Bialek, & L. Botlou (Eds.), advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Dayan, P. (2002). Motivated reinforcement learning. In T. G. Dietterich, S. Becker, & Z. Ghahramani, (Eds.) Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Dayan, P., & Balleine, B. W. (2002). Reward, motivation and reinforcement learning. Neuron, 36, 285–298.

Modeling the Functional Role of Dopamine in the Ventral Striatum

389

de Wit, H., Enggasser, J. L., & Richards, J. B. (2002). Acute administration of damphetamine decreases impulsivity in healthy volunteers. Neuropsychopharmacology, 27(5), 813–825. Dews, P. B., & Morse, W. H. (1961). Behavioral pharmacology. Annual Review of Pharmacology, 1, 145–174. Dias, R., Robbins, T. W., & Roberts, A. C. (1996). Disssociation in prefrontal cortex of affective and attentional shifts. Nature, 380(6569), 69–72. DiChiara, G. (1999). Drug addiction as dopamine-dependent associative learning disorder. European Journal of Pharmacology, 375, 13–30. Dickinson, A. (1980). Contemporary animal learning theory. Cambridge: Cambridge University Press. Dickinson, A. (1987). Instrumental performance following saccharin prefeeding. Behavioural Processes, 14, 147–154. Dickinson, A., Nicholas, D. J., & Adams, C. D. (1983). The effect of the instrumental training contingency on susceptibility to reinforcer devaluation. Quarterly Journal of Experimental Psychology, 35B, 35–51. Dickinson, A., Smith, J., & Mirenowicz, J. (2000). Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. Behavioral Neuroscience, 40, 468–483. Ettenberg, A. (1989). Dopamine, neuroleptics and reinforced behavior. Neuroscience and Biobehavioral Reviews, 13, 105–111. Ettenberg, A., Koob, G. F., & Bloom, F. E. (1981). Response artifact in the measurement of neuroleptic-induced anhedonia. Science, 213, 357–359. Evenden, J. L., & Robbins, T. W. (1983). Dissociable effects of d-amphetamine, chlordiazepoxide and alpha-flupenthixol on choice and rate measures of reinforcement in the rat. Psychopharmacology, 79, 180–186. Everitt, B. J., Morris, K. A., O’Brien, A., & Robbins, T. W. (1991). The basolateral amygdala-ventral striatal system and conditioned place preference: Further evidence of limbic-striatal interactions underlying reward-related processes. Neuroscience, 42, 1–18. Everitt, B. J., & Robbins, M. C. T. W. (1989). Interactions between the amygdala and ventral striatum in the stimulus-reward associations: Studies using a second-order schedule of sexual reinforcement. Neuroscience, 30(1), 63–75. Fibiger, H. C., Carter, D. A., & Phillips, A. G. (1976). Decreased intracranial self-stimulation after neuroleptics of 6-hydroxydopamine: Evidence for mediation by motor deficits rather than by reduced reward. Psychopharmacology, 47, 21–27. Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Fowler, S. C., LaCerra, M. M., & Ettenberg, A. (1986). Effects of haloperidol on the biophysical characteristics of operant responding: Implications for motor and reinforcement processes. Pharmacology, Biochemistry and Behaviour, 25, 791–796. Frances, A., (Ed.). (2000). Diagnostic and statistical manual of mental disorders. Washington, DC: American Psychiatric Association. Giordano, L. A., Bickel, W. K., Loewenstein, G., Jacobs, E. A., Marsch, L., & Badger, G. J. (2002). Mild opioid deprivation increases the degree that opioid-

390

A. Smith, S. Becker, and S. Kapur

dependent outpatients discount delayed heroin and money. Psychopharmacology, 163, 174–182. Gottfried, J. A., O’Doherty, J., & Dolan, R. J. (2002). Appetitive and aversive olfactory learning in humans studied using event-related functional magnetic resonance imaging. Journal of Neuroscience, 22, 10829–10837. Grace, A. A. (1991). Phasic versus tonic dopamine release and the modulation of dopamine system responsivity: A hypothesis for the etiology of schizophrenia. Neuroscience, 41(1), 1–24. Grace, A. A. (2000). Gating of information flow within the limbic system and the pathophysiology of schizophrenia. Brain Research Reviews, 31, 330–341. Grilly, D. M., Johnson, S. K., Minardo, R., Jacoby, D., & LaRiccia, J. (1984). How do tranquilizing agents selectively inhibit conditioned avoidance responding? Psychopharmacology, 84, 262–267. Hollerman, J., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neuroscience, 1, 304– 309. Horvitz, J. C. (2000). Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience, 96(4), 651–656. Horvitz, J. C. (2002). Dopamine gating of glutamatergic sensorimotor and incentive motivational input signals to the striatum. Behavioural Brain Research, 137, 65–74. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT press. Hunt, H. F. (1956). Some effects of drugs on classical (type S) conditioning. Ann. N.Y. Acad. Sci., 65, 258–267. Ikemoto, S., & Panksepp, J. (1999). The role of nucleus accumbens dopamine in motivated behavior: A unifying interpretation with special reference to reward-seeking. Brain Research Reviews, 31(1), 6–41. Irwin, S. (1958). Factors influencing acquisition of avoidance behaviour and sensitivity to drugs. Fed. Proc., 17, 380. Iversen, S. D., & Mishkin, M. (1970). Perseverative interference in monkeys following selective lesions of the inferior prefrontal convexity. Experimental Brain Research, 11, 376–386. Janssen, P. A. J., Niemegeers, C. J. E., & Schellekens, K. H. L. (1965). Is it possible to predict the clinical effects of neuroleptic drugs (major tranquilizers) from animal data? Arzneimittelforschung, 15, 104–117. Jensen, J., McIntosh, A. R., Crawley, A. P., Mikulis, D. J., Remington, G., & Kapur, S. (2003). Direct activation of the ventral striatum in anticipation of aversive stimuli. Neuron, 40, 1251–1257. Joseph, M. H., Datla, K., & Young, A. M. J. (2003). The interpretation of the measurement of nucleus accumbens dopamine by vivo dialysis: The kick, the craving or the cognition. Neuroscience and Biobehavioral Reviews, 27, 527– 541. Kaelbling, L., Littman, M., & Moore, A. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence, 4, 237–285.

Modeling the Functional Role of Dopamine in the Ventral Striatum

391

Kamin, L. J. (1954). Traumatic avoidance learning: The effects of CS-US interval with a trace-conditioning procedure. Journal of Comparative and Physiological Psychology, 47, 65–72. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science. New York: Elsevier Science. Kauer, J. A. (2003). Addictive drugs and stress trigger a common change at VTA synapses. Neuron, 37, 549–550. Key, B. J. (1961). The effects of drugs on discrimination and sensory generalisation of auditory stimuli in cats. Psychopharmacologia, 2, 352–363. Kilts, C. D. (2001). The changing roles and targets for animal models of schizophrenia. Biological Psychiatry, 50(11), 845–855. Low, L. A., Eliasson, M., & Kornetsky, C. (1966). Effects of chlorpromazine on avoidance acquisition as a function of CS-US interval length. Psychopharmacologia, 10, 148–154. Low, L. A., & Low, H. L. (1962). Effects of CS-US interval upon avoidance responding. Journal of Comparative and Physiological Psychology, 55(6), 1059– 1061. Maffii, G. (1959). The secondary conditioned response of rats and effects of some psychopharmacological agents. Journal of Pharmacy and Pharmacology, 11, 129–139. Maher, B., & Ross, J. S. (1984). Delusions. In H. E. Adams, & P. Sutker (Eds.), Comprehensive handbook of psychopathology (pp. 383-409). New York: Plenum Press. Masterman, D. L., & Cummings, J. L. (1997). Frontal-subcortical circuits: The anatomical basis of executive, social and motivated behaviors. Journal of Psychopharmacology, 11(2), 107–114. McClure, S. M., Daw, N., & Montague, P. R. (2003). A computational substrate for incentive salience. Trends in Neuroscience, 26(8), 423–428. Miller, R. E., Murphy, J. V., & Mirsky, A. (1957). The effect of chlorpromazine on fear-motivated behavior in rats. Journal of Pharmacol. and Exper. Therap., 120, 379–387. Montague, P. R., & Berns, G. S. (2002). Neural economics and the biological substrates of valuation. Neuron, 36, 265–284. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive hebbian learning. Nature, 377, 725–728. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947. Moore, H., West, A. R., & Grace, A. A. (1999). The regulation of forebrain dopamine transmission: Relevance to the pathophysiology and psychopathology of schizophrenia. Biological Psychiatry, 46, 40–55. Morpurgo, C. (1965). Drug-induced modifications of discriminated avoidance behavior in rats. Psychopharmacologia, 8, 90–99. Nader, K., & LeDoux, J. (1999). The dopaminergic modulation of fear: Quinpirole impairs the recall of emotional memories in rats. Behavioral Neuroscience, 113(1), 152–165.

392

A. Smith, S. Becker, and S. Kapur

Nishijo, H., Ono, T., & Nishino, H. (1988). Single neuron responses in amygdala of alert monkey during complex senosry stimulation with affective significance. Journal of Neuroscience, 8, 3570–3583. O’Doherty, J. P., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 28, 329–337. O’Doherty, J. P., Deichmann, R., Critchley, H. D., & Dolan, R. J. (2002). Neural responses during anticipation of primary taste reward. Neuron, 33, 815–826. Ogren, S. O., & Archer, T. (1994). Effects of typical and atypical antipsychotic drugs on two-way active avoidance: Relationship to DA receptor blocking profile. Psychopharmacology (Berl.), 114(3), 383–391. Pennartz, C. M. A., Groenewegen, H. J., & Silva, F. H. L. D. (1994). The nucleus accumbens as a complex of functionally distinct Neuronal ensembles: An integration of behavioural, electrophysiological and anatomical data. Progress in Neurobiology, 42, 719–761. Phares, V. (2003). Understanding abnormal child psychology. New York: Wiley. Phillips, P. E. M., Stuber, G. D., Helen, M. L. A. V., Wightman, R. M., & Carelli, R. M. (2003). Subsecond dopamine release promotes cocaine seeking. Nature, 422, 614–618. Ponsluns, D. (1962). An analysis of chlorpromazine-induced suppression of the avoidance response. Psychopharmacologia, 3, 361–373. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward error? Trends in Neurosciences, 22(4), 146– 151. Richards, J. B., Sabol, K. E., & de Wit, H. (1999). Effects of methamphetamine on the adjusting amount of procedure, a model of impulsive behavior in rats. Psychopharmacology, 146, 432–439. Rizley, R. C., & Rescorla, R. A. (1972). Associations in second-order conditioning and sensory preconditioning. Journal of Comparative and Physiological Psychology, 81(1), 1–11. Robbins, T. W., Cador, M., Taylor, J. R., & Everitt, B. J. (1989). Limbic-striatal interactions in reward-related processes. Neuroscience and Biobehavioural Reviews, 13, 155–162. Rolls, E. T., Rolls, B. J., Kelly, P. H., Shaw, S. G., Wood, R. J., & Dale, R. (1974). The relative attenuation of self-stimulation, eating and drinking produced by dopamine-receptor blockade. Psychopharmacology, 38, 219–230. Salamone, J. D., Cousins, M. S., & Bucher, S. (1994). Anhedonia or anergia? Effects of haloperidol and nucleus accumbens dopamine depletion on instrumental response selection in a T-maze cost/benefit procedure. Behavioural Brain Research, 65, 221–229. Salamone, J. D., Cousins, M. S., & Snyder, B. J. (1997). Behavioural functions of nucleus accumbens dopamine: Empirical and conceptual problems with the anhedonia hypothesis. Neuroscience and Biobehavioural Reviews, 21(3), 341– 359. Salamone, J. D., Kurth, P. A., McCullough, L. D., Sokolowski, J. D., & Cousins, M. S. (1993). The role of brain dopamine in response initiations: Effects of

Modeling the Functional Role of Dopamine in the Ventral Striatum

393

haloperidol and regionally-specific dopamine depletions on the local rate of instrumental responding. Brain Research, 628, 218–226. Salamone, J. D., Steinpreis, R. E., McCullough, L. D., Smith, P., Grebel, D., & Mahan, K. (1991). Haloperidol and nucleus accumbens dopamine depletion suppress lever pressing for food but increase free food consumption in a novel food-choice procedure. Psychopharmacology, 104, 515–521. Salamone, J. D., Wisniecki, A., Carlson, B. B., & Correa, M. (2001). Nucleus accumbens dopamine depletions make animals highly sensitive to high fixed ratio requirements but do not impair primary food reinforcement. Neuroscience, 105(4), 863–870. Schmajuk, N. A. (1988). The hippocampus and the classically conditioned nictitating membrane response: A real-time attentional-associative model. Psychobiology, 16(1), 20–35. Schmajuk, N. A., Cox, L., & Gray, J. A. (2001). Nucleus accumbens, entorhinal cortex and latent inhibition: A neural network model. Behavioural Brain Research, 118, 123–141. Schneider, J. S., Sun, Z. Q., & Roeltgen, D. P. (1994). Effects of dopamine agonists on delayed response performance in chronic low-dose MPTP-treated monkeys. Pharmacology, Biochemistry and Behaviour, 48(1), 235–240. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10, 272–283. Seeman, P., & Madras, B. K. (1998). Anti-hyperactivity medication: Methylphenidate and amphetamine. Molecular Psychiatry, 3(5), 386–396. Smith, A., Li, M., Becker, S., & Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: A computational approach. Neuropsychopharmacology, 29(6), 1040–1049. Solanto, M. V., Abikoff, H., Sonuga-Barke, E., Schachar, R., Logan, G. D., Wigal, T., Hechtman, L., Hinshaw, S., & Turkel, E. (2001). The ecological validity of delay aversion and response inhibition as measures of impulsivity in AD/HD: A supplement to the NIMH multimodal treatment study of AD/HD. Journal of Abnormal Child Psychology, 29(3), 215–228. Stark, H., Bischof, A., & Scheich, H. (1999). Increase of extracellular dopamine in prefrontal cortex of gerbils during acquisition of the avoidance strategy in the shuttle box. Neuroscience Letters, 264, 77–80. Suri, R. E. (2001). Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model. Experimental Brain Research, 140, 234–240. Suri, R. E. (2002). TD models of reward predictive responses in dopamine neurons. Neural Networks: Special Issue on Computational Models of Neuromodulation, 15(4-6), 523–533. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience, 103(1), 65–85. Suri, R., & Schultz, W. (1998). Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Experimental Brain Research, 121, 350–354.

394

A. Smith, S. Becker, and S. Kapur

Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1981). An adaptive network that constructs and uses an internal model of its world. Cognition and Brain Theory, 4(3), 217–246. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Talk, A. C., Gandhi, C. C., & Matzel, L. D. (2002). Hippocampal function during behaviourally silent association learning: Dissociation of memory storage and expression. Hippocampus, 12, 648–656. Taylor, J. R., & Robbins, T. W. (1984). Enhanced behavioural control by conditioned reinforcers following microinjections of d-amphetamine into the nucleus accumbens. Psychopharmacology, 84, 405–412. Tremblay, L., & Schultz, W. (2000a). Modifications of reward expectation–related neuronal activity during learning in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1877–1885. Tremblay, L., & Schultz, W. (2000b). Reward-related neuronal activity during go–nogo task performance in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1864–1876. van der Heyden, J. A. M., & Bradford, L. D. (1988). A rapidly acquired oneway conditioned avoidance procedure in rats as a primary screening test for antipsychotics: Influence of shock intensity on avoidance performance. Behavioural Brain Research, 31, 61–67. Wade, T. R., de Wit, H., & Richards, J. B. (2000). Effects of dopaminergic drugs on delayed reward as a measure of impulsive behavior in rats. Psychopharmacology, 150, 90–101. Wadenberg, M. G., Soliman, A., Vanderspek, S. C., & Kapur, S. (2001). Dopamine D2 receptor occupancy is a common mechanism underlying animal models of antipsychotics and their clinical effects. Neuropsychopharmacology, 25(25), 633–641. Wadenberg, M. L., & Hicks, P. B. (1999). The conditioned avoidance response test reevaluated: Is it a sensitive test for the detection of potentially atypical antipsychotics? Neurosci. Biobehav. Rev., 23(6), 851–862. Wadenberg, M.-L. G., Kapur, S., Soliman, A., Jones, C., & Vaccarino, F. (2000). Dopamine D2 receptor occupancy predicts catalepsy and the supression of conditioned avoidance response behaviour in rats. Psychopharmacology, 150, 422–429. Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412, 43–48. Wilkinson, L. S., Humby, T., Killcross, A. S., Torres, E. M., Everitt, B. J., & Robbins, T. W. (1998). Dissociations in dopamine release in medial prefrontal cortex and ventral striatum during the acquisition and extinction of classical aversive conditioning in the rat. European Journal of Neuroscience, 10, 1019–1026. Wilson, M., & Daly, M. (2004). Do pretty women inspire men to discount the future? Biology letters, 271(S4), 177–179. Wise, R. A. (1982). Neuroleptics and operant behavior: The anhedonia hypothesis. Behavioural and Brain Sciences, 5, 39–87.

Modeling the Functional Role of Dopamine in the Ventral Striatum

395

Wise, R. A. (2002). Brain reward circuitry: Insights from unsensed incentives. Neuron, 36, 229–240. Wise, R. A., & Schwartz, H. V. (1981). Pimozide attenuates acquisition of leverpressing for food in rats. Pharmacology, Biochemistry and Behavior, 15, 655–656. Wise, R. A., Spindler, J., DeWit, H., & Gerber, G. J. (1978). Neuroleptic-induced ”anhedonia” in rats: Pimozide blocks reward quality of food. Science, 201, 262–264. Young, A. M. J., Ahier, R. G., Upton, R. L., Joseph, M. H., & Gray, J. A. (1998). Increased extracellular dopamine in the nucleus accumbens of the rat during associative learning of neutral stimuli. Neuroscience, 83(4), 1175–1183. Received January 9, 2004; accepted July 2, 2004.

LETTER

Communicated by Bruno Olshausen

A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals Yan Karklin [email protected]

Michael S. Lewicki [email protected] Computer Science Department and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.

Capturing statistical regularities in complex, high-dimensional data is an important problem in machine learning and signal processing. Models such as principal component analysis (PCA) and independent component analysis (ICA) make few assumptions about the structure in the data and have good scaling properties, but they are limited to representing linear statistical regularities and assume that the distribution of the data is stationary. For many natural, complex signals, the latent variables often exhibit residual dependencies as well as nonstationary statistics. Here we present a hierarchical Bayesian model that is able to capture higher-order nonlinear structure and represent nonstationary data distributions. The model is a generalization of ICA in which the basis function coefficients are no longer assumed to be independent; instead, the dependencies in their magnitudes are captured by a set of density components. Each density component describes a common pattern of deviation from the marginal density of the pattern ensemble; in different combinations, they can describe nonstationary distributions. Adapting the model to image or audio data yields a nonlinear, distributed code for higher-order statistical regularities that reflect more abstract, invariant properties of the signal. 1 Introduction The goal of many algorithms in machine learning, signal processing, and computational perception is to discover and process intrinsic structures in the data. Extracting these from real signals is a difficult problem, because often the relationships among the observable variables are complex, and there is little a priori knowledge about the types of structures that exist. When some a priori knowledge is available, specialized algorithms can be designed, but this approach is generally less desirable, as it places restrictions on the type of structure that can be learned. Another difficulty is that the dimensionality of the data is often very high, and properties of interNeural Computation 17, 397–423 (2005)

c 2005 Massachusetts Institute of Technology

398

Y. Karklin and M. Lewicki

est lie in a relatively low-dimensional subspace. Because of the inherent variability of most real-world signals, intrinsic regularities are statistical in nature, which makes them that much more difficult to learn. One approach to learning statistical regularities is to formulate a probabilistic model of how the data are generated and adapt its parameters to fit the observed distribution. The adapted parameters reflect the statistics of the data ensemble, while internal representations encode individual data patterns. These models make minimal assumptions about the data and can result in more general representations than those in algorithms tailored for specific tasks or types of data. There are several ways in which data patterns are represented in probabilistic generative models. Distributed representations of linear componential models, such as those for principal component analysis (PCA) and independent component analysis (ICA), are particularly useful for modeling complex high-dimensional data because they can capture independent regularities with independent internal parameters (Bell & Sejnowski, 1995). This makes it possible to model a continuum of different statistical relationships and allows scaling of the algorithms to large numbers of dimensions. Current models, however, are limited in the type of structure they can represent; in order to understand these limitations, it is helpful to look at their mathematical formulation. Linear componential models achieve a distributed representation by describing the data as a combination of linear basis functions (for a review, see Hyv¨arinen, Karhunen, & Oja, 2001; Cichocki & Amari, 2002). This yields a probabilistic generative model in which the data (x) are generated as a linear combination of basis functions (A) weighted by coefficients (u), x = Au.

(1.1)

The likelihood of the observed data under this model is p(x) = p(u)/| det(A)|

(1.2)

(Pearlmutter & Parra, 1996; Cardoso, 1997), and the basis function matrix A is adapted to maximize the data likelihood. The coefficients u are the unknown (latent) variables. They are assumed to be independent and identically distributed (i.i.d.), p(u) = p(ui ). (1.3) i

The priors p(ui ) are typically chosen to be fixed sparse distributions (although parameters of the prior may be adjusted to maximize data likelihood). Because basis function coefficients are assumed to be i.i.d., the dependence among the data is represented solely by the learned matrix of basis functions.

Learning Density Components

399

The obvious limitation of this model is that its inherent linearity restricts the type of structure it can capture. Even simple, low-dimensional data often exhibit statistical dependencies that cannot be captured by linear transformations. In many applications, data are complex and rich with statistical structure, and latent variables of linear models adapted to these data exhibit significant residual mutual dependence (Hyv¨arinen & Hoyer, 2000; Schwartz & Simoncelli, 2001; Karklin & Lewicki, 2003). Another shortcoming of these models is that they assume that the statistical regularities in the data do not change; they describe stationary probability distributions. For example, once model parameters are adapted in ICA, both the prior and the basis functions are fixed, leading to a stationary distribution over the data. This does not depend on the form of the prior and also applies to models with adaptive or entirely nonparametric priors. In many domains, however, the statistics of the data are known to change, as the physical properties of the environment or conditions for data acquisition vary. While the stationary prior assumption gives a valid approximation of true density over a large enough corpus of training data, it does not reflect the variation across contexts that is observed in many signals. Figure 1 illustrates nonstationary statistics observed in images of natural scenes. ICA basis functions were adapted to 20 × 20 patches taken from an ensemble of natural images. Over the full ensemble of the training data, the basis function coefficients have marginal distributions that are consistent with the prior assumed by the model (not shown). However, computing coefficient histograms over particular image regions reveals systematic deviations from the (globally valid) stationary distribution. Patterns in the histograms suggest that basis functions of certain orientations are more active in some parts of the image (e.g., textured, oriented surface of the log), while in other regions, different subsets tend to be activated. This is observed in other types of data as well: temporal basis functions adapted to speech also yield coefficients whose statistics vary greatly across local regions of the signal (see Figure 2). Figures 1 and 2 give just a few examples of patterns in latent variable distributions that depend on the local context. In fact, there is a wide range of statistical regularities in complex data, a continuum of contexts that is as multidimensional as the physical properties of the environment that give rise to it. Local representations, as employed by clustering or mixture model techniques, assume that the contexts are discrete and thus cannot describe regularities that arise from a combination of different contexts. A model that captures this variation must form flexible, distributed representations of higher-order structure. Moreover, because the dimensions of the contexts are not known a priori, the model must be able to automatically discover this underlying structure. Finally, many previous models of nonstationary distributions have relied on the assumption that data statistics vary smoothly from sample to sample (Everson & Roberts, 1999; Pham & Cardoso, 2001) and computed local estimates of context-dependent variation.

400

Y. Karklin and M. Lewicki

Figure 1: The distribution of ICA basis function coefficients exhibits nonstationary statistics that reflect local image structure. (a) A subset of image basis functions learned from an ensemble of natural images, ordered by orientation. The small black square on the image indicates the size, relative to the image, of the learned basis functions. (b) Coefficients of independent components were computed over two regions of an image. (c, d) Histograms of the coefficients for the two regions reveal patterns in the joint distributions. Each histogram in the 10 × 10 grid in c and d corresponds to a basis function at the same grid position in a and is normalized so that the filled area sums to 1. Different types of local image structure produce different patterns in the joint activities. For example, the image region containing the log yields higher coefficient variation for basis functions oriented along the grain and matching the approximate spatial frequency of the wood texture.

Learning Density Components a

401

256

128

1

b

R1

R2

R1

log(var(u))

c

R3

R2

R3

4

4

4

0

0

0

−4

−4

−4

−8

1

128

256

−8

1

128

256

−8

1

128

256

Figure 2: ICA basis functions adapted to speech data also exhibit nonstationary statistical dependencies. (a) A subset of 256 ICA-derived basis functions ordered by dominant frequency. (b) Each basis function was convolved with three different regions of a speech signal. The length of the basis function is indicated by the short bar above the start of the speech signal. (c) The variances of coefficients sampled over the three regions, with the 256 coefficients ordered by frequency as in a. Although all basis function coefficients have unit variance when sampled over the whole data ensemble, local regions show characteristic variance patterns that reflect local signal structure.

This assumption does not always hold; even spatially and temporally coherent data exhibit abrupt changes that cannot be modeled as slowly evolving processes. Here we address the limitations of previous models with a hierarchical Bayesian model that forms a distributed code of higher-order statistical regularities and captures nonstationarities in the data distribution. The model is a generalization of ICA; thus, we begin with a standard linear componential model in which the data are generated as a combination of linear basis functions. However, instead of assuming that the basis function coefficients are independent (and their joint prior distribution is factorable; see equation 1.3), we explicitly model the dependence among hyperparameters of their priors. In order to capture variable, context-dependent activation of

402

Y. Karklin and M. Lewicki

basis functions, the dependence is specified through the scale parameters governing the width of the prior (and hence the variance of the coefficients). This dependence is modeled with a set of density components, a distributed code that describes the shape of the joint density of the linear coefficients and captures patterns in the variances of the coefficients as observed in the motivating examples. Each density component describes a common underlying deviation from the standard assumption of independence (the i.i.d. joint prior) associated with a frequently encountered context. Using a weighted combination of density components, the model is able to represent a continuum of contextdependent changes in probability distributions. Adapting the set of density components and modeling their activation with a sparse prior yields a compact description of higher-order statistical regularities of the data ensemble. Unlike other recent methods, the model makes no assumptions of temporal or spatial coherence; it is able to infer, independently for each data sample, the higher-order code that describes the generating distribution. Below we present the probabilistic framework for the model and describe the associated learning algorithms. Previously, we have used this model to discover higher-order structure in natural images (Karklin & Lewicki, 2003). Here, we describe the algorithm in more detail and frame it as a general method of statistical density estimation for high-dimensional nonstationary data. We verify the recovery of correct model parameters using a toy data set, apply the learning algorithm to a wider range of data types, and show how the learned higher-order code accounts for observed dependencies. We provide results and analysis for photographs of natural scenes, scanned images of newspapers, and speech waveforms. However, the model is not tailored specifically to images or audio data, and can be used to automatically learn the nonlinear statistical dependencies in any data set with sufficiently rich structure. 2 A Hierarchical Model for Nonstationary Distributions Our model is a generalization of previous linear models. Hence, we begin by assuming that each data vector is generated as a combination of linear basis functions, x = Au. As in standard ICA models (e.g., Cichocki & Amari, 2002), basis function coefficients are assumed to be sparsely distributed. Here we use a generalized gaussian distribution with zero mean: p(ui ) = N (0, λi , qi ) qi ui = zi exp − , λ

(2.1) (2.2)

i

where zi = qi /(2λi [1/qi ]) is a normalizing constant. The parameter qi determines the weight of the distribution’s tails and can be estimated from the data; in many ICA applications, the coefficients tend to be sparse, making

Learning Density Components

403

their distributions supergaussian (qi < 2). Typically, the scale parameter λi is fixed to a constant, since the basis functions in A can themselves scale to fit the data. In order to capture residual dependence among coefficients u, we must abandon the assumption of fixed, independent priors. The motivating examples suggested that intrinsic structures in the data give rise to patterns in the scales of the coefficients (similar dependencies have been observed previously in wavelet coefficients; Simoncelli, 1997). A natural way to model this is through the scale parameters of the prior, which we model as a nonlinear transformation of latent higher-order variables. Specifically, we use a matrix of density components B and density component coefficients v to describe the logarithm of the scale parameter, log(λ/c) = Bv.

(2.3)

If we define the constant c = (1/q)/ (3/q), the variance of the coefficients becomes 1 when the right side of the equation is 0 (this becomes convenient when a zero-centered prior is selected for the distribution of v; see below). The joint prior distribution of coefficients u can now be expressed as − log p(u|B, v) ∝

[Bv]i + i

qi ui , c exp([Bv]i )

(2.4)

where [Bv]i represents the ith element of the vector Bv (see the appendix for the derivation). Basis function coefficients are assumed to be independent conditional on the higher-order variables, p(u|v) = p(ui |v). This accounts for the dependence in the magnitudes of basis function coefficients. The new form of the prior (2.4) implies that if v is 0, the model reduces to standard ICA in which the linear coefficients are independent and identically distributed with variance equal to 1. Nonzero values of v scale and combine density components (columns of B) that define patterns in the distributions of u. Because each vi can be positive or negative, each density component represents contrast in the magnitudes of coefficients u (see Figure 3). We place a nongaussian, sparse prior on the latent variables v and infer their values for each data sample.1 This means that a priori, we assume that the activity of density component coefficients is sparse, and relatively few components are needed to describe how the generating distribution associated with each data sample differs from the i.i.d. ICA model. Using this parameterization, we adapt the density components to the entire data ensemble, which produces a compact description of higher-order statistical regularities. 1 A Laplacian prior was used in the simulations, but other distributions may be more appropriate.

404

Y. Karklin and M. Lewicki

Bj

p(u|v j = 1) p(u|v j = 0) p(u|v j = −1) Figure 3: Each density component defines a pattern in the joint distribution p(u). The plot at the top shows an example nine-dimensional density component Bj . The distributions of coefficients u1,...,9 are shown for different values of vj . Here we show only a single density component Bj , whereas the model adapts a set of them B = {B1 , B2 , . . . , BM } to obtain a compact description for common scale patterns in the data.

The full generative model is shown in graphical form in Figure 4. There are two sets of random variables that give rise to the data, v and u, and two sets of parameters adapted to the data: the linear basis functions A and the density components B. A crucial difference between this generative form and several other models that account for higher-order dependence is that here, the density components specify a distribution over the coefficients, as opposed to exact values or pooled magnitudes, which have been used in other models (Hoyer & Hyv¨arinen, 2002; Welling, Hinton, & Osindero, 2003). Thus, the model forms a hierarchical representation in which the lower-level codes data values precisely and the higher level represents more abstract properties associated with the shape of the data distribution. 3 Inference of Density Component Coefficients For each data sample, it is necessary to compute the higher-order representation v that best describes the pattern in the scale of coefficients u. This transformation is nonlinear and cannot be expressed in closed form. Here, we compute the best value of v by maximizing the posterior distribution, vˆ = arg max p(v|u, B), v

= arg max p(u|B, v)p(v). v

(3.1) (3.2)

Learning Density Components

405

vi ∼ N (0, 1, qi) λ j = c exp[Bv] j u j |λ j ∼ N (0, λ j , q j ) xk =

∑ Ak j u j j

Figure 4: Schematic of the hierarchical generative model. Sparsely distributed random variables v specify (through a nonlinear transformation) the scale hyperparameters λ for the distribution of coefficients u. The data x are a linear combination of coefficients u. Matrices A and B are parameters that are adapted to the statistical distribution of the data.

We assume that vi ’s are independent (p(v) = i p(vi )) and sparsely distributed (log p(vi ) ∝ −|vi |). For the simulations below, vˆ was derived by gradient ascent. We used second-order methods (LeCun, Bottou, Orr, & Muller, ¨ 1998) to stabilize and speed up convergence to optimal estimates. Because the prior is zero centered and sparse, only a few nonzero values will contribute to the representation of each data sample. The inference of optimal density component coefficients is analogous to estimating sample variance based on a single observation, but the problem is further constrained by the structure of the learned density components. Because the model is constrained to describe the pattern of variance with a sparse combination of density components, the value of v for a typical pattern is usually well determined. In addition, the high dimensionality of the input facilitates the inference process, as it provides more directions of variation that make up the variance pattern. 4 Adapting Model Parameters to the Data The linear basis functions and the density components are adapted to the data ensemble by maximizing the posterior p(A, B|X). We assume that samples in the data ensemble X = {x1 , . . . , xN } are independent, so that p(X|A, B) =

N

p(xn |A, B).

(4.1)

n=1

For each data sample x, the posterior distribution is p(A, B|x) ∝ p(x|A, B)p(A, B) = p(u|B)p(B)/| det(A)|.

(4.2) (4.3)

406

Y. Karklin and M. Lewicki

Ideally, the marginal distribution p(u|B) would be computed by integrating over v, but evaluating this integral for equation 2.4 is intractable. Here we approximate it using the maximum a posteriori estimate v: ˆ p(u|B) =

p(u|B, v)p(v)dv,

≈ p(u|B, v)p( ˆ v). ˆ

(4.4) (4.5)

Substituting this approximation into the posterior gives p(A, B|x) ∝ p(u|B, v)p( ˆ v)p(B)/| ˆ det A|.

(4.6)

The prior on B places a small a priori bias for small values of Bi,j and eliminates the problem of a degenerate case in which B grows without bounds while v’s rescale to be smaller. For the results here, we assumed Bi,j followed a gaussian distribution. The matrices A and B can be optimized iteratively by maximizing p(A|X, B) and then maximizing p(B|X, A). In this case, the first step amounts to performing ICA in which the priors incorporate the scale estimates v. ˆ Alternatively, we can assume that optimal linear basis functions are largely independent of the set of density components, and optimize B using a fixed A. For computational efficiency, A and B were assumed to be independent and were adapted separately in the simulations described below. We confirmed the validity of this approach by training a model on data of reduced dimensionality and with fewer density components; results were qualitatively similar to optimizing the parameters independently. In order to verify that the learning algorithm produces a valid solution, we adapted model parameters to an artificial data set for which the optimal solution was known. The data were generated by constructing a set of density components and then sampling basis function coefficients according to p(u|B). An illustration of the process and the obtained results is shown in Figure 5. Optimizing density components from random initial values produced a matrix that was identical (up to a permutation of its columns) to the true model parameters (see Figures 5a and 5b). The patterns in the learned density components specify nonlinear dependencies among coefficient magnitudes; in fact, there are no linear correlations among basis function coefficients sampled from the model (even when the same v is used to generate the coefficients). Linear models like ICA are unable to recover these statistical regularities. As a control, we adapted the density component model to a pure noise data set in which coefficients u were random samples from independent sparse distributions. In this case, no regularities in the magnitudes of coefficients existed, and the resulting density components consisted of small, random values.

Learning Density Components

407

5 Discovering Structure in Complex Data 5.1 Learned Density Components. We optimized model parameters on several data sets and analyzed the learned density components. For computational simplicity, the model was optimized in two stages in all the simulations. First, a complete linear basis A was adapted to the data using standard methods; next, the density component matrix B was optimized on the coefficients of the fixed A. Since the linear basis functions were learned using standard ICA methods, our analysis and discussion here is limited to the recovered matrix of density components. The density components were initialized to small random values, and gradient ascent was performed on stochastically sampled batches of data. The maximum a posteriori estimate vˆ was obtained using 20 steps of gradient ascent. Convergence of the gradient procedures for the optimization of B and estimation of vˆ was tested in a number of ways, including varying the step size, the number of iterations, and the initial conditions. The given optimization parameters yielded reasonable speed and accuracy, as well as consistent solutions for different random initial conditions. We first applied the learning algorithm to small (20 × 20) image patches sampled from a standard set of 10 gray-scale images of natural scenes (Olshausen & Field, 1996; Karklin & Lewicki, 2003). We used a complete set of 400 linear basis functions. The number of density components was set to 100 (although the algorithm is able to recover any number that yield a sparse distribution for coefficients v). We used batches of 1000 samples for 35,000 iterations of gradient ascent with a fixed step size of 0.3. Statistical regularities of the data ensemble are captured in the matrix of density components. In order to analyze the structure described by this matrix, we need to examine its weights as they relate to the basis functions whose distributions they affect. (Recall that each weight in a density component vector specifies how a particular p(ui ) is rescaled). The initial ordering of basis functions in the learned matrix A is arbitrary; hence, weights in B also appear random in their original ordering. However, we can rearrange the weights in B according to some property of the linear basis functions and examine whether the learned density components capture structure related to the chosen property. For example, ICA basis functions adapted to natural images are spatially localized; arranging density component weights according to the location of corresponding basis functions within the image patch reveals patterns in their organization (see Figure 6). Thus, density components that appear structured in this arrangement specify dependence among spatially related linear basis functions. As parameters in the generative model, they describe common data distributions that reflect localized image structure. Some density components also appear random when arranged spatially, but these often show organization along other dimensions of the lower-order representation, such as orientation or spatial frequency (Karklin & Lewicki, 2003). Changing the number of density components

408

Y. Karklin and M. Lewicki

does not affect the type of structure captured by the hierarchical model. A larger number of density components allows the model to represent finerscale spatial regularities, as well as other statistical structure that is not as obvious to interpret. We also applied the model to speech data from the TIMIT database. Linear basis functions were adapted to bandpass filtered speech segments of 256 samples (16 msec of 16 kHz sound). The number of density components was set to 100, and the parameters were optimized using stochastic learning on data batches of 1000 for 10,000 iterations. A representative set of the learned density components is shown in Figure 7. In order to display the weights in the density components as they relate to the linear code, we first computed the Wigner distributions (WD) of the linear basis functions using the DiscreteTFDs Matlab package (O’Neill, 1999). The Wigner distribution of a basis function is a surface in the time-frequency space; we took a contour at 95% peak value for each basis function and drew all these contours on a single time-frequency plot (time on the horizontal axis, 0 to 16 msec, and frequency on the vertical axis, 0 to 8 kHz). Because the linear basis functions adapted to speech tile most of the the time-frequency space, the contours also exhibit relatively even tiling of the plots. In Figure 7, nine WD plots show the weights in nine density components to the same set of linear basis functions. Here, as in image density components, the shading of each patch corresponds to the value of the weight. Some density components

Figure 5: Facing page. The model correctly recovers the density components used to generate synthetic data. We constructed a 50 × 10 matrix B composed of 10 cosine-shaped density components (a). After 3000 iterations, the model recovers (up to a permutation) the correct density components (b). (c) The generative and inference steps of the algorithm. (1) Three 10-dimensional density component coefficients are drawn from a sparse distribution; (2) each v(i) specifies a vector of (i) (i) scaling variables λ through the nonlinear transformation λ = c exp[Bv](i) . (3) The scaling variables are hyperparameters for nonstationary distributions p(u), from which data samples u are drawn. In order to emphasize that each (i) vector of scaling variables λ specifies a distribution, not fixed values of u, we (i) plotted several u’s drawn from the distribution p(u|λ ). In actual simulation, each data point was generated independently. Using the learned density comˆ were obtained for each data sample. Because ponents, estimates of (4) vˆ and (5) λ the inference problem involves the estimation of density parameters from sinˆ only approximately match true parameters. Although gle data points, vˆ and λ the complete hierarchical model includes another transformation x = Au, the projection to data space x is linear and is not necessary for inference of vˆ when coefficients u are known. The scatter plot of 1000 samples of u1 and u2 drawn from the model (d) shows that there is no linear dependence among basis function coefficients.

Learning Density Components

true B

a

c (1)

(5)

v ∼ p(v)

vˆ

409

b

(2)

(4)

λ = c exp(Bv)

learned B

(3)

λ) u ∼ p(u|λ

λˆ d

uj

ui

410

Y. Karklin and M. Lewicki

Figure 6: Density components optimized on an ensemble of 20 × 20 image patches drawn from natural scenes. Each column of B is represented here as a square; its weights to 400 image basis functions are plotted as dots, placed in locations corresponding to the center of each image basis function in the image patch. Each dot is colored according to the value of the weight, with white indicating positive weights, black negative weights, and gray weights that are close to zero. Most density components describe spatial relationships and capture coactivation of linear basis functions localized to a particular area of the image patch. For example, the density component in the second row, second column indicates whether contrast in the image patch is localized to the top or the bottom half. While most density components represent location, orientation, or spatial frequency regularities, the organization of some is not obvious.

describe coactivation of linear basis functions of adjacent frequency bands, while others are localized in time within the sample window. Most density components capture periodic higher-order structure and regularities across

Learning Density Components

411

Figure 7: A subset of density components of speech. The weights in a column of B are plotted as shaded patches in one of the nine panels. Each patch is placed according to the temporal and frequency distribution of the associated linear basis function and shaded according to the value of the weight, with white indicating positive weights, black negative weights, and gray weights that are close to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0 to 8 kHz, vertically. The density components form a distributed representation of the frequency of the signal and the location of energy within the sample window. Density components coding for multiple frequencies might capture harmonic regularities in the speech signal (see the text for details).

412

Y. Karklin and M. Lewicki

multiple frequencies or time intervals, and a few are tuned specifically to subtle shifts in dominant frequency over the sample window. 5.2 Higher-Order Code. In order to better understand the type of structure captured by the model, it is informative to look at the higher-order code—the coefficients of density components—and the statistical regularities it represents. Individual density component coefficients indicate the presence, in each data sample, of the type of structure represented in Figure 6. As a distributed code, their joint activity describes the data density whose shape reflects underlying structure in the data. Figure 6 shows that among other statistical regularities, the higher-order code captures spatial relationships in the data. How does this representation compare to the lower-level, linear code for image structure? The activity of density component coefficients over contiguous regions of the data suggests that the higher-order representation captures more abstract properties of the data (see Figure 8). When a sliding window is applied to a natural scene image, the resulting lower-level representation changes rapidly from sample to sample, as would be expected from what are essentially outputs of linear filters. The higher-order representation varies more slowly over the image and captures more invariant properties of the data, such as overall image contrast or the dominance of certain spatial frequencies. Also shown in Figure 8 are the values of the linear and the density component coefficients for a model trained on images of newspaper text. Here too the density component coefficients describe more abstract properties: several combine to form a distributed representation of text line position in the image patch (the activity of one such coefficient is shown in the first panel of Figure 8f), while others represent commonly observed structures in the data, such as recurring shapes of letters or blank spaces between words. Applied to audio data, the model also captures more abstract properties of the stimulus. In Figure 9a, we plot an example audio signal, along with the activities of three linear coefficients in Figure 9b and three density component coefficients in Figure 9c. We emphasize that, as for the images, the model is trained on segments drawn randomly from the data set, and the values of the coefficients for each sample position in the signal shown in the figure are determined independently. The higher-order representation varies more slowly than responses of the linear filters and captures structural elements that extend well beyond the small sampling window. This may reflect a general property of natural signals—fast fluctuations in their exact values are caused by interactions of underlying physical properties, which themselves change more slowly. 5.3 Modeling Residual Dependencies. The motivating examples (see Figures 1 and 2) showed specific types of residual dependencies among the “independent” linear coefficients, such as the dependence among the scale of coefficients, which formed patterns that changed from context to context.

Learning Density Components

413

Figure 8: The higher-order code captures more abstract properties of image data and therefore forms a more invariant representation than the coefficients of linear basis functions. We trained the model on natural images (a–c) and scanned newspaper clippings (d–f) and analyzed the representation formed by the model as it varied over the images. A sliding window (represented as white squares in the images) was applied over contiguous sections of the training data (a,d), and values of three linear coefficients ui (b,e) and three higher-order coefficients vj (c,f) were plotted as they varied over the signal. White represents large pos values, black large negative values, and gray zeros. Although the model is trained on image patches selected randomly from the data set, the higher-order code forms a representation that changes more slowly over space and captures properties of the data that extend beyond the sampling window, such as the overall contrast in natural images or the position of the text-line in newspaper images.

The adapted hierarchical density component model is able to capture these dependencies. First, drawing from the model generates data with similar

414

Y. Karklin and M. Lewicki

a b

c

Figure 9: The higher-order representation formed by the hierarchical model trained on speech data is more invariant than simple outputs of linear filters. A sliding window was applied to a speech signal (a; size of window indicated by a short bar). At each point, the linear basis function coefficients u were computed (b) and the higher-order coefficients v were inferred (c). Values of v change slowly and represent more abstract properties, such as the presence of silence or the onset of vocalization. Only three examples for u and v are shown.

statistical regularities. Furthermore, the higher-order representation in the model defines an implicit normalization of the linear code, and the residual dependencies are no longer observed in the normalized code. Figure 10a shows the empirical joint distributions (top row) of two linear coefficients when sampled from the image regions R1 or R2 of Figure 1. In the two contexts, the shape of the distribution is different: the coefficients have high variance in one context but not in the other. The statistical properties in the two contexts are captured by the inferred density component coefficients. Fixing the density component coefficient to the empirical distributions and sampling the linear coefficients reveals the same type of statistical structure (middle row). At the same time, it is possible to use the estimated parameters of the generating distribution to normalize the data. ˆ results Dividing the linear coefficients by the estimated scale parameters λ in joint distributions that are symmetric with uniform variance across different contexts and image regions (bottom row). Another way to observe dependence among coefficient magnitudes is to draw a conditional histogram that plots distributions of one coefficient conditional on different values of another (Simoncelli, 1997; Schwartz & Simoncelli, 2001). While the joint histograms show that coefficient magnitudes are dependent on the sampling context, conditional histograms reveal pair-wise dependencies between coefficients across all contexts. For natural images, most linear coefficients show a positive magnitude dependence; the

Learning Density Components

415

Figure 10: Dependence in the magnitudes of linear basis function coefficients is captured by the density component model. (a) The joint distributions of linear coefficients are different in the two image regions from Figure 1, that is, the data distribution is not stationary. Sampling from the model under the estimated higher-order representation of each context results in similar distributions. Normalizing the image data by the estimated scale parameters, u¯i = ui /λi , eliminates the non-stationarity. (b) Over the full data ensemble, empirical conditional histograms for pairs of coefficients show statistical dependencies in the magnitude. Sampling from the model adapted to this data ensemble produces similar dependencies, and normalizing by the estimated scale parameters removes the magnitude correlations. See the text for more details.

magnitude of one coefficient is positively correlated with the magnitude of another, (e.g., the left pair in Figure 10b), but some exhibit the reverse pattern. Sampling from the model produces data with the same statistical dependencies (see Figure 10b, middle row), while normalized linear coefficients show no conditional magnitude dependence (see Figure 10b, bottom row). Joint and conditional histograms illustrate pair-wise structure in the linear coefficients; global patterns in coefficients, such as those observed in Figures 1 and 2, are also captured by the model. In the top row of Figure 11, we replot the statistics from Figure 2 that show variance patterns in different regions of the speech signal. In the bottom row, we plot the same statistics for the coefficients normalized by the estimated scale parameters; after nor-

416

Y. Karklin and M. Lewicki R

R

log(var(u))

1

3

4

4

0

0

0

−4

−4

−4

−8 1 4

log(var(u)) ¯

R

2

4

128

256

−8 1 4

128

256

−8 1 4

0

0

0

−4

−4

−4

−8

1

128

256

−8

1

128

256

−8

1

128

256

128

256

Figure 11: The model accounts for nonstationary statistics of coefficients. (Top row) Log variance of u for the three regions in the speech signal from Figure 2b. Each plot shows the log variance of 256 basis functions, sorted by dominant frequency (replotted from Figure 2c). (Bottom row) Log variance of the normalized basis function coefficients u¯i = ui /λˆ i .

malization, the statistics are stationary, and the coefficients are identically distributed. The same global normalization effect is observed for natural images (plots not shown). 6 Discussion Some previous work has focused on extending linear probabilistic models. Mixtures of linear ICA models have been used to describe high-dimensional, nongaussian data drawn from distinct classes (Lee, Lewicki, & Sejnowski, 2000; Lee & Lewicki, 2002). In this approach, the number of classes is specified in advance, and an optimal linear basis is learned for each class. This nonlinear generative model describes different data distributions for different classes, but its higher-order representation is fundamentally local and does not scale well in domains where the variation in higher-order structure is continuous and high-dimensional. A key problem addressed by the model presented here is the presence and interaction of multiple instrinsic structures, and this is achieved by a continuous, distributed higher-order code. Other models have extended ICA to handle nonstationary data distributions. Everson and Roberts (1999) proposed a model in which ICA basis functions evolve with time as a first-order Markov diffusion process. Similarly, Pham and Cardoso (2001) developed and Choi, Cichocki, and Belouchrani (2002) extended algorithms for non-stationary models in which

Learning Density Components

417

the variances of the sources modulate slowly in time. These are also related to models of time-varying mean and variance in economics (Bollerslev, Engle, & Nelson, 1994), and typically model data whose statistics change slowly over time or space. Alternatively, one can describe the variance with a sparse but temporally coherent latent variable (Hyv¨arinen, Hurri, & V¨ayrynen, 2003). However, in many cases, real-world data are subject to both smooth and abrupt changes that do not follow diffusion dynamics or smooth amplitude modulation. In contrast to these approaches, the density component model makes no assumptions of temporal or spatial smoothness. It infers an optimal generating distribution for each data sample based on only the values of that sample, though the inference process is constrained by parameters adapted to the statistical regularities of the entire data ensemble. Thus, it is able to capture both smooth and abrupt changes in the underlying structure. Another approach to capturing intrinsic structures in the data has been to incorporate a specific nonlinearity, such as the sum of squares (Kruger, ¨ 1998; Hoyer & Hyv¨arinen, 2002) or sigmoid functions (Lee, Koehler, & Orglmeister, 1997). The drawback to these models is that the type of structure learned is limited by the specific choice of the nonlinearity. Most of these methods also assume a fixed linear representation (e.g., a set of oriented, localized 2D basis functions for image models), and those that adapt the linear representation assume a more constrained form of the nonlinear dependence (see below). In the model presented here, the linear basis is adapted to the data and maximizes the statistical independence of the linear representation. This ensures that the statistical regularities captured by the higher-order code represent fundamentally nonlinear dependencies rather than residual dependence resulting from the choice of a suboptimal linear basis. Furthermore, in some applications, there is no clear choice of linear representation (such as Gabor filters or wavelets in image processing); in such cases, it is sensible to derive the linear code from the statistics of the data. Several earlier models have explicitly represented the dependence among coefficients of linear basis functions. In the subspace ICA model (Hyv¨arinen & Hoyer, 2000), the linear basis functions are grouped into neighborhoods and adapted to maximize the independence of the vector norms of the neighborhoods. Basis functions within a neighborhood are no longer assumed to be independent; in fact, the energies of their coefficients are correlated. In the more generalized form of the model, called topographic ICA, the disjoint sets of dependent basis functions are replaced by a topographic arrangement that defines magnitude dependencies among basis functions (Hyv¨arinen, Hoyer, & Inki, 2001). The generative forms of subspace ICA and topographic ICA can be interpreted as more constrained versions of the density component model presented here. Neighborhood or topographic dependencies can be equivalently represented by density components whose weights are specified in advance to reflect tree-dependent or topographic relationships. The density component model, however, places no such constraints on the

418

Y. Karklin and M. Lewicki

higher-order representation; thus, density components adapted to the data can capture nontopographic dependencies as well. A related set of work has attempted to model the dependence among coefficients of a fixed linear transform, such as a multiscale wavelet decomposition. Romberg, Choi, and Baraniuk (2001) used a set of discrete latent variables, propagated along a multiscale wavelet tree, to describe the distribution of each wavelet coefficient. The transition probabilities of the latent states were adapted to match the scale dependencies between adjacent nodes in the tree. Buccigrossi and Simoncelli (1999) computed a linear predictor of scale for each coefficient as a function of the magnitudes of its neighbors. Wainwright, Simoncelli, and Willsky (2001) extended this approach by modeling the wavelet coefficients as observed variables in a gaussian scale mixture, in which random gaussian variables are multiplied by latent scaling variables. Dependence among coefficients adjacent on the wavelet tree is captured through the structure of a gaussian process defined on the scaling variables. In addition to its reliance on a fixed linear representation (the drawbacks of this are outlined above), this model is limited in that it can only describe pairwise dependencies between variables adjacent on the wavelet tree. Adapting a model to learn global statistical regularities, as opposed to local representations of class structure or pairwise dependence, allows it to capture a wider range of intrinsic structures. Also, learning an efficient basis to describe these dependencies facilitates their interpretability and provides a better fit to the underlying structure. 7 Conclusion We have introduced a hierarchical, generative Bayesian model that can be considered a nonlinear extension to ICA. It uses parametric density estimation to learn statistical regularities from the data and makes no assumptions about the type of structure it expects to find. The model is general, it is not specific to any domain and can be applied to any data set with rich statistical structure. Because the model forms distributed representations at all levels of its hierarchy, it scales well to large-dimensional data. Adapted to patches from natural images or samples from speech data, the density component model was able to learn nonlinear statistical regularities. It yielded a distributed representation of context, which included higher-order spatial relationships for image data and frequency and harmonic structure for audio data. Sampling from the model produced data with the same statistical regularities observed in the training data sets and the model’s implicit normalization of the lower-order code accounted for the residual dependencies observed in various data sets. Recently, it has been argued that higher-order properties of natural signals change slowly across time or space and that this spatial and temporal coherence can be used to extract higher-order structure from the data (Foldiak, 1991; Kayser, Einh¨auser, Dummer, ¨ Konig, ¨ & Kording, ¨ 2001; Wiskott & Se-

Learning Density Components

419

jnowski, 2002; Hurri & Hyv¨arinen, 2003). We show that in some cases, simply learning higher-order statistical regularities in the data leads the model to recover more abstract properties that tend to vary slowly with time or space. This raises the possibility that the explicit computational goal of extracting coherent (slowly changing) parameters is helpful, but not necessary to learning intrinsic structures that underlie the variation in the data. One result of learning global statistical regularities is that the learned structure is not necessarily obvious; for example, density components adapted to natural images describe a variety of statistical regularities, some of which are not easily interpreted. This is true for many unsupervised learning models that do not specify in advance the structure to be learned. For example, ICA applied to natural images yields a matrix of basis functions whose functional interpretation has ranged from edge detectors (Bell & Sejnowski, 1997) to models of biological sensory systems (van Hateren & van der Schaaf, 1998). The work presented here suggests that as more powerful unsupervised learning models are developed, the analysis of learned parameters and data representations will gain in importance. The approach taken in this work is to attack a difficult problem—capturing intrinsic regularities in complex high-dimensional data—incrementally. Although the model is able to capture some nonlinear statistical regularities, the structure it learns is still quite low level. This step-wise approach stands in contrast to other computational schemes that solve specific problems, such as perceptual invariance or scene segmentation. This may prove more tractable and robust because it does not rely on preconceived notions of intrinsic structures but learns them from the data. This approach might also give more insight into the organization of biological perceptual systems, where each processing unit performs a relatively simple computational task, and many computational goals might be achieved incrementally and in parallel. Appendix The value of vˆ for a given u was obtained by maximizing the log posterior distribution L = log p(v|u, B) ∝ log p(u|B, v)p(v).

(A.1)

We use the Laplace distribution for the prior on v and a generalized gaussian distribution with the scale parameters λ for the likelihood p(u|B, v), so that L ∝ log ∝

qi qj M ui vj zi exp − zj exp − λi c i=1 j=1

N

M log i=1

qi qj

M ui vj qj qi log − + − 2λi (1/qi ) λi 2(1/qj ) c j=1

(A.2)

(A.3)

420

Y. Karklin and M. Lewicki

∝

qi N M qj ui v j , − log λi − − λi c i=1 j=1

(A.4)

where z = q/(2λ(1/q)) is the normalization term, λi = ce[Bv]i , and c = (1/q)/ (3/q). For a given data sample, u is the N × 1 vector of linear basis function coefficients and v the M × 1 vector of density component coefficients. A is the N × N matrix of linear basis functions, and B is the N × M matrix of density components. We use [Bv]i to denote the ith element of the vector Bv, and Bi to denote the ith row of the matrix B. The MAP estimate vˆ was obtained by gradient ascent,   qi N M qj ui vj ∂  ∂L  − log λi − − = c ∂vj ∂vj i=1 λi j=1 =

N u qi

|vj |qj −1 i . −Bij + qi Bij [Bv] − sign(vj )qj cqj ce i i=1

(A.5)

(A.6)

The gradient ascent procedure was sensitive to initial conditions and in some cases did not converge to a solution. We tried several alternatives, including a closed-form approximation to the MAP estimate. Ultimately, the most effective learning method was to adjust the step size by the stochastic estimate of the Hessian over each batch of data (LeCun et al., 1998): ηj =

2

∂∂vL2 j

+µ

,

(A.7)

where µ is a small constant that improves stability when the second derivative is very small. The second derivative for a data sample is given by N u qi |vj |qj −2 ∂ 2L i =− q2i B2ij [Bv] − qj (qj − 1) qj . 2 c ce i ∂vj i=1

(A.8)

The density component matrix B was estimated by maximizing the posterior over the data ensemble, log p(B|x1 , . . . , xN , A) ∝ log p(x1 , . . . , xN |A, B)p(B) log p(xn |A, B)p(B) ∝

(A.9) (A.10)

n

∝

log p(un |B, vˆ n )p(vˆ n )p(B)/| det A|. (A.11)

n

Let Lˆ n = log p(un |B, vˆ n )p(vˆ n )p(B). We place a gaussian prior on B and im plement gradient ascent B = N1 n ∂ Lˆ n /∂Bij , where the posterior for each

Learning Density Components

421

data sample xn is ∂ ∂ Lˆ log p(un |B, vˆ n ) + log p(vˆ n ) + log p(B) = ∂Bij ∂Bij   qi

N,M N M qj B2ij ui vj ∂  −  = − log λi − − c ∂Bij i=1 λi 2 j=1 i=1,j=1 u qi i = −vj + vj qi [Bv] − Bij . ce i

(A.12)

(A.13) (A.14)

Acknowledgments We thank Bruno Olshausen and Eero Simoncelli for helpful discussions. This work was supported by a Department of Energy Computational Science Graduate Fellowship to Y.K. and National Science Foundation grant no. 0238351 to M.S.L. References Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Res., 37(23), 3327–3338. Bollerslev, T., Engle, R. F., & Nelson, D. B. (1994). ARCH models. In R. F. Engle & D. L. McFadden (Eds.), Handbook of Econometrics. Amsterdam: Elsevier. Buccigrossi, R. W., & Simoncelli, E. P. (1999). Image compression via joint statistical characterization in the wavelet domain. IEEE Transactions on Image Processing, 8(12), 1688–1701. Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind source separation. IEEE Signal Processing Letters, 4, 109–111. Choi, S., Cichocki, A., & Belouchrani, A. (2002). Second order nonstationary source separation. Journal of VLSI Signal Processing, 32(1–2), 93–104. Cichocki, A., & Amari, S.-I. (2002). Adaptive blind signal and image processing: Learning algorithms and applications. New York, Wiley. Everson, R., & Roberts, S. (1999). Non-stationary independent component analysis. In Proceedings of the 8th International Conference on Artificial Neural Networks (pp. 503–508). Berlin: Springer-Verlag. Foldiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194–200. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Res., 42, 1593–1605. Hurri, J., & Hyv¨arinen, A. (2003). Simple-cell-like receptive fields maximize temporal coherence in natural video. Neural Computation, 15(3), 663–691.

422

Y. Karklin and M. Lewicki

Hyv¨arinen, A., & Hoyer, P. (2000). Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12, 1705–1720. Hyv¨arinen, A., Hoyer, P. O., & Inki, M. (2001). Topographic independent component analysis. Neural Comput., 13, 1527–1558. Hyv¨arinen, A., Hurri, J., & V¨ayrynen, J. (2003). Bubbles: A unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America A, 20(7), 1237–1252. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Karklin, Y., & Lewicki, M. (2003). Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14, 483–499. Kayser, C., Einh¨auser, W., Dummer, ¨ O., Konig, ¨ P., & Kording, ¨ K. (2001). Extracting slow subspaces from natural videos leads to complex cells. Artificial Neural Networks, 2130, 1075–1080. Kruger, ¨ N. (1998). Collinearity and parallelism are statistically significant second order relations of complex cell responses. Neural Processing Letters, 8, 117–129. LeCun, Y., Bottou, L., Orr, G., & Muller, ¨ K. (1998). Efficient backprop. In G. Orr & K. Muller ¨ (Eds.), Neural networks: tricks of the trade. New York: Springer. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind source separation of nonlinear mixing models. In Proceedings of IEEE International Workshop on Neural Networks for Signal Processing. Lee, T.-W., & Lewicki, M. S. (2002). Unsupervised classification, segmentation and de-noising of images using ICA mixture models. IEEE Trans. Image Proc., 11(3), 270–279. Lee, T.-W., Lewicki, M. S., & Sejnowski, T. J. (2000). ICA mixture models for unsupervised classification of non-Gaussian sources and automatic context switching in blind signal separation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(10), 1078–1089. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381, 607–609. O’Neill, J. C. (1999). DiscreteTFDs time-frequency analysis software. Available online at: http://tfd.sourceforge.net/. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive generalization of ICA. In International Conference on Neural Information Processing (pp. 151–157). New York: Springer. Pham, D.-T., & Cardoso, J.-F. (2001). Blind separation of instantaneous mixtures of non stationary sources. IEEE Trans. Signal Processing, 49(9), 1837–1848. Romberg, J., Choi, H., & Baraniuk, R. (2001). Bayesian tree-structured image modeling using wavelet domain Hidden Markov models. IEEE Transactions on Image Processing, 10(7), 1056–1068. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Simoncelli, E. P. (1997). Statistical models for images: Compression, restoration and synthesis. In Proc. 31st Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA.

Learning Density Components

423

van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. Royal Soc. Lond. B, 265, 359–366. Wainwright, M. J., Simoncelli, E. P., & Willsky, A. S. (2001). Random cascades on wavelet trees and their use in analyzing and modeling natural images. Applied Computational and Harmonic Analysis, 11, 89–123. Welling, M., Hinton, G. E., & Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15, Cambridge, MA: MIT Press. Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Received December 23, 2003; accepted July 2, 2004.

LETTER

Communicated by Aapo Hyvarinen

Extended Gaussianization Method for Blind Separation of Post-Nonlinear Mixtures Kun Zhang [email protected]

Lai-Wan Chan [email protected] Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong

The linear mixture model has been investigated in most articles tackling the problem of blind source separation. Recently, several articles have addressed a more complex model: blind source separation (BSS) of postnonlinear (PNL) mixtures. These mixtures are assumed to be generated by applying an unknown invertible nonlinear distortion to linear instantaneous mixtures of some independent sources. The gaussianization technique for BSS of PNL mixtures emerged based on the assumption that the distribution of the linear mixture of independent sources is gaussian. In this letter, we review the gaussianization method and then extend it to apply to PNL mixture in which the linear mixture is close to gaussian. Our proposed method approximates the linear mixture using the CornishFisher expansion. We choose the mutual information as the independence measurement to develop a learning algorithm to separate PNL mixtures. This method provides better applicability and accuracy. We then discuss the sufficient condition for the method to be valid. The characteristics of the nonlinearity do not affect the performance of this method. With only a few parameters to tune, our algorithm has a comparatively low computation. Finally, we present experiments to illustrate the efficiency of our method. 1 Introduction Blind source separation (BSS) aims at identifying and separating signals from their mixtures by linear or nonlinear transformation. It has received more and more attention due to its potential applications in signal processing areas, such as speech recognition, telecommunications, and medical signal processing. Most BSS algorithms are based on the theory of independent component analysis (ICA). As a generative model, ICA aims to find the independent components from the mixture of statistically independent sources by optimizing different criteria (for review, see Hyv¨arinen, 1999b; Hyv¨arinen & Oja, 2000; Cardoso, 1998; Mansour, Barros, & Ohnishi, 2000; Neural Computation 17, 425–452 (2005)

c 2005 Massachusetts Institute of Technology

426

K. Zhang and L.-W. Chan

Jutten & Taleb, 2000). In many cases, the terms independent component analysis and blind source separation mean essentially the same thing. ICA for linear instantaneous mixtures has been solved in many different ways (Jutten & H´erault, 1991; Amari & Cardoso, 1997; Oja, 1997; Hyv¨arinen, 1999a; Cardoso & Souloumiac, 1993; Pham & Garat, 1997; Comon, 1994; Pham, 2002; Bell & Sejnowski, 1995). Recently some extensions of classical ICA have been investigated. These extensions are more realistic for some situations in the real world. One important extension is nonlinear ICA. As shown in Hyv¨arinen and Pajunen (1999), for general nonlinear ICA, there always exist an infinite number of solutions if the space of the nonlinear mixing functions is unlimited. The independent components extracted from the observations are not necessarily the true source signals. Furthermore, in general, nonlinear ICA suffers from high computational complexity. Solving the nonlinear BSS problem appropriately with only the independence assumption of the sources is possible only in some special cases, for example, when the mixtures are post-nonlinear (PNL) with some weak assumptions (Taleb & Jutten, 1999b). Due to the practicability and relative uniqueness of the solutions, blind separation of PNL mixtures plays an important role in the nonlinear ICA area. The nonlinear model in Yang, Amari, and Cichocki (1998) is similar to the PNL mixing model. Lee, Koehler, and Orglmeister (1997) proposed a set of algorithms for BSS of PNL mixtures using parametric nonlinear functions. They focused on a parametric sigmoidal nonlinearity and on high-order polynomials. In a series of papers (Taleb & Jutten, 1997, 1999a, 1999b; Jutten & Taleb 2000), Taleb and Jutten approximated the inverse nonlinear function with multilayer perceptrons (MLP). Taleb and Jutten (1999b) exhibited their results on this topic and gave a general learning procedure based on the minimization of mutual information, given any parametric model for the inverse nonlinear functions. In Ziehe, Kawanabe, Harmeling, and Mueller (2001), alternating conditional expectation (ACE), a technique from nonparametric statistics, is used to approximately invert the nonlinear function. Due to the large residual nonlinear distortion that cannot be eliminated using this technique in some cases, the performance of this method is sensitive to the ICA algorithm in the linear stage. In fact, in this method, the linear mixture of independent sources is assumed to be gaussian (Breiman & Friedman, 1985). Under this assumption, the inverse nonlinear function can be constructed directly using the gaussianization technique (Ziehe, Kawanabe, Hermeling, & Mueller, 2003). Almost all existing methods for BSS of PNL mixtures impose a parametric model, such as an MLP or a given functional space directly on the inverse of the nonlinear function generating observations. Although multilayer feedforward networks are universal approximators (Hornik, Stinchcombe, & White, 1989) they may behave poorly for some peculiar nonlinearities. These methods are sensitive to actual nonlinear functions used. In principle, when we use a parametric model to fit an arbitrary nonlinear function on which

Extended Gaussianization for Separation of PNL Mixtures

427

we have very little prior knowledge, either many parameters are needed or the constraint imposed on the actual function must be restrictive. In this letter, we consider the BSS problem of PNL mixtures when the distributions of latent linear mixtures are close to gaussian. We do not construct the inverse nonlinearity directly, but recover the nonlinearity from the distributions of linear mixtures. In the gaussianization method for BSS of PNL mixtures (Sol-Casals, Babaie-Zadeh, Jutten, & Pham, 2003; Ziehe et al., 2003), the distributions of the linear mixtures are simply approximated by the gaussian distribution. We first briefly review this method by discussing its foundation, its limitation, and some practical considerations in its implementation. Next, we propose an extended version of gaussianization to tackle the case when the distributions of the linear mixtures are not exactly gaussian but close to gaussian. Assuming the distributions of the linear mixtures are close to gaussian, this method models the linear mixtures by the Cornish-Fisher expansion and gives their distributions freedom to adapt in the vicinity of the standard gaussian distribution. In noiseless cases, its performance is not affected by the actual nonlinearities. It depends only on how close the estimated distributions of latent linear mixtures are to their actual distributions. With few parameters to tune, this method is comparatively practical—it should be less prone to getting stuck at local optima, and it avoids the demanding computation other nonlinear ICA algorithms require in applications. The rest of this letter is organized as follows. In section 2, we give a brief introduction to the PNL mixing model and discuss its identifiability. The gaussianization method is reviewed in section 3. In section 4, we develop an extended version of the gaussianization. In section 5, experimental results are shown to illustrate the performance of our method. Section 6 concludes the article. 2 Post-Nonlinear Mixing Model Blind separation of PNL mixtures, which was introduced by Taleb and Jutten (1997), is a special case of nonlinear ICA. In this model, the sources si are first mixed linearly according to the basic ICA model, and after that, a nonlinear function fi is applied to the linear mixture ti to generate the observation xi ,  xi = fi (ti ) = fi 

n

 aij sj  ,

(2.1)

j=1

where si are statistically independent sources, fi are unknown invertible nonlinear functions, aij denote entries of a regular m × n mixing matrix A, and ti are latent linear mixtures. The PNL assumption can be used to model a nonlinear sensor distortion in some signal processing applications.

428

K. Zhang and L.-W. Chan

PNL mixing system s1

t1

s2

t2

. . . sn

A linear mixing

. . . tn

f1 f2 . . . fn nonlinear distortion

PNL de-mixing system x1 x2 . . . xn

g1 g2 . . . gn

z1

y1

z2

y2

. . . zn

inverse nonlinear transform

W

. . . yn

linear de-mixing

Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti ). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.

For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mixing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g used to invert the nonlinear mapping f, z has n elements zi as an estimate of the latent linear mixtures ti , and W is a linear demixing matrix transforming zi to yi , the estimate of independent sources si . Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually independent components if and only if gi ◦ fi is linear and W is a linear separating matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indeterminacy, named sign indeterminacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of si is unknown, one more indeterminacy emerges in the BSS problem of PNL mixtures, which we call mean indeterminacy.

Extended Gaussianization for Separation of PNL Mixtures

429

If the sources si are not zero mean, let si = snew + E(si ) and ti = tnew + E(ti ). i i We have   n new xi = fi (ti ) = fi  aij (sj + E(sj )) = fi (tnew + E(ti )) = finew (tnew i i ), (2.2) j=1

where finew is a new function obtained by shifting the nonlinear function fi (to the left if E(ti ) > 0, and to the right if E(ti ) < 0) by |E(ti )| units. We can change the mean of ti and at the same time shift the nonlinear function fi with corresponding units, while the observations xi do not change. In other words, in the problem of blind separation of PNL mixtures, the mean of the latent random vector s, as well as the actual functions fi , cannot be recovered without any additional prior knowledge. In some cases, this indeterminacy may be trivial. Moreover, in practice, if the actual nonlinear distortion functions fi or the actual values of si are required for further investigation, we can sometimes find some hints, such as the intersection of the nonlinear function fi with the x-axis, or prior knowledge in distributions of si , to eliminate this indeterminacy. Even if si are not zero mean while yi are zero mean, each gi ◦ fi is also affine. This is because the nonlinear block g in the demixing system is used to invert the unknown nonlinear function f new , new = k · ti − k · E(ti ), zi = (gi ◦ fi )(ti ) = (gi ◦ finew )(tnew i ) = k · ti

(2.3)

where k is a constant. In the following, in order to eliminate the scaling indeterminacy, we assume each function gi is strictly monotonically increasing, and zi and yi are of unit variance. In order to eliminate the mean indeterminacy, zi are assumed to be zero mean. 3 Gaussianization for BSS of PNL Mixtures Gaussianization (Chen & Gopinath, 2001) transforms a random variable into a standard gaussian random variable. In BSS of PNL mixtures, by approximating the distributions of linear mixtures as the gaussian distribution, gaussianization can be used to recover the linear mixtures directly from the observations. 3.1 Review of the Central Limit Theorem. The central limit theorem (CLT), as one of the fundamentals in probability theory, provides an imprecise but intuitive way of measuring independence between nongaussian variables using their individual nonnormality. This has been investigated in the ICA literature. The CLT also inspires the application of gaussianization in BSS of PNL mixtures.

430

K. Zhang and L.-W. Chan

The CLT provides sufficient conditions for asymptotic normality of normalized sums of random variables (Neuts, 1995). Assume Xi , i = 1, 2, ... are independent variables with zero mean and variance σi2 , respectively, and let Sn = X1 + X2 + ... + Xn . The classical CLT says that if Xi are independent and identically distributed, then Sn /(σ1 n1/2 ) is asymptotically normally distributed as n → ∞ (note that σ1 = σ2 = ... = σn ). In another form of CLT, it is claimed that Sn /sn is asymptotically normally distributed under n E(|Xi |3 ) → 0 as Lyapunov’s condition that E(|Xi |3 ) are finite and that s13 n

i=1 n→ = σ12 +σ22 +...+σn2 . Lindeberg’s condition is much weaker; n 1 it states that Sn /sn is asymptotically normal if s2 x2 dFi (x) → 0 for n j=1 |x|≥εsn

∞, where s2n

any ε > 0, where Fi (x) is the cumulative distribution function (cdf) of Xi (Neuts, 1995). In the latter two conditions, the constraint that Xi are identically distributed is not necessary. We can roughly say that in general, sums of independent variables tend to be close to gaussian. Of course, there are many exceptions, for instance, when the independent variables are very nongaussian and the number of independent variables is very small, or when one strong nongaussian source dominates the sum. Here, we focus on the case that linear mixtures of independent sources are comparatively close to gaussian.

3.2 BSS of PNL Mixtures with Gaussianization. Now we approximate the distributions of the recovered linear mixtures zi by the standard gaussian distribution. Then the nonlinearity gi is used to transform the variable xi into the standard gaussian variable zi . Assume the cdf of xi is Fxi (xi ), which can be estimated from observed samples. Let G be the cdf of the standard gaussian distribution: G(z) =

√1 2π

z

−∞

e−t

2

/2

dt. BSS of PNL mixtures with

gaussianization is illustrated as a two-stage procedure (Sol-Casals et al., 2003; Ziehe et al., 2003): 1. Construct each inverse nonlinear function by gaussianization: gi = G−1 ◦ Fxi . The latent linear mixtures are estimated by zi = gi (xi ). 2. Use a linear ICA algorithm to extract independent components yi from z: y = Wz. y is the estimate of s. We address the following practical issues: • The estimation of the cdf Fxi can be done in a variety of ways. For instance, it can be constructed using the pdf estimated by kernel estimation. For simplicity, we use the empirical cumulative distribution function (ecdf) Fˆxi (x) instead, which is the proportion of samples less

Extended Gaussianization for Separation of PNL Mixtures

ˆ than or equal to x, that is, F(x) =

1 N+1

N

431

I(x(i) ≤ x), where x(i) is the ith

i=1

sample of variable x, N is the total number of samples, and I(x(i) ≤ x) is an indicator function with value 1 if x(i) ≤ x and with value 0 otherwise. • The inverse of the cdf of the standard gaussian distribution G−1 can be calculated numerically. Or we can use an analytical form to approximate it (Abramowitz & Stegun, 1968). For 0.5 ≤ p < 1, G−1 (p) can be approximated by c0 + c1 u + c2 u z = G−1 (p) ∼ (0.5 ≤ p < 1), where =u− 1 + d1 u + d2 u2 + d3 u3 u = ln[1/(1 − p)2 ], and c0 = 2.515517, c1 = 0.802853, c2 = 0.010328, d1 = 1.432788, d2 = 0.189269, d3 = 0.001308. The absolute error in this approximation is less than 4.5 × 10−4 . For 0 < p ≤ 0.5, z = G−1 (p) can first be rewritten as z = −G−1 (1 − p) and then evaluated using the same approximation. • If necessary, parametric estimates of gi can be computed using a least(k) N −1 ˆ square method on points {x(k) i , G (Fxi (xi ))}k=1 , with a LevenbergMarquardt search algorithm (Gill, Murray, & Wright, 1981). • In principle, any linear ICA algorithm is valid in the linear demixing stage, such as JADE (Cardoso & Souloumiac, 1993), FastICA (Hyv¨arinen, 1999a), Extended INFOMAX algorithm (Lee, Girolemi, & Sejnowski, 1999), and others. However, in some cases, the latent linear mixtures are far from gaussian, such that the residual distortion in the recovered linear mixtures is considerable, and hence these methods may provide different results. We found that methods that do not ensure perfect decorrelation between independent components, such as the INFOMAX algorithm (Bell & Sejnowski, 1995; Lee et al., 1999), and the natural gradient algorithm (Amari, Cichocki, & Yang, 1996), exhibit better separability. And if there is a time correlation in the independent sources, methods using time structure, such as SOBI (Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997), are more robust to the residual distortion. In general, the separation result with the gaussianization method is not accurate because this method involves a rough approximation—it simply approximates the distribution of linear mixtures by the gaussian distribution. However, its computational cost is very low. Sol-Casals et al. (2003) describe this method as a technique providing a pertinent initialization of the demixing system and improving the learning speed. We aim to find

432

K. Zhang and L.-W. Chan

a reasonable parametric model for the linear mixtures to provide a better estimate for them, and the gaussianization method will be used as an initialization step. 4 Extended Gaussianization When the linear mixtures ti are not so close to gaussian, we denote by g(1) the nonlinear mapping constructed by gaussianization and introduce another nonlinear mapping g(2) to the demixing system, as shown in Figure 2B. In this system, v = (v1 , v2 , ..., vn )T and vi follow the standard gaussian distribution. In other words, in the extended gaussianization method, the nonlinear stage in the demixing system consists of two parts. First, we use an invertible nonlinear mapping g(1) to obtain v from the observed vector x by gaussianization: −1 vi = g(1) i (xi ) = G (Fxi (xi )).

(4.1)

After v has been obtained, we consider v as a new observation. The latent linear mixture z is then recovered from v by another invertible nonlinear mapping g(2) , (1) gi = g(2) i ◦ gi .

(4.2)

(2) Similar to gi , g(1) i and gi are strictly monotonically increasing. When the (2) nonlinear mapping g is estimated, we can construct the overall inverse nonlinear functions by equation 4.2.

4.1 Parametric Model for Function g(2) i Using Cornish-Fisher Expansion. The functions g(2) transform the standard gaussian variables vi into i the variables zi . Since the distributions of zi are assumed to be close to gaussian, the Cornish-Fisher (C-F) expansion helps to construct a simple and reasonable parametric model for g(2) i . 4.1.1 Cornish-Fisher Expansion. We aim to find how to express quantiles for a distribution close to gaussian using those of the gaussian distribution. In fact, it is possible to obtain explicit polynomial expansions for standardized quantiles of a general distribution in terms of its normalized cumulants and the corresponding quantiles of the standard gaussian distribution (Johnson & Kotz, 1970). The basis behind this kind of expansion is that if p(x) is a pdf with cumulants κ1 , κ2 , ..., then the function i   ∞ (−D) j ∞ ∞ j j=1 j j! (−D)  p(x) = j p(x) g(x) = exp  j! i! i=1 j=1

Extended Gaussianization for Separation of PNL Mixtures

433

A

x

g

z

W

inverse nonlinear transform

y

linear de-mixing

B

x

g(1)

v

g(2)

inverse nonlinear transform

z

W

y

linear de-mixing

Figure 2: (A) The separating system with gaussianization. (B) The separating system used in the extended gaussianization method. In A, components of z and the nonlinear transform g are constructed by gaussianization. In B, we do not construct the nonlinear transform g directly, but construct g(2) ◦ g(1) instead. g(1) and v are constructed by gaussianization. g(2) and W are tuned to minimize the mutual information between yi .

will have cumulants κ1 + 1 , κ2 + 2 , ..., where D is the differentiation operator and D j p(x) = d j p(x)/dx j . Using this principle, we can obtain useful approximate representation of a distribution with known cumulants in terms of a known distribution. The Edgeworth expansion and the Gram-Charlier expansion arise from choosing the standard gaussian distribution as this known distribution. Furthermore, quantiles for a distribution are totally determined by the pdf of this distribution, and hence one can approximate the quantiles of a distribution close to gaussian using those of the standard gaussian distribution and known cumulants. Cornish and Fisher (1937) propounded a form of expansion in which the terms are polynomial functions of the appropriate standard gaussian quantile, with functions of known cumulants as coefficients. The four-term C-F expansion for α-quantile is 1 1 1 x(vα ) ∼ = vα + (v2α − 1)κ3 + (v3α − 3vα )κ4 − (2v3α − 5vα )κ32 , (4.3) 6 24 36 where κ3 and κ4 are the third-order and fourth-order cumulants (or skewness and kurtosis) of x, respectively, vα is the α-quantile for the standard gaussian distribution, and x(vα ) is zero mean and of unit variance. Figure 3 shows the pdf’s of x(vα ) obtained by equation 4.3 and the mapping functions from quantiles for the standard gaussian distribution to those for distributions determined by equation 4.3 with different values of κ3 and

434

K. Zhang and L.-W. Chan

κ4 . Jaschke (2002) showed that the C-F approximation is a competitive technique if the target distribution is relatively close to normal. According to the CLT, the higher-order statistics of the sum of independent variables tend to vanish, so the sum can be modeled well with only the third- and fourth-order statistics. And the most prominent use and motivation of this expansion is to model the distribution of the normalized sum of independent variables. 4.1.2 Parametric Model for g(2) i . Assume the distributions of the linear mixtures ti are close to gaussian and ti can be modeled by the four-term C-F expansion. According to equation 4.3, the recovered linear mixtures zi can be expressed in terms of the gaussian signals vi : zi = g(2) i (vi |θi ) 1 1 1 2 = vi + (v2i − 1)κ3,i + (v3i − 3vi )κ4,i − (2v3i − 5vi )κ3,i . 6 24 36

(4.4)

Parameters in the nonlinear stage are θi = {κ3,i , κ4,i }, i = 1, 2, ...n. 4.2 Learning Rule for Extended Gaussianization. Now we derive the learning rules for the unknown parameters in Figure 2B based on the minimization of the mutual information between yi . 4.2.1 Mutual Information: An Independence Criterion. Mutual information is a measure of the information that members of a set of random variables have on the other random variables in the set.1 In information theory, the mutual information I between n random variables yi , i = 1, 2, ..., n is defined as follows: I(y1 , y2 , ..., yn ) =

n

H(yi ) − H(y),

(4.5)

i=1

where y = (y1 , y2 , ..., yn )T , and H(•) denotes the differential entropy.2 Alternatively, mutual information can be interpreted as the Kullback-Leibler (KL) divergence between the joint pdf py (y) and the product of marginal densities ni=1 pyi (yi ): n py (u) δ(py (y), du. pyi (yi )) = py (u) log n i=1 pyi (ui ) i=1 Mutual information is always nonnegative, and it is zero if and only if the variables are independent (Cover & Thomas, 1991). 1 Mutual information was originally defined for two variables. It has more than one multivariate generalizations. For details, see Bell (2003). 2 Shannon entropy is defined for a discrete variable. Differential entropy is the extension of Shannon entropy for continuous random variables.

Extended Gaussianization for Separation of PNL Mixtures

435

A 0.7

standard normal skew:0, kurt:2 skew:0.6, kurt:0 skew:−1, kurt:2 skew:−0.3, kurt:−0.7

0.6

0.5

p(x)

0.4

0.3

0.2

0.1

0 −4

−3

−2

−1

0

1

2

3

4

x (obtained by the C−F expansion)

B 4

3

2

standard normal skew:0, kurt:2 skew:0.6, kurt:0 skew:−1, kurt:2 skew:−0.3, kurt:−0.7

α

x(v )

1

0

−1

−2

−3

−4 −3

−2

−1

0

1

2

3

vα Figure 3: Quantiles obtained by the C-F expansion with different cumulants κ3 and κ4 . (A) Distributions of quantiles obtained. (B) Quantiles obtained versus quantiles for the standard gaussian distribution—the mapping x(·) defined in equation 4.3.

436

K. Zhang and L.-W. Chan

Mutual information is a reliable measurement for independence between random variables. It has been widely used for constructing contrast functions in the ICA problem (Comon, 1994; Pham, 2000; Taleb & Jutten, 1999b). 4.2.2 Blind Separation of PNL Mixtures by Minimization of Mutual Information: A General Procedure. Let θi be the parameter set in the nonlinear function gi in Figure 1. According to Figure 1, we have zi = gi (xi |θi ),

(4.6)

y = Wz.

(4.7)

Since the demixing system for PNL mixtures is of cascade architecture and the linear demixing stage is to minimize the mutual information between yi given variables zi as inputs, the learning rule for W will not involve parameters in θi explicitly. It is just the same as in the linear ICA model except that the inputs are not the original observations but the estimated linear mixtures. Therefore, any learning rule for the linear ICA model based on minimization of mutual information can be used, for instance, the BellSejnowski algorithm (Bell & Sejnowski, 1995; Cardoso, 1997), W ∝ [W T ]−1 + E[ψ(y)zT ],

(4.8)

where ψ is a component-wise vector function that consists of the so-called score functions ψyi of the distributions of yi = wTi z (wTi is the ith row of W), defined as ψyi (u) = (log pyi (u)) =

pyi (u) pyi (u)

.

(4.9)

Multiply the right-hand side of equation 4.8 with W T W; we obtain W ∝ (I + E[ψ(y)yT ])W,

(4.10)

which is known as the natural gradient proposed by Amari (1998) and Cardoso and Laheld (1996). In order to make yi of unit variance when the algorithm converges, we modify equation 4.10 as follows: W ∝ (I−diag{E(y21 ), ..., E(y2n )}+E[ψ(y)yT ]−diag{E[ψ(y)yT ]})W,(4.11) where diag{E(y21 ), ..., E(y2n )} denotes the diagonal matrix with E(y21 ), ..., E(y2n ) as its diagonal entries, and diag{E{ψ(y)yT }} denotes the diagonal matrix whose diagonal entries are those on the diagonal of E{ψ(y)yT }. Taleb and Jutten (1999b) derived the mutual information-based learning rule for the parameters θi in the nonlinear stage: n ∂ log |gi (xi |θi )| ∂gi (xi |θi ) ψyk (yk )wki . (4.12) +E θi ∝ E ∂θi ∂θi k=1

Extended Gaussianization for Separation of PNL Mixtures

437

4.2.3 Learning Rule for Extended Gaussianization. The learning rule for extended gaussianization is obtained by applying the general learning rule for separating PNL mixtures (Taleb & Jutten, 1999b) to the specific demixing (2) procedure (g(2) i , W). Let Di be the derivative of gi (vi |θi ) with respect to vi : 1 1 2 1 2 Di = g(2) κ (6v2 − 5). (4.13) i (vi |θi ) = 1 + κ3,i vi + (vi − 1)κ4,i − 3 8 36 3,i i Let φ = (φ1 , φ2 , ..., φn )T = W T ψ. equation 4.12 becomes  1 2 κ3,i ∝ E vi − 6 κ3,i (6vi −5) + φi v2 − 1 − 3Di 6 i κ4,i ∝ E v2i −1 + φi (v3 − 3vi ) . 8Di 24 i

κ3,i (2v3i −5vi ) 3

,

(4.14)

These two rules, together with equation 4.11, are the update rules for all the unknown parameters in the demixing system of PNL mixtures. Of course, suitable learning rates are also essential for the convergence speed and stability of the learning procedure. Estimation of the score function ψ will be treated later. 4.2.4 Performance Analysis. The gaussianization technique is very simple, and its computational cost is very low. It is considered a preprocessing step in the extended gaussianization method. After this step, the inverse of remaining nonlinearity in each channel is modeled using only two unknown parameters (κ3,i and κ4,i ). When the linear mixtures are not far from gaussian, our method is more stable and computationally lighter compared to ordinary methods, which simply exploit some generic structures, such as MLPs and polynomials, to model the nonlinearity (Taleb & Jutten, 1999b; Lee et al., 1997). In the extended gaussianization method, the component-wise function, defined as g(1) ◦f, has elements that are strictly increasing functions mapping ti to standard gaussian variables vi . Given the same latent mixture t, different nonlinear functions f do not affect the realization value of v. Therefore, the performance of the extended gaussianization method is not affected by the choice of nonlinearities fi .3 The extended gaussianization method can be considered as a specific application of the general learning rule derived in Taleb and Jutten (1999b) with gaussianization as a preprocessing step and the C-F expansion for representing the inverse of remaining nonlinearities. The performance of this method depends on how well the C-F expansion can fit the actual distributions of linear mixtures. Hence, in order to analyze the performance of our method, we first investigate the property of the C-F expansion. 3 This is true only in the noiseless situation, and only in the training set. We thank one of the reviewers for pointing this out.

438

K. Zhang and L.-W. Chan

To ensure that the C-F expansion corresponds to the quantile of a distribution and that the mapping x(vα ) (see equation 4.3) is one-to-one, equation 4.3 should be a strictly monotonic function of vα . However, it is not always satisfied for all possible values of (κ3 , κ4 ). We can identify the region in the (κ3 , κ4 )-plane guaranteeing the strict monotonicity of the mapping x(vα ). For values of (κ3 , κ4 ) in this region, the derivative 1 1 1 dx(vα ) = 1 + κ3 vα + (v2α − 1)κ4 − κ32 (6v2α − 5) dvα 3 8 36

(4.15)

should be positive definite. Since equation 4.15 is a simple function of vα with the highest order no more than 2, we can easily find the domain for (κ3 , κ4 ) α) in which dx(v dvα is always positive for any value of vα , as shown in Figure 4A. Values in this region always result in super-gaussian distributions of x(vα ). In practice, we can concentrate on a bounded range of vα because the value of vα is bounded. In this way, the monotonicity constraint is relaxed. For instance, we can implement the monotonicity constraint for vα between the 0.001 quantile (−3.09) and the 0.999 quantile (3.09) for the standard gaussian distribution. After some straightforward but tedious derivations, the domain for κ3 and κ4 is obtained, as sketched in Figure 4B. The region in this figure shows us the feasibility of approximating zi with the C-F expansion when zi is close to gaussian, especially when it is super-gaussian. When (κ3 .κ4 ) lies in the lower part of Figure 4B, the C-F expansion may become less and less reliable for α → 0 (or α → 1). Due to the limitation of approximation of a distribution by the C-F expansion, the performance of the extended gaussianization method varies for different characteristics of the distribution of the linear mixture: • When the linear mixtures ti are close to gaussian and their statistics (κ3,i , κ4,i ) lie in the domain shown in Figure 4A, the C-F expansion models the linear mixtures ti well, and hence good performance is achieved. • When ti are close to gaussian, {κ3,i , κ4,i } lie in the domain indicated by the lower part of Figure 4B, and the monotonicity of the C-F expansion is guaranteed for all samples of vi , the linear mixtures ti can still be modeled well by the C-F expansion, which also results in a good performance. • When ti are close to gaussian, (κ3,i , κ4,i ) lie in the domain indicated by the lower part of Figure 4B, and some samples of vi are so large as to violate the monotonicity of the C-F expansion, we can drop these samples, and the extended gaussianization method still works. This is because mutual information as an independence measure is robust to outliers. But inevitably the performance declines compared to the first two cases.

Extended Gaussianization for Separation of PNL Mixtures

439

A

B

Figure 4: The domain of (κ3 , κ4 ) guaranteeing monotonicity of the four-term C-F expansion, equation 4.3. (A) Theoretical solution. (B) The domain guaranteeing monotonicity for vα ∈ [−3.093.09]. It consists of two parts. The upper part is the same as the domain in A.

440

K. Zhang and L.-W. Chan

• When the distributions of linear mixtures are far from the gaussian distribution, for instance, when they are multimodal, this method may fail. 4.3 Estimation of the Score Function. In the ICA algorithms for linear mixtures, estimation of the score functions (see equation 4.9) does not have to be very accurate. In practice, only two nonlinearities (one for supergaussian signals and the other for subgaussian signals) used to approximate the score functions are sufficient (Lee et al., 1999). However, separation performance in nonlinear mixtures depends greatly on the estimation performance of score functions (Taleb & Jutten, 1999b). Currently, there are three main ways to estimate the score function. The first is to estimate the pdf pyi directly and find its derivative. Finally, the score function can be obtained. Many ways such as the kernel estimation can be used to estimate the pdf. Although the kernel estimation method may provide a noisy estimation, its simplicity makes it popular when the number of samples is small. However, since this method has a heavy memory demand, it is not suitable when the number of samples is large. The second way is to use the Gram-Charlier expansion to approximate the pdf pyi and to find the score function based on this expansion. The performance of this method is good when the nonlinearities are not strong, but it is poor in the case of strong nonlinearities due to the limitation of accuracy of the Gram-Charlier expansion (Taleb & Jutten, 1999b). The third way is to use some nonlinear models, such as nonlinear regressors and MLPs, to estimate the score function directly. But for nonlinear regressors, it is hard to choose suitable projection basis functions, and for MLPs, many parameters need to be learned. Here we propose to use a mixture of gaussians to model each output density. The mixture density model is widely used to model an arbitrary density. In Xu, Cheung, Yang, and Amari (1997), the cdf of a mixture of densities is used to model the nonlinearity in Bell and Sejnowski’s informationtheoretic ICA framework. In the independent factor analysis model (Attias, 1999), each source density is described by a mixture of gaussians. In our experiments, the density of output yi is modeled as the mixture of ni gaussians, and the qi th gaussian has the means mi,qi , variances vi,qi , and mixing weights πi,qi : pyi (u) =

ni

πi,qi G (u|mi,qi , vi,qi ).

(4.16)

qi =1

Then the score function can be obtained analytically, ψyi (u) =

pyi (u) pyi (u)

=

ni qi =1

p(qi |u)

u − mi,qi , vi,qi

(4.17)

Extended Gaussianization for Separation of PNL Mixtures

441

where p(qi |u) denotes the posterior probability that u, as a realization of yi , is generated by the qi th gaussian. During the learning process, all parameters in the mixture model are tuned to trace current outputs based on maximum likelihood. The expectation-maximization (EM) algorithm is used to maximize the likelihood. After each update of the demixing system, the parameters in the mixture model are updated according to the EM algorithm (N denotes the number of samples, and the subscript t in yi,t denotes the tth sample of yi ): (k) (k) πi,q G (yi,t |m(k) i,qi , vi,qi ) i p(k) (qi |yi,t ) = n , (k) (k) (k) i qi =1 πi,qi G (yi,t |mi,qi , vi,qi ) N 1 p(k) (qi |yi,t ), N t=1

N (k) t=1 p (qi |yi,t )yi,t = , N (k) t=1 p (qi |yi,t )

(4.18)

(k+1) πi,q = i

(4.19)

m(k+1) i,qi

(4.20)

= N v(k+1) i,qi

1

(k) t=1 p (qi |yi,t )

N

2 p(k) (qi |yi,t )(yi,t − m(k+1) i,qi ) .

(4.21)

t=1

When the nonlinear stage (with parameters θi , i = 1, ..., n), the linear stage (with parameter W), and the EM stage for modeling the output density (with parameters πi,qi , mi,qi , vi,qi , i = 1, ..., n and qi = 1, ..., ni ) converge, the learning process terminates. We observe that using this method, all parameters exhibit a small oscillation in the learning process. The reason is that parameters in the nonlinear stage are very sensitive to the estimated output densities, and parameters in the linear stage and the EM stage depend crucially on those in the nonlinear stage. We have two ways to eliminate the oscillation. The first is to let the EM algorithm iterate several times to get a stable distribution, instead of just once, after each update of the demixing system. The second way is to reduce the learning rate in the nonlinear stage as training goes on. 5 Experiments In this section, we present some experiments to show the validity of our methods. When the number of independent sources increases, generally their linear mixtures are closer to gaussian, so that the accuracy of our methods is improved. In our experiments, we use only two sources for illustration purpose. In order to compare the performance of different methods quantitatively, the classical residual cross talk C(y, s) = 10 log10 E[(y − s)2 ] is used as the performance index, in which y is an estimate of s, and by multiplying by a constant, y and s are made to be of unit variance and with positive correlation.

442

K. Zhang and L.-W. Chan

In the first experiment, two independent sources are an amplitudemodulated signal, that is, s1 (t) = cos(0.0157t) sin(0.2t), and a gamma(3,2) distributed noise. They have been made zero mean and of unit variance. The linear mixing matrix is chosen randomly: A=

0.0691 0.8821

0.1257 . 0.9395

In the first experiment, the nonlinearities chosen are said to be “hard” in Taleb and Jutten’s work:

f1 (t1 ) = (5t1 )3 , f2 (t2 ) = t2 + 2 tanh(t2 ).

Figure 5 shows the independent sources si , their linear mixtures ti , and the distributions of PNL observations xi . We can see that ti are skewed and supergaussian. The learning rule for the nonlinear stage is described in equation 4.14. Parameters in the nonlinear stage are {κ3,i , κ4,i }, i = 1, 2. We initialized all of these parameters as 0. The demixing matrix W in the linear demixing stage was initialized as the identity matrix. The learning rate in the nonlinear stage is 0.5, and that in the linear demixing stage is 0.2. For simplicity, the ecdf Fˆxi is used as the estimate of the cdf Fxi . In estimation of the score function, the gaussian mixture model consists of five gaussians. After about 800 iterations, the learning process converges. At convergence, residual distortion in each channel, independent components yi , and their joint distributions are given in Figure 6. The performance index is C(s1 , y1 ) = −27.8 dB, C(s2 , y2 ) = −27.7 dB. Figure 6A compares the residual distortion obtained by the extended gaussianization method with that by gaussianization. The independent components obtained by gaussianization are given in Figure 7. They are shown for comparison with the separation result of the extended gaussianization method. The performance index of the gaussianization method is C(s1 , y2 ) = −11.1 dB, C(s2 , y1 ) = −10.5 dB. We can see that the separation result of the extended gaussianization method is much superior to those obtained by gaussianization. The pdf’s generated by the C-F expansion with the parameters {κ3,i , κ4,i } at convergence are shown in Figure 8 by the solid line, together with the distribution histograms of the linear mixtures ti (which have been normalized to unit variance). These pdf’s fit the actual distributions of ti well. However, due to the limited freedom of the C-F expansion, and more important, due to the particularities of the actual distributions of mixtures ti (e.g., both independent sources are bounded or left bounded), the distributions of the linear mixtures estimated by the C-F expansion might not approach their actual distributions so well. This can also be seen by comparing the cumu-

Extended Gaussianization for Separation of PNL Mixtures

443

A 2

1

1

s

0

−1 −2 0

200

400

600

800

1000

200

400

600

800

1000

6

2

4

s

2 0

B

−2 0

1

t1

0.5 0 −0.5 0

200

400

600

800

1000

200

400

600

800

1000

10

t2

5 0 −5 0

6

6 4

5

2

2

t2

2

4

s

10

x2

C

0

0

0 −2 −2

−5

−2

−1

0

s

1

1

2

−4 −0.5

0

t

1

0.5

1

−10

0

20

x1 40

60

Figure 5: (A) Independent sources used in the first experiment. The column on the right shows their normalized histograms. For comparison, the gaussian distribution with the same mean and variance is given by the dashed curve. (B) Linear mixtures and their histograms. (C) Scatter plots of the sources, linear mixtures, and PNL observations in the first experiment (left to right).

444

K. Zhang and L.-W. Chan

A 5 Channel 1

4 3 2

Channel 2

1 0 −1 −2 −3

Extended Gaussianization Gaussianization

−4 −4

B

−2

0

2

4

6

y1

2

0

−2 400

450

500

550

450

500

550

6

y2

4 2 0 −2 400

C 4

4

2

2 0

y

z

2

2

0

−2 −4

−2

0

z1 2

4

−2

−2

y1

0

2

Figure 6: Separation result by the extended gaussianization. (A) Residual distortion gi ◦ fi in each channel. (For comparison, that obtained by gaussianization is shown by the dash-dot line.) (B) Recovered independent components (the original sources are shown by the dash-dot line). (C) Scatter plots of the recovered linear mixtures and independent components. Performance index: C(s1 , y1 ) = −27.8 dB, C(s2 , y2 ) = −27.7 dB.

Extended Gaussianization for Separation of PNL Mixtures

445

Figure 7: Separation result (independent components) obtained by gaussianization (shown for comparison). The original sources are shown by the dash-dot line. Performance index: C(s1 , y2 ) = −11.1 dB, C(s2 , y1 ) = −10.5 dB.

Figure 8: Distributions constructed by the C-F expansion with parameter values at convergence (solid line) and distribution histograms of the linear mixtures ti (normalized to unit variance). Gaussian distributions are given by the dash-dot line for comparison.

lants of the true mixtures and those of the estimated ones. Skewness and kurtosis of t1 are 0.92 and 1.37, and those of t2 are 0.50 and 0.62, respectively. While z1 has the skewness 0.79 and the kurtosis 0.98, z2 has 0.41 and 0.20. (The parameters at convergence are κ3,1 = 0.77, κ4,1 = 1.20, κ3,2 = 0.43, and κ4,2 = 0.35.) Figure 9 shows the output densities estimated by the gaussian mixture model as well as their normalized distribution histograms. We can see the gaussian mixture model has been adapted to the output densities.

446

K. Zhang and L.-W. Chan

Figure 9: Output distribution histograms and their densities learned by the gaussian mixture model (five gaussians are used).

We also compare our result with that obtained by the method described in Taleb and Jutten (1999b). The latter uses an MLP to model the nonlinearities gi . A two-layer MLP with five hidden neurons is used. Sigmoid activation function is used in the hidden layer, and other neurons are linear. Even with gaussianization as an initialization step, it takes at least 2500 iterations to achieve the same performance index (-27.8 dB for the first source and -27.7 dB for the second source), which is similar to the result Taleb and Jutten reported. Compared to the extended gaussianization method, the MLP method is much more time-consuming. Moreover, the learning is more prone to local optima. In order to demonstrate that our method also performs well in other cases, we repeat the above experiment for 20 runs, and in each run the (1, 1)th entry of the mixing matrix A is randomly chosen between 0 and 0.15. The independent sources are always successfully recovered by the extended gaussianization method. In the worst of the 20 runs, the performance index is −23.6 dB (for the first source) and −23.3 dB (for the second source). And the extended gaussianization method always outperforms the gaussianization method. The average performance index of the extended gaussianization method is −26.1 dB (with standard deviation 1.3) and −27.2 dB (with standard deviation 1.7), while the gaussianization method has an average performance index of −8.9 dB (with standard deviation 4.5) and −8.7 dB (with standard deviation 3.8). In cases when the linear mixture is far from gaussian, the assumptions used in both gaussianization and the extended gaussianization method break down. As a demonstration, we set the (2, 2)th entry of the mixing matrix A as 0 and repeat the above experiment. Now the second linear mixture t2 is far from gaussian, as seen from the distribution of s1 shown in Figure 5A. Both gaussianization and the extended gaussianization method

Extended Gaussianization for Separation of PNL Mixtures

447

behave poorly. The performance index of the gaussianization method is −8.6 dB and −9.9 dB. And the performance index of the extended gaussianization method is −9.1 dB and −9.6 dB. In another experiment, we modify the nonlinear functions fi as if t1 ≥ 0 (5t√1 )3 , f1 (t1 ) = − −t1 , if t1 < 0 t + 2 tanh(t2 ), if t2 ≥ 0 f2 (t2 ) = 2 t2 , if t2 < 0. The nonlinearities are neither smooth nor symmetrical. Other settings are the same as the first experiment. The separation result by the extended gaussianization method is exactly the same as that in the first experiment. This verifies the robustness of our method to different nonlinearities. We also test our method with natural audio sources. The two sources are a speech signal (with kurtosis of 6.1) and a music signal (with kurtosis 0.5), as shown in Figure 10A. They are mixed by a randomly chosen matrix. The linear mixtures are shown in Figure 10B. The nonlinearities in both channels are the same as in the first experiment. Since the separation result is insensitive to the nonlinearities, the PNL mixtures are not given in the figure. The learning rates in the nonlinear stage and the linear stage are both 0.2. After about 900 iterations, the learning process converges. The parameters at convergence are κ3,1 = −0.04, κ4,1 = 1.06, κ3,2 = −0.02, and κ4,2 = 0.34. The recovered sources are shown in Figure11A, with the performance index C(s1 , y1 ) = −18.1 dB, C(s2 , y2 ) = −25.6 dB. From the performance index, we can see the extended gaussianization exhibits a good separation performance. This can also be confirmed by listening to the recovered signals and the original sources—we perceive very little cross talk between them. Figure 11B shows the recovered signals by gaussianization. And Figure 11C compares the residual distortion obtained using these two methods. Clearly the separation result of the extended gaussianization method is much better. 6 Conclusion and Discussion In this article, we propose the extended gaussianization method to tackle the problem of blind source separation of post-nonlinear mixtures when the linear mixture of independent sources is close to gaussian. The gaussianization technique assumes the linear mixtures are gaussian and estimates the inverse nonlinearities directly. Our method extends gaussianization for the cases when the linear mixtures are not exactly gaussian, just not far from gaussian. Gaussianization can be considered a preprocessing step for our method. After this step, the inverse of remaining nonlinear distortion is modeled by the Cornish-Fisher expansion. Consequently, the inverse nonlinearity can be learned with only a few unknown parameters. The region of these parameters guaranteeing the validity of the Cornish-Fisher expansion

448

K. Zhang and L.-W. Chan

s1

A 0.5 0 −0.5 0

500

1000

1500

2000

2500

0

500

1000

1500

2000

2500

−1 0 0.5

500

1000

1500

2000

2500

500

1000

1500

2000

2500

s2

0.5 0 −0.5

B

t

2

t

1

1 0

0

−0.5 0

Figure 10: Two-channel audio data set. (A) Waveforms of the original sources. (B) Linear mixtures of the original sources.

is also derived. These parameters, as well as the linear demixing matrix, are learned based on the minimization of mutual information between outputs. Although the region indicates the linear mixtures should not be far away from gaussian, this is a reasonable assumption according to the central limit theorem. In the noiseless case, the performance of this method is not affected by different forms of the nonlinearity generating observations. Compared to the gaussianization method, our method extends the region of validity to include distributions in the vicinity of gaussian using parameters adaptation. Compared with other adaptation methods for blind separation of PNL mixtures, our method reduces the problem complexity and simplifies the learning procedure by using the assumption that linear mixtures of independent variables are close to gaussian. The nonlinearity is modeled with the help of the Cornish-Fisher expansion, and this specific parametric model has good flexibility and adaptability under the assumption of our method. Experimental results have been given to support the theoretical claims. In summary, our extended gaussianization method combines two major approaches for blind separation of post-nonlinear mixtures and gives good separation performance with low computational cost.

Extended Gaussianization for Separation of PNL Mixtures

449

A y1

0

y2

5

−5 0 4 2 0 −2 −4 0

B

500

1000

1500

2000

2500

500

1000

1500

2000

2500

500

1000

1500

2000

2500

500

1000

1500

2000

2500

y1

5 0

y

2

−5 0 5 0

−5 0

C 5

Channel 2

Channel 1

0

Extended Gaussianization Gaussianization −5 −1

−0.5

0

0.5

1

Figure 11: Separation result of PNL mixtures of audio signals. (A) Recovered sources using the extended gaussianization method. Performance index: C(s1 , y1 ) = −18.1 dB, C(s2 , y2 ) = −25.6 dB. (B) Recovered sources with gaussianization. Performance index: C(s1 , y1 ) = −10.6 dB, C(s2 , y2 ) = −17.0 dB. (C) Residual distortion gi ◦ fi in each channel.

450

K. Zhang and L.-W. Chan

Acknowledgments The work reported here was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China. We are very grateful to the anonymous referees for their valuable comments and suggestions, which led to a strong improvement of this article. References Abramowitz, M., & Stegun, I. A. (1968). Handbook of mathematical functions. New York: Dover. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Comput., 10, 251–276. Amari, S.-I., & Cardoso, J.-F. (1997). Blind source separation—semiparametric statistical approach. IEEE Trans. on Signal Processing, 45, 2692–2700. Amari, S.-I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: The MIT Press. Attias, H. (1999). Independent factor analysis. Neural Computation, 11, 803–851. Bell, A. J. (2003). The co-information lattice. Proc. ICA2003 (pp. 921–926). Nara, Japan. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Abed-Meraim, K., Cardoso, J., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. on Signal Processing, 45, 434–444. Breiman, L., & Friedman, J. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–598. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE, 9, 2009–2025. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. Signal Processing, 44, 3017–3030. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceeding-F, 140, 362–370. Chen, S. S., & Gopinath, R. A. (2001). In T.K. Leen, T.G. Dietterich, & V. Tresp (Eds.), Gaussianization. Advances in neural information processing systems 13 (pp. 423–429). Cambridge, MA: MIT Press. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cornish, E. A., & Fisher, R. A. (1937). Moments and cumulants in the specification of distributions. Review of the International Statistical Institute, 5, 307–320.

Extended Gaussianization for Separation of PNL Mixtures

451

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimization. New York: Academic Press. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Hyv¨arinen, A. (1999a). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A. (1999b). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13, 411–430. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12, 429–439. Jaschke, S. R. (2002). The Cornish-Fisher expansion in the context of deltagamma-normal approximations. Journal of Risk, 4, 33–52. Johnson, N. L., & Kotz, S. (1970). Continuous univariate distributions (Vol. 1). New York: Wiley. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Jutten, C., & Taleb, A. (2000). Source separation: From dusk till dawn. In 2nd International Workshop on Independent Component Analysis and Blind Signal Separation (ICA 2000) (pp. 15–26). Helsinki, Finland. Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation, 11, 417–441. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind source separation of nonlinear mixing models. In N. Morgan, J. Principe, E. Wilson, & L. Gile (Eds.), Neural Networks for Signal Processing, 7 (pp. 406–415). Piscataway, NJ: IEEE Press. Mansour, A., Barros, A., & Ohnishi, N. (2000). Blind separation of sources: Methods, assumptions and applications. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Special Section on Digital Signal Processing in IEICE EA., E83-A, 1498–1512. Neuts, M. F. (Ed.). (1995). Advanced probability theory (2nd ed.). New York: Marcel Dekker. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17, 25–45. Pham, D.-T. (2000). Blind separation of instantaneous mixture of sources based on order statistics. IEEE Trans. on Signal Processing, 48, 363–375. Pham, D.-T. (2002). Mutual information approach to blind separation of stationary sources. IEEE Trans. on Information Theory, 48, 1935–1946. Pham, D.-T., & Garat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45, 1712–1725.

452

K. Zhang and L.-W. Chan

Sol-Casals, J., Babaie-Zadeh, M., Jutten, C., & Pham, D.-T. (2003). Improving algorithm speed in PNL mixture separation and Wiener system inversion. Proc. ICA2003 (pp. 639–644). Nara, Japan. Taleb, A., & Jutten, C. (1997). Nonlinear source separation: The post-nonlinear mixtures. In Proc. ESANN (pp.179–284). Bruge, Belgium. Taleb, A., & Jutten, C. (1999a). Batch algorithm for source separation in postnonlinear mixtures. In Proc. First Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 155–160). Aussois, France. Taleb, A., & Jutten, C. (1999b). Source separation in post-nonlinear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820. Xu, L., Cheung, C., Yang, H., & Amari, S.-I. (1997). Independent component analysis by the information-theoretic approach with mixture of densities. Proc. of 1997 IEEE Intl. Conf on Neural Networks (IEEE-INNS IJCNN97) (pp. 1821–1826). Houston, TX. Yang, H. H., Amari, S.-I., & Cichocki, A. (1998). Information-theoretic approach to blind separation of sources in nonlinear mixture. Signal Processing, 64, 291–300. Ziehe, A., Kawanabe, M., Harmeling, S., & Mueller, K.-R. (2001). Separation of post-nonlinear mixtures using ACE and temporal decorrelation. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001) (pp. 433–438). Ziehe, A., Kawanabe, M., Harmeling, S., & Mueller, K.-R. (2003). Blind separation of post-nonlinear mixtures using Gaussianizing transformations and temporal decorrelation. InProc. ICA2003 (pp. 269–274). Received November 7, 2003; accepted July 2, 2004.

LETTER

Communicated by Stephen Roberts

Bayesian Analysis of Nonlinear Autoregression Models Based on Neural Networks A. Menchero [email protected]

R. Montes Diez [email protected]

D. R´ıos Insua [email protected] GECD, Rey Juan Carlos University, Madrid 28925, Spain

P. Muller ¨ [email protected] M. D. Anderson Cancer Center. University of Texas, Houston, TX 77030, U.S.A.

We show how Bayesian neural networks can be used for time-series analysis. We consider a block-based model building strategy to model linear and nonlinear features within the time series: a linear combination of a linear autoregression term and a feedforward neural network (FFNN) with an unknown number of hidden nodes. To allow for simpler models, we also consider these terms separately as competing models to select from. Model identifiability problems arise when FFNN sigmoidal activation functions exhibit almost linear behavior or when there are almost duplicate or irrelevant neural network nodes. New reversible-jump moves are proposed to facilitate model selection, mitigating model identifiability problems. We illustrate this methodology analyzing several timeseries data examples.

1 Introduction Many neural network models have been applied to time-series analysis and forecasting. There has been some interest in using recurrent networks such as Elman networks (Elman, 1990), Jordan networks (Jordan, 1986), and realtime recurrent learning networks (Williams & Zipser, 1989). However, the model most frequently used is the multilayer feedforward neural network (FFNN). Many articles compare FFNNs and standard statistical methods for time-series analysis (e.g., Tang, de Almeida, & Fishwick, 1991; Foster, Collopy, & Ungar, 1992; Stern, 1996; Hill, O’Connor, & Remus, 1996; Faraway & Chatfield, 1998). Several articles have found FFNN superior to linear methods such as ARIMA models for several time-series problems and Neural Computation 17, 453–485 (2005)

c 2005 Massachusetts Institute of Technology

454

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

comparable to other nonlinear methods like generalized additive models or projection pursuit regression. We will consider FFNNs to model nonlinear autoregressions. The net output will represent the time-series predicted value, when past values of the series are given as net inputs. Fitting an FFNN model requires many choices about the model structure: activation functions, number of hidden layers, number of hidden nodes, inputs, and so on. A nonlinear optimization algorithm is typically used to estimate weights to optimize some performance criterion, for example, minimization of mean square error, with weight decay. To fix the model structure, rules of thumb traditionally are used. Remus, O’Connor, and Griggs (1998) suggest appropriate rules for FFNN in time-series forecasting. Alternatively, a Bayesian approach to FFNN modeling (Mackay, 1992; Neal, 1996; Muller ¨ & R´ıos Insua, 1998; R´ıos Insua and Muller, ¨ 1998; Holmes & Mallick, 1998; Andrieu, de Freitas, & Doucet, 1999) provides a coherent framework to deal with these issues. FFNN parameters are regarded as random variables whose posterior distribution is inferred in the light of data. Most important, perhaps, we may include the number of hidden nodes as an additional parameter and model its uncertainty. Predictions are obtained by averaging over all possible models and parameter values according to their posterior distributions. In this letter, we introduce a Bayesian FFNN forecasting model for timeseries data. We present an inference scheme based on Markov chain Monte Carlo (MCMC) simulation. To deal with model selection, the standard birth and death reversible-jump moves developed by Muller ¨ and R´ıos Insua (1998) result in a slowly mixing MCMC. Instead, we propose new reversiblejump moves to add or delete special kinds of nodes characterized as linearized, irrelevant, or duplicate. Two examples are used to illustrate the methodology: the lynx time-series data (Priestley, 1988) and the airline passengers data (Box & Jenkins, 1970). 2 Model Definition Consider univariate time-series data {y1 , y2 , . . . , yN }. We would like to model the generating stochastic process in an autoregressive fashion, N p y1 , y2 , . . . , yN = p(y1 , . . . , yq ) p(yt | yt−1 , yt−2 , . . . , yt−q ). t=q+1

We shall assume that each yt is modeled by a nonlinear autoregression function of q past values plus a normal error term: yt = f (yt−1 , yt−2 , . . . , yt−q ) + t , t = q + 1, . . . , N t ∼ N(0, σ 2 ),

so that yt | yt−1 , yt−2 , . . . , yt−q ∼ N f (yt−1 , yt−2 , . . . , yt−q ), σ 2 , t = q + 1, . . . , N.

Bayesian Analysis of Nonlinear Autoregression Models

455

We complete model specification using a block-based strategy to describe f . We propose a mixed model as a linear combination of a linear autoregression term and an FFNN. An FFNN model with q input nodes, one hidden layer with M hidden nodes, one output node, and activation function ϕ is a model relating a response variable yt and q explanatory variables, in our case, xt = (yt−1 , . . . , yt−q ): yt (xt ) =

M

βj ϕ(x t γj + δj ),

j=1

with βj ∈ R, γj ∈ Rq . Biases δj may be assimilated to the rest of the γj vectors if we consider an additional input with constant value one, say, xt = (1, yt−1 , . . . , yt−q ), so that, γj = (γ0j , γ1j, . . . , γqj ) ∈ Rq+1 . To be specific, in the following we will assume a logistic activation function: ϕ(z) = exp(z)/(1 + exp(z)). However, the discussion remains valid for any other sigmoidal function. In the proposed model, the linear term accounts for linear features, whereas the FFNN term could take care of nonlinear ones: f (yt−1 , yt−2 , . . . , yt−q ) = x t λ +

M

βj ϕ(x t γj ), t = q + 1, . . . , N.

(2.1)

j=1

Initially, the parameters in our mixed model are the linear coefficients λ = (λ0 , λ1 , . . . , λq ) ∈ Rq+1 , the hidden to output weights β = (β1 , β2 , . . . , βM ), the input to hidden weights γ = (γ1 , γ2 . . . , γM ), and the error variance σ 2 . As there will be uncertainty about the number of hidden nodes M to include, we shall model this uncertainty considering M as an unknown parameter. Note that the mixed model, equation 2.1, embeds, as a particular case, the standard linear autoregression model, when M = 0, and the FFNN model, when λ = 0. We shall assume the autoregressive order q to be known in advance. If desired, one could model uncertainty about q as a problem of variable selection in a model selection context. To allow for simpler models, we also consider the linear and nonlinear terms separately as competing models to select from: a simple linear autoregression model, yt = f (yt−1 , yt−2 , . . . , yt−q ) = x t λ, t = q + 1, . . . , N,

(2.2)

and a nonlinear autoregression feedforward neural net model, yt = f (yt−1 , yt−2 , . . . , yt−q ) =

M j=1

for each value of M.

βj ϕ(x t γj ), t = q + 1, . . . , N,

(2.3)

456

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

As in R´ıos Insua and Muller ¨ (1998), we assume a normal/inverse gamma prior βj ∼ N(µβ , σβ2 ), λ ∼ N(µλ , σλ2 I) γj ∼ N(µγ , γ ), σ 2 ∼ InvGamma(aσ , bσ ).

(2.4)

When there is nonnegligible uncertainty about prior hyperparameters, we may extend the prior model with additional hyperpriors. We shall use the following standard conjugate choices in hierarchical models: µβ ∼ N(aµβ , bµβ ), σβ2 ∼ InvGamma(aσβ , bσβ ) µλ ∼ N(aµλ , bµλ ), σλ2 ∼ InvGamma(aσλ , bσλ ) µγ ∼ N(aµγ , bµγ ), γ ∼ InvWishart(aλ , bλ ).

(2.5)

Hyperparameters are a priori independent. Given hyperparameters, parameters are a priori independent. Since the likelihood is invariant with respect to relabelings, we include an order constraint to avoid trivial posterior multimodality due to index permutation. For example, we may use γ1q ≤ γ2q · · · ≤ γMq . 3 MCMC Posterior Inference with a Fixed Model Consider first the mixed model 2.1 with fixed M. The complete likelihood for a given data set D = {y1 , y2 , . . . , yN } is p(D | λ, β, γ , σ 2 ) = p(y1 , . . . , yq | λ, β, γ , σ 2 )p(D | y1 , . . . , yq , λ, β, γ , σ 2 ), where D = {yp+1 , . . . , yN } and p(D | y1 , . . . , yq , λ, β, γ , σ 2 ) =

N

p(yt | yt−1 , yt−2 , . . . ,t−q , λ, β, γ , σ 2 )

t=q+1

is the conditional likelihood given first q values. From here on, we will make inference conditioning on the first q values, assuming they are known without uncertainty (alternatively, we could include an informative prior over first q values in the model and perform inference with the complete likelihood). Together with the prior assumptions 2.4 and 2.5, the joint posterior distribution is given by p(λ, β, γ , σ 2 , χ | D ) ∝ p(yq+1 , . . . , yN | y1 , . . . , yq , λ, β, γ , σ 2 ) 2

p(λ, β, γ , σ , χ)M!,

(3.1)

Bayesian Analysis of Nonlinear Autoregression Models

457

where p(λ, β, γ , σ 2 , χ) = p µλ , σλ2 , µβ , σβ2 , µγ , γ p(σ 2 ) p(λ | µλ , σλ2 I)p(β | µβ , σβ2 I)

M

p(γi | µγ , γ )

i=1

is the joint prior distribution, χ = (µλ , σλ2 , µβ , σβ2 , µγ , γ ) is the set of hyperparameters, and M! appears because of the order constraint on γ . As in Muller ¨ and R´ıos Insua (1998), we propose a hybrid, partially marginalized MCMC posterior sampling scheme to implement inference in the fixed architecture mixed model, equation 2.1. We use a Metropolis step to update the input to hidden weights γj using the marginal likelihood over (β, λ): p(D | y1 , . . . , yq , γ , σ 2 ) to partly avoid the random walk nature of the Metropolis algorithm: 1. Given current values of χ and σ 2 (β and λ are marginalized), for each γj , j = 1, . . . , M, generate a proposal γj ∼ N(γj , cγ ), calculate the acceptance probability,

p( γ | µγ , γ )p(D | y1 , . . . , yq , γ , σ 2) a = min 1, p(γ | µγ , γ )p(D | y1 , . . . , yq , γ , σ 2 ) where γ = γ1 , . . . , γj−1 , γj , γj+1 , . . . , γM , and γj , γ = γ1 , . . . , γj−1 , γj+1 , . . . , γM . With probability a, replace γ by γ and rearrange indices if necessary to satisfy order constraint. Otherwise, leave γj unchanged. 2. Generate new values for parameters, drawing from their full conditional posteriors: ∼ p β | D , γ , λ, σ 2 , χ is a multivariate normal distribution. β λ ∼ p λ | D , γ , β, σ 2 , χ is a multivariate normal distribution. 2 ∼ p σ 2 | D , γ , β, λ is an inverse gamma distribution. σ 3. Given current values of γ , β, λ, σ 2 , generate a new value for each hyperparameter by drawing from their complete conditional posterior distributions: µ β ∼ p µβ | D , β, σβ2 is a normal distribution. 2 ∼ p σ 2 | D , β, µ is an inverse gamma distribution. σ β β β µ λ ∼ p µλ | D , λ, σλ2 is a multivariate normal distribution.

458

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

2 ∼ p σ 2 | D , λ, µ is an inverse gamma distribution. σ λ λ λ µ γ ∼ p µγ | D , γ , γ is a multivariate normal distribution. γ ∼ p γ | D , γ , µγ is an inverse Wishart distribution. A similar sampling scheme can be used when using a neural net model without linear term 2.3, but likelihood marginalization would be just over β. Posterior inference with the normal linear autoregression model 2.2 is straightforward (see Gamerman, 1997). 4 Modeling Uncertainty About the Architecture We now extend the model to include inference about model uncertainty. In fact, the posterior distribution 3.1 should be written including a reference to the model k considered: p(λ, β, γ , σ 2 , χ | D , k) = p θk | D , k , where θk = λ, β, γ , σ 2 , χ represents parameters and hyperparameters in model k. We index models with a pair of indexes mhk , where h = 0(1) indicates absence (presence) of the linear term; k = 0, 1, . . . indicates the number of hidden nodes in the NN term. Therefore, m10 is the linear autoregression model (see equation 2.2), m0k is the FFNN model (see equation 2.3) with k hidden nodes, and m1k is the mixed models (see equation 2.1) with k hidden nodes. Note that models m0k , k ≥ 1 are nested, as well as m1k , k ≥ 1. It would be possible to think of the linear model m10 as a degenerate case of the mixed model m11 when β = 0 and of models m0k , k ≥ 1, as degenerate m1k , k ≥ 1 when λ = 0 and finally consider all models above as nested models. However, given our model exploration strategy outlined below, we prefer to view them as nonnested. We wish to design moves between models to get good coverage of the model space (see Figure 1). When dealing with nested models, it is common to add or delete model components, using add/delete or split/combine move pairs (Green, 1995; Richardson & Green, 1997). Similarly, we could define two reversible-jump pairs: add/delete an arbitrary node selected at random and add/delete the linear term. With such strategy, described in Muller ¨ and R´ıos Insua (1998), it would be possible to reach a model from any other model. However, our experience with time-series data shows that the acceptance rate of such an algorithm is low, that the model space is not adequately covered, and we have identified cases of nonconvergence. That happened, for example, when we tried to model the lynx time series and the airline passengers data sets used in section 5 as numerical examples of our methodology.

Bayesian Analysis of Nonlinear Autoregression Models j2a j3a

m10

j2a j3a j4a

m11 j2d j3d

j2a j3a j4a

j2a j3a j4a

m12 j2d j3d j4d

j1a

j1d

j1d

j2a j3a j4a

...

j2d j3d, j4d j1a

j1d

j2a j3a j4a

m02 j2d j3d j4d

m1k+1

m1k j2d j3d j4d

j1a

m01

459

j1d

j2a j3a j4a

m0k j2d j3d j4d

j1a

m0k+1 j2d j3d j4d

Figure 1: Possible models along with moves available from each model.

To avoid this, let us introduce the notion irrelevant, of LID (linearized, and duplicate) nodes. Consider a node βj , γj with γj = γ0j , γ1j, . . . , γpj in a model m0k or m1k . We call it: Definition 1. (Approximately) linearized if its output βj ϕ(x t γj ) is nearly a hyperplane in the input-output space for the input data range. Definition 2.

(Approximately) irrelevant if βj 0.

Definition 3. We say nodes (βj , γj ) and (βk , γk ) are (approximately) duplicate nodes if γj − γk 0. Note that a model with an (approximately) irrelevant node has (almost) identical likelihood as a model without it. A model with (approximately) du- plicate nodes will perform similarly to a model with a single node βj +βk , γj instead of them. Finally, the linear behavior of a linearized node may be assimilated into a linear term or another linearized nodes. The appearance of LID nodes therefore may cause problems of model identifiability and multimodality in model space. Recall that the acceptance probability of a proposed model is computed using the marginal likelihood over β and/or λ whenever possible. This marginalization accelerates convergence, but γ is not easily marginalized. Thus, any proposed model structure change is evaluated with the currently imputed value of γ . Adding irrelevant, duplicate, or linearized nodes implies less structural change, so that in our simulation, it was more common to accept add moves that actually add LID nodes. And, vice versa, it is very unlikely to accept a jump to a simpler model than propose an arbitrary node to be deleted. In fact, it is usually easier for a new LID node to be added before old ones are deleted. This motivates an MCMC scheme that includes add/delete of LID nodes as additional moves to a standard reversible-jump scheme. Once moves to delete LID nodes are defined, we shall propose the corresponding add nodes

460

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

moves. Note that whereas delete LID nodes would mitigate our problem, add LID nodes would seem to complicate our scheme, following our above discussion. In principle, we could propose these add LID nodes with low probability. However, besides ensuring balance, add moves have a useful side effect. Note that with add moves, we have control over when and how an LID node is added. Adding LID nodes will have more probability of being accepted than adding an arbitrary node because it usually implies a smaller structural change. We expect that once a model with a new LID node becomes the current model, the next iterations will delete some other LID nodes. In this way, more complex models have a chance of being visited, so it can help to get a wide coverage of model space. In a sense, this is related to some global search optimization techniques, like simulated annealing (Geman & Geman, 1984), in which a worse solution is sometimes proposed to escape from a local minima, with the hope of reaching a better, possibly global, minima from it. As Bayesian inference implicitly embodies Occam’s razor principle and clever delete moves are defined, we expect simpler models to be more common. In the rest of this section, we show how to characterize a linearized node so as to generate it at random or propose deletion. We also introduce modifications to a basic thin/seed move to deal with duplicate nodes, and we put an ad hoc distribution over nodes to increase the probability of an irrelevant node to be proposed for deletion. Note that there is a decision to make about how deterministic a move should be. For example, we could look deterministically for two (approximately) duplicate nodes in the set of nodes and, if they exist, delete them, but in general, it is good to allow for some randomness in move definitions (Green, 1995). At the end of this section, a complete reversible-jump scheme is outlined. A full description of add/delete moves is given in the appendixes. 4.1 (Approximately) Linearized Nodes. To facilitate explanation, consider an FFNN model in a regression problem: y(x1 , . . . , xq ) = f (x1 , . . . , xq ) =

M

βj ϕ(ζj )

j=1

ζj (x1 , . . . , xq ) = γj0 + γj1 x1 + · · · + γjq xq . The net output y is given by a linear combination of hidden node outputs. Each hidden node output ϕ(ζj ) = 1+exp1 −ζ , j = 1, . . . , M, represents a ( j) is approximately linear in its middle range sigmoidal surface in Rq+1 , which (see Figure 2). For any data x , i = 1, . . . , N, we may find weights , . . . , x i1 iq γj0 , γj1 , . . . , γjp so that −ρ ≤ ζj (x1 , . . . , xq ) ≤ ρ, i = 1, . . . , N,

(4.1)

Bayesian Analysis of Nonlinear Autoregression Models

461

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 6 0.2

4 2

0.1 0 0 −6

−2 −4

−2

0

−4 2

4

6

−6

Figure 2: Sigmoidal surface when input space dimension is q = 2.

where ρ and ϕ(ζj ) resembles a hyperplane in the input data domain. Note also that when ζj > η, say η ≥ 4, the sigmoid is saturated (see Figure 3) and ϕ(ζj ) 1. It is possible to adjust weights so that saturation is verified for the whole data set. When fitting an FFNN model, some hidden nodes will ocasionally work in the linear zone, behaving approximately as linear regression terms, or in the saturation zone, behaving as a constant term. For example, in Figures 4 and 5, an FFNN with four nodes fits a cosine function. The joint output of three of the four nodes has a linear trend, but none of them is working in the linear zone. The fourth node helps to compensate this linear trend so that the cosine is fitted. Also, when using mixed models, some nodes may add up their linear behavior to the linear term. Appendix A describes computational aspects of the derivation for generation and deletion of linearized nodes. 4.2 (Approximately) Irrelevant Nodes. We let chance deal with the generation of an irrelevant node. We modify a basic birth-death move when selecting a node for deletion. We choose a node at random with probabilities ψ −1 2 pj = M+1j , j = 1, . . . , M where ψj = exp{ 2 2 , βj }, so that more weight k=1

ψk

is given to those nodes whose βj is almost zero. (Full details are given in section B.3.)

462

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨ 1.0

ϕ(ζj) 0.8

0.6

0.4

0.2

ζj

0.0

-5

-4

-3

-2

-ro -1

0

+ro 1

2

3

4

5

20

Figure 3: When −ρ ≤ z ≤ ρ, sigmoid function could be well approximated by its tangent in z = 0.

-10

0

10

joint output first 3 nodes

-20

output 4th node

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

Figure 4: Feedforward neural network with four nodes fitting a cosine function. Three of the four nodes fit the cosine waves with a global up-linear trend. The fourth node adds up its linear behavior to compensate the trend.

4.3 (Approximately) Duplicate Nodes. Consider two (approximately) duplicate nodes βj , γj and (βk , γk ) in a model m0k or m1k . We could propose a simpler model, collapsing them into a new node βj + βk , γj . The proposed add or delete pair is a basic thin or seed move with minor modifications.

463

-1.0

-0.5

0.0

0.5

1.0

Bayesian Analysis of Nonlinear Autoregression Models

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

Figure 5: The joint output of the four nodes in the neural net model resembles the target cosine function.

To add a duplicate node, take a node βj , γj at random, and introduce small perturbations to produce two (approximately) duplicate new nodes, j , j+1 , hence substituting βj , γj by β γj and β γj+1 with j = βj (1 − v), β j+1 = βj v β γj = γj ,

γj+1 = γj + δ.

Perturbations v and δ are generated from v ∼ Beta(2, 2) δ ∼ N 0, cγ , where c is a small enough constant (say, c = 0.01). As we introduced an order constraint on γ , the add move proposed will be rejected if γj and γj+1 do not satisfy the order constraint. Given the on γ , we expect that for a given βj , γj , the order constraint next node βj+1 , γj+1 could be an almost duplicate version of it. Then we j = βj + βj+1 and propose removing βj+1 , γj+1 and transforming βj , γj as β γj = γj . (See appendix B for full details.) 4.4 Our Reversible-Jump Scheme. We propose the following moves: 1. Add or delete linear term (j1a , j1d ). 2. Add or delete linearized node (j2a , j2d ). 3. Add or delete irrelevant nodes (j3a , j3d ).

464

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

4. Add or delete duplicate nodes (j4a , j4d ). Not all of them are reachable from a given model. For example, we cannot delete a linear term when the current model is the linear model, or delete duplicate nodes if we have just one node (see Figure 1). Specifically: • From the linear autoregression model m10 , valid moves are j2a and j4a . • From a feedforward neural net model with one node m01 , valid moves are j1a , j2a , j3a , and j4a . • From a feedforward neural net model with k ≥ 2 nodes m0k , valid moves are j1a , j2a , j2d , j3a , j3d , j4a , and j4d . • From a mixed model with k ≥ 1 and the nodes m1k , valid moves are j1d , j2a , j2d , j3a , j3d , j4a , and j4d . We will assume that from a given model, all reachable moves are equally likely. Formally, unreachable moves are given zero probability. Note that the definition of each move depends on the current model where to jump from. For example, adding a linearized node when the current model is m10 involves proposing neural network parameters and hyperparameters absent in m10 . However, when the current model is m0k , k ≥ 1, adding a linearized node is a jump between nested models with shared parameters and hyperparameters. Shared parameters and hyperparameters could be proposed with the same values as in the previous model. But unshared parameters and hyperparameters have to be proposed from scratch, for example, from their prior distribution. This usually implies less of a chance for the move to be accepted. It would be useful for convergence purposes to generate unshared parameters using proposals centered in the values they had the last time the model was visited. We shall not deal with this matter here and will use priors as proposals. Similar comments apply to other moves. (See appendix B for a complete description.) The general reversible-jump posterior inference scheme is as follows: 1. Start with an initial model mhk , h = {0, 1}, k = 0, 1, . . . and initial values for its parameters (and hyperparameters if needed) (for example, prior means). Until convergence is achieved, iterate through steps 2 to 4: 2. With probability p1 (say, p1 = 0.5), decide to stay within the current model; otherwise (with probability 1 − p1 ), decide to move to another model. 3. If staying in the current model, perform the MCMC scheme described in section 3. 4. If moving to another model, select at random a move from the list of reachable moves with probabilities assigned as mentioned and propose a new model accordingly. If accepted, the new model is the current model.

465

4000 0

1000

2000

3000

lynx

5000

6000

7000

Bayesian Analysis of Nonlinear Autoregression Models

1820

1840

1860

1880

1900

1920

Time

Figure 6: Annual number of lynx trappings in the Mackenzie River District of Northwest Canada, 1821–1934 (Priestley, 1988).

5 Examples We illustrate the methodology with two well-known time-series data sets and two simulated examples. 5.1 Lynx Data. First, we consider time-series data giving the annual number of lynx trappings in the Mackenzie River District of Northwest Canada for the period 1821 to 1934 (Priestley, 1988; see Figure 6). In neural network applications, the data set is often split into two subsets: one for estimation and the other for validation. We split the lynx data set into a training data set (the first 72 observations) and a test data set (the last 38 observations). Prior distributions for the unknown parameters and hyperparameters in the model are chosen as described in section 2. As for the choice of the hyperhyperparameters, our experience shows that at that level of the hierarchical model, the results do not change much when using different values. Here,

466

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

15 Initial value M = 0 Initial value M = 15

M

10

5

0

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Figure 7: Lynx data. Trace plots of the Markov chains for the number of hidden nodes, from two well-dispersed starting points, M = 0 and M = 15.

we used the following choices for the hyperparameter distributions: µβ ∼ N(0, 3),

σβ2 ∼ InvGamma(9, 1)

µλ ∼ N((0, 0, 0), 3I),

σλ2 ∼ InvGamma(9, 1)

µγ ∼ N((0, 0, 0), 3I),

γ ∼ InvWishart(10, 2.5I)

2

σ ∼ InvGamma(1, 1) Also, the unknown number of nodes M is given a geometric prior distribution with parameter α = 14 , which implies a prior mean E(M) = 4. Two runs were carried out from different starting points: M = 0 and M = 15. A burn–in of 1000 iterations was used, and then 9000 additional iterations were monitored for inference purposes. Figures 7 and 8 show trace plots of the Markov chains for the number of nodes in the FFNN and histogram of the posterior distribution of M, respectively, suggesting M = 2 as the most likely number of nodes for the hidden layer of the FFNN term, although M = 1, 3, 4 received nonnegligible probability. In any case, the analysis clearly indicates the presence of important nonlinearity. Finally, Figure 9 shows the observed values (+) (note that the time series has been log transformated and detrended), the one-step-ahead forecast values (solid line), and 90% probability interval (dash-dot line), showing good performance for our LID-RJMCMC scheme. For comparison, we show the predictions obtained with the unscented particle filter algorithm proposed by van der Merwe, Doucet, de Freitas, & Wan (2000); dotted line). Their algorithm consists of a particle filter (PF) that uses an unscented Kalman filter (UKF) to generate the importance proposal distribution. Our method seems to perform similarly to the PF-UKF algorithm. 5.2 Airline Data. As a second example, we consider the well-known international airline data by Box and Jenkins (1970). This time series is one of the most commonly used seasonal ARIMA models. It was modeled by

Bayesian Analysis of Nonlinear Autoregression Models

467

Histogram of M 0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00

0

1

2

3

4

5

6

7

M Figure 8: Lynx data. Histogram of the posterior distribution of M, suggesting M = 2 nodes for the hidden layer of the FFNN term.

Box and Jenkins as ARIMA(011)(011)12 and is usually known as airline model due to this time series. Exploratory data analysis suggests seasonal behavior of the time series with similar properties exihibited every 12 observations (every 12 months). We shall therefore model each yt on the basis of the inmediately past values, yt−1 , yt−2 as well as corresponding seasonal values yt−12 , yt−13 . We initialized the MCMC algorithm by two well-dispersed values for the number of hidden nodes M and let the reversible-jump MCMC algorithm described run for a sufficient number of iterations: 2000 burn-in and another 18,000 iterations for making inferences. Similar comments about prior choices in the previous example apply here. Figure 10 shows the histogram of the posterior distribution of M; in this case, M = 3 seems to be the most likely number of nodes, but probabilities for M = 2, 4, 5 are also quite important. It is of interest to look at the reversible-jump algorithm development. Table 1 shows the number of times each move has been proposed and accepted. Note that acceptance rates are considerably larger for irrelevant nodes than for linearized or duplicate nodes, given, perhaps, to the simplicity of the movement.

468

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

TEST 4

3

log−lynx detrended

2

1

0

−1

−2

−3

True y PF−UKF estimate Bayes−FFNN estimate 95% prob. interval

−4

−5 1890

1895

1900

1905

1910

1915

1920

1925

1930

1935

Figure 9: Predicted values for the Lynx time series (log transformed and detrended).

For the airline time series, we also consider fitting both a classical ARIMA model, as presented by Box and Jenkins (1970) and an FFNN with three hidden nodes. Figure 11 presents prediction values (solid line) and 95% intervals (dash-dot line) for our LID-RJMCMC scheme, as well as comparison with the ARIMA(011)(011)12 model (dashed line) and FFNN(3) (dotted line), showing the superiority of neural networks in contrast to classical Table 1: Number of Times Proposed, and Times Accepted for the Different Moves. Move

Times Proposed

Times Accepted

j1a j1d j2a j2d j3a j3d j4a j4d

1 840 483 831 731 823 428 837

1 1 130 36 229 396 68 7

Bayesian Analysis of Nonlinear Autoregression Models

469

Histogram of M 0.35

0.30

Density

0.25

0.20

0.15

0.10

0.50

0.00

0

1

2

3

4

5

6

7

8

9

10

11

12

M Figure 10: Airline data. Histogram of the posterior distribution of M, showing M = 3 as the most likely number of nodes for the hidden layer of the FFNN term.

ARIMA models. Note that our results are very similar to those obtained from the classical FFNN with fixed number of nodes. Yet the main point we are trying to make is that with our algorithm, model selection is performed automatically, without the need to decide on a fixed architecture in advance. Table 2 compares the one-step-ahead mean square errors (MSE) obtained with each method. Again, our method leads to a slightly better performance over standard methods such as classical ARIMA and FFNN. 5.3 Simulated Examples. Finally, we apply our LID-RJMCMC scheme to simple linear and nonlinear simulated processes. Here, we are interested in seeing how our method performs for simple standard processes, deTable 2: One-Step-Ahead Mean Square Errors. Method

LID

ARIMA

FFNN(3)

MSE

0.0399

0.0455

0.0438

470

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

TEST 9

8

True y Bayes−NN estimate 95% prob. interval Arima FFNN(3)

7

air

6

5

4

3

2

120

122

124

126

128

130

132

Figure 11: Predicted values for the airline time series.

tecting them and not producing overly complex models. We generate two simple time series: a linear stationary AR(1) process and a nonlinear process generated by a FFNN with one node in the hidden layer. Table 3 shows the posterior distributions for indexes H (indicator of the linear term) and M (number of hidden nodes), after carrying out 20,000 MCMC iterations. Despite the complicated model, the scheme is able to identify the simpler standard autoregresive linear models (AR(1)) or neural networks (FFNN(1)), according to Occam’s principle of preferring simpler models.

Table 3: Posterior Distribution for Indexes H and M. H

AR(1) FFNN(1)

M

0

1

0

1

2

3

4

0.00 0.62

1.00 0.38

0.87 0

0.12 0.83

0.01 0.11

0.00 0.06

0.00 0.00

Bayesian Analysis of Nonlinear Autoregression Models

471

6 Conclusion We have presented a reversible-jump algorithm for the analysis of timeseries data through FFNNs viewed as nonlinear autoregression models. The advantages of the Bayesian approach outweigh its additional computational effort: as no local optimization algorithms are used, local minima issues are mitigated; model selection is performed as part of Bayesian methodology without the need of deciding explicitly a range of models to select from and having to use ad hoc methods to rank them. Bayesian analysis naturally embodies Occam’s razor—the principle of preferring simpler models to complex models. It is also possible to include a priori information about parameters and number of hidden nodes in accordance with this principle. The final product of a Bayesian analysis is a predictive distribution that averages over all possible models and parameter values, so uncertainty about predictions is estimated and overfitting risk is reduced. Many issues remain to be explored. An obvious extension is the application of the reversible-jump algorithm based on LID nodes to other standard neural network applications like regression, classification, or density estimation. Also, we have confined this analysis to FFNNs, but many other neural network models, analyzed from a Bayesian point of view, might prove useful. Relating to time-series analysis, here we have assumed autoregressive order q to be known in advance. We could also model uncertainty about q as a problem of model selection. Appendix A Linearized Nodes: Technical Details. Comments on section 4.1 emphasize the need to add and remove linearized nodes to get good coverage of the model space. When adding a linearized node, we will have to randomly generate weights γj0 , γj1 , . . . , γjq so that the node behaves in the linear zone for the given data set. To show how to achieve this, consider the case in which q = 2. For a given node γ0 , γ1, γ2 , ζ = γ0 + γ1 x1 + γ2 x2 belongs to 2 a family of straight lines with slope −γ γ1 . This node is linearized if equation 4.1 is verified. Rewrite equation 4.1 as −ρ − γ0 ≤ z (xi1 , xi2 ) ≤ ρ − γ0 ∀ (xi1 , xi2 ) , i = 1, . . . , N,

(A.1)

where z (x1 , x2 ) = γ1 x1 + γ2 x2 = κ (s1 x1 + vs2 x2 ) with si = sgn(γi ), i = 1, 2 κ = |γ1 | v=

|γ2 | . |γ1 |

For some data points, z is maximized or minimized (see Figure 12), so

472

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

x2

zmax=g1*x1+g2*x2

x1 zmin=g1*x1+g2*x2

Figure 12: To find zmin , we and zmaxmax force −ρ − γ0 ≤ zmin ≤ zmax ≤ ρ + γ0 into a min rectangle xmin , xmax containing all input data points. 1 , x2 1 , x2

the above condition can be summarized as −ρ − γ0 ≤ zmin ≤ zmax ≤ ρ + γ0 , (A.2) zmin zmin zmax zmax where zmin = z x1 , x2 and zmax = z x1 , x2 . To find zmin and zmax ,

min , xmax , xmax it is easier to force equation A.2 into a rectangle xmin 1 , x2 1 2 containing all input data points: = min (xi1 ) , xmin 1

xmax = max (xi1 ) 1

xmin 2

xmax 2

i

= min (xi2 ) , i

i

= max (xi2 ) . i

We may then write, 1 1 1 + sgn(γi ) xmax 1 − sgn(γi ) xmin + , i = 1, 2 i i 2 2 1 1 = + , i = 1, 2, 1 + sgn(γi ) xmin 1 − sgn(γi ) xmax i i 2 2

xizmax = xizmin and

zmax = γ1 x1zmax + γ2 x2zmax = κ s1 x1zmax + vs2 x2zmax zmin = γ1 x1zmin + γ2 x2zmin = κ s1 x1zmin + vs2 x2zmin .

Bayesian Analysis of Nonlinear Autoregression Models

473

Let L = zmax − zmin = δ=

κ δ

1 . min ) + v(xmax − xmin ) (xmax − x 1 1 2 2

Then, rewrite equation A.2 as: L ≤ 2ρ −ρ − γ0 + L ≤ zmax ≤ ρ − γ0 .

(A.3)

We have three degrees of freedom to verify equation A.3. Setting v = γγ21 and signs si , i = 1, 2, fixes the slope of the straight line z, which fixes the sigmoidal surface orientation. Once this is defined, we could take a value for κ so that L ≤ 2ρ, fixing the sigmoidal surface steepness. Finally, we should take γ0 so that −ρ − γ0 + L ≤ zmax ≤ ρ − γ0 , which locates the sigmoidal surface. 1 Rewrite δ = xmaxu−x min for some u1 and 1

1

κ max (x − xmin 1 ) = 2ρb u1 1

− xmin − xmin xmax 1 1 − u1 xmax 1 1 1 1 v = max −1 = u1 x2 − xmin δ(xmax − xmin xmax − xmin 2 1 1 ) 2 2

L=

κ=

2ρbu1 − xmin 1

xmax 1

γ0 = −ρ + L − zmax + u0 (2ρ − L) . These new quantities may be interpreted as follows: • u1 is related to the orientation of the sigmoidal surface, and therefore 1 with the slope of the straight lines z. In fact, 1−u u1 should take values in [0, ∞). • b is related to the width L of the interval (zmin , zmax ) , which is related to the steepness of the sigmoidal surface. It should take values in (0, 1). • u0 is related to the sigmoidal surface location given by how interval (zmin , zmax ) is placed within interval (−ρ − γ0 , ρ − γ0 ). It should take values in (0, 1). Let us rewrite γ0 , γ1 , and γ2 in terms of the above quantities: γ1 = s1

2ρbu1 − xmin 1 )

(xmax 1

474

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

2ρb (1 − u1 ) (xmax − xmin 2 2 ) γ0 = 2ρu0 1 − b + ρ(2b − 1) − (γ1 x1zmax + γ2 x2zmax ). Suppose now that we randomly generate weights γ0 , γ1, γ2 from a multivariate normal distribution, accepting those verifying equation A.3. The sample distribution of these weights has no recognizable shape, but the distributions of the other quantities above are suggested by their histograms: γ2 = s2

p(si = +1) = 0.5; b ∼ Beta(2, 2);

p(si = −1) = 0.5 u1 ∼ U(0, 1) ;

u0 ∼ U(0, 1).

We can then generate random weights corresponding to linearized nodes. The generalization to q > 2 is straightforward: γi =

2ρϕi , i = 1, . . . , q − xmin ) i

(xmax i

γ0 = 2ρu0 (1 − ϕ0 ) + ρ(2ϕ0 − 1) − (γ1 x1zmax + · · · + γq xqzmax ), q with ϕ0 = i=1 |ϕi | and xizmax defined as above. To generate ϕi , i = 1, . . . , q, first generate: q−1 ξ1 , . . . , ξq−1 ∼ Dirichlet(1, 1, . . . , 1); ξq = 1 − ξi i=1

b ∼ Beta(q, 2) p(si = +1) = p(si = −1) =

1 , i = 1, . . . , q 2

u0 ∼ U(0, 1), and define ϕi = bsi ξi . To propose a linearized node for deletion, we simply choose a node at random. Instead of finding out deterministically which node, if any, is a linearized one, we allow some randomness in the move definition. Once a node is selected, we recover the quantities defined previously: ξi = |ϕi | = u=

|γi | (xmax − xmin ) i i 2ρ

i = 1, . . . , q

γ0 + (γ1 x1zmax + · · · + γq xqzmax ) − ρ(2ϕ0 − 1) 2ρ(1 − ϕ0 )

b = ϕ0 =

q i=1

|ϕi | .

Bayesian Analysis of Nonlinear Autoregression Models

475

Should the node selected be approximately linearized, the following conditions would verify: q

ξi = 1 and ξi < 1

∀i = 1, . . . , q

i=1

u ∈ [0, 1] b ∈ [0, 1]. The density of any of these parameters (see appendix B) would be zero if they do not satisfy that condition. Hence, the probability of deleting a nonlinearized node with this move is zero; the move would be rejected. Appendix B: LID Reversible Jumps This appendix presents the mathematical details for the reversible-jump MCMC scheme for models containing both a linear term and an FFNN term, Linear + NN(m). Throughout this appendix, we use the following notation cur: current model pro: proposed model θ : parameters in current model φ: parameters in proposed model uθ : parameters to be added by the move uθ : parameters to be added by the move uφ : parameters to be deleted by the move According to the general reversible-jump MCMC scheme, proposed models are accepted with probability p(y | φ) p(φ | pro) p(pro) q(uφ | pro) ppro−→cur · · · · ·J , α = min 1, p(y | θ) p(θ | cur) p(cur) q(uθ | cur) pcur−→pro where p(y|φ) p(y|θ) p(φ|pro) p(θ|cur) p(pro) p(cur) q(uφ |pro) q(uθ |cur)

is the likelihood ratio, is the prior ratio for parameters for given models, is the prior ratio for models, is the proposal ratio,

476

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨ pprocur pcurpro

J

is the transition probabilities ratio, and is the Jacobian of the transformation φ = g(θ, uθ ).

Note the following: • For simplicity, we shall assume all possible models to be equiprobable p(pro) so that p(cur) = 1 for all jumps considered. • Extension to models without the linear term, NN(m) is straightforward. • Likelihoods p(y | φ) and p(y | θ) may be substituted by marginal likelihoods in β. • Rearrangement of γ to satisfy order constraint should be done once the jump between models has been accepted, to save computational time. B.1 Add/Delete Linear Term. We propose this move with probability B1 . B.1a Add Linear Term. NN(m) −→ Linear + NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uθ = {λ, µλ , λ } φ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ , λ, µλ , λ . The Jacobian of the one-to-one transformation φ = g(θ, uθ ) is J = 1. The proposal is given by µλ ∼ N(mλ , Sλ ) λ ∼ IWishart(wλ , λ ) λ ∼ N(µλ , λ ). Thus, p(φ | pro) = p(λ | µλ , λ )p(µλ | mλ , Sλ )p(λ | wλ , λ ) p(θ | cur)

Bayesian Analysis of Nonlinear Autoregression Models

ppro−→cur B1d = =1 pcur−→pro B1a

assuming B1d = B1a =

477

B1 2

q(uφ | pro) 1 = , q(uθ | cur) N(λ | µλ , λ )N(µλ | mλ , Sλ )IW(λ | wλ , λ ) p(y|φ) so that the acceptance probability is given by α = min 1, p(y|θ ) . B.1b Delete Linear Term . Linear + NN(m) −→ NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ , λ, µλ , λ φ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uφ = {λ, µλ , λ } . The Jacobian of the one-to-one transformation (φ, uφ ) = g(θ) is J = 1 and 1 p(φ | pro) = p(θ | cur) p(λ | µλ , λ )p(µλ | mλ , Sλ )p(λ | wλ , λ ) ppro→curr B1a = =1 pcur−→pro B1d

assuming B1d = B1a =

B1 2

q(uφ | pro) = N(λ | µλ , λ )N(µλ | mλ , Sλ )IW(λ | wλ , λ ), q(uθ | cur) p(y|φ) so that the acceptance probability is given by α = min 1, p(y|θ ) . B.2 Add/Delete Linearized Node. We propose this move with probability B2 . B.2a: Add Linearized Node. Linear + NN(m) −→ Linear + NN(m + 1), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ

478

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

uθ = , ϕ1 , ϕ2 , . . . , ϕp , u 2 , . . . , β M , β M+1 , γ . 1 , β γ1 , γ2 , . . . , γM , γM+1 , σ 2, µβ , σβ2 , µγ , φ= β The transformation φ = g(θ, uθ ) is defined by j = βj , β M+1 = , β

γj = γj γM+1,i =

j = 1, . . . , M 2ρϕi − xmin ) i

i = 1, . . . , p

(xmax i

γM+1,0 = 2ρu(1 − ϕ0 ) + ρ(2ϕ0 − 1) − ( γM+1,1 x1zmax + · · · + γM+1,p xpzmax ) σ 2 = σ 2, ϕ0 =

p

µβ = µβ ,

σβ2 = σβ2 ,

γ = γ µγ = µγ ,

|ϕi |

i=1

xizmax =

1 1 + , i = 1, . . . , p. 1 + sgn(ϕi ) xmax 1 − sgn(ϕi ) xmin i i 2 2

The Jacobian of the transformation is 2ρ(1 − ϕ0 ) 0 ... 2ρ − ... (xmax −xmin ) 1 1 − 0 ... J = − 0 ... − 0 ...

0

0

0

0

0

0

2ρ max min (xp−1 −xp−1 )

0

(2ρ)p (1 − ϕ0 ) = p . max − xmin ) i i=1 (xi The proposal is given by u ∼ U(0, 1) ∼ N(µβ , σβ2 ) b ∼ Beta(p, 2) ξ1 , . . . , ξp−1 ∼ Dirichlet(1, 1, . . . , 1) p(si = +1) = p(si = −1) = Then define ϕi = si bξi , i = 1, . . . , p.

1 , i = 1, . . . , p. 2

0 2ρ (xpmax −xpmin )

Bayesian Analysis of Nonlinear Autoregression Models

479

and it can be shown that p 1 fϕ (ϕ1 , . . . , ϕp ) = p (p + 2) 1 − ϕi , 2 i=1 |ϕk | f or0 < |ϕi | < 1 −

i = 1, . . . , p.

k=i

Thus, p(φ | pro) γ (M + 1)! µβ , σβ2 )p γM+1 | µγ , = p(βM+1 | p(θ | cur) M! ppro→curr B21d = pcur−→pro B21a q(uφ | pro) = q(uθ | cur)

1 M+1

1 2p (p

1

=

1 M+1

+ 2) 1 −

assuming B2d = B2a =

B2 2

1 p

. 2) ϕ , σ N( | µ i β β i=1

B.2b Delete Linearized Node. Linear + NN(m + 1) −→ Linear + NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ 1 , β 2 , . . . , β M , γ φ= β γ1 , γ2 , . . . , γM , σ 2, µβ , σβ2 , µγ , uφ = , ϕ1 , ϕ2 , . . . , ϕp , u . Select node h = 1, . . . , M + 1 at random, take nodes in θ : 1, . . . , h − 1, h + 1, . . . , M + 1, and relabel as 1, . . . , M. The transformation φ, uφ = g(θ) is defined by j = βj , β γj = γj

j = 1, . . . , M

γ = γ σ 2 = σ 2, µβ = µβ , σβ2 = σβ2 , µγ = µγ , = βh ϕi =

γh,i (xmax − xmin ) i i 2ρ

u=

γh,0 + (γh,1 x1zmax + · · · + γh,p xpzmax ) − ρ(2ϕ0 − 1) 2ρ(1 − ϕ0 )

i = 1, . . . , p

480

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

ϕ0 =

p

|ϕi | ,

i=1

xizmax =

1 1 + , i = 1, . . . , p. 1 + sgn(ϕi ) xmax 1 − sgn(ϕi ) xmin i i 2 2

The Jacobian of the transformation is 1 0 ... 2ρ(1−ϕ0 ) max min (x1 −x1 ) ... − 2ρ − 0 ... J= − 0 ... − 0 ...

0

0

0

0

0

0

max min (xp−1 −xp−1 ) 2ρ

0

0

(xpmax −xpmin ) 2ρ

p =

max − xmin ) i i=1 (xi (2ρ)p (1 − ϕ0 ).

If γh is not a linearized node, then the acceptance probability is α = 0; otherwise, 1 p(φ | pro) M! = p(θ | cur) p(βh | µβ , σβ2 )p γh | µγ , γ (M + 1)! p(prop) =1 p(curr) ppro→curr

assuming Kequiprobable models p(i) =

1 , ∀i K

B21a 1 B21 = M+1 assuming B21d = B21a = 1 pcur−→pro B21d M+1 2 p p q(uφ | pro) 1 I[0,1− |ϕk |] (|ϕi |) p (p + 2) 1 − ϕi = k=i q(uθ | cur) 2 i=1 i=1 =

I[0,1] (u)N( | µβ , σβ2 ). B.3 Add/Delete Irrelevant Node. We propose this move with probability B3 . B.3a Add Irrelevant Node . Linear + NN(m) −→ Linear + NN(m + 1), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ

Bayesian Analysis of Nonlinear Autoregression Models

481

uθ = {βM+1 , γM+1 } φ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ . The proposal is given by γM+1 ∼ N(µγ , γ ) βM+1 ∼ N(µβ , σβ2 ), rearranging γ to satisfy order constraint γ1,p < . . . < γM+1,p . The Jacobian of the one-to-one transformation φ, uφ = g(θ) is J = 1. Also, 2 p(φ | pro) p(βM+1 | µβ , σβ )p γM+1 | µγ , γ (M + 1)! = p(θ | cur) 1 M! p(prop) =1 p(curr)

assuming Kequiprobable modelsp(i) =

ppro→curr B3d ψM+1 ψM+1 = = M+1 M+1 pcur−→pro B3a k=1 ψk k=1 ψk

1 , ∀i K

assuming B3d = B3a =

B3 2

q(uφ | pro) 1 , = q(uθ | cur) p(βM+1 | µβ , σβ2 )p γM+1 | µγ , γ p(y|φ) M+1 M+1 . so that the acceptance probability is given by α = min 1, p(y|θ ) (M+1)ψ k=1

ψk

B.3b Delete Irrelevant Node. Linear + NN(m + 1) −→ Linear + NN(m), m ≥ 0 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ φ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uφ = {, δ} . Let ψj =

√1 2π

exp

−1 2 2

x2

j = 1, . . . , M + 1 for a given . Choose node ψ

h at random with probabilities pj = M+1j h, and relabel as 1, . . . , M.

k=1

ψk

, j = 1, . . . , M + 1. Remove node

482

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

The transformation φ, uφ = g(θ) is defined by = βh δ = γh , with Jacobian J = 1 and p(φ | pro) M! 1 = 2 p(θ | cur) p(βh | µβ , σβ )p γh | µγ , γ (M + 1)! p(prop) =1 p(curr)

assuming K equiprobable models p(i) = M+1

ppro→curr

B3a 1 = ψh pcur−→pro B3d M+1 k=1

= ψk

k=1

ψk

ψh

1 , ∀i K

assuming B3d = B3a =

B3 2

p( | µβ , σβ2 )p δ | µγ , γ q(uφ | pro) = , q(uθ | cur) 1 M+1 ψk p(y|φ) k=1 so that the acceptance probability is given by α = min 1, p(y|θ ) (M+1)ψ . h B.4 Add/Delete Duplicate Node. We propose this move with probability B4 . B.4a Add Duplicate Node. Linear + NN(m) −→ Linear + NN(m + 1), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uθ = {ν, δ} h , β h+1 , . . . , βM , φ = β1 , β2 , . . . , β

γ1 , γ2 , . . . , γh , γh+1 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ .

h+1 , h , γh and (β γh+1 ) where Choose h = 1, . . . , M at random and propose β h = βh (1 − ν) β γh = γh

Bayesian Analysis of Nonlinear Autoregression Models

483

h+1 = βh ν β γh+1 = γh + δ. Reject if (γ1 , γ2 , . . . , γh , γh+1 , . . . , γM ) is not ordered. The Jacobian of the transformation φ = g(θ, uθ ) is defined by 0 1 0 1 ∂ β h+1 , γh , β γh+1 (1 − ν) 0 ν 0 h , = |βh | . J= = −βh ∂ (γh , βh , ν, δ) 0 βh 0 0 0 0 1 The proposal is given by ν ∼ Beta(2, 2) δ ∼ N(0, cSγ ). Thus, h+1 | µβ , σ 2 )p h | µβ , σ 2 )p γh | µγ , γ p(β γh+1 | µγ , γ p(φ | pro) p(β β β = p(θ | cur) p(βh | µβ , σβ2 )p γh | µγ , γ (M + 1)−1 1 ppro→curr B4d M 1 = = pcur−→pro B4d 1 M

assuming B4d = B4a =

B4 2

q(uφ | pro) 1 = . q(uθ | cur) Beta(ν | 2, 2)N(δ | 0, cSγ ) B.4d Delete Duplicate Node. Linear + NN(m + 1) −→ Linear + NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ h , . . . , βM , γ1 , γ2 , . . . , φ = β1 , β2 , . . . , β γh , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uφ = {ν, δ} . h , Select h = 1, . . . , M at random. Remove βh+1 , γh+1 and propose β γh , where h = βh + βh+1 β

484

A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨

γh = γh ν=

βh+1 βh + βh+1

δ = γh+1 − γh . Then relabel as 1, . . . , M. The Jacobian of the transformation φ, uφ = g(θ) is γh , ν, δ ∂ βh , = J= ∂ γh , γh+1 , βh , βh+1

0 0 1

1 0 0

1

0

0 0

−βh+1 (βh +βh+1 )2 βh (βh +βh+1 )2

1 = βh + βh+1 0

−1 1 0

and h | µβ , σ 2 )p γh | µγ , γ (M + 1)−1 p(β p(φ | pro) β = p(θ | cur) p(βh | µβ , σβ2 )p γh | µγ , γ p(βh+1 | µβ , σβ2 )p γh+1 | µγ , γ ppro→curr B4a 1 = =M 1 pcur−→pro B4d M

assuming B4d = B4a =

B4 2

q(uφ | pro) = I(0,1) (ν)Beta(ν | 2, 2)N(δ | 0, cSγ ). q(uθ | cur) Acknowledgments This work has been supported by projects from CICYT, CAM, and URJC. R.M.D. gratefully acknowledges receipt of a postdoctoral fellowship from CAM (Comunidad Autonoma ´ de Madrid). We also thank the referees for their illuminating comments. References Andrieu, C., de Freitas, J. F. G., & Doucet, A. (1999). Robust full Bayesian learning for neural networks (Tech. Rep. CUED/FINFENG/TR 343). Cambridge: Cambridge University, Engineering Department. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco: Holden-Day. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Faraway, J., & Chatfield, C. (1998). Time series forecasting with neural networks: A comparative study using the airline data. Journal of the Royal Statistical Society, Applied Statistics (Series C), 47, 231–250. Foster, B., Collopy, F., & Ungar, L.H. (1992). Neural network forecasting of short noisy time series. Computers and Chem. Engr., 16(4), 293–298.

Bayesian Analysis of Nonlinear Autoregression Models

485

Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation for Bayesian inference. London: Chapman & Hall. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Hill, T., O’Connor, M., & Remus, W. (1996). Neural network models for time series forecasts, Management Sci., 42, 1082–1092. Holmes, C. C., & Mallick, B. K. (1998). Bayesian radial basis functions of variable dimension. Neural Computation, 10, 1217–1233. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Conference of the Cognitive Science Society. Mackay, D. J. C. (1992). Bayesian methods for adaptive models. Unpublished doctoral dissertation, California Institute of Technology. Muller, ¨ P., & R´ıos Insua, D. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 571–592. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Priestley, M. B. (1988). Non-linear and non-stationary time series analysis. London: Academic Press. Remus, W., O’Connor, M., & K. Griggs, (1998). The impact of information of unknown correctness on the judgmental forecasting process. International Journal of Forecasting, 14, 313–322. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59, 731–792. R´ıos Insua, D, & Muller, ¨ P. (1998). Feedforward neural networks for nonparametric regression. In D. Dey, P. Muller, ¨ & D. Sinha (Eds.), Practical nonparametric and semiparametric bayesian statistics. Berlin: Springer. Stern, H. (1996). Neural networks in applied statistics (with discussion). Technometrics, 38, 205–220 Tang, Z., de Almeida, C., & Fishwick, P. A. (1991). Time series forecasting using neural networks versus Box-Jenkins methodology. Simulation, 57, 303–310 van der Merwe, R., Doucet, A. de Freitas, N., & Wan, E. (2000). The unscented particle filter (Tech. Rep. CUED/F-INFENG/TR 380). Cambridge: Cambridge University Department of Engineering. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Received January 7, 2004; accepted July 1, 2004.

ˇ ıma Communicated by Jiri S´

LETTER

Loading Deep Networks Is Hard: The Pyramidal Case David Windisch [email protected] Bahnplatz 5, A-2371 Hinterbruehl, Austria

The question of whether it is possible to load deep neural network architectures efficiently is examined by considering the class of pyramidal architectures. This class allows only a low interaction of the nodes. Still, the loading problem is found to be NP-complete. This provides evidence that depth alone is a factor accounting for loading hardness.

1 Introduction The loading problem for neural networks was first considered by Judd (1990). While he proved the loading problem to be NP-complete for arbitrary networks by reduction of the NP-complete 3 satisfiability problem (see Garey & Johnson, 1979), he found a polynomial-time loading algorithm for a restricted class of shallow architectures. Judd also proved NP-hardness of the loading problem for several classes of shallow architectures that do not satisfy his restriction but left the loading problem for deep (i.e., nonshallow) architectures unaddressed . This raises the question whether deep architectures can possibly be loaded in polynomial time. Deep architecˇ ıma (1994), who proved NP-completeness for tures were considered by S´ the class of triangular architectures, and by Hammer (1998a), who proved NP-hardness for certain classes of multilayer perceptrons. Finally, de Souto and de Oliveira (1999) introduced the restricted class of deep pyramidal architectures. These are similar to triangular architectures but allow only a minimal interaction between the nodes. de Souto and de Oliveira also proved NP-completeness of the loading problem for pyramidal architectures, but they overlooked that their proof holds only for a fan-in strictly greater than 2. Their result remains true for fan-in equal to 2 but requires a modified proof, which is given below. Also, NP-completeness is extended to another class of pyramidal architectures, where the number of hidden layers is fixed, and the fan-in varies instead, and to the node function set of linearly separable functions. This result is of interest because it shows that not even for the restricted class of pyramidal architectures, in which only minimal interaction of the nodes is possible, can loading be achieved in polynomial time, provided P = NP. This leads to the conclusion that depth by itself makes loading hard. Neural Computation 17, 487–502 (2005)

c 2005 Massachusetts Institute of Technology

488

D. Windisch

q

q

q

q

d q

Figure 1: A pyramidal architecture. The input nodes (empty circles) do not count as a layer of the architecture.

2 Terminology Definition 1. (pyramidal architecture) A pyramidal architecture π(d, q) is a feed forward architecture whose nodes form a q-tree (q ≥ 2) with d + 1 layers, where all the nodes have the same fan-in q. The leaves (in layer 0) are the input nodes feeding the nodes in layer 1. The root of the tree (in layer d) is the output node. Thus, every node receives input from its q children and propagates its output to its parent. See Figure 1 for a schematic illustration and Figure 2 for an illustration of the case d = q = 3. Definition 2. (node function sets used) The functions computed by the nodes can be chosen from either one of the following two sets: 1. AOFns = {AND, OR}, 2. LSFns = {[w, . ≥ c] : w ∈ Rq , c ∈ R}, where ., . denotes the inner product on Rq and 1 if w, y ≥ c [w, y ≥ c] = 0 otherwise By v, we denote a node of an architecture as well as the function computed by it. Also, we do not distinguish between a configured architecture and the function computed by it. This does not cause any confusion throughout this letter.

Loading Deep Networks Is Hard

489

Definition 3. (configuration, item, task, performability) Let A be a class of architectures and F a set of node functions. We call a configuration of A ∈ A a mapping from the set of nodes of A to F . By a task T = {Ii }, we mean a set of correct input-output combinations Ii = (xi , αi ), called items. If the architecture A is equipped with a configuration (i.e., if A has node functions attached to its nodes), then we say that A performs T if, for any item I = (x, α) ∈ T, the configured architecture A, when given the input x specified in I, returns the output α specified in I. Definition 4. (loading problem—decision version) Let A be a class of architectures and F a set of node functions. We define the loading problem as the following decision problem: Given an architecture A ∈ A and a task T, is there a configuration of A such that A performs T? Formally, the loading problem is the problem of recognizing the following language: PerfF = {(A, T) : F allows a configuration of A such that A performs T},

(2.1)

where A denotes a network architecture in A, T a task. A pyramidal architecture (see definition 1, Figure 1) depends on two different parameters: depth d and fan-in q. In order to consider the hardness of the loading problem, we must let one of these parameters vary. We first examine the problem for variable d and fixed q. 3 Variable Depth, Fixed Fan-In We now state the theorem de Souto and de Oliveira (1999) attempted to prove and provide a modified proof that also holds for the case q = 2. Then, we explain where their proof fails. Theorem 1. PerfAOFns is NP-complete for the class of pyramidal architectures with fixed q ≥ 2 and variable d ≥ 3. Proof. The proof is accomplished by reduction of the NP-complete satisfiability problem (SAT). For an explanation of this technique, see Garey and Johnson (1979). By vi , i = 1, 2, . . . , qd−2 , we denote the nodes in layer 2, and by vik , k = 1, 2, . . . , q, the corresponding nodes of layer 1 (i.e., the input nodes of the node vi are the nodes {vik , k = 1, 2, . . . , q}; see Figure 2). As indicated previously, the variables for the nodes also denote the functions computed by them.

490

D. Windisch

v11

v12

v13

v1

v21

v22

v23

v31

v2

v32

v33

v3

v

Figure 2: A pyramidal architecture, where d = 3, q = 3. The empty circles are input nodes.

The SAT (satisfiability) problem can be defined as follows: Given a Boolean expression in conjunctive normal form, is there a truth assignment to its variables, such that the expression is true? Formally, the SAT problem is the problem of recognizing the following language: SAT = {(Z, C) : ∃ such that (C) is true}, where Z = {ζ1 , ζ2 , . . . , ζr } is a set of boolean variables, C a boolean expression in terms of conjunctions of disjunctions of literals, and : Z → {0, 1} is some truth assignment to the variables in Z. We let (Z, C) be an arbitrary instance of the SAT problem, where C = {c1 , c2 , . . . , ce } is a conjunction of disjunctions over the set Z. For an arbitrarily chosen disjunction cj of C, we use the following notation: cj = {ζ˜i1 , ζ˜i2 , . . . , ζ˜im },

(3.1)

where 1 ≤ i1 < i2 < . . . < im ≤ r, and ζ˜ip is the literal derived from the variable ζip (i.e., ζ˜ip = ζip or ζ˜ip = ¬ζip ). 3.1 Reduction of SAT. We now reduce this instance (Z, C) of SAT to the instance (π, T) of the loading problem. To define the architecture π = π(d, q), we let every node vi in layer 2 represent one of the variables ζi . Hence, we need at least r nodes in layer 2. To meet this criterion, we choose d = 2 + logq r.

(3.2)

Note that the input dimension is then qd . For any disjunction cj in C, we define a corresponding item Ij in the task, T = {I0 , I1 , . . . , Ie , J1 , J2 , . . . , Jqd−2 , K1 , K2 , . . . , Kqd−2 },

Loading Deep Networks Is Hard

491

as follows (and give the justification below): Ij = (([ζ˜1 ]1q

2

−2q

), ([ζ˜2 ]1q

2

−2q

). . .([ζ˜r ]1q

2

−2q

)0q

d

−rq2

, 1), j = 1, 2,. . ., e, (3.3)

where [ζ˜i ] denotes an encoding of the literal ζ˜i , and is defined as follows:  (11)1q−2 (10)1q−2 if ζ˜i = ζi ˜ (3.4) [ζi ] = (10)1q−2 (11)1q−2 if ζ˜i = ¬ζi  (10)1q−2 (10)1q−2 if ζ˜i is not present in cj , where the number of 1’s is chosen such that half of the encoding [ζ˜i ] (i.e., (..)1q−2 ) constitutes the whole input to the node vi1 and the other half the 2 input to vi2 . In every bracket ([ζ˜i ]1q −2q ) in equation 3.3, the number of 2 1’s is chosen such that the bracket ([ζ˜i ]1q −2q ) constitutes the whole input affecting the layer 2 node vi . We define the item I0 (corresponding to the empty disjunction) in the same way but with required output 0: I0 = ([(10)1q−2 (10)1q−2 1q

2

−2q r qd −rq2

]0

, 0).

(3.5)

The following items Ji , i = 1, 2, . . . , qd−2 , force all nodes in layers ≥ 3 to compute OR. This is because only the ith layer 2 node will output 1 from the input of item Ji , so all nodes along the path from vi to the output node (below vi ) must compute OR: Ji = (0(i−1)q 1q 00 . . . 0, 1), 2

2

i = 1, 2, . . . , qd−2 .

(3.6)

And finally, we construct the following items Ki , i = 1, 2, . . . , qd−2 , such that Ki generates the input (100 . . . 0) to layer 2 node vi , and thus forces vi to compute AND: Ki = (0(i−1)q 1q 00 . . . 0, 0), 2

i = 1, 2, . . . , qd−2 .

(3.7)

The construction can be performed in polynomial time with respect to the size of (Z, C). To show this, we note that the length of any item I is qd + 1 = O(r) (by the definition of d, see equation 3.2), the number of items is equal to 1 + e + 2qd−2 = O(e + r), the number of nodes of A is qd+2 −1

1 + q + . . . + qd+1 = q−1 = O(qd ) = O(r), and the number of edges of A is one less than the number of nodes of A. We now give an illustrative example of the above construction for the case q = 3. Let (Z, C) be the following instance of SAT: Z = {ζ1 , ζ2 } C = {c1 , c2 } c1 = {ζ1 } c2 = {¬ζ1 , ¬ζ2 }.

492

D. Windisch

Following the above construction of a corresponding instance of PerfAOFns , we choose d = 2 + log3 2 = 3. This gives the architecture π(3, 3) depicted in Figure 2. The items in the task T are the following: I1 = (111101111

101101111

000000000, 1),

I2 = (101111111

101111111

000000000, 1),

I0 = (101101111

101101111

000000000, 0),

J1 = (111111111

000000000 000000000, 1),

J2 = (000000000 111111111

000000000, 1),

J3 = (000000000 000000000 111111111, 1), K1 = (111000000 000000000 000000000, 0), K2 = (000000000 111000000 000000000, 0), K3 = (000000000 000000000 111000000, 0). Proof. To prove that (Z, C) ∈ SAT ⇔ (π, T) ∈ PerfAOFns , we first assume that (Z, C) ∈ SAT and show that this implies that (π, T) ∈ PerfAOFns . The assumption means that there is a truth assignment : Z → {0, 1}, such that (C) is true. We make the following choices for the node functions:

AND OR vi1 , vi2 = vi1 =

if (ζi ) = 1 if (ζi ) = 0,

vik = AND ∀(i, k) : (i ≤ r, k ≥ 3) or i > r vi = AND ∀i (this is layer 2),

(3.8)

and all other nodes compute OR. First, we consider the item I0 (see equation 3.5). For any i, both nodes vi1 and vi2 receive at least one 0 as input. But either vi1 or vi2 computes the AND function, so vi receives one 0 (either from vi1 or from vi2 , at least) as its input. Since vi computes AND, it follows that vi outputs 0. Hence, all the layer 2 nodes output 0, so the final output is 0, as required. Now we consider all other items Ij , j = 1, 2, . . . , e, in the task. Each of these requires output 1. Since all nodes in layers ≥ 3 compute OR, it suffices to check that there is at least one node in layer 2 that outputs 1 in each case. Here, we will use the established correspondence with the disjunctions in C. We choose an arbitrary item Ij = (xj , αj ), say, and we denote the corresponding disjunction again by cj = {ζ˜i1 , ζ˜i2 , . . . , ζ˜im } (see equation 3.1). Since we are assuming the boolean expression C to be satisfied by the assignment , it would suffice to show that vl (xj ) = (ζ˜l ) ∀l ∈ {i1 , i2 , . . . , im }.

Loading Deep Networks Is Hard

493

To show that this is indeed true, we need only to check the four possible cases: (1) If (ζl ) = 1, then the node functions were chosen such that vl1 = AND, and vl2 = OR (see equation 3.8): • If ζ˜l = ζl , then we see from the definition of Ij (see equations 3.3 and 3.4) that the only node of all vlk , k = 1, 2, . . . , q, receiving a 0 as input is node vl2 . But vl2 = OR, so all inputs to node vl are 1’s, and vl (xj ) = 1, as required. • If ζ˜l = ¬ζl , then we see from the definition of Ij (see equations 3.3 and 3.4) that only vl1 receives a 0 as input. Since vl1 = AND, this implies that vl receives a 0 as input, and hence vl (xj ) = 0, as required. (2) If (ζl ) = 0, then the node functions were chosen such that vl1 = OR, and vl2 = AND (see equation 3.8). The same considerations as in case 1 show that vl (xj ) = (ζ˜l ) in this case, too. Hence, π maps all items Ij , j = 0, 1, . . . , e, correctly. If we consider the input of any item Ji , i = 1, 2, . . . , qd−2 (see equation 3.6), we see that it generates input 1q to the layer 2 node vi , so that vi must output 1. Since all nodes below layer 2 compute OR, this implies that the output of the architecture is 1, as required. Finally, the input of an item Ki , i = 1, 2, . . . , qd−2 (see equation 3.7), generates input 0q to all layer 2 nodes, except vi , whose input is 10q−1 . Since vi computes AND, vi outputs 0, like all other layer 2 nodes. Therefore, the output is 0, as required. We therefore conclude that we indeed have (π, T) ∈ PerfAOFns . Now, we assume that (π, T) ∈ PerfAOFns and show that this implies that (Z, C) ∈ SAT. To this end, we define the truth assignment by 1 if vi1 = AND (ζi ) = 0 otherwise, qd−2

and show that (C) is true. First, we note that the set of items {Ji }i=1 requires all nodes in layers ≥ 3 to compute OR (see equation 3.6), and that qd−2

therefore the items {Ki }i=1 require all layer 2 nodes vi to compute AND (see equation 3.7). We need to choose any disjunction cj (j ∈ {1, 2, . . . , e}) and show that at least one literal in it is assigned the value 1. To do that, we consider the corresponding item Ij , and compare it with I0 . We note that the outputs required by these two items differ. This difference can be accounted for only by some node functions at the nodes where the inputs of these two items differ. Comparing equations 3.3 and 3.4 with equation 3.5, we see that these nodes must be some nodes vi1 or vi2 , say, where i ≤ r. Remembering that the corresponding node vi computes AND, we see that this implies that exactly one of the nodes vi1 , vi2 must compute the AND function, for

494

D. Windisch

in the other two cases, the output of vi would remain the same for inputs from I0 and Ij . Equation 3.4 shows that if vi1 = AND (which implies that (ζi ) = 1), then ζ˜i = ζi , and if vi1 = AND (which implies that (ζi ) = 0), then (vi2 = AND and) ζ˜i = ¬ζi . Hence, we have in both cases (ζ˜i ) = 1, as required, so (C) is true and (Z, C) ∈ SAT. This completes the proof that (Z, C) ∈ SAT ⇔ (π, T) ∈ PerfAOFns . To show that PerfAOFns is in NP, we note that an algorithm checking whether a given configuration of an architecture A performs T has to evaluate all node functions once for every item in T, and hence runs in polynomial time. This completes the proof. It should be noted that the construction of (A, T) in the last proof could be achieved with fewer items Ji , since a Ji for every layer 3 node would suffice. This, however, would complicate the notation unnecessarily. This last proof is very similar to the one used by de Souto and de Oliveira (1999), except for two modifications. The first modification is that instead of the modified satisfiability problem (MSAT), proved to be NP-complete ˇ ıma (1994), we use the classical version SAT. The difference between by S´ these two problems is that MSAT requires positive and negative literals to alternate in every disjunction. In fact, even the proof by de Souto and de Oliveira (1999) can be left unchanged if MSAT is replaced by SAT. The crucial modification, however, is the extension of the task T (of the constructed instance of the loading problem) by the additional items Ji and Ki . These items force all nodes in layers ≥ 3 to compute OR and all nodes in layer 2 to compute AND. This property of a yes-instance of PerfAOFns is needed to infer that exactly one of two layer 1 nodes vi1 , vi2 must compute AND in order to perform the task T. But this last fact is essential for the final argument that given a performing configuration for π, the truth assignment satisfies the boolean expression C. If the fan-in q is strictly greater than 2 then even without items Ji and Ki , an argument similar to the above allows concluding that exactly one of two layer 1 nodes vi1 , vi2 must compute AND in order to perform the task, because the corresponding node vi must compute AND to distinguish between a pair of items I0 and Ij . However, this last argument breaks down if q = 2. To illustrate this point, we let q = 2, and provide an example of a no instance of SAT (as well as MSAT), which is mapped to a yes instance of PerfAOFns if the items Ji and Ki are omitted. Let Z = {ζ1 , ζ2 }, C = {c1 , c2 , c3 }, c1 = {¬ζ1 , ζ2 }, c2 = {¬ζ2 }, c3 = {ζ1 }. Obviously, C is then not satisfiable. However, using the reduction constructed in the proof, this instance gets mapped to π(3, 2) (i.e., a pyramidal architecture with three layers) with task T = {I0 , I1 , I2 , I3 }, I0 = (10101010, 0), I1 = (10111110, 1), I2 = (10101011, 1), I3 = (11101010, 1). This task is performed by π(3, 2) with node functions (v denotes the output node, otherwise notation as in the above proof) v11 = AND, v12 = OR, v21 = AND, v22 = AND, v1 = AND, v2 = OR, v = OR.

Loading Deep Networks Is Hard

495

q−1

q

Figure 3: The class of architectures described in theorem 2.

It remains an open question whether the same hardness result as the one just proved also holds for the node function set LSFns. A generalization of the method used in the last proof to the LSFns case does not seem obvious. 4 Fixed Depth, Variable Fan-In We now consider the set of pyramidal architectures with fixed d and variable q. We note that these architectures still qualify as deep networks because the support cone configuration space is unbounded. (The support cone configuration space of an output node is the set of all possible configurations of the nodes affecting it. See Judd, 1990.) We consider the node function sets LSFns and AOFns separately. 4.1 LSFns. We first restrict the set under consideration to the set of pyramidal architectures with fixed d = 2. From Blum and Rivest (1992), we know that the loading problem is NP-complete for a similar class of architectures: Theorem 2. PerfLSFns is NP-hard for the class of architectures mapping {0, 1}q−1 to {0, 1}, with q nodes in one hidden layer, each connected to all input nodes and computing LSFns, and one output node computing AND. (See Figure 3.) . (In fact, the proof We denote the problem in the last theorem by Perf given by Blum & Rivest, 1992, covers even the more general case, where the number of hidden nodes is bounded by a polynomial in q.) We will use theorem 2 to prove the following lemma: Lemma 1. PerfLSFns is NP-hard for the class of pyramidal architectures with fixed d = 2 and variable q ≥ 2.

496

D. Windisch

x1 x2

v^1

v^2

x1 x 2 0 x 1 x 2 0 x 1 x 2 0

v^3

^

v

v1

v2

v3

v

(left) is reduced Figure 4: Illustration of the proof of lemma 1. An instance of Perf to the instance on the right. The input shows the construction of the first set of ¯ the task T. (the idea is illustrated in Figure 4). Proof. The proof is by reduction of Perf , where Perf is the problem ˆ ˆ We denote by (A, T) an arbitrary instance of Perf proved to be NP-hard in theorem 2. The input dimension of this instance ¯ T) ¯ is q − 1 and the number of hidden nodes q. We construct an instance (A, ¯ Note that this implies that the of our problem PerfLSFns with fan-in q of A. input dimension of A¯ is q2 . Let Tˆ = {(xi , αi )}. We then choose some item of Tˆ with required output 1 and redenote its input by x1 , if necessary. If ˆ T) ˆ is a no such item exists, that is, if all required outputs are 0, then (A, ), and the reduction can be completed by ˆ T) ˆ ∈ Perf trivial yes instance ((A, constructing a trivial yes instance of PerfLSFns (it suffices to choose the node functions such that the output of A¯ is constant, with a corresponding task). Otherwise, we continue by defining the task T¯ as follows: ˆ T¯ = {((xi 0)q , αi ) : (xi , αi ) ∈ T} ∪{((x1 1)(x1 0)q−1 , 0), ((x1 0)(x1 1)(x1 0)q−2 , 0), . . . , ((x1 0)q−1 (x1 1), 0)}. ˆ where the second set is to make sure that That is, T¯ is an enlargement of T, the output node of A¯ is forced to compute the AND function (without loss of generality; see argument below). To check that this construction can be performed in polynomial time ˆ T)|, ˆ we note that the number of nodes in A¯ is equal to with respect to |(A, ˆ that the number of edges in A¯ is q2 + q = q2 + q + 1 = O(q2 ) = O(|A|), ˆ that |T| ¯ = |T| ˆ + q = O(|(A, ˆ T)|), ˆ O(q2 ) = O(|A|), and that the length of every 2 ¯ ˆ item in T is q + 1 = O(|A|). It remains to be shown that Aˆ can perform Tˆ if and only if A¯ can perform ¯ We first assume that there is a configuration of Aˆ such that Aˆ performs T. ˆ and we show that this implies that there is a configuration of A¯ such T,

Loading Deep Networks Is Hard

497

¯ We denote the hidden nodes (as well as the functions that A¯ performs T. ˆ and v¯ i (for A), ¯ i = 1, . . . , q, and the output computed by them) by vˆ i (for A) ¯ respectively. We denote the performing configuration of Aˆ as node as vˆ or v, ˆ i , y ≥ cˆi ], w ˆ i = (wˆ i1 , wˆ i2 , . . . , wˆ i(q−1) ) (and vˆ = AND). We then vˆ i (y) = [w choose the configuration of A¯ as follows: v¯ i (y) = [(wˆ i1 , wˆ i2 , . . . , wˆ i(q−1) , θ), y ≥ cˆi ] v¯ = vˆ = AND (i.e. r¯(y) = [(11 . . . 1), y ≥ q]), where θ is some low enough value, such that a 1 in the last input bit to any v¯ i generates a threshold value of less than cˆi , and hence the output 0 (we ˆ ˆ i , xj : i = 1, 2, . . . , q, j = 1, 2, . . . , |T|}). may choose any θ < min{ˆci − w The ¯ architecture A¯ configured in this way then performs T. Conversely, we assume that there is some configuration of A¯ such that A¯ ¯ The second set in the definition of T¯ then implies that there is performs T. ¯ z, say, such that v(z) ¯ exactly one possible input to the output node v, = 1, because any change of the input to v¯ must decrease the threshold value of the function computed by v¯ to a value low enough to make v¯ output 0. Thus, the function computed by v¯ is entirely determined by z, the unique vector ¯ in {0, 1}q such that v(z)=1. The node function set LSFns is closed under negation. This is due to the fact that ¬[w, . ≥ c] = [−w, . > −c] = [−w, . ≥ −c + ] for > 0 small enough. Therefore, if we replace the function computed by a node v¯ i ¯ Furthermore, it is by its negation, we still have a valid configuration of A. q easy to see that for any z ∈ {0, 1} , the boolean function from {0, 1}q to {0, 1}, which is equal to 1 only z, is in the set LSFns (in fact, this function can at q be written as [ˆz, . ≥ i=1 zi /2], where zˆ i = zi − 1/2). Since v¯ computes ¯ so such a function, we may therefore adjust the function computed by v, that the mapping computed by the entire architecture remains the same after the previous replacement of v¯ i by its negation. Indeed, it suffices to replace zi by its negation (i.e., to make z˜ = (z1 , z2 , . . . , zi−1 , ¬zi , zi+1 , . . . , zq ) ¯ z) = 1), since after the replacement of v¯ i by its the unique vector such that v(˜ negation, the ith input bit to v¯ is always the negation of what it was before the replacement, whereas the other input bits remain unchanged. Since the mapping is the same after this adjustment, the task is still performed by the architecture. By making the appropriate replacements just described for every i such that zi = 0, we obtain a performing configuration under which ¯ v(y) = 1 only if y = (11 . . . 1), which means that v¯ computes AND. The choice of the first set in the definition of T¯ then implies that Aˆ perˆ Using the above notation for the (adjusted) node functions, we let forms T. ˆ i = (w¯ i1 , w¯ i2 , . . . , w¯ iq ) (we drop the last entry w¯ i(q+1) ), and cˆi = c¯i . w ¯ which completes Hence, Aˆ can perform Tˆ if and only if A¯ can perform T, the proof.

498

D. Windisch

v11/ v11

v12/ v12

v21/ v21 v31 v(d−1)1

v(d−1)2

vd1=v ¯ is 2, the architecture A conFigure 5: In the LS-functions case, if the fan-in of A structed in the proof of theorem 3 looks like this. The circled nodes correspond to ¯ The input nodes shown (empty circles) are the only nodes receiving nonzero A. input by the constructed task T.

Using another reduction, we can generalize this last result to an arbitrary number of fixed layers. Theorem 3. PerfLSFns is NP-complete for the class of pyramidal architectures with fixed d ≥ 2 and variable q ≥ 2. Proof. The proof is accomplished by by reduction of PerfLSFns for pyramidal architectures with fixed d = 2 and variable q ≥ 2, which was proved to ¯ T) ¯ be be NP-hard in lemma 1. We denote this last problem by Perf . Let (A, an arbitrary instance of Perf with fan-in q. We also let T¯ = {(xi , αi )}. Then we construct an instance (A, T) of our problem PerfLSFns with the same fan-in q, and with the following task: ¯ T = {(xi 00 . . . 0, αi ) : (xi , αi ) ∈ T}. From this construction, it follows that the nodes receiving nonconstant input by items in T correspond exactly to A¯ (these are the circled nodes in Figure 5). ¯ T)|, ¯ beThis construction can be performed in polynomial time in |(A, ¯ ¯ cause |T| = |T|, and the length of every item in T is O(|A|), as is the number of nodes and edges in A.

Loading Deep Networks Is Hard

499

It remains to be shown that A¯ performs T¯ if and only if A performs T. We ¯ respectively (see denote the jth node in layer i by vij and v¯ ij , for A and A, ¯ To see that this implies Figure 5), and we first assume that A¯ performs T. ¯ All the other that A performs T, we choose vij = v¯ ij for all nodes v¯ ij of A. node functions of A are chosen arbitrarily, except those along the path from v21 to the output node vd1 = v, which are chosen such that the output of v21 equals the output of v for any input in the task T (the node vk1 just propagates the input received from node v(k−1)1 (k = 3, 4, . . . , d), regardless of all other input bits). With this choice, A performs T. Next, we assume that A performs T. If all outputs occurring in T are ¯ by the definition equal, then this is also true for all outputs occurring in T, ¯ Otherwise, we note that for all of T, and hence A¯ can trivially perform T. items in T, the output of v is entirely determined by the output of v21 . Indeed, all other layer 2 nodes produce the same output for every input present in T. Hence, for inputs in T, the outputs of v and v21 are either always equal or always negations of each other (here, we use the fact that not all outputs occurring in T are equal). Therefore, should the two outputs always be negations of each other, we may replace the functions computed by v21 and v by their negations (thus negating every output in T twice) without influencing the performance of T (here, as in the proof of lemma 1, we use closure under negation of LSFns). With the resulting configuration, A still performs T, and the outputs of v21 and v are equal for every input appearing in T (if v has been replaced by its negation, the output of v21 no longer gets negated). We then choose v¯ ij = vij as the configuration of ¯ which shows that A¯ performs T. ¯ Hence, A¯ performs T¯ if and only if A A, performs T. It remains to be shown that PerfLSFns is in NP. It is known (Muroga, Toda, & Takasu, 1961) that any linearly separable boolean function (i.e., any function in the set LSFns) receiving q inputs can be represented using not more than O(q2 log q) bits. Since an algorithm checking whether a given configuration of A performs T needs to evaluate every node function once for every item in T, this algorithm therefore runs in polynomial time. Hence, PerfLSFns is in NP, and this completes the proof. 4.2 AOFns. Similarly, we now consider AOFns as a node function set. The first result is proved in the same way as theorem 1: Lemma 2. PerfAOFns is NP-complete for the class of pyramidal architectures with fixed d = 3 and variable q ≥ 2. Proof. The proof is similar to that of theorem 1. The reduction of SAT to PerfAOFns remains the same, except that instead of choosing a large enough d, we choose here a large enough q = r, where r is the number of variables in the instance of SAT. The rest of the proof is left unchanged.

500

D. Windisch

The proof of the next theorem is similar to the proof of theorem 3: Theorem 4. PerfAOFns is NP-complete for any class of pyramidal architectures with fixed d ≥ 3 and variable q ≥ 2. Proof. The proof is by reduction of the result of lemma 2, proceeding as in the proof of theorem 3. The following corollary gives a result for LSFns as well as AOFns and is a direct consequence of theorems 3 and 4: Corollary 1. Perf (LS functions or AO functions) is NP-complete for the class of pyramidal architectures with variable q ≥ 2 and variable d ≥ 3 (AO functions)/d ≥ 2 (LS functions), where the parameters d and q are understood to be given as input to the loading algorithm. Proof. The proof is by identity reduction of Perf for the class of pyramidal architectures with fixed d and variable q, which was proved to be NP-hard in theorems 3 and 4. The proof that the problem is in NP is the same as the one in the proof of theorem 1 (AOFns case) or theorem 3 (LSFns case), respectively. 5 Conclusion It remains open whether the above results hold for more widely used classes of node functions, especially in the case of variable depth and fixed fan-in, in which NP-hardness was proved only for the node function set AOFns. An important extension would be to the sigmoidal activation functions used in the backpropagation learning algorithm (Rumelhart, Hinton, & Williams, ˇ ıma 1986). A theorem for sigmoidal activation functions was proved by S´ (1996): NP-hardness holds for the three-node network structure, with a restriction on the node functions, which holds, for example, if the threshold of the function computed by the output node is 0. Due to Hammer (1998b), NP-hardness of loading the three-node structure also holds with a different restriction on weights and classification accuracy. Vu (1998) considered a certain class of two-layered sigmoidal networks and showed that it is NP-hard to decide whether there exist weights minimizing the training error up to a certain constant. Similar hardness results (for approximate minimization of the quadratric error) on two-layered LSFns networks with a sigmoidal ˇ ıma (2002) extended output node are due to Bartlett and Ben-David (2002). S´ NP-hardness of approximate minimization of the squared error to the single sigmoidal node (Hoffgen, ¨ Simon, & VanHorn, 1995, showed a similar singlenode NP-hardness result for the LSFns case). DasGupta and Hammer (2000) considered two-layered sigmoidal (and other multilayered) networks, and

Loading Deep Networks Is Hard

501

the problem, proved to be NP-hard, of whether there are weights maximizing the ratio of correctly classified items in a task up to a certain constant. ˇ ıma (2002) gives a more detailed survey of these and other results. While S´ they suggest that hardness results on pyramidal architectures also hold for sigmoidal activation functions, it is not obvious how a generalization to pyramidal architectures could be achieved. Still, the above contribution provides evidence that depth is a factor that can make loading hard by itself. Acknowledgments This research was supported by an Undergraduate Summer Research Bursary from Keele University, Department of Mathematics, Keele, Staffordshire, UK, ST5 5DT. References Bartlett, P., & Ben-David, S. (2002). Hardness results for neural network approximation problems. Theoretical Computer Science, 284, 53–66. Blum, A., & Rivest, R. L. (1992). Training a three-node neural network is NPcomplete. Neural Networks, 5, 117–127. DasGupta, B., & Hammer, B. (2000). On approximate learning by multilayered feedforward circuits. In H. Arimura, S. Jain, & A. Sharma (Eds.), Algorithmic learning theory ’2000, (pp. 264–278). Berlin: Springer. de Souto, M. C. P., & de Oliveira, W. R. (1999). The loading problem for pyramidal neural networks. Electronic Journal on Mathematics of Computation. Available at: http://gmc.ucpel.tche.br/ejmc/. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. San Francisco: Freeman. Hammer, B. (1998a). Some complexity results for perceptron networks. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), International Conference on Artificial Neural Networks ’98 (pp. 639–518). Berlin: Springer-Verlag. Hammer, B. (1998b). Training a sigmoidal network is difficult. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks ’98 (pp. 255–260). Brussels: D-Facto Publications Hoffgen, ¨ K.-U., & Simon, H.-U., & VanHorn, K.S. (1995). Robust trainability of single neurons. Journal of Computer and System Sciences, 50, 114–125 Judd, J.S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Muroga, S., Toda, I., & Takasu, S. (1961). Theory of majority decision elements. Journal of the Franklin Institute, 271, 376–418. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(9), 533–536. ˇ ıma, J. (1994). Loading deep networks is hard. Neural Computation, 6(5), 842– S´ 850. ˇ ıma, J. (1996). Back-propagation is not efficient. Neural Networks, 9(6), 1017– S´ 1023.

502

D. Windisch

ˇ ıma, J. (2002). Training a single sigmoidal neuron is hard. Neural Computation, S´ 14, 2709–2728. Vu, V.H. (1998). On the infeasibility of training neural networks with small squared errors. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 371–377). Cambridge, MA: MIT Press. Received September 4, 2003; accepted July 1, 2004.

NOTE

Communicated by Klaus Obermayer

Maximum Likelihood Topographic Map Formation Marc M. Van Hulle [email protected] K.U.Leuven, Laboratorium voor Neuro- en Psychofysiologie, B-3000 Leuven, Belgium

We introduce a new unsupervised learning algorithm for kernel-based topographic map formation of heteroscedastic gaussian mixtures that allows for a unified account of distortion error (vector quantization), loglikelihood, and Kullback-Leibler divergence. 1 Introduction Several unsupervised learning algorithms have been devised in recent years that develop topographically organized maps of gaussian mixture densities (for references, see Van Hulle, 2002). Unifying accounts of the homoscedastic (equal-variance) case have been introduced by Graepel and coworkers (Graepel, Burger, & Obermayer, 1998) and Heskes (2001). Graepel et al. adopted a statistical physics approach, called deterministic annealing, showing the connection between different classes of kernel-based topographic map formation algorithms, and eventually, as a limiting case, the batch map version of the self-organizing map (SOM) algorithm (Kohonen, 1995). Heskes showed the connection between minimum distortion topographic map formation and maximum likelihood homoscedastic gaussian mixture density modeling. An approach different from distortion-based learning is to optimize an information-theoretic criterion. Linsker (1989) was among the first to do so by applying his principle of maximum information preservation (infomax) to topographic map formation. Another approach is to minimize the Kullback-Leibler divergence (also termed cross-entropy) between the true and estimated input densities, an idea that has been introduced in kernelbased topographic map formation by Benaim and Tomasini (1991), using homoscedastic gaussians, and extended more recently by Yin and Allinson (2001) to heteroscedastic (different-variance) gaussians. In this article, we introduce an algorithm that develops topographically organized maps of heteroscedastic gaussian mixtures. We show the link between distortion error (vector quantization), log-likelihood and Kullback Leibler divergence. We develop an alternative kernel definition that greatly simplifies the link and introduce a likelihood-based regularizer to avoid nonoptimal solutions. Furthermore, we suggest an alternative approach to log-likelihood, called differential log-likelihood, for monitoring, in an unbiased manner, the learning process. Neural Computation 17, 503–513 (2005)

c 2005 Massachusetts Institute of Technology

504

M. Van Hulle

2 Likelihood Maximization Let v = [v1 , . . . , vd ] be a random vector in V ⊆ d generated from the probability density p(v). Assume we have N formal neurons with gaussian activation functions Ki (v, wi , σi ), i = 1, . . . , N, with center wi = [wi1 , . . . , wid ] ∈ d , and radius σi , 1 −v − wi 2 , (2.1) exp Ki (v, wi , σi ) = d 2σi2 (2πσ 2 ) 2 i

such that we obtain a homogeneous, heteroscedastic gaussian mixture estimate of the unknown input density: ˜ σ) = p(v) ≈ p(v|W,

1 Ki (v, wi , σi ). N i

(2.2)

A standard procedure to estimate the parameters W = (w1 , . . . , wN ) and σ = (σ1 , . . . , σN ) is by maximizing the (average) likelihood or by minimizing the (average) negative log-likelihood for the sample S = {vµ |µ = 1, . . . , M} (Redner & Walker, 1984): F = −log L = −

1 ˜ µ |W, σ ), log p(v M µ

(2.3)

through an expectation-maximization (EM) approach (Dempster, Laird, & Rubin, 1977). Suppose the gaussian mixture is to be developed in such a way that the N neurons form a lattice in which neighboring neurons code for neighboring positions in the input space (topology-preserving map). Define ij as the usual neighborhood function, a symmetric (ij = ji ), monotonically decreasing function of the lattice distance between neurons i and j. Before we define our gaussian activation function for such a lattice, consider the following reasoning. We take the weighted and normalized error term shown next on which we apply a bias-variance decomposition into a term depending on vµ and another independent of vµ , ri vµ − wi 2 ri vµ − wr 2 = + wi − wr 2 , 2 2 2 2σ 2σ 2σ r r r r i with

(2.4)

wi = 1 σ 2i

=

ri wr σ2 rri r σr2 r

ri r

σr2

.

,

(2.5) (2.6)

Maximum Likelihood Topographic Map Formation

The gaussian activation kernel can then be defined as 1 v − wi 2 exp − , Ki (v, wi , σ i ) = d 2σ 2i (2πσ 2i ) 2 so that we obtain the following mixture density model, ˜ σ , Q) = qi Ki (v, wi , σi ), p(v|W,

505

(2.7)

(2.8)

i

ri 2 where Q = (q1 , . . . , qN ), with qi = (exp (− r 2σ 2 wi − wr ))/Z the kerr nel’s prior probability, which contains the topological information, with Z the usual normalization constant so that i qi = 1. Taking the derivatives of the log-likelihood, equation 2.3, with respect to the wi s and putting them equal to zero leads to the following set of fixed point rules, µ µ µ r ir pr v wi = (2.9) µ , ∀i, µ r ir pr µ

where we have substituted for the posterior probabilities P(i|vµ ) ≡ pi = µ qi Ki µ . qK j j j

We then take the derivatives of F with respect to the radii σi and put them equal to zero. This yields the following set of linear equations in (σ 1 )2 , . . . , (σ N )2 , after some straightforward algebraic manipulations: µ

µ

pj (σ j )2 ij =

1 d

µ

j

µ

pj ij vµ − wi 2 , ∀i.

(2.10)

j

Algorithmically, we adopt an EM approach. In the expectation step, we µ determine the new values for pi . In the maximization step, we determine µ new the new wi and σi , given pi . We determine wnew from equation 2.9, i new 2 then (σ i ) by solving equation 2.10, and then σinew by solving the linear equations in σ12 , equation 2.6. i

3 Link with Vector Quantization Consider the following distortion or quantization error functional: E(W, σ ) =

µ

+

i

j

µ

µ µ v

ji pj

j

µ

pj

− wi 2 2σi2

µ ji µ d pi − + p log log . (3.1) i 2 2 1/N i σi µ i

506

M. Van Hulle

The first term on the right-hand side resembles (apart from σi ) the familiar distortion term in topographic map formation (see, e.g., Heskes, 2001), with µ pi the probability that kernel Ki generated input vµ and ij the “confusion probability” that input vµ is instead generated by kernel Kj . The second term contains the effect of having variable kernel radii (note it is a constant when the radii are equal—the homoscedastic case—and is therefore absent in Heskes’s 2001 format, equation 1). The last term is a cross-entropy, intended to minimize the discrepancy between the prior probability assignµ ments P(i) = N1 and the actual ones, pi and thus acts as a regularization term. From the derivative with respect to wi , we obtain equation 2.9. The derivative with respect to σi leads to µ

µ 2 pj d ij ∂E µ v − wi =− ij pj + jk 3 3 ∂σi 2σi σi µ µ j j k 2

=−

µ

j

σk

µ 2 µ ij µ v − wi ij pj + pj d σ j2 3 , 3 2σi σi µ j

(3.2)

which, when put equal to zero, yields equation 2.10. The derivative with µ respect to pi becomes, after substitution of σ i , qi Ki (vµ , wi , σ i ) µ pi = . µ j qj Kj (v , wj , σ j )

(3.3) µ

Finally, when filling in the expression for pi in equation 3.1, and after some algebraic manipulations, we obtain that √ E = (−log L + R)M( 2π)d

(3.4)

with    ij 1 R=− log  qi exp − wi − wj 2  , 2 M µ i j 2σj

(3.5)

the part that collects the topological information. We observe that, up to a scale factor, distortion minimization is equivalent to the combined loglikelihood maximization of the mixture model and the minimization of the term R, which specifies the topological relations. Finally, one should note that it is straightforward to extend our format to the heterogeneous case by taking the prior probabilities P(i) = N1 .

Maximum Likelihood Topographic Map Formation

507

4 Alternative Kernel Definition Let us adopt the following kernel function, ri v − wr 2 1 Ki (v, W, σ ) = , (4.1) exp − Zi 2σr2 r with Zi such that V Ki (v, W, σ ) dv = 1 (normalization constant), namely,   ij √ Zi = ( 2πσ i )d exp − (4.2) wi − wj 2 , 2 j 2σj and which can be used in the following kernel-based mixture estimate: ˜ σ) = p(v) ≈ p(v|W,

1 Ki (v, W, σ ). N i

(4.3)

Consider the following error function, E(W, σ ) =

µ

+

i

r

µ

ri pµ r

i

µ

pi

vµ − wi 2 2σi2

µ pi + log Zi , log 1/N

(4.4)

which is different from equation 3.1 in the log Zi term. Filling in the expresµ sion for pi , equation 3.3, into the expression for E(W, σ ), and after some algebraic manipulations, we obtain that E = −M log L,

(4.5)

which shows the immediate link between log-likelihood maximization and distortion minimization. Conceptually, the alternative kernel is also simpler since it includes the topological information in its argument. If we take M → ∞ in the previous equation, thenwe can write the expectation of E as ˜ E (E) = − V p(v) log p(v|W, σ )dv ≥ − V p(v) log p(v)dv = E (E)min . Hence, ˜ with the latter the Kullback-Leibler divergence. E (E) − E (E)min = KL(p, p), Or, when put in the finite sample format, E = −M log L = M(KL + entropy). Since the log-likelihood is biased by the differential entropy, we cannot use this as a metric for monitoring the convergence process when the actual density p is not known. We suggest a way around that by replacing the differential entropy by the estimate − V p˜ log p˜ dv: ˜ p(v) log p(v|W, σ )dv LL = − V ˜ ˜ + p(v|W, σ ) log p(v|W, σ )dv, (4.6) V

508

M. Van Hulle

which we call the differential log-likelihood, in analogy with the differential entropy (thus also with respect to a bias). When the density estimate equals the true density, then this metric will equal zero. The advantage is that we can generate as large a data set as desired from p˜ (Monte Carlo simulation of data) for obtaining an accurate kernel density estimate of entropy (Ahmad & Lin, 1976), or a sample spacing estimate in the one-dimensional case (Beirlant, Dudewicz, Gyorfi, ¨ & van der Meulen, 1997). 5 Simulation: Topographic Map Formation Consider the standard case of a 10×10 lattice that is trained according to the alternative kernel formulation on a sample S of size M = 1000 drawn from the uniform distribution in the unit square [−1, 1]2 . The weights are initialized by sampling the same distribution; the radii are initialized by sampling the uniform distribution [0.25, 0.75]. We use a gaussian neighborhood function, (i, j, σ ) = exp −

ri − rj 2

2σ2

,

(5.1)

with σ the neighborhood radius and ri neuron i’s lattice coordinate. We adopt the following neighborhood cooling scheme, t σ (t) = σ0 exp −σ0 , tmax

(5.2)

with t the current epoch and tmax the maximum number of epochs, and σ0 the radius at t = 0. We take tmax = 50, σ0 = 5. The results are shown in Figure 1. We observe that the lattice unfolds extremely rapidly; initially, the weights are at the sample’s centroid, and the kernel radii grow rapidly in size and span a considerable part of the input distribution. Finally, we observe in the evolution of LL (see Figure 2) that from about 30 epochs onward, it hovers around zero, as desired. 6 Correspondence with Other Algorithms We can verify that when taking σi = √1 , ∀i (i.e., homoscedastic gaussian β

mixture case), equation 3.1 becomes equal to equation 1 in Heskes (2001), except for the second term, which is missing in Heskes’s case (since it becomes a constant). Heskes established the link between the error function and the log-likelihood in an approximate √ sense, since the second term was omitted, as well as the scale factor M( 2π)d . Hence, we can say that our first approach is the generalization of Heskes’s to the heteroscedastic case. Our approach with the alternative kernel function is, however, different and

Maximum Likelihood Topographic Map Formation

509

0

2

4

5

10

20

40

50

1.0

|

0.5

|

0.0

|

-0.5

|

∆LL

Figure 1: Evolution of a 10 × 10 lattice as a function of time (epochs). The circles demarcate the radii σi of the corresponding gaussian kernels. The boxes correspond to the range spanned by the uniform input density. The values given below the boxes represent the number of epochs elapsed.

|

-1.0 | 0

| 10

| 20

| 30

| | 40 50 epochs

Figure 2: Evolution of LL as a function of epochs.

becomes equal to Heskes’s only when the neighborhood range vanishes (ij = δij ). The advantage of the heteroscedastic case is that no separate procedure is needed for optimizing the (common) kernel radii. Also, a benefit is expected when the input density consists of components with strongly varying spreads. ˜ as Yin and Allinson (2001) used the Kullback-Leibler divergence KL(p, p) the goal function. Since for our alternative kernel function, we could write ˜ = E (E) − E (E)min , with the latter term a constant, given the that KL(p, p) ˜ and E (E) is equivainput density, performing gradient descent on KL(p, p)

510

M. Van Hulle

lent. The weight and radii update rules should correspond to ours (albeit, they used incremental learning); however, they were developed differently µ since Yin and Allinson considered the posterior pi to be an approximation of the neighborhood function (in input space coordinates) and further simplified the radii update rules by adopting a competitive stage (winner µ = arg maxi pi ). We have separate terms for the neighborhood function (in lattice space coordinates) and the posteriors, and no competitive stage. In the soft topographic vector quantization (STVQ) algorithm for homoscedastic mixture modeling (Graepel, Burger, & Obermayer, 1997), the same fixed-point rule for the weight updates is obtained as ours, but with a common gaussian kernel radius σi = √1 ∀i, with β the inverse temperβ

ature, which is decreased exponentially over time (deterministic annealing) and thus serves a different purpose. Graepel et al. (1998) also proposed the kernel-based soft topographic mapping (STMK) algorithm by introducing a nonlinear transformation that maps the data points to a highdimensional “feature” space and, in addition, admits a kernel function, such as a gaussian with fixed radius, as in (kernel-based) support vector machines (SVMs) (Vapnik, 1995). The topographic map’s parameters (“weights”) are expressed as linear combinations of the transformed inputs, so that the map is in fact developed in feature space rather than in input space directly, as in our case. The idea of Graepel and coworkers was taken up again by Andr´as (2002) but with the purpose of optimizing the map’s classification performance, by individually adjusting the kernel radii using a supervised learning algorithm. This basically sets Andr´as’s approach apart from ours. In the kernel-based maximum entropy learning Rule (kMER) (Van Hulle, 1998), the kernel outputs are thresholded, and, depending on the binary activations, the kernel centers and radii are adapted. In the current approach, both the activation states and the learning rules depend on the continuously graded kernel outputs. µ Finally, when taking pi = δii∗ , with i∗ = arg mini vµ − wi 2 , that is, the “winning” neuron, we obtain the SOM algorithm in batch mode (Kohonen, µ 1995), which only the wi s. When we take pi = δii∗µ , with i∗µ = adapts 1 µ 2 arg mini j ij v − wj , and σi = √ , ∀i in equation 3.1, then the error β function becomes E = β2 µ i ii∗µ vµ −wi 2 , plus a nonrelevant constant. This is the format introduced by Heskes and Kappen (1993) and Luttrell (1991) in order to derive the SOM weight update rule from an error function. 7 Regularization Interestingly, we can derive a likelihood-based regularization term for the likelihood functional, equation 2.3, that will help in finding relevant local minima (e.g., nonvanishing kernel radii). Since P(i) = V p(v)P(i|v)dv, with µ 1 ˆ = 1 in our case P(i) = N1 , we should have that P(i) µ pi ≈ N (note that the M

Maximum Likelihood Topographic Map Formation

511

maximum likelihood estimate is asymptotically unbiased for large M; for a textbook account, see Rice, 1995). We suggest the following log-likelihood regularization term, R=−

1 1 µ µ log pi /M = − log pi − log M, NM µ i NM µ i

(7.1)

µ

which is minimized when pi becomes similar to N1 . We can further drop the constant − log M since it is irrelevant to the minimization. The regularized log-likelihood functional then becomes Freg = F + αR, with α a parameter 1 µ (usually α 1). Note that the regularization term − µ i NM log pi is 1 µ minimized, or µ i NM log pi maximized, and that in the error function µ µ pi is minimized. E(·), equation 3.1, the regularization term µ i pi log 1/N Hence, the two regularization terms are different. 8 Simulations: Regularization Consider a one-dimensional lattice (“chain”) consisting of N = 9 neurons with gaussian kernels, trained on a sample S sized M = 1000, drawn from the standard normal distribution G(0, 1). We take tmax = 50, σ0 = 4.5, σi (0) = 0.2, ∀i. In order to explore the effect of the log-likelihood constraint, we consider α = 0.001 and α = 0, with and without the constraint. We initialize the kernel centers by taking N = 9 data points from either the uniform distributions [−2, 2] or [−1, 1] or the sample set itself; for the initial radii, we take 0.2. We run each simulation 19 times and determine, for an independent test set of size M = 1000, the median value of the mean squared error (MSE) between the density estimate and the true distribution, as well as the first and third quantiles (the mean value is dominated by unsuccessful runs). The results are summarized in Table 1. We repeat the simulations for a skewed distribution consisting of a mixture of two gaussians, G(0, 1) and G(1.5, 1/3), with mixing coefficients 3/4 and 1/4. The results are summarized in Table 2. Both sets of results show a clear advantage of likelihood-based regularization in the case of a uniform random initialization, as is usually done in topographic map formation, and, thus, also that the cross-entropy regularization term in equation 3.1 is not very effective. For the case where the data set is sampled, the advantage is less clear; however, it can still matter for sparse distributed data sets. 9 Conclusion We have introduced a log-likelihood maximization approach to variable kernel-based topographic map formation. We have shown that for a specific kernel definition, a simple relation exists between distortion minimization, log-likelihood maximization, and Kullback-Leibler divergence. We have

512

M. Van Hulle

Table 1: MSE Results (in 10−4 ) in Case of Standard Normal Input Density. Initialization [−2, 2] [−1, 1] S

α = 0.001

α=0

1.46 [0.686, 1.91] 1.33 [0.789, 2.00] 1.20 [0.850, 2.44]

27.1 [3.33, 433] 2.12 [0.984, 2.95] 1.29 [0.724, 2.18]

Notes: Results are listed in terms of medians and first and third quartiles (in brackets). The input density is estimated using 1000 data points and N = 9 gaussian kernels, with (α = 0.001) and without log-likelihood constraint (α = 0), and for three different kernel center initializations: by means of samples taken from the uniform distributions [−2, 2] and [−1, 1] or the sample S.

Table 2: MSE Results (in 10−4 ) in Case of a Skewed Bimodal Input Density. Initialization [−2, 2] [−1, 1] S

α = 0.001

α=0

2.15 [1.32, 3.36] 1.92 [1.06, 2.69] 2.18 [0.961, 2.99]

326 [4.03, 651] 2.08 [0.875, 2.56] 2.28 [0.948, 3.41]

Note: Same conventions as in Table 1.

also introduced a regularizer and suggested a possible metric, the differential log-likelihood, for monitoring the convergence process. There are at least two ways to extend the practical value of this work, as exemplified in Heskes (2001): the extension to the missing values case in an EM sense and the use of the marginal probabilities of the neurons of the trained topographic map for data visualization and exploration purposes. All these are clear advantages over the classic SOM algorithm. Showing the practical value of the differential log-likelihood metric is the subject of a follow-up technical paper. Acknowledgments I was supported by research grants from the Belgian Fund for Scientific Research–Flanders (G.0248.03 and G.0234.04), the Flemish Regional Ministry of Education (Belgium) (GOA 2000/11), and the European Commission (IST-2001-32114 and IST-2002-001917). References Ahmad, I. A., & Lin, P. E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, 22, 372–375.

Maximum Likelihood Topographic Map Formation

513

Andr´as, P. (2002). Kernel-Kohonen networks. Int. J. Neural Systems 12(2), 117– 135. Beirlant, J., Dudewicz, E. J., Gyorfi, ¨ L., & van der Meulen, E. C. (1997). Nonparametric entropy estimation: An overview. Int. J. Math. and Statistical Sciences 6, 17–39. Benaim, M., & Tomasini, L. (1991). Competitive and self-organizing algorithms based on the minimization of an information criterion. In Proc. ICANN’91 (pp. 391–396). Amsterdam: North-Holland. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc., B, 39, 1–38. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Rev. E, 56(4), 3876–3890. Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generalizations and new optimization techniques. Neurocomputing, 21, 173–190. Heskes, T. (2001). Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans. Neural Networks, 12(6), 1299–1305. Heskes, T. M., & Kappen, B. (1993). Error potentials for self-organization. In Proc. IEEE Int. Conf. on Neural Networks (pp. 1219–1223). New York: IEEE. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computat., 1, 402–411. Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar case. IEEE Trans. Neural Networks, 2, 427–436. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2), 195–239. Rice, J. A. (1995). Mathematical statistics and data analysis. Belmont, CA: Wadsworth. Van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computat., 10(7), 1847–1871. Van Hulle, M. M. (2002). Joint entropy maximization in kernel-based topographic maps. Neural Computat., 14(8), 1887–1906. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Yin, H., & Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE Trans. Neural Networks, 12, 405–411. Received April 28, 2004; accepted July 27, 2004.

NOTE

Communicated by Kiichi Urahama

On Convergence Conditions of an Extended Projection Neural Network Youshen Xia [email protected] Department of Applied Mathematics, Nanjing University of Posts and Telecommunications, China

Gang Feng Department of Manufacturing Engg. and Engg. Management, The City University of Hong Kong, Hong Kong, China

The output trajectory convergence of an extended projection neural network was developed under the positive definiteness condition of the Jacobian matrix of nonlinear mapping. This note offers several new convergence results. The state trajectory convergence and the output trajectory convergence of the extended projection neural network are obtained under the positive semidefiniteness condition of the Jacobian matrix. Comparison and illustrative examples demonstrate applied significance of these new results. 1 Introduction We are concerned with the following extended projection neural network developed in Xia (2004): State equation:   PX (x − (F(x) + ∇ g(x)y − BT z)) − x du  = λ (y + g(x))+ − y dt −Bx + b

(1.1)

Output equation: w(t) = x(t), where λ > 0 is a scaling constant, u = (x, y, z) ∈ Rn ×Rm ×Rr is a state vector, w is an output vector, B ∈ Rr×n , b ∈ Rr , F(x) is a continuously differentiable vector-valued function from Rn into Rn , g(x) = [g1 (x), . . . , gm (x)]T , gi (x)(i = 1, . . . , m) are twice differentiable functions from Rn into R1 , ∇ g(x) = (∇ g1 (x), . . . , ∇ gm (x)), ∇ gi (x) is the gradient of gi (i = 1, . . . , m), X = {x ∈ Rn |l ≤ x ≤ h}, l = [l1 , . . . , ln ]T , h = [h1 , . . . , hn ]T , (y)+ = [(y1 )+ , . . . , (ym )+ ]T , (yi )+ = Neural Computation 17, 515–525 (2005)

c 2005 Massachusetts Institute of Technology

516

Y. Xia and G. Feng

max{0, yi }, and PX : Rn → X is a projection operator defined by  xi < li li li ≤ xi ≤ hi PX (xi ) = xi  hi xi > hi . As shown in Xia (2004), the extended projection neural network in equation 1.1 has a low computational complexity and a fast convergence rate. Moreover, its equilibrium point is related to the solution of the following constrained variational inequality, (x − x∗ )T F(x∗ ) ≥ 0,

x ∈ ,

(1.2)

where = {x ∈ Rn | g(x) ≤ 0, Bx = b, l ≤ x ≤ h}. More exactly, if u∗ = (x∗ , y∗ , z∗ ) is an equilibrium point of equation 1.1, then x∗ is a solution of equation 1.2 (Bertsekas & Tsitsiklis, 1989; Harker & Pang, 1990). Thus, the desirable solution to equation 1.2 can be obtained by tracking the continuous trajectory of equation 1.1 in real time (Cichocki & Unbehauen, 1993; Golden, 1996; Urahama, 1996). The output trajectory convergence of the extended projection neural network in equation 1.1 has been proven under the condition that the Jacobian matrix ∇F(x) of F is positive definite on X and constraint functions gi (x) (i = 1, . . . , m) are convex on X (Xia, 2004). As a result, the extended neural network is theoretically guaranteed to solve strictly monotone variational inequality problems with general convex constraints. It is well known that studying the state trajectory convergence of the extended neural network is of research interest and applied significance. On the other side, in much of real-world engineering optimization, the nonlinear mapping F(x) is monotone instead of strictly monotone (De Farias & Van Roy, 2003). Moreover, there exist many cases that F(x) and ∇ gi (x) are not monotone (Oja,1992; Bell & Sejnowski, 1995; Welling & Webber, 2001; Ahn & Oh, 2003). Therefore, studying weak convergence conditions of the extended projection neural network in equation 1.1 is very desirable for wide optimization and engineering applications. This note proposes new conditions of the global convergence of the extended projection neural network in equation 1.1. The proposed new results not only remove the positive definiteness condition of the Jacobian matrix of F on X but also allow F(x) and ∇ gi (x) to be nonmonotone. Therefore, the extended projection neural network is further guaranteed to solve constrained monotone variational inequality problems and a class of constrained nonmonotonic variational inequality problems. 2 Main Results Definition 1. A mapping F is said to be monotone on X if, for each pair of points x, y ∈ X, we have (x − y, F(x) − F(y)) ≥ 0,

On Convergence Conditions of an Extended Projection Neural Network

517

where (· , ·) denotes an inner product. F is said to be strictly monotone on X if the strict inequality holds whenever x = y. Throughout the convergence analysis, we assume that the extended projection neural network in equation 1.1 has at least an equilibrium point u∗ = (x∗ , y∗ , z∗ ). Also, we denote the Hessian matrix of gi (x) by ∇ 2 gi (x). Our main results on the convergence of the extended projection neural network in equation 1.1 are the following. Theorem 1. Assume that F(x) and ∇ gi (x) (i = 1, . . . , m) are monotone on X. If ∇F(x∗ ) is positive definite, then the extended projection neural network in equation 1.1 converges globally to an equilibrium point of equation 1.1. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 1.1 with the initial point u(t0 ) = (x(t0 ), y(t0 ), z(t0 )). Since by Xia and Wang (2000), we see that u(t) will exponentially approach to set 0 = {u = (x, y, z) ∈ Rn+m+r | x ∈ X, y ∈ Rm / X and y(t0 ) < 0, without + } if x(t0 ) ∈ the loss of generality we assume that x(t0 ) ∈ X and y(t0 ) ≥ 0. Consider the following Lyapunov function, 1 1 V(u) = H(u)T T(u) − T(u)2 + u − u∗ 2 , 2 2

u ∈ 0 ,

(2.1)

where u∗ = (x∗ , y∗ , z∗ ) is an equilibrium point of equation 1.1, and  F(x) + ∇ g(x)y − BT z , H(u) =  −g(x) Bx − b 



 x − PX (x − (F(x) + ∇ g(x)y − BT z)) . T(u) =  y − (y + g(x))+ Bx − b From the result obtained by Xia (2004) we know that V(u) ≥ u(t) ⊂ 0 , and

1 2 u

dV(u) ≤ −λ(H(u∗ ) − H(u))T (u∗ − u) − λT(u)T ∇H(u)T(u), dt where  ∇F(x) + ni=1 yi ∇ 2 gi (x) ∇H(u) =  −∇ g(x)T BT

∇ g(x) O1 O3

 −B O2  , O4

− u∗ 2 ,

(2.2)

518

Y. Xia and G. Feng

and O1 ∈ Rn×m , O2 ∈ Rn×r , O3 ∈ Rr×m , and O4 ∈ Rr×r are zero matrices. n Since2 F(x) and ∇ gi (x) (i = 1, . . . , m) are monotone on X, ∇F(x) + i=1 yi ∇ gi (x) are positive semidefinite on 0 . Thus, ∇H(u) is positive semidefinite, and H(u) is monotone on 0 . It follows that dV(u)/dt ≤ 0, and thus {u(t)} is bounded. By invariant set theorem (Golden, 1996), we see that all solution trajectories of the extended projection neural network in equation 1.1 converge to a largest invariant set χ, where dV(u)/dt = 0. We now prove that dV/dt = 0 if and only if du/dt = 0. Clearly, if du/dt = 0, then ˆ zˆ ) ∈ χ. Then dV/dt = 0 ˆ y, dV(u)/dt = (dV(u)/du)T (du/dt) = 0. Let uˆ = (x, ˆ T (u∗ − u) ˆ = 0. Since implies (H(u∗ ) − H(u)) ˆ T (u∗ − u) ˆ = (H(u∗ ) − H(u))

1

ˆ T (∇F(xs ) (x∗ − x) 2 ∗ ˆ ds, + m i=1 ∇ gi (xs )yi (s))(x − x) 0

ˆ and yi (s) = yˆ i + s(y∗i − yˆ i ) (0 ≤ s ≤ 1), where xs = xˆ + s(x∗ − x) (xˆ − x∗ )T (∇F(xs ) +

m

∇ 2 gi (xs )yi (s))(xˆ − x∗ ) = 0.

i=1

2 ∗ ∗ Let s = 1. Then xs = x∗ , y∗i (s) = y∗i , and (xˆ − x∗ )T (∇F(x∗ ) + m i=1 ∇ gi (x )yi ) m ∗ ∗ 2 ∗ ∗ ∗ · (xˆ − x ) = 0. Since ∇F(x ) + i=1 ∇ gi (x )yi is positive definite, xˆ = x and thus ˆ u) ˆ = r(u) ˆ T (∇F(x∗ ) + ˆ T ∇H(u)T( T(u)

m

ˆ = 0, ∇ 2 gi (x∗ )y∗i )r(u)

i=1

ˆ = 0. Thus, ˆ = px (xˆ − F(x) ˆ It follows that r(u) ˆ − ∇ g(x) ˆ yˆ + BT zˆ ) − x. where r(u) ˆ = 0 and dz/dt = λ(−Bxˆ + b) = 0. Finally, consider the case dx/dt = −λr(u) ˆ Note that dy/dt = λ((yˆ + g(x)) ˆ = ˆ + − y). ˆ + − y) that dy/dt = λ((yˆ + g(x)) −λα yˆ (0 ≤ α ≤ 1). When α > 0, we have yˆ = 0 since {y(t)} is bounded. Thus, dy/dt = 0 also. So dV/dt = 0 if and only if du/dt = 0. The conclusion of theorem 1 thus holds. As an important corollary of theorem 1, we have the following result: Corollary 1.Assume that F(x) + ∇ g(x)y(t) is monotone on X, where y(t) is the trajectory defined in equation 1.1. If ∇F(x∗ ) is positive definite, then the extended protection neural network in equation 1.1 is globally convergent to an equilibrium point of equation 1.1. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 1.1 with the initial point (x(t0 ), y(t0 ), z(t0 )), where x(t0 ) ∈ X and y(t0 ) ≥ 0. Consider the Lyapunov function V(u) defined in equation 2.1. Note that

On Convergence Conditions of an Extended Projection Neural Network

519

2 ∇F(x)+ m i=1 ∇ gi (x)yi (t) is positive semidefinite on X when F(x)+∇ g(x)y(t) is monotone on X. Similar to the analysis in theorem 1, we can obtain that dV/dt ≤ 0 and dV/dt = 0 if and only if du/dt = 0. Therefore, the conclusion of corollary 1 holds. Theorem 2. Assume that F(x) and ∇ gi (x) (i = 1, . . . , m) are monotone on X. If there exists a nonnegative vector p = [p1 , . . . , pm ]T ≤ y∗ such that ∇F(x) + m 2 i=1 ∇ gi (x)pi is positive definite on X, then the output trajectory of the extended protection neural network in equation 1.1 with y(t0 ) ≥ p is convergent to a solution of equation 1.2 at a convergence rate w(t) − x∗ 2 ≤

δ , ∀t > t0 , λ(t − t0 )

(2.3)

where δ is positive constant. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 1.1 with the initial point (x(t0 ), y(t0 ), z(t0 )) satisfying x(t0 ) ∈ X and y(t0 ) ≥ p. Consider the Lyapunov function V(u) defined in equation 2.1. By the analysis of theorem 1 we have λ

t

(H(u(τ )) − H(u∗ ))T (u(τ ) − u∗ ) dτ

t0

≤ 2V(u(t0 )) − λ

t t0

1

T(u(τ ))T ∇H(u(τ )))T(u(τ )) dτ.

0

Since dy(t)/dt + y = (y + g(x))+ , y(t) = e−(t−t0 ) y(t0 ) + e−t

t

es (y + g(x))+ ds.

t0

Then y(t) ≥ e−(t−t0 ) p for t ≥ t0 . Using r(u) = PX (x−(F(x)+∇ g(x)y−BT z))−x, we have m

T(u)T ∇H(u(t))T(u) = r(u)T (∇F(x(t))+ ∇ 2 gi (x(t))yi (t))r(u) i=1

≥ r(u)T (∇F(x(t))+e−(t−t0 )

m

∇ 2 gi (x(t))yi (t0 ))r(u)

i=1

= e−(t−t0 ) r(u)T (e(t−t0 ) ∇F(x(t))+

m

∇ 2 gi (x(t))pi )r(u)

i=1

≥ e−(t−t0 ) r(u)T (∇F(x(t)) +

m

i=1

∇ 2 gi (x(t))pi )r(u) ≥ 0.

520

Y. Xia and G. Feng

On the other side, note that (H(u) − H(u∗ ))T (u − u∗ ) =

1 0

+

(x − x∗ )T (∇F(xs )

m

i=1 ∇

2 g (x )y (s))(x i s i

− x∗ ) ds,

where xs = x∗ + s(x − x∗ ) and yi (s) = y∗i + s(yi − y∗i ) for 0 ≤ s ≤ 1. Then t t0

1

(x(τ ) − x∗ )T (∇F(xs ) +

m

0

∇ 2 gi (xs )yi (s))(x(τ ) − x∗ ) ds dτ

i

=

t t0

t

=

1

(u(τ ) − u∗ )T ∇H(u∗ + s(u(τ ) − u∗ ))(u(τ ) − u∗ ) ds dτ

0

(H(u(τ ))−H(u∗ ))T (u(τ )−u∗ ) dτ ≤ V(u(t0 ))/λ,

t0

Since y∗ ≥ 0, yi (s) = (1 − s)y∗i + syi ≥ syi (t) ≥ e−(t−t0 ) spi (i = 1, . . . , m). It follows that for 0 < s ≤ 1, (x − x∗ )T (∇F(xs ) +

m

∇ 2 gi (xs )yi (s))(x − x∗ )

i=1

≥ (x − x∗ )T (∇F(xs ) +

m

∇ 2 gi (xs )(se−(t−t0 ) )pi )(x − x∗ )

i=1

= (x − x∗ )T (∇F(xs ) + (se−(t−t0 ) )

m

∇ 2 gi (xs )pi )(x − x∗ )

i=1

≥ s(e(t−t0 ) )(x − x∗ )T (∇F(xs ) +

m

∇ 2 gi (xs )pi )(x − x∗ ) ≥ 0.

i=1

Note that for 1 > s ≥ 0,

1

(x − x∗ )T (∇F(xs ) +

0

m

∇ 2 gi (xs )yi )(x − x∗ ) ds

i

≥

1

(x − x∗ )T (∇F(xs ) + (1 − s)

0

≥ 0

m

∇ 2 gi (xs )y∗i )(x − x∗ ) ds

i 1

(1 − s)(x − x∗ )T (∇F(xs ) +

m

i

∇ 2 gi (xs )y∗i )(x − x∗ ) ds.

On Convergence Conditions of an Extended Projection Neural Network

521

Using y∗ ≥ p, we have t t0

1

(1−s)(x(τ )−x∗ )T (∇F(xs )+

m

0

∇ 2 gi (xs )pi )(x(τ )−x∗ ) ds dτ

i

≤ V(u(t0 ))/λ. 2 Since ∇F(x) + m i=1 ∇ gi (x)pi is positive definite on X, there exists µ > 0 such that m

∗ T 2 ∗ ∇ gi (xs )yi (x − x∗ ) ≥ µx − x∗ 2 . (x − x ) ∇F(xs ) + i

It follows that t 1 V(u(t0 )) µ x(τ ) − x∗ 2 dτ (1 − s) ds ≤ λ t0 0 and

t

x(τ ) − x∗ 2 dτ ≤

t0

2V(u(t0 )) . λµ

Thus, for any t > t0 ,

t

x(t) − x < x(tˆ) − x = ∗ 2

∗ 2

x(s) − x∗ 2 ds

t0

t − t0

≤

δ , λ(t − t0 )

where t0 < tˆ < t and δ = 2V(u(t0 ))/µ. Hence, w(t) − x∗ 2 ≤

δ , ∀t > t0 . λ(t − t0 )

As an important corollary of theorem 2, we have the following result. 2 Corollary 2.Assume that ∇F(x) + m i=1 ∇ gi (x)yi (t) is positive definite on X, where y(t) is the trajectory defined in equation 1.1. Then the output trajectory of the extended neural network in equation 1.1 is globally convergent to a solution of equation 1.2 with a convergence rate in the form 2.3. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 2.1 with the initial point (x(t0 ), y(t0 ), z(t0 )), where x(t0 ) ∈ X and y(t0 ) ≥ 0. Consider the Lyapunov function V(u) defined in equation 2.1. According to the analysis in theorem 2, we can obtain that dV/dt ≤ 0 and t t0

1

(x(τ ) − x∗ )T (∇F(xs ) +

0

≤ V(u(t0 ))/λ.

m

i

∇ 2 gi (xs )yi (s))(x(τ ) − x∗ ) ds dτ

522

Y. Xia and G. Feng

Therefore, from the rest of the analysis of theorem 2, the conclusion of corollary 2 follows.

3 Comparison and Illustrative Examples In Xia (2004), the existing convergence condition of the extended projection neural network in equation 1.1 is that ∇F(x) is positive definite on X and ∇ gi (x) (i = 1, . . . , m) are monotone. It is easy to see that theorems 1 and 2 have removed the positive definiteness condition of ∇F(x) on X. Moreover, theorem 1 shows the global convergence of the state trajectory of the extended neural network. Furthermore, from corollaries 1 and 2, we see that F(x) and ∇ gi (x) may be nonmonotone on X. Two examples follow. Example 1. Consider the following quadratic program, 1 T x Qx + cT x 2

minimize

f (x) =

subject to

Dx ≤ b,

(3.1)

and its dual, maximize

1 T 1 y Hy + yT d − cT Q−1 c 2 2

subject to

y ≥ 0,

(3.2)

where x ∈2 , y ∈ R4 , H = DQ−1 DT , d = −b−DQ−1 c, b = [35/12, 35/2, 5, 5]T , and 5/12 D= −1

5/2 1

−1 0

T

0 1

2 , Q= 1

−30 1 . , c= −30 2

Equation 3.1 has a unique optimal solution x∗ = (5, 5), and equation 3.2 has a unique optimal solution y∗ = (0, 6, 0, 0, 9). We use the extended neural network in equation 1.1 to solve both equations 3.1 and 3.2. Let F(x) = Qx + c and = {x ∈ R2 | Dx − b ≤ 0}. Then equation 3.1 can be rewritten as the problem with the form equation 1.2. When the extended projection neural network in equation 1.1 is applied to this problem, the corresponding extended projection neural network in equation 1.1 can become d x −c − QT x − DT y = . (y + Dx − b)+ − y dt y

(3.3)

On the other side, by the dual theory of convex program (Bertsekas, 1989), we see that (x∗ , y∗ ) is an optimal solution to equations 3.1 and 3.2, respec-

On Convergence Conditions of an Extended Projection Neural Network

523

tively, if and only if (x∗ , y∗ ) satisfies the Karush-Kuhn-Tucker (KKT) condition,

c + QT x + DT y = 0, yT (Dx − b) = 0, Dx ≤ b, y ≥ 0.

The KKT condition can be represented by

c + QT x + DT y = 0, (y + Dx − b)+ = y.

Thus, (x∗ , y∗ ) is an optimal solution to equations 3.1 and 3.2, respectively, if and only if (x∗ , y∗ ) is an equilibrium point of equation 3.3. The existing convergence result (Xia, 2004) can show that the output trajectory w(t) of the extended projection neural network is convergent to the optimal solution of equation 3.1, but it cannot ascertain the convergence of the trajectory y(t) to the optimal solution of equation 3.2. However, theorem 1 guarantees that the state trajectory of the extended projection neural network can converge to optimal solutions of both equations 3.1 and 3.2. All simulation results show that the state trajectory u(t) = (x(t), y(t)) of (8) is always convergent to u∗ = (x∗ , y∗ ). For example,let λ = 10. Figure 1 displays the transient behavior of the norm error u(t)−u∗ based on equation 3.3 with 10 random initial points. Example 2. Consider nonlinearly constrained variational inequality 1.2, where = {x ∈ R2 | g(x) ≤ 0, x ∈ X}, X = {x ∈ R2 | x ≥ 0},g(x) = [g1 (x), g2 (x)]T , g1 (x) = −x21 − x2 , g2 (x) = −x22 − x1 , and F(x) = [2x1 − 1, 2x2 − 0.5]T . This problem has a solution x∗ = [0.5, 0.25]T . It is easy to see that ∇F(x) is positive definite, but ∇ g1 (x) and ∇ g2 (x) are not monotone on X. Thus, the existing convergence result Xia (2004) cannot ascertain the extended neural network for solving the above problem. However, note that y(t) always converges to zero. Then there exists T > t0 such that when t ≥ T,

∇F(x) +

2

i=1

∇ 2 gi (x)yi (t) =

2 − 2y1 (t) 0

0 2 − 2y2 (t)

is positive on X. Thus, corollary 2 guarantees that the extended neural network in equation 1.1 globally converges to x∗ and has a convergence rate in the form of equation 2.3. All simulation results show the output trajectory of equation 1.1 is always convergent to x∗ . For example, let λ = 10. Figure 2 displays the transient behavior of the norm error x(t) − x∗ based on equation 2.1 with 10 random initial points.

524

Y. Xia and G. Feng

14

12

||u(t)−u*||

10

8

6

4

2

0

0

0.1

0.2

0.3

0.4

0.5 Time (sec)

0.6

0.7

0.8

0.9

1

Figure 1: Convergence behavior of u(t) − u∗ with 10 random initial points in example 1. 1.4

1.2

1

*

||x(t)−x ||

0.8

0.6

0.4

0.2

0

0

0.1

0.2

0.3

0.4

0.5 Time (sec)

0.6

0.7

0.8

0.9

1

Figure 2: Convergence behavior of x(t) − x∗ with 10 random initial points in Example 2.

On Convergence Conditions of an Extended Projection Neural Network

525

Conclusion This note proposes new convergence conditions of the extended neurel network. Both theoretical analysis and computational examples show that the extended neural network can solve constrained monotone variational inequality problems and a class of constrained nonmonotonic variational inequality problems. References Ahn, J. H., & Oh, J. H. (2003). A constrained EM algorithm for principal component analysis. Neural Computation, 15, 57–66. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing. New York: Wiley. De Farias, D.P., & Van Roy, B. (2003). The linear programming approach to approximate dynamic programming. Operations Research, 51, 850–865. Golden, R. M. (1996). Mathematical methods for neural network analysis and design. Cambridge, MA: MIT Press. Harker, P. T., & Pang, J. S. (1990). Finite-dimensional variational inequality and nonlinear complementary problems: A survey of theory, algorithms, and applications. Mathematical Programming, 48, 161–220. Oja, E. (1992). Principal components, minor components and linear neural networks. Neural Networks, 5, 927–935. Urahama, K. (1996). Gradient projection network: Analog solver for linearly constrained nonlinear programming. Neural Computation, 8, 1061–1074. Welling, M., & Webber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, pp. 677–689. Xia, Y. S. (2004). An extended neural network for constrained optimization. Neural Computation, 16, 863–883. Xia, Y. S., & Wang, J. (2000). On the stability of globally projected dynamic systems, Journal of Optimization Theory and Applications, 106, pp. 129–150. Received April 23, 2004; accepted July 28, 2004.

LETTER

Communicated by Jerome Feldman and Peter Thomas

Memorization and Association on a Realistic Neural Model Leslie G. Valiant [email protected] Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, U.S.A.

A central open question of computational neuroscience is to identify the data structures and algorithms that are used in mammalian cortex to support successive acts of the basic cognitive tasks of memorization and association. This letter addresses the simultaneous challenges of realizing these two distinct tasks with the same data structure, and doing so while respecting the following four basic quantitative parameters of cortex: the neuron number, the synapse number, the synapse strengths, and the switching times. Previous work has not succeeded in reconciling these opposing constraints, the low values of synapse strengths that are typically observed experimentally having contributed a particular obstacle. In this article, we describe a computational scheme that supports both memory formation and association and is feasible on networks of model neurons that respect the widely observed values of the four quantitative parameters. Our scheme allows for both disjoint and shared representations. The algorithms are simple, and in one version both memorization and association require just one step of vicinal or neighborly influence. The issues of interference among the different circuits that are established, of robustness to noise, and of the stability of the hierarchical memorization process are addressed. A calculus therefore is implied for analyzing the capabilities of particular neural systems and subsystems, in terms of their basic numerical parameters. 1 Introduction We consider four quantitative parameters of a neural system that together constrain its computational capabilities: the number of neurons, the number of neurons with which each neuron synapses, the strength of synaptic connections, and the speed of response of a neuron. The typical values that these parameters are believed to have in mammalian cortex appear to impose extremely severe constraints. We believe that it is for this reason that computationally explicit mechanisms for realizing multiple cognitive tasks simultaneously on models having these typical cortical parameters have not been previously offered. Neural Computation 17, 527–555 (2005)

c 2005 Massachusetts Institute of Technology

528

L. Valiant

Estimates of these four cortical parameters are known for several systems. The number of neurons in mouse cortex has been estimated to be 1.6 × 107 , while the corresponding estimate is in the region of 1010 for humans (Braitenberg & Schuz, 1998). There also exist estimates of the number of neurons in different parts of cortex and in related structures such as the hippocampus and olfactory bulb. The number of neurons with which each neuron synapses, which we shall call the degree, is a little harder to measure. However, it is considered that the effect of multiple synapsing between pairs of neurons is small and therefore that this degree is close to the total number of synapses per neuron, which has been estimated to be 7800 in mouse cortex and in the 24,000 to 80,000 range in humans (Abeles, 1991). The third parameter, the synapse strength, presents a still more complex set of issues. The most basic question here is how many of a neuron’s neighbors need to be sending an action potential in order to create an action potential in the neuron. Equivalently, the contribution of each synapse, the excitatory presynaptic potential, can be measured in millivolts and the fraction that this constitutes of the threshold voltage that needs to be overcome evaluated. While some moderately strong synapses have been recorded (Thomson, Deuchars, & West, 1993; Markram & Tdodyks, 1996a; Ali, Deuchars, Pawelzik, & Thomson, 1998), the average value is believed to be weak. The effective fraction of the threshold that each neighbor contributes, has been estimated (Abeles, 1991) to be in the range 0.003 to 0.2. In other words, it is physiologically quite possible that cognitive computations are characterized by a very small and hard-to-observe fraction of synapses that contribute above this range, perhaps even up to the threshold fraction of 1.0. However, there is no experimental confirmation of this to date, and for that reason, this article addresses the possibility that at least some neural systems work entirely with synapses whose strengths are some orders of magnitude smaller. The time it takes for a neuron to complete a cycle of causing an action potential in response to action potentials in its presynaptic neurons has been estimated as being in the 1 to 10 millisecond range. Since mammals can perform significant cognitive tasks in 100 to 200 milliseconds, algorithms that take a few dozen parallel steps must suffice. The central technical contribution of this article is to show that two basic computational problems, memory formation and association, can be implemented consistently in model neural systems that respect all of the above-mentioned numerical parameters. The first basic function, JOIN, implements memory formation of a new item in terms of two established items: if two items A and B are already represented in the neural system, the task of JOIN is to modify the circuit so that at subsequent times, there is the representation of a new item C that will fire if and only if the representations of both A and B are firing. Further, the representation of C is an “equal citizen” with those of A and B for the purposes of subsequent calls of JOIN. JOIN

Memorization and Association on a Realistic Neural Model

529

is intended as the basic form of memorization of an item and incorporates the idea that such memorization has to be indexed in terms of the internal representations of items already represented. The second function, LINK, implements association. If two “items” A and B are already represented in the neural system in the sense that certain inputs can cause either of these to fire, the task of LINK is to modify the circuit so that at subsequent times, whenever the representation of item A fires, the modified circuit will cause the representation of item B to fire also. Implicit in the definitions of both JOIN and LINK is the additional requirement that there be no deleterious interference or side effects. This means that the circuit modifications do not impair the functioning of previously established circuits; that when the newly created circuit executes, no unintended other items fire; and that the intended action of the new circuit cannot be realized in consequence of some unintended condition. We note that for some neural systems, such as the hippocampus and the olfactory bulb, the question of what items, whether representing location or odors, for example, are being represented has been the subject of some experimental study already. Also for such systems, our four cortical parameters can be measured. We therefore expect that our analysis offers both explanatory and predictive value for understanding such systems. For the parts of cortex that process higher-level functions, the corresponding experimental evidence is more elusive. In order that our results apply to a wide range of neural systems, we describe computational results for systems within a broad range of realistic parameters. We show that for wide ranges of values of the neuron count between 105 and 109 , and of values of the synapse count or degree between 16 and 106 , there is a range of values of the synapse strength between .001 and .125 for which both JOIN and LINK can be implemented. Furthermore, this latter range usually includes synaptic strengths that are at the small end of the range. Tables 1 through 4 summarize these data and show, given the values of the neuron count and degree of a neural system, the maximum synapse strength that is sufficient for JOIN and LINK in both disjoint and shared representations. The implied algorithms for LINK take just one step and for JOIN either two steps (see Tables 1 and 2) or also just one step (see Tables 3 and 4). The simplicity of these basic algorithms leaves room for more complex functions to be built on top of them. We also describe a general relationship among the parameters that holds under some stated assumptions for systems that use the mechanisms described. This relationship (∗) states that kn exceeds rd, but only by at most a fixed small constant factor, where 1/k is the maximum collective strength of the synapses to any one neuron from any one of its presynaptic neurons, n is the number of neurons, r is the number of neurons that represent a single item, and d is the number of neurons from which each neuron receives synapses.

530

L. Valiant

The essential novel contribution of this article is to show that random graphs have some unexpected powers. In particular, for parameters that have been observed in biology, they allow a method of assigning memory to a new item and also allow for paths, and algorithms for establishing the paths, for realizing associations between items. There is a long history of studies of random connectivity for neural network models, notably Beurle (1955), Griffith (1963, 1971), Braitenberg (1978), Feldman (1982), and Abeles (1991). In common with such previous studies, ours assumes random interconnections and does not apply to systems where, for example, the connections are strictly topographic. The other component of our approach that also has some history is the study of local representations in neural networks, including Barlow (1972), Feldman (1982), Feldman and Ballard (1982), Shastri and Ajjanagadde (1993), and Shastri (2001). The question of how multiple cognitive functions can be realized simultaneously using local representations and random connections has been pursued by Valiant (1988, 1994). Our central subject matter is the difficulty of computing flexibly on sparse networks where nodes are further frustrated in having influence on others by the weakness of the synapses. This difficulty has been recognized most explicitly in the work of Griffith (1963) and Abeles (1991). Griffith suggests communication via chains that consist of sets of k nodes chained together so that each member of each set of k nodes is connected to each member of the next set in the chain. If the synaptic strength of each synapse is 1/k, then a signal can be maintained along the chain. Abeles suggests a more general structure, which he calls a synfire chain, in which each set has h ≥ k nodes and each node is connected to k of the h nodes in the next set. He shows that for some small values of k, such chains can be found somewhere in suitably dense such networks. The goals of this article impose multiple constraints for which these previous proposals are not sufficient. For example, for realizing associations, we want that between any pair of items, there is a potential chain of communication. In other words, these chains have to exist from anywhere to anywhere in the network rather than just somewhere in the network. A second constraint is that we want explicit computational mechanisms for enabling the chain to be invoked to perform the association, and the passive existence of the chain in the network is not enough. A third requirement is that for memory formation, we need connectivity of an item to two others. Some readers may choose to view this article as one that solves a communication or wiring problem, and not a computational one. This view is partly justified since once it is established that the networks have sufficiently flexible interaction capabilities, the mechanisms required at the neurons are computationally very simple. For readers who wish to investigate more rigorously what simple means here, we have supplied a section that goes into more details. The model of computation used there is the neuroidal model (Valiant, 1994), which was designed to capture the communication capa-

Memorization and Association on a Realistic Neural Model

531

bilities and limitations of cortex as simply as possible. It assumes only the simplest timing and state change mechanisms for neurons so that there can be no doubt that neurons are capable of doing at least that much. Demonstrating that some previously mysterious task can be implemented even on this simple model therefore has explanatory power for actual neural systems. The neuroidal model was designed to be more generally programmable than its predecessors and hence to offer the challenge of designing explicit computational mechanisms for explicitly defined and possibly multiple cognitive tasks. The contribution of this article may be viewed as that of exhibiting a wide range of new solutions to that model. The previous solutions given for the current tasks were under the direct action hypothesis—the hypothesis that synapses could become so strong that a single presynaptic neuron was enough to cause an action potential in the postsynaptic neuron. Whether this hypothesis holds for neural systems that perform the relevant cognitive tasks is currently unresolved. In contrast, the mechanisms described here are in line with synaptic strength values that have been widely observed and generally accepted. This letter pursues a computer science perspective. In that field, it is generally found, on the positive side, that once one algorithm has been discovered for solving a computational problem within specified resource bounds, many others often follow. On the other hand, on the negative side, it is found that the resource bounds on computation can be very severe. For example, for the NP-complete problem of satisfiability (Cook, 1971; Papadimitriou 1994) of boolean formulas with n occurrences of literals, no n algorithm for solving all instances of it in 2f( ) steps is known for any function f(n) growing more slowly than linear in n. If a device were found that could solve this problem faster, then a considerable mystery would be created: the device would be using some mechanism that is not understood. Neuroscience has mysteries of the same computational nature and needs to resolve them. This letter aims at making one of these mysteries concrete and to resolve it. 2 Graph Theory We consider a random graph G with n vertices (Bollobas, 2001). From each vertex, there is a directed edge to each other vertex with probability p, so that the expected number of nodes to which a node is connected is d = p(n − 1). In this model, a vertex corresponds to a neuron, and a directed edge from one vertex to another models the synapse between the presynaptic neuron and the postsynaptic neuron. Such a model makes sense for neural circuits that are richly interconnected. A variant of the model is that of a bipartite graph where the vertex set can be partitioned into two subsets V1 and V2 , such that every edge is directed from a V1 vertex to a V2 vertex. This would be appropriate for modeling the connections from one area of cortex to a

532

L. Valiant

second, possibly distant, area. The analyses we give for JOIN and LINK apply equally to both variants. We shall use d = pn in the analysis, which is exact for the bipartite case, and a good approximation for the other case for large n. In general, to obtain rigorous results about random graphs, we take the view that for the fixed nodes under consideration, the edges are present or not, each with probability p independent of each other. It is convenient for analysis to view the edges as being generated in that manner afresh rather than as fixed at some previous time. We assume that the maximum synaptic strength is 1/k of the threshold, for some integer k. In the graph theoretic properties, we shall therefore always need to find at least k presynaptic neighbors to model the k presynaptic neurons that need to be active to make the neuron in question active. Finally, we shall model the representation of an item in cortex by a set of about r neurons, where r is the replication factor. In general, such an item will be considered to be recognized if essentially all the constituent neurons are active. In general, different items will be represented by different numbers of neurons, though of the same order of magnitude. We do not try to ensure that they are all represented by exactly r. However, once an item is represented by some r neurons, then it makes sense to assert that if no more than r /2 of its members are firing, then the item has not been recognized. We call a representation disjoint or shared, respectively, depending on whether the sets that represent two distinct items need, or need not, be disjoint. In disjoint representations, clearly, no more than n/r items can be represented, while shared representations allow for many more, in principle. 2.1 Memory Formation for Disjoint Representations. The JOIN property is the following. Given values of n, d, and k, which are the empirical parameters of the neural system, we need to show that the following holds. Given two subsets of nodes A and B of size r, the number of nodes to which there are at least k edges directed from A nodes and also at least k edges directed from B nodes has expected value of r. The vertices that are so connected will represent the new item C. The above-mentioned property ensures that the representation of C will be made to fire by causing the representation of either one of A or B to fire. The required network is illustrated in Figure 1. We want C to be an equal citizen as much as possible with A and B for the purposes of further calls of JOIN. We ensure this by requiring that the expected number of nodes that represent C is the same as the number of those that represent A and B. In general, we denote by B (r, p, k) the probability that in r tosses of a coin that comes up heads with probability p and tails with probability 1−p, there will be k or more heads. This quantity is equal to the sum for j ranging from k to r of the value of T (r, p, j) = (r!/(j!(r − j)!)p j (1 − p)r−j . For constructing our tables, we compute such terms to double precision using an expansion

Memorization and Association on a Realistic Neural Model

r

A

533

k k C k

r

B

k

Figure 1: Graph-theoretic structure needed for the two-step algorithm for disjoint representations for the memorization of the conjunction of items at A and B. For shared representations, the sets A and B may intersect. For the one-step algorithm, there is a bound km on the total number of edges coming from A and B rather than bounds on A and B separately.

for the logarithm of the factorial or gamma function (Abramowitz & Stegun, 1964). We now consider the JOIN property italicized above. For each vertex u in the network, the probability that it has at least k edges directed toward it from the r nodes of A is B (r, p, k), since each vertex of A can be regarded as a coin toss with probability p of heads (i.e., of being connected to u) and we want at least k successes. The same holds for the nodes of B. Hence the probability of a node being connected in this way to both A and B is p = (B (r, p, k))2 , and hence the expected number of vertices so connected is n times this quantity. The stated requirement on the JOIN property therefore is that the following be satisfied: n(B (r, p, k))2 = r.

(2.1)

This raises the important issue of stability. Even if the numbers of nodes assigned to A and B are both exactly r, this process will assign r nodes to C only in expectation. How stable is this process if such memorization operations are performed in sequence, with previously memorized items forming the A and B items of the next memorization operation? Fortunately, it is easy to see that this process gets more and more stable as r increases. The argument for relationship 2.1 showed that the number of nodes assigned to C is a random variable with a binomial distribution defined as the number of successes in n trials where the probability of success in each trial is p .

534

L. Valiant

This distribution therefore has mean np , as already observed, and variance np (1 − p ). The point is that this variance is close to the mean r = np if p is small relative to 1, and then the standard deviation, which is the square root √ of the variance, is approximately r. Hence, the standard deviation as a √ fraction of the mean decreases as 1/ r as the mean np = r increases. For the ranges of values that occur typically in this article, such as r equal to 103 , 104 , √ or even much larger, this r standard deviation will be a small fraction of r, and hence one can expect the memorization process to be stable for many stages. Thus, for stability, the large k large r cases considered in this article are much more favorable than the k = 1 case considered in Valiant (1994) with r = 50. For the latter situation, some analysis and suggested ways of coping with the more limited stability were offered in Valiant (1994) and Gerbessiotis (2003). If fewer than a half of the representatives of an item are firing, we regard that item as not being recognized. As a side-effect condition, we therefore want that if no more than a half of one of A or B is active, then the probability that more than a half of C is active is negligible. Since we cannot control the size of C exactly, we ensure the condition that at most half of C be active by insisting that at most a much smaller fraction of r, such as r/10, be active. The intention is that r/10 will be smaller than a half of C even allowing for the variations in the size of C after several stages of memory allocation. This gives

B (n, B (r/2, p, k)B (r, p, k), r/10) ∼ 0.

(2.2)

The second side-effect condition we impose is related to the notion of capacity, or the number of items that can be stored. To guarantee large capacity, we need an assurance that the A ∧ B nodes allocated will not be caused to fire if a different conjunction is activated. The bad case is if the second conjunction involves one of A or B, say A, and another item D different from B. If the node sets allocated to A ∧ B and not to A ∧ D is of size at least 2r/3, then we will consider there to be no interference since if A ∧ B is of size at most 4r/3, then the firing of A ∧ D will cause fewer than half of the nodes of A ∧ B to fire. The probability that a node receives k inputs from B and k from A, but fewer than k from D, is p = (1 − B (r, p, k))(B (r, p, k))2 , and we want that the number of nodes that are so allocated to A ∧ B but not to A ∧ D to be at least 2r/3. Hence, we want

B (n, p , 2r/3) ∼ 1.

(2.3)

2.2 Association for Disjoint Representations. We now turn to the LINK property, which ensures that B can be caused to fire by A via an intermediate set of “relay” neurons: Given two sets A and B of r nodes, for each B vertex u with high probability, the following occurs: the number of (relay) vertices from which

Memorization and Association on a Realistic Neural Model

k r

A

535

k B

r

Figure 2: Graph-theoretic structure needed for the algorithm for establishing an association of the item at A to the item at B.

there is a directed edge to u and to which there are at least k edges directed from A nodes is at least k. We shall call this probability Y. This property ensures that each neuron u that represents B will be caused to fire with high probability if the A representation fires. This property is illustrated in Figure 2. For the LINK property, we note that the probability of a vertex having at least k connections from A and also having a connection to B vertex u is pB (r, p, k). We need the number of such nodes to be at least k with high probability, or in other words that Y = B (n, pB (r, p, k), k) ∼ 1.

(2.4)

As a side-effect condition, we need that if at most half of A fire, then with high probability, fewer than half of B should fire. We approximate this quantity by assuming independence for the various u:

B (r, B (n, pB (r/2, p, k), k), r/2) ∼ 0.

(2.5)

As a second side-effect condition, we consider the probability that a third item C for which no association with B has been set up by a LINK operation will cause B to fire because some relay nodes are shared with A. Further, we make this more challenging by allowing there to have been t, rather than just one, association with B previously set up, say from A1 , . . . , At . Now, the probability that a node will act as a relay node from A1 to a fixed node u in B is pB (r, p, k). For t such previous associations, the probability that a node acts as a relay for at least one of the Ai is p = p(1−(1− B (r, p, k))t ). If we require that these nodes be valid relay nodes from item C also, then this probability gets multiplied by another factor of B (r, p, k) since C is disjoint from A1 , . . . , At .

536

L. Valiant

Then the side-effect requirement becomes that p = B (n, p B (r, p, k), k), the probability of there being at least k relay nodes for u is so small that it is unlikely that a large fraction, say at least r/2, of the B nodes are caused to fire by C. We approximate this quantity also by making the assumption of independence for the various u:

B (r, p , r/2) ∼ 0.

(2.6)

2.3 Memory Formation for Shared Representations. By shared representation, we mean a representation where each neuron can represent more than one item. There is no longer a distinction between nodes that have and those that have not been allocated. The items each node will be assigned by JOIN are already determined by the network connections without any training process being necessary. The actual meaning of the items that will be memorized at a node will, of course, depend on the meanings assigned by a different process, such as the hard wiring of certain sensory functions to some input nodes. The model here is that an item is represented by r neurons, randomly chosen. The expected intersection of two such sets then is of size r2 /n. We can recompute the relations corresponding to equations 2.1 to 2.6 under this assumption. For simplicity, we shall consider only the case in which for JOIN, the neurons of A and B are in one area, and those of C in another, and for LINK the neurons of A, B and the relay nodes are in three different areas, respectively. Then, if for simplicity we make the assumption that the intersection is of size exactly r , the closest integer to r2 /n, then relation 2.1 for JOIN becomes: n{B (r , p, k) + (i=0:k−1) [T (r , p, i)(B (r − r , p, k − i))2 ]} = c1 r,

(2.1’)

where i in the summation indexes the number of connections from the intersection of A and B. We need to show that c1 is close to one. Equation 2.1 we adapt from equation 2.2 to be the following:

B (n, p , r/10) ∼ 0,

(2.2’)

where p =(B (r , p, k)+(i=0,...,k−1) [T (r , p, i)B (r−r , p, k−i)B (r/2−r , p, k− i)]), where i in the summation indexes the number of connections from a neuron that is in both A and B, and r = r2 /n. We adapt equation 2.3 from equation 2.3 by considering the case that the intersections A ∩ B, A ∩ D, and B ∩ D all have their expected sizes r = r2 /n, and that A ∩ B ∩ D has its expected size r = r3 /n2 . For a fixed node in C, we shall denote the numbers of connections to these four intersections by i + m, j + m, l + m, and m, respectively, in the summation below. Then the probability of a node being allocated to A ∧ B and not to A ∧ D is lower

Memorization and Association on a Realistic Neural Model

537

bounded by the following quantity p , where r∧ = r − r : (i=0:k) T (r∧ , p, i)(j=0:k−i) T (r∧ , p, j)(l=0:k−i−j) T (r∧ , p, l) (m=0:k−i−j−l) T (r , p, m) [B (r − 2r + r , p, k − i − j − m)B (r − 2r + r , p, k − j − l − m) (1 − B (r − 2r + r , p, k − i − l − m))]. (Note that to allow for terms in which the intersection of A and B have more than k connections to the C node, we need to interpret the first term T (r∧ , p, i) in the above expression for the particular value i = k to mean B (r∧ , p, k).) Then the relationship we need is

B (n, p , 2r/3) ∼ 1.

(2.3’)

2.4 Association for Shared Representations. Equations 2.4 and 2.5 in the shared coding are identical to equations 2.4 and 2.5 since we are assuming here, for simplicity, that in an implementation of LINK between A and B, the neurons representing A, B, and the relay nodes are from three disjoint sets. Equation 2.6 will correspond to equation 2.6 in the special case of t = 1. For a fixed node in B, the probability of a node u being a relay node to it and having k edges coming from both A and C is p = p(B (r , p, k) + (i=0,...,k−1) [T (r , p, i)(B (r − r , p, k − i))2 ]). The probability that there are at least k of these is p = B (n, p , k). We want the probability that at least half of the members of B have such sets of at least k relay nodes to be small. We approximate this quantity by assuming independence for the various u:

B (r, p , r/2) ∼ 0.

(2.6’)

2.5 One-step Memory Formation for Shared Representations. The algorithm for memorization implied above requires A and B to be active at different time instants and therefore requires the memorization process to take two steps. Here we shall discuss a process by which memorization can be achieved in one step. For brevity, we shall consider one-step only for the shared representation case. The results are presented in Tables 3 and 4. The main advantage of these one-step algorithms is that the algorithms become even simpler and assume even less about the timing mechanism of the model of computation. But one small complication arises: the number of inputs needed to fire a node in the memorization algorithm is now different from that needed in the association algorithm. Instead of having a single parameter k, we shall have two parameters, km and ka , respectively. It turns

538

L. Valiant

out that km = 2ka works. This means that we can fix the parameter k of the neurons to be km , and then use a weight in the memorization algorithm that is only half as strong as the maximum value allowed. We now consider the JOIN property. For each vertex u that is a potential representative of C, the probability that it has at least km edges directed toward it from the nodes of A ∪ B is now p = B (2r − r , p, km ), provided the intersection of A and B is of size exactly r , the integer closest to the expectation r2 /n. This is because each vertex of A ∪ B may be regarded as a coin toss with probability p of coming up heads (i.e., of being connected to u), and we want at least km successes. Hence, the expected number of vertices so connected is n times this quantity. The stated requirement on the JOIN property therefore is that the following be satisfied: nB (2r − r , p, km ) − r.

(2.1”)

Alternatively one could impose some specific probability distribution on the choice of A and B and compute an analog of equation 2.1 that is precise for that distribution and does not need the assumption that the intersections are of size exactly r . If fewer than half of the representatives of an item are firing, we regard that item as not being recognized. As a side-effect condition, we therefore want that if no more than half of one of A or B is active, then the probability that more than half of C is active is negligible. In exact analogy with equations 2.2 and 2.2 , we have that

B (n, B (3r/2, p, km ), r/10) ∼ 0.

(2.2”)

Here, 3r/2 upper bounds the number of firing nodes in A ∪ B if at most r/2 are firing in A and all are firing in B, say. As a second side-effect condition, we again need an assurance that the A ∧ B nodes allocated will not be caused to fire if a different conjunction is activated. Again, a bad case is if the second conjunction is A ∧ D where D is different from B. If the node set allocated to A ∧ B and not to A ∧ D is at least of size 2r/3, we will consider there to be no interference since if A ∧ B is of size less than 4r/3 then the firing of A ∧ D will cause fewer than half of the nodes of A ∧ B to fire. If A, B, and D were disjoint sets of r nodes, then the probability that a node receives km inputs from A ∪ B but fewer than k = km from A ∪ D would be p = (s=0:k−1) T (r, p, s)(B (r, p, km − s))(1 − B (r, p, km − s)), where s denotes the number of nodes in A that are connected to that node. We want the number of nodes that are so allocated to A ∧ B but not to A ∧ D to be at least 2r/3. Hence we would want that

B (n, p , 2r/3) ∼ 1.

(2.3”)

Memorization and Association on a Realistic Neural Model

539

In the event that A, B, and D are not disjoint but randomly chosen sets of r elements, we need equation 2.3 but with a value of p computed as follows. We assume that the intersections A∩B, A∩D, and B∩D all have their expected sizes r = r2 /n, and that A ∩ B ∩ D has its expected size r = r3 /n2 . For a fixed node in C, we shall denote the number of connections to these four intersections by i + m, j + m, l + m, and m, respectively, in the summation below. Then the probability of a node being allocated to A ∧ B and not to A ∧ D is lower-bounded by the following quantity p , where r∧ = r −r and r# = r − −2r + r : p = (s=0:k−1) T (r# , p, s)(i=0:k−s−1) T (r∧ , p, i)(j=0:k−i−s−1) T (r∧ , p, j) (m=0:k−i−j−s−1) T (r , p, m)(l=0:k−i−j−m−s−1) T (r∧ , p, l) [(B (r# , p, km − s − i − j − l − m)) (1 − B (r# , p, km − s − i − j − l − m))]. Here s indexes the number of connections from nodes that are in A but not in B or D. 2.6 One-Step Association with Shared Representations. For the LINK property we simply have relations 2.4, 2.5, and 2.6 with k replaced by ka : Y = B (n, pB (r, p, ka ), ka ) ∼ 1,

(2.4”)

B (r, B (n, pB (r/2, p, ka ), ka ), r/2) ∼ 0,

(2.5”)

B (r, p , r/2) ∼ 0,

(2.6”)

and

where p = B (n, p , k), and p = p(B (r , p, ka ) + (i=0,...,k−1) [T (r , p, i)(B (r − r , p, ka − i))2 ]). Since equation 2.6 guarantees only that one previous association to item C will not lead to false associations with C, we also use equation 2.6 to compute an estimate of the maximum number of previous associations that still allow resistance to such false associations. 3 Graph-Theoretic Results In Tables 1 and 2 we summarize the solutions we have found. For each combination of n, d, and k, the table entry gives a value of r that satisfies all six

540

L. Valiant

conditions 2.1 to 2.6 to high precision, as well as their equivalents, conditions 2.1 , 2.2 , 2.3 , and 2.6 , for shared representations. (There are essentially only eight equations since equations 2.2 and 2.6 , subsume equations 2.2 and 2.6, respectively.) For example consider a neural system with n = 1,000,000 neurons where each one is connected to 8192 others on the average and the maximum synaptic strengths are 1/64 of the threshold amount. The entry 8491 found in Table 1 gives the value of r that solves the 10 constraints. It means that if each item is represented by about 8491 neurons, then the graph has the capability of realizing JOIN and LINK using the algorithms to be outlined in later sections. The central positive result of this article is the existence of entries in the tables for combinations of neuron numbers, synapse strengths, and synapse numbers that are widely observed in neural systems. We expect that the analysis that underlies the tables offers a basis for a calculus for understanding the algorithms and data structures used in specific systems, such as the hippocampus or the olfactory bulb. In interpreting the tables the following comments are in order. The solutions were found by solving equation 2.1 using binary search, and discarding any solutions that failed to solve the remaining relations. As a further comment, we note that equation 2.1 and some of the others are defined only for integer values of r, and in our search we therefore imposed the constraint of allowing only such integer values. The values of r are therefore the integer values for which the value of the left-hand side of equation 2.1 is as close as possible to r. We detail the exact integer values as a reminder of this. For all the entries shown, the difference between the two sides of equation 2.1 is at most 1%, except for those labeled ∗, where a difference of 10% is allowed. The following are some further observations: The case of k = 1 in equation 2.4 is known (Valiant, 1994) to give an asymptotic value of Y ∼ 1−1/e = .63 . . .. In general, values k = 1, 2, and 4 violate at least one of either equation 2.2 or 2.5. Most of the entries support equation 2.6 only up to t = 1, except for some entries with k = 8 or 16 and with some of the higher values of d. Finally, corresponding entries for different values of the neuron count n are in the ratio of the values of n. We note that for the k = 1 case, the analysis in Valiant (1994) relates to the analysis here in the following way. It is observed there that in general for any r and n, the graph density that supports JOIN is too sparse to support LINK with a Y value close to 1. The suggested solution there is to use a graph that is dense enough to support LINK and to have it ignore a random fraction of the connections when implementing JOIN so as to effectively use a sparser graph regime for that purpose. On a separate issue, the earlier analysis did not have any equivalents of the relations 2.2 and 2.5 above. For disjoint representations, where the intention is that in each situation, either all or none of the representatives of an item fires, these conditions might be argued to be too onerous. Tables 3 and 4 summarize the solutions we have found that support the one-step algorithm. We note that in general, corresponding values of r are a

Memorization and Association on a Realistic Neural Model

541

Table 1: Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k. k=8 n = 100,000 neurons d = 128 d = 256 1981 d = 512 899 d = 1024 412 d = 2048 d = 4096 d = 8192 d = 16,384 d = 32,768 d = 65,536 n = 1,000,000 neurons d = 128 d = 256 19,803 d = 512 8979 d = 1024 4105 d = 2048 1888 ∧ 873 d = 4096 ∧ 406 d = 8192 ∧ 190 d = 16,384 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 n = 10,000,000 neurons d = 128 d = 256 198,025 d = 512 89,777 d = 1024 41,033 d = 2048 18,868 ∧ 8717 d = 4096 ∧ 4043 d = 8192 ∧ 1882 d = 16,384 ∧ 879 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576

k = 16

5025 2338 1098 519 *247 *119

50,235 23,365 10,957 5168 2449 1165 ∧ 557 ∧ 268 ∧ *130

k = 32

5420 2582 1238 597 *290 *143

54,181 25,796 12,353 5940 2866 1388 ∧ 675 ∧ 330 ∧ *163

k = 64 k = 128 k = 256 k = 512 k = 1024

2749 1337 654 *322 ∧ *162

2865 1407 *695 ∧ *347

27,460 13,330 6491 3169 1552 ∧ *763 ∧ *378 ∧ *190

28,607 14,015 6883 ∧ 3388 ∧ *1672 ∧ *830 ∧ *415

1458

∧ *727

14,497 7161 ∧ 3545 ∧ 1761

*1496

∧ *753

14,836 ∧ 7360 ∧ 3660 ∧ *1827

502,339 233,636 541,791 109,547 257,940 51,660 123,498 274,571 24,467 59,368 133,265 286,021 11,628 28,628 64,866 140,098 ∧ 5542 13,839 31,643 68,761 144,886 ∧ 2649 6704 15,464 33,802 71,516 148,248 ∧ 1269 ∧ 3255 ∧ 7570 ∧ 16,640 ∧ 35,342 ∧ 73,463 ∧ 610 ∧ 1584 ∧ 3712 ∧ 8202 ∧ 17,484 ∧ 36,437 ∧ 295 ∧ 773 ∧ 1824 ∧ 4049 ∧ 8660 ∧ 18,089

∧ 74,842

Notes: Equation 2.1 is accurate to ratio 10−2 (but only 10−1 if marked by ∗) and Equation 2.4 to 10−6 . Equations 2.2, 2.3, 2.5, 2.6, 2.2 , 2.3 , and 2.6 are accurate to 10−6 . For unmarked entries these accuracies are achieved even if the noise rates are 10−4 for equations 2.2, 2.3, 2.5, and 2.2 ; 10−5 for 2.6 and 2.3 ; and 10−6 for 2.6 . For entries marked with a ∧ , this accuracy is achieved with the lower noise rate 10−6 for all seven equations. Equation 2.1 is satisfied with constant 1 < c1 < 1.1.

k = 16

5,023,377 2,336,342 1,095,455 516,578 244,648 116,254 ∧ 55,394 ∧ 26,455 ∧ 12,660 ∧ 6069 ∧ 2915

k=8

n = 100,000,000 neurons d = 128 d = 256 1,980,239 d = 512 897,763 d = 1024 410,318 d = 2048 188,664 ∧ 87,153 d = 4096 ∧ 40,412 d = 8192 ∧ 18,797 d = 16,384 ∧ 8766 d = 32,768 ∧ 4098 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576 5,417,894 2,579,374 1,234,952 593,653 286,244 138,351 67,001 ∧ 32,501 ∧ 15,789 ∧ 7681

k = 32

2,745,675 1,332,609 648,612 316,375 154,582 ∧ 75,636 ∧ 37,053 ∧ 18,172

k = 64

2,860,166 1,400,921 687,547 337,949 ∧ 166,314 ∧ 81,930 ∧ 40,397

k = 128

1,448,778 715,063 ∧ 353,310 ∧ 174,721 ∧ 86,468

k = 256

1,482,369 ∧ 734,499 ∧ 364,217 ∧ 180,718

k = 512

∧ 371,950

∧ 748,226

k = 1024

Table 2: The Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k.

542 L. Valiant

50,233,759 23,363,401 10,954,534 5,165,758 2,446,454 1,162,518 553,914 ∧ 264,523 ∧ 126,564 ∧ 60,656 ∧ 29,112

n = 1,000,000,000 neurons d = 128 d = 256 19,802,377 d = 512 8,977,613 d = 1024 4,103,166 d = 2048 1,886,626 ∧ 871,517 d = 4096 ∧ 404,099 d = 8192 ∧ 187,946 d = 16,384 ∧ 87,638 d = 32,768 ∧ 40,955 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576 54,178,923 25,793,723 12,349,496 5,936,498 2,862,401 1,383,469 669,965 ∧ 324,966 ∧ 154,842 ∧ 76,758

k = 32

27,456,714 13,326,048 6,486,077 3,163,694 1,545,764 ∧ 756,296 ∧ 370,461 ∧ 181,644

k = 64

28,601,613 14,009,157 6,875,400 3,379,417 ∧ 1,633,055 ∧ 819,214 ∧ 403,874

k = 128

14,487,695 7,150,540 ∧ 3,532,995 ∧ 1,747,094 ∧ 864,553

k = 256

∧ 1,807,014

∧ 3,642,013

14,823,572

∧ 7,344,851

k = 512

∧ 3,719,279

∧ 7,482,072

k = 1024

Notes: Equation 2.1 is accurate to ratio 10−2 and equation 2.4 to 10−6 . Equations 2.2, 2.3, 2.5, 2.6, 2.2 , 2.3 , and 2.6 are accurate to 10−6 . For unmarked entries, these accuracies are achieved even if the noise rates are 10−4 for 2.2, 2.3, 2.5, and 2.2 ; 10−5 for 2.6 and 2.3 ; and 10−6 for 2.6 . For entries marked with a ∧ , this accuracy is achieved with the lower noise rate 10−6 for all seven equations. Equation 2.1 is satisfied with constant 1 < c1 < 1.1.

k = 16

k=8

Table 2: Continued.

Memorization and Association on a Realistic Neural Model 543

544

L. Valiant

Table 3: The Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k, where km = 2k and ka = k. k = 8 k = 16

k = 32

k = 64 k = 128 k = 256 k = 512 k = 1024

n = 100,000 neurons d = 128 d = 256 d = 512 2134 5170 d = 1024 1000 2436 d = 2048 473* 1164 2653 d = 4096 562* 1284* d = 8192 628* d = 16,384 d = 32,768 d = 65,536 n = 1,000,000 neurons d = 128 d = 256 d = 512 d = 1024 3571 9974 24327 d = 2048 4707 11,603 26,487 d = 4096 2233 5576 12,792 d = 8192 1065 2692 6219 d = 16,384 1305 3037 d = 32,768 636* 1488* d = 65,536 733 d = 131,072 364* d = 262,144 d = 524,288 n = 10,000,000 neurons d = 128 d = 256 d = 512 d = 1024 35,693 99,248 d = 2048 16,451 46,807 115,310 d = 4096 22,307 55,487 126,970 d = 8192 10,619 26,877 61,909 d = 16,384 5069 13,005 30,302 d = 32,768 6307 14,815 d = 65,536 7258 d = 131,072 3562 d = 262,144 d = 524,288 d = 1,048,576

2809 1372* 678* 339*

2922 1438* 716*

1488*

28,027 13,650 6688 3290 1625* 807* 406*

29,122 14,265 7028 3477 1728* 865

29,899 14,706 7274 3614 1806

7454 3717*

66,577 32,814 16,155 7967 3936

69,939 34,635 17,131 8487

72,347 35,946 17,838 8867

36,887 18,350

Notes: Equation 2.1 is accurate to ratio 10−2 (but only 10−1 if marked by *) and equation 2.4 to 10−6 . Equations 2.2 , 2.3 , 2.5 , and 2.6 are accurate to 10−6 . These accuracies are achieved even if the noise rates for equations 2.2 , 2.3 , 2.5 , and 2.6 are 10−4 , 10−4 , 10−4 , and 10−5 , respectively.

n = 100,000,000 neurons d = 128 d = 256 d = 512 d = 1024 356,186 d = 2048 164,349 d = 4096 d = 8192 d = 16,384 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576

k=8

991,770 469,174 222,765 106,086 50,640

k = 16

1,152,860 555,466 268,344 129,914 63,004

k = 32

1,270,184 619,272 302,491 147,976 72,484 35,546

k = 64

665,623 327,502 161,315 79,539 39,248

k = 128

698,966 345,606 171,018 84,678

k = 256

722,820 358,615 178,029 88,413

k = 512

367,903 183,035

k = 1024

Table 4: The Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k, Where km = 2k and ka = k.

Memorization and Association on a Realistic Neural Model 545

k=8

9,917,674 4,691,653 2,227,722 1,060,910 506,456

k = 16

11,528,462 5,554,651 2,683,448 1,299,108 630,010

k = 32

12,701,863 6,192,598 3,024,779 1,479,667 724,718 355,323

k = 64

6,656,114 3,274,935 1,613,041 795,184 392,293

k = 128

6,989,595 3,455,971 1,710,073 846,699

k = 256

7,228,103 3,585,972 1,780,004 883,950

k = 512

3,678,864 1,830,101

k = 1024

Notes: Equation 2.1 is accurate to ratio 10−2 , and equation 2.4 to 10−6 . Equations 2.2 , 2.3 , 2.5 , and 2.6 are accurate to 10−6 . These accuracies are achieved even if the noise rates for equations 2.2 , 2.3 , 2.5 , and 2.6 are 10−4 , 10−4 , 10−4 , and 10−5 , respectively.

n = 1,000,000,000 neurons d = 128 d = 256 d = 512 d = 1024 3,561,948 d = 2048 1,643,398 d = 4096 d = 8192 d = 16,384 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576

Table 4: Continued.

546 L. Valiant

Memorization and Association on a Realistic Neural Model

547

little smaller in these tables than in Tables 1 and 2. With regard to equation 2.6, our findings, which are not detailed here, are as follows. While the parameters of Tables 1 and 2 support maximum values of t = 1 usually, the smaller values of r in Tables 3 and 4 (where equation 2.6 is only an approximation) lead to rather larger values of t, scattered in the range 1 to 7. Further, it turns out that instead of using km = 2ka , as we do in these tables, we can find similar results for slightly smaller coefficients, such as km = 1.95ka , or km = 1.9ka , and these give even smaller values of r and larger values of t. 4 The Computational Model and Algorithms As explained earlier, our goal is not only to show that the connectivity of the networks we consider are sufficient to provide the minimum communication bandwidth needed for realizing memorization and association, but also to show that algorithms are possible that modify the network so as to be able to execute instances of these tasks. In particular, each of these two tasks and for each representation, one needs two algorithms—one for creating the circuit, say for associating A to B in the first place, and one for subsequently executing the task, namely, causing B to fire when A fires. For describing such algorithms, we need a model of computation. We employ the neuroidal model because it is programmable and well suited to describing algorithms (Valiant, 1994). As mentioned earlier, the neuroidal model is designed to be so simple that there is no debate that real neurons have at least as much power. It is not designed to capture all the features of real neurons. Our algorithms are described for a variant of the neuroidal model that allows synapses to have memory in addition to weights. This has some biological support (Markram & Tsodyks, 1996b) and allows for somewhat more natural programming, even though temporary values of synaptic weights may be used instead, in principle, to simulate such states (Valiant, 1994). A brief summary of the model is as follows. A neuroidal net consists of a weighted directed graph G with a model neuron or neuroid at each node. A neuroid is a threshold element with some additional internal memory, which can be in one of a set of modes. The mode si of node i at an instant will specify the threshold Ti , and may also have further components such as a member qi of a finite set of states Q. In particular, a mode is either firing or nonfiring and fi has value 1 or 0 accordingly. The weight of an edge from node j to node i is wji and models the strength of a synapse for which j is presynaptic and i is postsynaptic. Each synapse can also have a state qji , which with wji forms a component of the mode sji . The only way a neuroid i can be influenced by other neuroids is through the quantity wi which equals the sum of the weights wji over all nodes j presynaptic to i that are in firing modes. Each neuroid executes an algorithm that is local to itself and can be formally defined in terms of mode update functions δ and λ, for the neuroid

548

L. Valiant

itself and each synapse, respectively: δ(si , wi ) = si , and λ(si , wi , sji , fj ) = sji . These relations express the values of the modes of the neuroid and synapses at one step in terms of the values at the previous step of the variables on which they are permitted to depend. Thus, the mode of a neuroid can depend only on its mode at the previous step and on the sum wi of weights of synapses incoming from firing nodes. The mode of one of its synapses can depend only on the mode of the same synapse, on the mode of the neuroid, on the firing status of the presynaptic neuroid, and on the sum wi of weights of synapses incoming from firing presynaptic neuroids. The model assumes a timing mechanism that has two components. Each transition has a period. We assume here that all transitions have a period of 1, except for threshold transitions, those that are triggered by wi ≥ Ti , which work on a faster timescale. There is a global synchronization mechanism such that, for example, if some external input is to cause the representations of two items A and B to fire simultaneously, then the nodes partaking in these representations will be caused to fire synchronously enough that the algorithms that will be caused to execute can keep in lockstep for the duration of these local algorithms. These durations will be typically no more than 10, and for the purposes of this article, just two steps. By a disjoint representation, we mean a representation where each neuroid can represent at most one item, though one item may be represented by many neuroids. The specific disjoint representation that our algorithms support has been called a positive representation (Valiant, 1994). The generalization that allows a node to represent more than one item, as needed for the shared representations of the next section, we call a positive shared representation. The neuroidal model allows for negative weights, which may be needed, for example, for inductive learning. However, the algorithms we describe here for JOIN and LINK do not use negative weights. We shall start with the algorithms needed for the disjoint two-step scheme implied by relations 2.1 to 2.6. The algorithms for implementing JOIN and LINK are very similar to those that were described for the same tasks for unit weights (Valiant, 1994, algorithms 7.2 and 8.1). It is clear, however, that once graph-theoretic properties such as equations 2.1 to 2.6 are guaranteed, then a rich variety of variants of these algorithms also suffices. We shall describe these algorithms informally here. The following algorithm for creating JOIN needs the nodes of A and B to be caused to fire at distinct time steps. The nodes that are candidates for C = A∧B (1) are initially in “unallocated” state q1, (2) have a fixed threshold T, (3) have each synapse in initial state qq1, and (4) have all the presynaptic weights initially at the value T/k.

Memorization and Association on a Realistic Neural Model

549

The algorithm acting locally on each candidate C node will act over two steps. The first step is prompted by the firing of A and the second by the firing of B one time unit later. Following these two prompts, each candidate C node initially in state q1 that has at least k connections from A and also at least k connections from B will be in state q2, indicating that it has become a C node and assigned to store something. Incoming weights from nodes other than A or B will be made zero. An incoming weight from A will equal T/x if there are x ≥ k of them, and those from B will equal T/y if there are y ≥ k of them. A candidate node that does not receive two successive prompts will return to the initial unassigned condition. The algorithmic mechanism that realizes this outcome is the following. First, note that a node that does become a C node needs to make the incoming synapses have one of the three weight values depending on whether it comes from A or B or neither. The trick is that after the A prompt, the synapses from A will memorize the value T/x for an x ≥ k as its weight, and memorize the fact that it is in this transitory condition by having the synapse state have the temporary value qq2. Also, the node will memorize the fact that it is in a transitory state in which k connections from A have been found by going to state q3. At the B prompt, if at a candidate node in state q3 some y ≥ k synapses come from firing nodes, then these synapses can be updated to have value T/y. At the same time, the A synapses, in state qq2, can go on to take on the values T/x, and the remaining synapses the value 0. However, if no such k connections from B nodes are found at this second step, then the whole neuroid returns to its initial condition. The reader can verify that the circuit constructed as described can execute the created conjunction using a very similar two-step process if at any later time A and B are presented at successive time instants. However, many variations are possible. For example, if the weights are set to T/2x and T/2y instead of T/x and T/y, then simultaneous presentation of A and B will work for recognition. Thus, one-step execution is possible even with twostep creation. We observe here that in general, there is a chance that the set of neuroids identified to represent a conjunction are mostly previously taken, and the new ones that can be assigned form only a small fraction of r. Condition 2.2 ensures that this effect is initially limited. The situation here is akin to that of a hashing scheme in which as the memory fills up, fewer and fewer new places are available (Valiant, 1994). We now go on to discuss an algorithm for creating LINK. We consider A to be represented in one area from which there are directed edges to an area of relay neuroids, from which in turn there are directed edges to a third area containing B. Initially, the relay neuroids have threshold T, and all weights on incoming edges weight T/k, and on outgoing edges weight 0. The relay neuroids never change, except for firing. The neuroids in B are in state q1 initially.

550

L. Valiant

The algorithm for creating LINK has one step in which the representations of A and B are caused to fire at the first prompt. For the nodes in B that are in state q1 and are caused to fire, each incoming edge from a firing node is given weight T/k. The algorithm for executing LINK is even simpler. It requires simple threshold firing at both the relay level and in the B nodes. We now go on to discuss the shared representation expressed by relations 2.1 to 2.6 . The algorithm given for creating and executing LINK in the disjoint case described above applies unchanged to the shared case also. For JOIN, the creation algorithm given has to be modified so that no distinction is made any more between allocated and unallocated nodes. Now no creation process is necessary. Execution can be realized by the following modification of the creation algorithm for the disjoint case: (1) the final state is made the same as the initial state q1, rather than a new state q2, and (2) no synaptic weights are changed at all. The evaluation algorithm is unchanged. The descriptions of the two algorithms above assume bipartite graphs, in which A and B will be in different areas for the case of LINK and C in a different area from A and B in the case of JOIN. To adapt the algorithms to general graphs, small modifications are needed to allow for the node sets having nonzero intersections. For the shared representation one-step algorithm, there is again no creation process. Evaluation requires again only threshold firing. For LINK, there is no difference between the one-step and two-step cases. 5 Capacity and Interference The intention of our representations is that when all or most of the nodes representing an item fire, then the item is considered recognized. For example, the activation of sufficiently many neurons in a representation in a motor area of cortex would cause a certain muscle movement. This style of representation gives rise to a pair of related concerns: How many items can be represented in the system? What exactly does it mean for an item to be represented if unforeseen interference from other activities in the circuit can occur? First, we note that these concerns occur in a fundamentally novel way in our approach as compared with some previous theories. In a traditional associative memory, for example, there is just one kind of execution, retrieval, and we can assume that nothing else is going on simultaneously with an instance of it. Hence, the notion of capacity, the number of items that can be stored, is analyzable in a clean manner (Graham & Willshaw, 1997). In this letter, we have two kinds of tasks, memorization and association. We also have diverse environments arising from different histories of past circuit creations. Further, we may want some robustness to other simultaneous activities in the circuit, and our longer-term aim may be to support further tasks. These factors make possible a large number of potential sources of interference, which we define to be the effect on the execution

Memorization and Association on a Realistic Neural Model

551

of an algorithm of network conditions that arise from sources not specific to that execution. Our guarantee that the circuit acts correctly is only with respect to some specified set of noninterference conditions such as relations 2.2, 2.3, 2.5, or 2.6 and a robustness condition that, as described in the section to follow, upper-bounds the total number of nodes that are active in the whole circuit. No guarantees are implied for situations that are outside these constraints. For example, if the circuit has a “seizure” so that half of the nodes extraneous to the task at hand fire, then this is a pathological condition for which no guarantees are offered. A second observation is that our guarantees are only probabilistic and, at least in this article, only as computed by some numerical calculations of limited precision. In particular, in order to bound the probability of error in the various relations, we have computed the tails of the Bernouilli distribution B (n, p, k) by adding the individual nonnegligible terms T (n, p, i), using double-precision calculations. Since the number of terms grows with n, our accuracy was limited to about 10−6 for the largest values of n that we considered. In principle, the calculations may be performed to arbitrary accuracy in the following sense. One could compute for what minimum integer x is the probability of error less than 10−x for each relation. Doing such detailed calculations is beyond the scope of this article. It will suffice here to observe that certainly in some cases within the tables of parameters that we consider, the errors are much smaller than the claimed 10−6 , in fact, less than 10−1000 in one extremity. As an example, we give a simple analytic upper bound on the error for relation 2.2 :

B (n, B (3r/2, p, km ), r/10) ∼ 0 under the assumption 2.1 , nB (2r − r , p, km ) ∼ r where, further, km = 2k. Now, for generic variables n, k, p, and b, if k = (1 + b)np and 0 ≤ b ≤ 1, then

B (n, p, k) ≤ exp(−b2 np/3) can be derived from Chernoff’s bound (Angluin & Valiant, 1979), where “exp” denotes exponentiation to the base e = 2.71 . . .. From this bound, it follows that

B (3r/2, p, 2k) ≤ exp(−(2k/(3rp/2) − 1)2 (3rp/2)/3). Now in all cases in the tables, rp < k (a fact that also follows from equation 2.1 if we assume r to be negligible and r/n to be small enough to ensure that km = 2k exceeds the mean). It then follows by substitution that B (3r/2, p, 2k) ≤ exp(−rp/18). So if we choose values from the tables with n = 109 , rp ≥ 180, and r ≥ 107 , say, then equation 2.2 becomes

552

L. Valiant

B (109 , exp(−10), 106 ) ∼ 0. Applying the general bound on B given above a second time gives that the error in equation 2.2 is at most B (109 , 10−4 , 106 ) ≤ B (109 , 10−4 , 2 ∗ 105 ) ≤ exp(−105 /3) ≤ 10−1000 . Thus, at one extremity of our range of parameters, the errors for the one relation 2.2 are indeed extremely small. Our point is that while for conceptually simpler models, the notion of capacity, a single value for the number of items that can be represented, makes sense, for more complex models a more appropriate way of expressing the same notion is that of upper-bounding the probability of various kind of interference in any execution of the task. For example, relation 2.2 does not give an absolute guarantee of the relevant interference’s not happening. It says that if A and B have total size 3r/2 and the graph is regarded as randomly generated with respect to those sets, then the probability of the unwanted interference is very small (e.g., 10−6 or 10−1000 ). This we interpret to say, roughly, in the fixed network, that if A is fixed and of size r and B is a random set of r/2 nodes, then the interference effect will occur with such small probability. 6 Robustness to Noise The discussion on capacity referred to specific interactions in the network. A more generic source of interference that can be analyzed is that due to some fixed fraction of neurons being active in the network that are extraneous to the task being executed. The main question is the fraction of extraneous neurons that can be active without interfering with the intended effects of the task at hand. We shall call that fraction the noise rate σ . In general, we can refine each of our noninterference constraints to allow for the expected number s = σ n of extraneous nodes being additionally active. We have restricted the entries in Tables 1 and 2 to those where noninterference relations 2.2, 2.3, 2.5, 2.6, 2.2 , 2.3 , and 2.6 , held even with perturbations corresponding to noise rates σ in the range from 10−4 to 10−6 . In all seven relations, we replaced the relevant quantities r or r/2, when they referred to the input neurons of the task, by r+σ n or r/2+σ n, as appropriate. In particular, for the respective relations, the replacements were done for the following neuron sets: equations 2.2 and 2.2 : A, B; equations 2.3 and 2.3 : A, B, D; equation 2.5: A; and equations 2.6 and 2.6 : A1 , B. In Tables 3 and 4, the entries are restricted in an exactly analogous way. We note that with the exception of equations 2.3, 2.3 , and 2.3 , it is clear that with a lower noise rate, it is easier to satisfy these equations. Estimates of the noise rates that can be tolerated can be made in a number of other senses also. For example, we could assume that all situations, including both circuit creations and executions, are subject to some noise rate, and solve equation 2.1 under that assumption. We also note that we have sought noise rates that can be supported by a very wide range of the parameters. Higher rates can be tolerated for individual parameter combinations.

Memorization and Association on a Realistic Neural Model

553

7 Predictions The entries in our tables are all solutions to equations 2.1, 2.1 , or 2.1 . Remarkably, there is the following simple interdependence among r, d, k, and n: rd < kn < c2 rd,

(∗)

where for each entry in Tables 1 to 4, c2 is a modest constant. In fact, for all entries in the tables with k ≥ 64, it is the case that c2 is smaller than 1.35. For entries with k = 32, k = 16, and k = 8, it is smaller than 1.6, 2.1, and 3.0, respectively. This relationship can be explained as follows. Equation 2.1 is of the form f(B(r, p, k)) = r/n, where f is a fixed function, here the squaring function. For const k, the expectation rp of the associated binomial distribution will stay a constant as r goes up by factor of 10 and p goes down by factor of 10, or, equivalently, as n goes up by a factor of 10 for constant d. (Note that this explains why the corresponding entries in the tables for the various values of n are approximately in the ratio of the magnitudes of n.) Since, in general, r/n is small, solutions of equation 2.1 will correspond to having combinations of r, p, k that correspond to a point somewhat above the expectation. In other words, k will be somewhat above rp = rd/n. Since the binomial distribution falls off exponentially above, the mean k will not be much larger than rn/d, from which we deduce that kd is a little larger than rn. It is easy to see that the same argument also holds for relation 2.1 if km = 2k. This simple relation can be taken as a prediction for systems that allocate memory in the style of our memorization mechanism, provided the number of representatives for a concept at the lower level, that is, A and B, is the same as at the next level, C. This is an attractive assumption for a memory system that treats all memorized concepts as “equal citizens.” It may not be true for all systems. For example, in various levels of a vision or other sensory system, there may be amplification or reduction in the number of neurons that represent an item between the various levels, and in that case appropriate modifications of equations 2.1 or 2.1 need to be solved instead. Finally, we note that a node in our formalism may be simulating a unit that consists of more than one biological neuron. For example, the suggestion that local connections between different layers in cortex may have the effect of increasing the effective degree of a node in the long-range connection network is analyzed in detail in Valiant (1994). 8 Discussion We have shown that networks of model neurons having the four parameters of neuron count, synapse count, strength of synapses, and switching time all within ranges widely observed in biology, can realize the two basic tasks of memory formation and association. We have given tables of values for

554

L. Valiant

these parameters that are consistent with the quantitative constraints that we have identified as being sufficient for the realization of these two tasks. Our positive result is that entries exist for realistic combinations of these numerical parameters. Further, the algorithms needed for creating and executing the circuits for these tasks are of the simplest kind, requiring as little as one step of vicinal or neighborly interaction. The two basic tasks that we have considered here have been the basis for implementing a broader variety of cognitive tasks, including memorization of conjunctions and disjunctions, handling relations, and inductive learning, under the direct-action hypothesis of strong synapses (Valiant, 1994). For the less restrictive setting of the current article, any such broader implications that may follow have yet to be worked out. In particular, if items have a shared representation and these are to be the targets of inductive learning, then special challenges arise. If multiple concepts are being learned and the examples for them are intermingled in time, then having a single synapse take part in the learning of more than one concept would appear to be problematic. This work offers apparently the first explanation of how the basic cognitive tasks that we consider here can be performed at all by neural systems that have synaptic strengths that are as weak as those that are typically observed experimentally. It is probable that different neural systems exploit different combinations of the numerical parameters, and do so in different ways. It is possible, and even probable, for example, that higher-order cognitive tasks require disjoint representations and stronger synapses. Our methodology offers a calculus for investigating such phenomena. Acknowledgments This work was supported in part by grants from the National Science Foundation: NSF-CCR-98-77049, NSF-CCR-03-10882, NSF-CCF-0432037, and NSF-CCF-04-0427129. I am also grateful to two anonymous referees for their helpful comments. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abramowitz, M., & Stegun, I. (1964). Handbook of mathematical functions. Cambridge, MA: National Bureau of Standards. Ali, A. B., Deuchars, J., Pawelzik, H., & Thomson, A. M. (1998). CA1 pyramidal to basket and bistratified cell EPSPs: Dual intracellular recordings in rat hippocampal slices. J. Physiol., 507, 201–217. Angluin, D., & Valiant, L. G. (1979). Fast probabalistic algorithms for Hamiltonian circuits and matchings. J. Comput. and Syst. Sciences, 8, 155–193. Barlow, H. B. (1972). Single units and sensation: A neuron doctrine for perceptual psychology. Perception, 1, 371–394.

Memorization and Association on a Realistic Neural Model

555

Beurle, R. L. (1955). Properties of a mass of cells capable of regenerating impulses. Philos. Trans. R. Soc. Lond. [Biol.], 240, 55–87. Bollobas, B. (2001). Random graphs. Cambridge: Cambridge University Press. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems (pp. 171–188). Braitenberg, V., & Schuz, A. (1998). Cortex: Statistics and geometry of neuronal connectivity. Berlin: Springer-Verlag. Cook, S. A. (1971). The complexity of theorem proving procedures. In Proc. 3rd ACM Symp. on Theory of Computing (pp. 151–158). New York: ACM Press. Feldman, J. A. (1982). Dynamic connections in neural networks. Biol. Cybern., 46, 27–39. Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. Cog. Sci., 6, 205–254. Gerbessiotis, A. V. (2003) Random graphs in a neural computation model. International Journal of Computer Mathematics, 80, 689–707. Graham, B., & Willshaw, D. (1997). Capacity and information efficiency of the associative net. Network: Comput. Neural Syst., 8, 35–54. Griffith, J. S. (1963). On the stability of brain-like structures. Biophys. J., 3, 299–308. Griffith, J. S. (1971). Mathematical neurobiology: An introduction to the mathematics of the nervous system. New York: Academic Press. Markram, H., & Tsodyks, M. (1996a). Redistribution of synaptic efficiency: A mechanism to generate infinite synaptic input diversity from a homogeneous population of neurons without changing the absolute synaptic efficacies. J. Physiol. (Paris), 90, 229–232. Markram, H., & Tsodyks, M. (1996b). Redistribution of synaptic efficacy between pyramidal neurons. Nature, 382, 807–810. Papadimitriou, C. H. (1994). Computational complexity. Reading, MA: AddisonWesley. Shastri, L., & Ajjanagadde, A. (1993). From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16(3), 417– 494. Shastri, L. (2001). A computational model of episodic memory formation in the hippocampal system. Neurocomputing, 38–40: 889–897. Thomson, A. M., Deuchars, J., & West, D. C. (1993). Large, deep layer pyramidpyramid single axon EPSPs in slices of rat motor cortex displayed paired pulse and frequency-dependent depression, mediated presynaptically and self-facilitation, mediated postsynaptically. J. Neurophysiol., 70, 2354–2369. Valiant, L. G. (1988). Functionality in neural nets. In Proc. AAAI-88 (Vol. 2, pp. 629–634). San Mateo, CA: Morgan Kaufmann. Valiant, L. G. (1994) Circuits of the mind. New York: Oxford University Press. Received March 9, 2004; accepted July 6, 2004.

LETTER

Communicated by Bard Ermentrout

Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons Christoph Borgers ¨ [email protected] Department of Mathematics, Tufts University, Medford, MA 02155, U.S.A.

Nancy Kopell [email protected] Department of Mathematics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.

Synchronous rhythmic spiking in neuronal networks can be brought about by the interaction between E-cells and Icells (excitatory and inhibitory cells). The I-cells gate and synchronize the E-cells, and the Ecells drive and synchronize the I-cells. We refer to rhythms generated in this way as PING (pyramidal-interneuronal gamma) rhythms. The PING mechanism requires that the drive II to the I-cells be sufficiently low; the rhythm is lost when II gets too large. This can happen in at least two ways. In the first mechanism, the I-cells spike in synchrony, but get ahead of the E-cells, spiking without being prompted by the E-cells. We call this phase walkthrough of the I-cells. In the second mechanism, the I-cells fail to synchronize, and their activity leads to complete suppression of the Ecells. Noisy spiking in the E-cells, generated by noisy external drive, adds excitatory drive to the I-cells and may lead to phase walkthrough. Noisy spiking in the I-cells adds inhibition to the E-cells and may lead to suppression of the E-cells. An analysis of the conditions under which noise leads to phase walkthrough of the I-cells or suppression of the E-cells shows that PING rhythms at frequencies far below the gamma range are robust to noise only if network parameter values are tuned very carefully. Together with an argument explaining why the PING mechanism does not work far above the gamma range in the presence of heterogeneity, this justifies the “G” in “PING.” 1 Introduction The gamma rhythm, a 30 to 80 Hz rhythm in the nervous system, has been associated with early sensory processing (Singer & Gray, 1995; Fries, Neuenschwander, Engel, Goebel, & Singer, 2001; Fries, Roelfsema, Engel, Konig, ¨ & Singer, 1997), attention (Tiitinen et al., 1993; Pulvermuller, ¨ Birbaumer, Lutzenberger, & Mohr, 1997), and memory (Tallon-Baudry, Bertrand, PerNeural Computation 17, 557–608 (2005)

c 2005 Massachusetts Institute of Technology

558

C. Borgers ¨ and N. Kopell

onnet, & Pernier, 1998; Slotnick, Moo, Kraut, Lesser, & Hart, 2002). It has also been associated with the formation of cell assemblies, that is, temporarily synchronous sets of cells that work together in some aspect (Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Singer & Gray, 1995; Singer, 1999; Olufsen, Whittington, Camperi, & Kopell, 2003). However, it is still not completely understood which mechanisms underlie gamma rhythms, which biophysical parameters are important in these mechanisms, and how gamma rhythms participate in the formation of cell assemblies. Several different kinds of gamma rhythms have been observed in vivo, in vitro, and in computational studies. Synchronization may result from chemical synapses between inhibitory cells (Whittington, Traub, & Jefferys, 1995; Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996; Wang & Buzs´aki, 1996; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000; Tiesinga, Fellous, Jos´e, & Sejnowski, 2001; Hansel & Mato, 2003). Gamma rhythms generated in this way have been called interneuronal gamma (ING) rhythms (Whittington, et al., 2000) or γ -I rhythms (Tiesinga et al., 2001). The mechanism has also been called the mutual inhibition mechanism by Hansel and Mato (2003). In this letter, we study gamma rhythms resulting from chemical synapses between excitatory (pyramidal) cells and inhibitory cells. The I-cells (inhibitory cells) gate and synchronize the E-cells (excitatory cells), and the E-cells drive and synchronize the I-cells (Whittington et al., 2000; Tiesinga et al., 2001; Hansel & Mato, 2003). Gamma rhythms of this kind have been called pyramidal-interneuronal gamma (PING) rhythms (Whittington et al., 2000) or γ -II rhythms (Tiesinga et al., 2001). The mechanism has also been called the cross-talk mechanism by Hansel and Mato (2003)1 In vitro, PING is produced by tetanic stimulation (Whittington et al., 2000). PING is believed to be associated with the creation of cell assemblies (Whittington et al., 2000; Olufsen et al., 2003). Other mechanisms, not modeled here, are believed to play a role in the generation of at least some gamma rhythms. Electrical coupling between dendrites of interneurons can contribute to enhancing the coherence of rhythms (Tam´as, Buhl, Lorincz, ¨ & Somogyi, 2000; Traub et al., 2001). Electrical coupling between axons of pyramidal cells is believed to play a central role in driving persistent gamma oscillations in the CA3 region of the hippocampus (Traub et al., 2000). Chattering cells, which intrinsically generate bursts of action potentials at gamma frequency, play a role in generating some gamma rhythms (Gray & McCormick, 1996; Traub, Buhl, Golveli, & Whittington, 2003). Unlike these forms of gamma, PING can be captured by models that abstract from much of the biophysical detail. In general, pyramidal cells are capable of producing many ionic currents. However, some currents that are not involved in producing action potentials, such as Ih and IT , are negligible 1

PING is one of two “cross-talk mechanisms” discussed by Hansel and Mato (2003).

Rhythms in the Presence of Noise

559

during PING activity, since the voltage does not become sufficiently low during any part of the cycle. Other currents, such as, slow outward potassium currents, are weakened by the large activation of metabotropic glutamate receptors needed to produce PING (Storm, 1989; Charpak, G¨ahwiler, Do, & Knopfel, ¨ 1990; Whittington et al., 2000). Consequently, during PING, the electrophysiological behavior is well described by standard HodgkinHuxley equations, which can often be well approximated by reduced equations such as the quadratic integrate-and-fire model (Latham, Richmond, Nelson, & Nirenberg, 2000) and the theta model (Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997; Gutkin & Ermentrout, 1998). These reductions and relevant parameter regimes are described in section 2. It is easy to describe and understand the PING mechanism in a network consisting of just one E-cell and one I-cell. The E-cell fires and excites the I-cell enough to fire. The I-cell fires and temporarily inhibits the E-cell. This simple mechanism, discussed in more detail in section 3, requires that external excitatory drive to the I-cell be sufficiently weak that the I-cell fires only in response to the E-cell, not on its own. When this condition is violated, phase walkthrough of the I-cell occurs, that is, the I-cell does not wait for input from the E-cell to fire, destroying the regular rhythm.2 The region in parameter space in which phase walkthrough occurs is bounded by a hypersurface that we call the phase walkthrough boundary. It is discussed in detail in section 4. Clearly, phase walkthrough of the I-cell occurs more easily when the E-cell is driven less strongly. The dynamical properties of the rhythm become more subtle when there are large populations of E- and I-cells. The key to the PING rhythm in large networks is the mechanism by which the I-cells synchronize the E-cells. Section 5 gives an explanation of this mechanism. If approximate synchronization is to occur in the presence of heterogeneity, the ratio r=

strength of inhibitory synaptic currents into E-cells strength of excitatory input drive to E-cells

(1.1)

must be sufficiently large (see section 5). (A refined definition of r is given later.) In a two-cell network, phase walkthrough of the I-cells is the only mechanism by which the PING rhythm can be lost as external drives are varied. In larger networks, there is a second important mechanism: suppression of the E-cells by asynchronous activity of the I-cells. (The I-cells are most effective at suppressing the E-cell when they are asynchronous. See appendix A.) In a simulation in which the I-cells are asynchronous initially and receive strong external drive, suppression of the E-cells may occur immediately, preventing the PING mechanism from synchronizing the network. The region in 2 Throughout this letter, we take the word rhythm to denote a regular rhythm. For instance, the firing pattern in the right panel of Figure 4 is not considered a “rhythm.”

560

C. Borgers ¨ and N. Kopell

parameter space in which asynchronous activity of the I-cells is capable of suppressing the E-cells is bounded by a hypersurface that we call the suppression boundary. It is discussed in detail in section 6. It turns out that the quantity r plays the central role here: suppression occurs more easily for larger r. Bistability between suppression and synchrony is analyzed in section 7. Our study of the phase walkthrough and suppression boundaries casts light on the behavior of PING rhythms in the presence of noisy external drive. The rhythm may not be disrupted by occasional out-of-order spiking of E-cells, caused by noisy external drive. However, too much noisy spiking of the E-cells generates too much tonic excitatory drive to the I-cell, resulting in phase walkthrough. This is analyzed in section 8. Since phase walkthrough of the I-cells occurs more easily for weaker external drive to the E-cells, PING rhythms are more easily abolished by noisy activity in the E-cells when external drive to the E-cells is weaker; the analysis in section 8 makes this quantitative. Similarly, the rhythm may not be disrupted by occasional out-of-order spiking of I-cells. However, too much noisy spiking of the I-cells generates too much tonic inhibitory drive to the E-cells, resulting in their suppression. This is analyzed in section 9. Since suppression of the E-cells occurs easily for large values of r, PING rhythms are easily abolished by noisy activity in the I-cells when r is large. The analysis and most of the simulations presented in this article assume homogeneity in network parameter values and all-to-all connectivity. However, we believe that moderate amounts of heterogeneity and significant sparseness in connectivity do not affect the conclusions. Some simulations, including heterogeneity and sparseness, are presented in section 10. To lower the frequency of a PING rhythm, one must decrease the external drive IE to the E-cells or increase r (or both). A decrease in IE makes phase walkthrough of the I-cells caused by noisy spiking of the E-cells more likely; an increase in r makes the rhythm more susceptible to noise in the I-cells. Our results thus imply that slower PING rhythms are less robust to noise. A PING rhythm will be disrupted by very low levels of noise if its period is many times larger than the decay time constant τI of inhibition, that is, if its frequency is far below the gamma range. If one attempts to raise the frequency above the gamma range (i.e., if one tries to decrease its period below τI ) by raising IE , leaving other parameter values unchanged, then r drops, causing the synchronization mechanism to break down, particularly in the presence of heterogeneity. To restore synchronization, one must raise the strength of the I→E synapses. This brings the frequency back into the gamma range. These arguments are presented in section 11. In summary, our results justify the “G” in “PING”: to be robust in the presence of both heterogeneity and noise, a PING rhythm must have a period that is greater than τI , but not many times greater.

Rhythms in the Presence of Noise

561

2 Networks of Theta Neurons 2.1 The Theta Model. In the theta model, a neuron is represented by a point P = (cos θ, sin θ) moving on the unit circle S1 . This is analogous to the Hodgkin-Huxley model, which represents a periodically spiking spaceclamped neuron by a point moving on a limit cycle in a four-dimensional phase space. In the absence of synaptic coupling, the differential equation describing the motion on S1 is dθ 1 − cos θ = + I (1 + cos θ) . dt τ

(2.1)

Here, I should be thought of as an input “current,” measured in radians per unit time. The time constant τ > 0 is needed to make equation 2.1 dimensionally correct. Its meaning will be clarified shortly. For negative I, equation 2.1 has two fixed points, one stable and the other unstable. As I increases, the fixed points approach each other. When I = 0, a saddle node bifurcation occurs: the fixed points collide, and cease to exist for positive I. For a theta neuron, to “spike” means to reach θ = π (modulo 2π), by definition. The transition from I < 0 to I > 0 is the analog of the transition from excitability to spiking in a neuron. If I > 0 and −π ≤ θ1 ≤ θ2 ≤ π , the time it takes for θ to rise from θ1 to θ2 equals

θ2 θ1

dθ = (1 − cos θ)/τ + I(1 + cos θ)

τ I

tan(θ/2) θ2 . arctan √ τI θ1

(2.2)

Setting θ1 = −π and θ2 = π in this formula, we find that the period is T=π

τ . I

(2.3)

We denote the time it takes for θ to rise from π/2 to 3π/2 by δ. This should roughly be thought of as the spike duration. Applying formula 2.2 with (θ1 , θ2 ) = (π/2, π) and (θ1 , θ2 ) = (−π, −π/2) and adding the results, we find τ 1 δ = π − 2 arctan √ . (2.4) I τI For physiological realism, we wish to ensure δ/T 1. By equations 2.3 and 2.4, δ 2 1 = 1 − arctan √ . T π τI

562

C. Borgers ¨ and N. Kopell

Therefore, δ/T 1 means the same as τ I 1. Since arctan(1/ ) = π/2− + O( 2 ) as → 0, equation 2.4 implies δ ≈ 2τ when τ I 1. This reveals the meaning of τ : in the parameter regime of interest to us, τ is approximately half the spike duration. Motivated by this discussion and by the fact that spike durations in real neurons are on the order of milliseconds, we set τ =1 for the remainder of this article, think of time as measured in milliseconds, and always consider input currents I 1. The frequency of a theta neuron is defined to be ν=

1000 . T

(2.5)

Since we think of t as time measured in milliseconds, ν should be thought of as frequency measured in Hz. Neuronal models are called of type I if the transition from excitability to spiking involves a saddle node bifurcation on an invariant circle, and of type II if it involves a subcritical Hopf bifurcation (Hodgkin, 1948; Ermentrout, 1996; Gutkin & Ermentrout, 1998; Rinzel & Ermentrout, 1998; Izhikevich, 2000). Thus, the theta model is a type I neuronal model. It is canonical, in the sense that other type I models can be reduced to it by coordinate transformations (Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997). In this letter, the E-cells are always modeled as theta neurons. The neuronal model used for the I-cells is irrelevant for many of our arguments; where it matters, we model the I-cells at theta neurons at well. The theta model can also be derived from the quadratic integrate-and-fire model (Latham et al., 2000)3 . The general equation of this model is dV = a(V − V0 )(V − V1 ) + Q, dt

(2.6)

with a > 0 and V0 < V1 . After scaling and shifting V and scaling t and Q, equation 2.6 becomes dV = 2V(V − 1) + Q. dt

(2.7)

For Q < 1/2, equation 2.7 has two fixed points, one stable and the other unstable. As Q increases, the fixed points approach each other. When Q = 1/2, a saddle node bifurcation occurs: the fixed points collide and cease 3 This derivation was shown to us by Rob Clewley, who learned it from Bard Ermentrout.

A

V

Rhythms in the Presence of Noise

563

1 0 10

20

30

40

50

40

50

B

V

t

1 0 10

20

30 t

Figure 1: Voltage traces for quadratic integrate-and-fire neuron with (A) threshold at 1 and reset at 0 and (B) threshold at ∞ and reset at −∞.

to exist for positive Q > 1/2. We supplement equation 2.7 with the reset condition V(t + 0) = 0 if V(t − 0) = 1 .

(2.8)

Figure 1A shows V as a function of t for Q = 0.51. The change of variables V=

θ 1 1 + tan 2 2 2

(2.9)

transforms equation 2.7 into equation 2.1 with I=

Q − 1/2 . 1/2

(2.10)

Note that I is the relative deviation of Q from the threshold value 1/2. The only difference between the quadratic integrate-and-fire model and the theta model lies in the reset condition. The two models are precisely equivalent if equation 2.8 is replaced by V(t + 0) = −∞ if V(t − 0) = ∞.

(2.11)

564

C. Borgers ¨ and N. Kopell

(Note that V rises from −∞ to ∞ in finite time when Q > 1/2.) Figure 1B illustrates the effect of replacing equation 2.8 by 2.11. 2.2 Synapses. The derivation of the theta neuron from the quadratic integrate-and-fire neuron offers a way of modeling conductance-based synapses in the framework of the theta model.4 One begins by adding a synaptic current to the right-hand side of equation 2.7, dV = −2V(1 − V) + Q + gs(Vrev − V) , dt

(2.12)

where g denotes the maximum conductance, s = s(t) ∈ [0, 1] is a synaptic gating variable (the quantity that the synapse directly acts on), and Vrev is the reversal potential of the synapse. For excitatory synapses, Vrev should be substantially above the threshold voltage. For inhibitory synapses, it should be somewhat below the reset voltage. We use Vrev = 6.5 for excitatory synapses and Vrev = −0.25 for inhibitory synapses throughout this article. Numerical experiments indicate that changes in these values have no qualitative and little quantitative effect on our conclusions. With the change of variables (see equation 2.9), equation 2.12 becomes dθ = 1 − cos θ + I + (2Vrev − 1)gs(t) (1 + cos θ ) − gs(t) sin θ . dt Thus, our equation for a theta neuron receiving an excitatory synapse (Vrev = 6.5) is dθ = 1 − cos θ + I + 12gs(t) (1 + cos θ) − gs(t) sin θ , dt and our equation for a theta neuron receiving an inhibitory synapse (Vrev = −0.25) is dθ 3 = 1 − cos θ + I − gs(t) (1 + cos θ) − gs(t) sin θ . dt 2

(2.13)

For theoretical arguments in this article, we let s(t) = H(t − t0 )e−(t−t0 )/τD ,

(2.14)

where t0 denotes the time of the spike of the presynaptic neuron, H is the Heaviside function, 1 if t > 0 , (2.15) H(t) = 0 if t ≤ 0 , 4

This too was shown to us by Rob Clewley, with attribution to Bard Ermentrout.

Rhythms in the Presence of Noise

565

and τD > 0 is the synaptic decay time constant. In numerical simulations of networks of theta neurons, we use a smooth approximation to equation 2.14, defining s to be a solution of the differential equation, ds 1−s s + e−η(1+cos θ) , =− dt τD τR where θ denotes the dependent variable associated with the presynaptic neuron, η = 5, and τR = 0.1. Thus, s rises rapidly toward 1 when θ ≈ π modulo 2π and decays exponentially with time constant τ otherwise. 2.3 Networks of Theta Neurons. The numerical simulations presented in this article are for networks of excitatory and inhibitory theta neurons. Unless stated otherwise, the connectivity is all-to-all, and there is no heterogeneity in network properties. The following notation will be used throughout. NE NI τE τI IE II gIE

= = = = = = =

gII =

gEI =

gEE =

TE TI TP νE

= = = =

νI = νP = r =

number of E-cells in the network number of I-cells in the network synaptic decay time constant for excitatory synapses synaptic decay time constant for inhibitory synapses external drive to E-cells external drive to I-cells sum of all conductances associated with inhibitory synapses acting on a given E-cell (so an individual I → E synapse has strength gIE /NI ) sum of all conductances associated with inhibitory synapses acting on a given I-cell (so an individual I → I synapse has strength gII /NI ) sum of all conductances associated with excitatory synapses acting on a given I-cell (so an individual E → I synapse has strength gEI /NE ) sum of all conductances associated with excitatory synapses acting on a given E-cell (so an individual E → E synapse has strength gEE /NE ) √ intrinsic period of E-cells = π/ IE (see equation 2.3) intrinsic period of I-cells (see section 2.4) period of population rhythm intrinsic frequency of E-cells = 1000/TE (see equation 2.5 and the comment following it) intrinsic frequency of I-cells = 1000/TI frequency of population rhythm = 1000/TP (3/2)gIE /IE

566

C. Borgers ¨ and N. Kopell

The definition of r given here will, for the remainder of the article, replace the informal definition in equation 1.1. The reason for including the factor 3/2 in the definition of r will become apparent in section 3. 2.4 Parameter Choices. Unless otherwise stated, we use τE = 2 and τI = 10 in this letter. This is motivated by the decay time constants of excitatory synapses involving AMPA receptors, approximately 2 ms, and inhibitory synapses involving GABAA receptors, approximately 10 ms. (Recall that we think of t as time in milliseconds.) As discussed in section 2.1, the external drive I to a theta neuron should be 1 for the theta model to be biologically reasonable. We therefore assume that the external drives√IE and II are 1. An external drive I corresponds √ to the period T = π/ I, and therefore to the frequency ν = 1000 I/π . For example, I = 0.1 corresponds to ν ≈ 100, and I = 0.4 corresponds to ν ≈ 200. For the PING rhythms studied in this article, it is important that gIE be at least comparable in size to IE (see section 5). We will often choose gII = gIE , but we will also discuss the effects of choosing much smaller values of gII , or even gII = 0. We choose gEI in such a way that a population spike of the E-cells promptly triggers a population spike of the I-cells but does not trigger multiple spikes. If the I-cells are modeled as quadratic integrate-and-fire neurons, the following argument gives a rough indication of how large gEI should be. Consider a network consisting of a single E-cell and a single Icell. (This is equivalent to a network in which each group of cells, E and I, is fully synchronized.) Recall that Vrev = 6.5 for excitatory synapses. Since V is near 0 except during spikes, we approximate the term gs(Vrev − V) in equation 2.12 by gsVrev . We use the idealized form of s given by equation 2.14. Thus, a spike of the E-cell at time t0 gives rise to injection of the current gEI H(t − t0 )e−t/τE Vrev into the I-cell. Since τE = 2 is small in comparison with the periods of the rhythms of interest to us, we introduce the additional approximation that the charge injection resulting from an excitatory synapse is instantaneous, causing a rise in the membrane potential of the I-cell by V = gEI τE Vrev . The excitatory synapse is sure to cause a nearly instantaneous spike of the I-cell if V = 1, that is, if gEI = 1/(τE Vrev ). Since we use τE = 2 and Vrev = 6.5 throughout, this means gEI = 1/13 ≈ 0.077. However, even V = 0.2 typically leads to a spike in the I-cell after a brief delay, so even gEI = 0.2/(τE Vrev ) = 0.2/13 ≈ 0.015 will suffice for PING rhythms. In the CA1 region of the hippocampus, E→E connections are known to have low density (Knowles & Schwartzkroin, 1981). Also, some types

Rhythms in the Presence of Noise

567

of gamma oscillations are induced by acetylcholine (Fisahn, Pike, Buhl, & Paulsen, 1998), which in turn suppresses recurrent excitatory connections (Hasselmo, 1999). Motivated by these experimental findings and following many previous modeling studies of gamma rhythms (e.g., Whittington, Stanford, Colling, Jefferys, & Traub, 1997), we assume gEE = 0 throughout this article. It would be interesting to study the effects of E→E synapses in the presence of noise, particularly in the presence of noisy spiking of the E-cells; however, this will be deferred to future work. 2.5 The Intrinsic Frequency of the I-Cells. The intrinsic frequency νI of the I-cells is the frequency at which the I-cells would spike if the E→I synapses were removed. In the presence of I→I synapses, νI differs from the frequency of an isolated I-cell and depends on whether the I-cells synchronize. In many of our arguments, we will make no specific assumption about the neuronal model used for the I-cells, and simply consider νI a network parameter. However, we will also apply our general results to the case when the I-cells are modeled as theta neurons. For this, we will need to compute νI . The intrinsic frequency νI is related to the intrinsic period TI by νI = 1000/TI . We will discuss how to compute TI . 2.5.1 Intrinsic Period in the Absence of I→I Synapses. If gII = 0, then π TI = √ II

(2.16)

by equation 2.4. 2.5.2 Intrinsic Period in the Presence of I→I Synapses, Assuming Synchronous Network Activity. Assume now that gII > 0. We model synapses in the idealized way described by equations 2.13 and 2.14. If the I-cells are in perfect synchrony, the period TI is the time that it takes for θ to reach π if θ is governed by 3 dθ −t/τI = 1 − cos θ + II − gII e (1 + cos θ ) − gII e−t/τI sin θ , dt 2 θ (0) = −π . (2.17) This cannot be computed analytically, but it is easy to compute numerically. 2.5.3 Intrinsic Period in the Presence of I→I Synapses, Assuming Asynchronous Network Activity. If the I-cells are in complete asynchrony, the function

568

C. Borgers ¨ and N. Kopell

e−t/τI in equation 2.17 is replaced by its time average. (We neglect fluctuations resulting from the finiteness of the network.) Thus, each I-cell is governed by dθ 3 = 1 − cos θ + II − gII s (1 + cos θ) − gII s sin θ dt 2

(2.18)

with s=

1 TI

TI

e−t/τI dt =

0

1 − e−TI /τI . TI /τI

(2.19)

If θ obeys equation 2.18, and θ(0) = −π, then θ reaches π at time

π

dθ −π 1 − cos θ + II − (3/2)gII s (1 + cos θ) − gII s sin θ π  tan(θ/2)−gII s/2 arctan √  II −(3/2)gII s−(gII s)2 /4   =   I − (3/2)g s − (g s)2 /4  I II II −π

π = . II − (3/2)gII s − (gII s)2 /4

(2.20)

This formula holds if II − (3/2)gII s − (gII s)2 /4 > 0. For II − (3/2)gII s − (gII s)2 /4 ≤ 0, θ = π is never reached. The period TI is therefore determined by TI =

π II − (3/2)gII s − (gII s)2 /4

.

(2.21)

Note that the right-hand side depends on TI , since s does. We cannot solve this equation analytically. However, it has a unique solution because the right-hand side is a strictly decreasing function of TI . This solution can easily be found numerically, for instance, using the bisection method. 2.5.4 An Observation on the Effects of I→I Synapses for Weakly Driven I-Cells. A major focus of this letter is the behavior of PING rhythms for weak external drives. We will show later that PING rhythms tend to become noise sensitive when the external drives become weak, but that the noise sensitivity can be counteracted, to some extent, by introducing I→I synapses. This motivates our interest in the effects of I→I synapses in the limit as II → 0. The main result of the following discussion is equation 2.23, which will play a role in section 7. Under the assumption of perfect synchrony of the I-cells, I→I synapses become irrelevant as II → 0. In this limit, TI → ∞, so the decay time τI

Rhythms in the Presence of Noise

569

of the inhibitory synapses becomes negligible in comparison with TI , and therefore π TI ∼ √ II as II → 0. (The symbol ∼ expresses that the ratio of the two quantities converges to 1.) Under the assumption of complete asynchrony, the I→I synapses remain relevant in the limit as II → 0. As II → 0, TI → ∞ and therefore s∼

τI TI

(2.22)

by equation 2.19. Using equation 2.22 in equation 2.21, we find that as II → 0, the fixed point equation for TI becomes TI =

π I − (3/2)gII τI /TI − (gII τI /TI )2 /4

.

This is equivalent to II TI2 =

3 gII τI TI + C 2

with C = π2 +

(gII τI )2 , 4

so TI ∼

3 gII τI C 3 gII τI + ∼ . 2 II II TI 2 II

(2.23)

Comparing equation 2.23 with 2.16, we see that the presence of I→I synapses makes an important difference in the limit as II → 0. 2.6 The Asynchronous State. In our simulations, we frequently initialize theta neurons in a state of complete asynchrony. We define here what we mean by “complete asynchrony.” Consider a theta neuron with an external drive I > 0. Suppose that θ(0) = θ0 ∈ (−π, π ). The time to the next spike is then π π dt 1 dθ = dθ R= θ0 dθ θ0 1 − cos θ + I(1 + cos θ ) 1 tan(θ/2) π = √ arctan √ I I θ0 1 π tan(θ0 /2) = √ − arctan . (2.24) √ I 2 I

570

C. Borgers ¨ and N. Kopell

To initialize a population of uncoupled neurons in complete asynchrony, √ we choose R = Uπ/ I with U ∈ (0, 1) random, uniformly distributed. We then compute θ0 from equation 2.24, √ I tan (Vπ/2) , (2.25) θ0 = 2 arctan with V = 1 − 2U uniformly distributed in (−1, 1). 3 PING in a Simple Two-Cell Network In this section, we consider a network of a single E-cell and a single I-cell. The E-cell is modeled as a theta neuron; it is assumed to receive external drive IE > 0. We make no explicit assumption about the neuronal model used for the I-cell, but denote by νI its intrinsic frequency. (If the I-cell is driven below threshold, νI = 0.) We make the following idealizing assumptions: 1. A spike of the E-cell instantaneously triggers a spike of the I-cell but has no effect lasting beyond the spike time. 2. The spike of the I-cell gives inhibitory input to the E-cell in the idealized form described by equation 2.14. 3. The I-cell does not spike again until prompted by the next spike of the E-cell. Assumptions 1 and 3 imply that the I-cells spike exactly once per oscillation cycle. By strengthening the E→I synapses, one can generate PING-like rhythms in which each I-cell fires multiple times on each oscillation cycle. These rhythms differ from those considered in this letter only in some details of minor interest; for instance, the oscillation frequency is reduced if the I-cell population fires several population spikes on each oscillation cycles. We note that spike doublets of the I-cells play a much subtler and more important role when conduction delays are substantial (Ermentrout & Kopell, 1998). However, in this article, we neglect conduction delays. If the two cells spike at t = 0, the E-cell is governed by the initial value problem dθ 3 −t/τI = 1−cos θ + IE − gIE e (1+cos θ)−gIE e−t/τI sin θ for t dt 2 > 0, (3.1) θ (0) = −π ,

(3.2)

until it spikes again. We denote by TP the time at which θ, governed by equations 3.1 and 3.2, reaches π , and write νP = 1000/TP . Assumption 3 then becomes νI ≤ νP .

(3.3)

Rhythms in the Presence of Noise

571

νP

100 0 0

0 IE 0.1 0.5

gIE

Figure 2: The PING frequency νP as a function of gIE and IE , for τI = 10.

We will now discuss the dependence of the frequency νP on IE , gIE , and τI . Our main conclusion will be that νP depends weakly on IE and gIE but strongly on τI . For τI = 10, Figure 2 shows the graph of νP as a function of IE and gIE . The graph is quite flat in the region in which neither gIE /IE nor IE is small. We will argue in later sections that in those regions in which gIE /IE or IE are small, PING rhythms are not robust. Specifically, one needs gIE /IE > 1 for rapid and robust synchronization in large networks (see section 5), and PING rhythms with small IE are highly susceptible to noise in the E-cells (see section 8). Thus, Figure 2 shows that in the most relevant parameter regime, the dependence of νP on IE and gIE is fairly weak. We will next derive an approximate formula for TP , valid for sufficiently large gIE /IE . This formula will confirm that the dependence of TP on IE and gIE is weak and also show the importance of τI . Defining J = IE −

3 gIE e−t/τI , 2

equations 3.1 and 3.2 can be written as follows: 2 dθ = 1 − cos θ + J(1 + cos θ) − (IE − J) sin θ dt 3 IE − J dJ = for t > 0 , dt τI θ (0) = −π ,

for t > 0 ,

(3.4) (3.5) (3.6)

572

C. Borgers ¨ and N. Kopell

*

J=J

A

J

0 −0.2 −0.4 −π

0 θ

π

q

1

B

0 0.4

20

IE 0 5

τI

Figure 3: (A) Phase portrait for equations 3.4 and 3.5 with IE = 0.1 and τI = 10, with the stable river indicated in bold. (B) q = (IE − J∗ )/IE as a function of IE and τI .

J(0) = IE −

3 gIE . 2

(3.7)

We will study the phase portrait for the two-dimensional dynamical system, equations 3.4 and 3.5. This is very similar to a discussion for inhibitory current pulses (not inhibitory synaptic inputs) that we gave earlier (Borgers ¨ & Kopell, 2003). The phase portrait for equations 3.4 and 3.5 is shown in Figure 3A for IE = 0.1 and τI = 10. The figure should be extended periodically in θ with period 2π. The flow is upward, in the direction of increasing J. The most striking feature of the phase portrait is the existence of strongly attracting and strongly repelling trajectories. Trajectories of this kind are found in many systems of ordinary differential equations and are called rivers

Rhythms in the Presence of Noise

573

(Diener, 1985a, 1985b). The figure reveals a stable river, that is, a trajectory, (θs , Js )

with Js (t) = IE −

3 gIE e−t/τI , 2

(3.8)

that is attracting in forward time. The stable river is indicated as a bold line in Figure 3A. As t → −∞, Js → −∞, and θs → θs0 , where θs0 is the unique solution in (−π, π ) of 1 + cos θ +

2 sin θ = 0 . 3

(It is easy to see that θs0 = −2 arctan(3/2) ≈ −1.966.) We denote by T∗ the time when θs (T∗ ) = π, and define J∗ = Js (T∗ ) ∈ (0, IE ) .

(3.9)

We define r=

(3/2)gIE , IE

(3.10)

and assume r>1,

(3.11)

so J(0) < 0. If J(0) is sufficiently negative, the trajectory (θ (t), J(t)) is rapidly attracted to (θs (t), Js (t)). At the time when θ = π, we therefore have J ≈ J∗ , or t ≈ T∗ . Thus, TP is approximately T∗ . Equations 3.8 and 3.9 imply that T∗ = ln

(3/2)gIE IE − J ∗

τI .

(3.12)

We write q = q(τI , IE ) =

IE − J ∗ ∈ (0, 1) . IE

(3.13)

Figure 3B shows the graph of q. From equations 3.10, 3.12, and 3.13, T∗ = ln

r τI . q

(3.14)

Of course, this is not an explicit formula for T∗ , since q depends on IE and τI and is not given by an explicit formula. Equation 3.14 does, however,

0

I−cell

I−cell

E−cell

C. Borgers ¨ and N. Kopell

E−cell

574

100 200 300

0

100 200 300

Figure 4: From PING (left panel) to phase walkthrough (right panel) as a result of raising drive to the I-cell.

explicitly describe the dependence of T∗ on gIE , since the only quantity on the right-hand side of equation 3.14 depending on gIE is r = (3/2)gIE /IE . We have concluded r TP ≈ ln τI , q

(3.15)

if r is sufficiently large. Numerical experiments show that r = 3 is sufficient for this formula to be quite accurate. Figure 3B shows that the graph of q is fairly flat unless IE is small. In a parameter regime in which q is approximately constant, formula 3.15 shows that TP depends strongly (namely, approximately linearly) on τI , but only weakly (namely, logarithmically) on gIE and IE . 4 The Phase Walkthrough Boundary A necessary condition for the PING mechanism to work is that the I-cells spike only when prompted by the E-cells, not independently. This is trivially true if the drive to the I-cells is subthreshold, that is, νI = 0. If νI > 0, then the PING mechanism works only if the frequency of the rhythm is greater than the intrinsic frequency of the I-cells. Figure 4 illustrates, using a network of a single E-cell and a single I-cell, what happens when this condition is not met. (The figure indicates spike times.) We refer to the phenomenon shown in Figure 4 as phase walkthrough of the I-cells. Figure 5 shows, in various different ways, that phase walkthrough occurs more easily when IE is smaller or gIE is larger, in other words, when the PING frequency is lower. The figure also shows that I→I synapses protect against phase walkthrough. We will now discuss the details of Figure 5.

Rhythms in the Presence of Noise

575

A

B

100

II

νI

0.1

0 0

0

ν

0 0 IE

gIE

E

0 gIE

100 0.5

0.1 0.5

C

D

0.06 II

II

0.1

0 0

0 0

IE

0.2

0 IE

g

IE

0.1 0.5

Figure 5: Phase walkthrough boundary for τI = 10 (A) in (gIE , νE , νI )-space, (B) in (gIE , IE , II )-space for gII = 0, (C) in the (IE , II )-plane, for gII = 0 and gIE = 0.2, (D) in (gIE , IE , II )-space, for gII = gIE .

4.1 The Phase Walkthrough Condition. In the simple two-cell network of section 3, a necessary and sufficient condition for phase walkthrough of the I-cells to be avoided is νI ≤ νP .

(4.1)

The equation νI = νP

(4.2)

(or equivalently TI = TP ) defines a hypersurface in parameter space that we call the phase walkthrough boundary. As discussed in section 3, νP is √ a function of IE , gIE , and τI . Since νE = (1000/π ) IE , we can also think of νP as a function of νE , gIE , and τI . For given νE , gIE , and τI , it is easy to

576

C. Borgers ¨ and N. Kopell

calculate νP numerically with great (ten-digit) accuracy. We first determine the time Tp at which θ, governed by equations 3.1 and 3.2, reaches π , and then compute νP = 1000/Tp . For τI = 10, Figure 5A shows the phase walkthrough boundary in (gIE , νE , νI )-space. If (gIE , νE , νI ) lies below the surface in Figure 5A, a PING rhythm is possible. Above the surface, there is no PING rhythm because of phase walkthrough. Note that slowing the rhythm without reducing drive to the I-cell eventually results in νI > νP , that is, in phase walkthrough. 4.2 I→I Synapses Protect Against Phase Walkthrough. Since νI is a decreasing function of gII , I→I synapses make phase walkthrough of the I-cells less likely to occur. (Recall that by the notational conventions of section 2, the definition of νI takes I→I synapses into account, but νE denotes the frequency of the E-cells in the absence of any synapses.) To make this quantitative, we must assume a specific model for the I-cell. For the remainder of this section, we therefore assume that the I-cell is a theta neuron. We still make assumptions 1 and 2 of section 3, and we also assume that I→I synapses take the idealized form given by equation 2.14. For fixed values of τI and gII , the phase walkthrough boundary can then be thought of as a surface in (gIE , IE , II )-space. For τI = 10 and gII = 0, this surface is plotted in Figure 5B. The surface in Figure 5B is thus identical to that of Figure 5A, except for the choice of coordinates. The parameter νI in Figure 5A is replaced by II in Figure 5B. Equation 2.16 yields the relation between νI and II . (Note that νI = 1000/TI .) Figure 5B shows that in the parameter regimes of interest to us (r > 1, that is, gIE > (2/3)IE )), II must be much smaller than IE for PING to be possible. In Figure 5C, we show a section through the surface of Figure 5B. In addition to fixing τI = 10 and gII = 0 as in Figure 5B, gIE = 0.2 is fixed as well. The curve in Figure 5C is the graph of a function of IE . For later reference, we denote this function by F = F(IE ), suppressing in the notation the dependence on τI , gII , and gIE . Thus, the curve shown in Figure 5C is given by II = F(IE ) .

(4.3)

Phase walkthrough occurs above the curve, but not below it. The surface in Figure 5B changes dramatically when gII is taken to be equal to gIE . It is easy to see that in this case, the walkthrough boundary is given by II = IE

(4.4)

(see Figure 5D). Thus, I→I synapses have a stabilizing effect, greatly enlarging the parameter regime in which PING is possible.

Rhythms in the Presence of Noise

577

5 PING in Networks of More Than Two Cells The key to PING rhythms in networks of more than two cells lies in the fact that a population of uncoupled neurons receiving a single common strong inhibitory input pulse synchronizes. In this section, we give an explanation of this synchronization mechanism and illustrate it with numerical examples. The main conclusion of this section is that rapid and robust synchronization in large networks requires r = (3/2)gIE /IE > 1. In earlier work (Borgers ¨ & Kopell, 2003), we discussed the synchronization of a population of theta neurons by an inhibitory current pulse. We present here a very similar discussion referring to an inhibitory synaptic pulse. A single theta neuron receiving an inhibitory synaptic input at t = 0 is described by equations 3.4 and 3.5, with initial conditions θ (0) = θ0 , and equation 3.7. Synchronization of a population of uncoupled theta neurons by a single inhibitory synaptic pulse can be understood from the phase portrait in Figure 3A. Recall that the trajectory (θs , Js ) indicated as a bold line in Figure 3A, the “stable river,” attracts nearby trajectories. Also recall that T∗ is defined by θs (T∗ ) = π and J∗ by J∗ = Js (T∗ ). Assume, as in section 3, r > 1, so J(0) < 0. If J(0) is sufficiently negative and θ0 is sufficiently far from π, (θ(t), J(t)) is rapidly attracted to the stable river. At the time when θ = π, we therefore have J ≈ J∗ , or t ≈ T∗ . Thus, the first spike after time zero occurs approximately at time T∗ . If θ0 is close to π, then θ(t) quickly passes through π and is then rapidly attracted to (θs (t) + 2π, Js (t)). (Recall that Figure 3A should be thought of as extended periodically in θ with period 2π.) When θ (t) reaches 3π , then J ≈ J∗ and therefore t ≈ T∗ . Thus, in that case, a spike occurs soon after time zero, followed by a spike approximately at time T∗ . Only for values of θ0 in a narrow transition band is (θ (t), J(t)) attracted to neither (θs (t), Js (t)) nor (θs (t) + 2π, Js (t)). A population of E-cells is approximately synchronized by an inhibitory pulse with sufficiently large r because T∗ is independent of θ0 . Figure 6A shows a simulation for a network as described in section 2, gEI = 0.05, gIE = 0.20, gII = 0, IE = 0.1 (νE ≈ 101), II = 0.002. (Other parameter values are as specified in section 2.) The common inhibitory input received by all E-cells leads to their rapid synchronization as a result of the mechanism described earlier. This in turn leads to synchronization of the I-cells, which are driven by the E-cells. Note that here, r=

(3/2)gIE 0.3 =3. = IE 0.1

PING rhythms are possible even for r < 1. For instance, Figure 6B shows results similar to those of Figure 6A, but with gIE = 0.05, corresponding to r = 0.75. A rhythm appears, but only after a number of periods, not

C. Borgers ¨ and N. Kopell

0 40

B

160

I−cells

C

0 40

0 0

100 200 300

160

D

100 200 300

0 40

0 0

100 200 300

E−cells

E−cells

0 0

160

I−cells

I−cells

A

E−cells

160

I−cells

E−cells

578

0 40

0 0

100 200 300

Figure 6: (A) PING with strong inhibitory synapses (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.1, II = 0.002). (B) PING with weak inhibitory synapses (gIE = 0.05; all other parameter values as in A). (C) As in A, with 5% heterogeneity in IE . (D) As in B, with 5% heterogeneity in IE .

immediately. Thus, when r ≤ 1, PING may still be possible, but it is less attracting and more difficult to analyze. Numerical experiments suggest that PING with r ≤ 1 is also highly sensitive to heterogeneity. As an example, we introduce 5% heterogeneity in IE . That is, we take IE to be a normally distributed random number with mean 0.1 and standard deviation 0.005. Figure 6A turns into Figure 6C—the rhythm is barely affected. However, Figure 6B turns into Figure 6D—the rhythm is destroyed. The effect of heterogeneity in IE is to spread out the spiking of the E-cells. If r is sufficiently large, each population spike of the I-cells approximately erases the memory of the past, bringing the E-cells back together and preventing the effects of heterogeneity from accumulating over time. In our numerical experience, at least r ≈ 3 is required if the rhythm is to survive heterogeneity in network parameter values on the order of 20%.

Rhythms in the Presence of Noise

579

6 The Suppression Boundary As seen in section 4, in a two-cell network (or, equivalently, in a larger network in which the E-cells and the I-cells are perfectly synchronized), PING is possible only if the intrinsic frequency νI of the I-cells is sufficiently small. If the I-cells intrinsically spike too rapidly, phase walkthrough occurs. In a network with many cells, there is another way in which rapid spiking of the I-cells can destroy PING: the I-cells can suppress the E-cells altogether. The I-cells are most effective at suppressing the E-cell when they are asynchronous. For linear integrate-and-fire neurons, this is proved in appendix A. Although we have not proved it for theta neurons, numerical evidence suggests that it is true in that case as well. We therefore assume in this section that the I-cells spike in complete asynchrony and ask under which circumstances the I-cells suppress the E-cells. Figure 7 shows, in various different ways, that suppression of the E-cells occurs more easily when IE is smaller or gIE is larger, in other words, when the PING frequency is lower. In this section, we will discuss the four panels of Figure 7 in detail. We will show that the quantity determining how easily suppression of the E-cells occurs is the ratio gIE /IE , which, up to the constant factor 3/2, is r. 6.1 The Suppression Condition. Consider a large population of I-cells, spiking in asynchrony, acting on a single E-cell driven above threshold. We ask under which conditions spiking in the E-cell will be prevented altogether by the inhibition. We model the synaptic gating variable in the idealized way described by equation 2.14. Because the I-cells are assumed to spike in asynchrony, the term s(t) in equation 2.13 is replaced by its time average. (We neglect fluctuations resulting from the finiteness of the network.) Thus, the equation governing the target neuron is dθ 3 (6.1) = 1 − cos θ + IE − gIE s (1 + cos θ ) − gIE s sin θ, dt 2 with s defined in equation 2.19. The E-cell escapes suppression if and only if 2 3 1 (6.2) gIE s + gIE s < IE . 2 4 To see this, replace gII by gIE and II by IE in equation 2.20. Since we assume IE 1, this is approximately equivalent to 3 gIE s < IE . 2 Using the definition of r, equation 3.10, this condition becomes s<

1 . r

(6.3)

580

C. Borgers ¨ and N. Kopell

A

B

100

II

νI

0.1

0 0

0

νE

0 0

g

0 I

g

E

IE

IE

100 0.5

0.1 0.5

C

D

0.06 II

I

I

0.1

0 0

0 0

I

E

0.2

0 IE

gIE 0.1 0.5

Figure 7: Suppression boundary for τI = 10, (A) in (gIE , νE , νI )-space, (B) in (gIE , IE , II )-space, for gII = 0, (C) in the (IE , II )-plane for gII = 0 and gIE = 0.2 (dashes), together with the phase walkthrough boundary (solid) and the region of bistability (shaded), (D) in (gIE , IE , II )-space, for gII = gIE .

The equation s=

1 r

(6.4)

defines a hypersurface in parameter space that we call the suppression boundary. Since s depends on only TI /τI = 1000/(νI τI ) (see equation 2.19), the suppression boundary could be drawn as a surface in (gIE , νE , νI τI )space. However, for easier comparison with Figure 5A, we set τI = 10 and plot, in Figure 7A, the suppression boundary as a surface in (gIE , νE , νI )space. If (gIE , νE , νI ) lies below the surface in Figure 7A, suppression of the E-cells by asynchronous activity of the I-cells is impossible. Above the surface, suppression occurs if the I-cells spike asynchronously.

Rhythms in the Presence of Noise

581

Asynchronous activity of the I-cells can simply be the result of initial conditions, in the absence of any mechanism synchronizing the I-cells. (The E-cells cannot synchronize the I-cells if they are suppressed.) It can also be the result of stochastic external inputs to the I-cells. In that case, each I-cell spikes in an irregular fashion. However, if TI is taken to be the average interspike interval, then equation 6.3 is still the condition under which the E-cells escape suppression. This will be discussed further in section 9. Figure 7A shows that in a large portion of parameter space (the portion where the surface is flat), PING rhythms are highly susceptible to suppression of the E-cells as a result of asynchronous activity of the I-cells. (Recall that suppression occurs above the surface shown in Figure 7A.) In this portion of parameter space, PING rhythms are easily abolished by noisy external drive to the I-cells (see section 9). Equation 6.3 shows that r is the quantity that matters here: the larger r, the more severe is condition 6.3. 6.2 Slow PING Rhythms Are Typically Susceptible to Suppression of the E-Cells. If a PING rhythm is slowed by lowering νE , that is, lowering IE , with all other parameter values fixed, then r grows, and therefore the rhythm becomes more susceptible to suppression of the E-cells by the I-cells. Similarly, if a PING rhythm is slowed by raising gIE , with all other parameter values fixed, then r grows, and the rhythm again becomes increasingly susceptible to suppression of the E-cells. 6.3 Suppression of the E-cells Can Be Avoided Even at Low Frequencies by Careful Parameter Tuning. Slow PING rhythms are not impossible. If IE → 0 and gIE → 0 in such a way that r remains fixed, then νP → 0. This can be seen from Figure 2 or from equation 3.15 in conjunction with Figure 3B (which shows that q decreases as IE decreases and q → 0 as IE → 0). However, the right-hand side of equation 6.3 remains unchanged in this limit. Thus, a PING rhythm with small IE and proportionally small gIE is slow, but not susceptible to suppression of the E-cells by the I-cells. 6.4 I→I Synapses Protect Against Suppression. I→I synapses make it harder for the I-cells to suppress the E-cells for two reasons. First, they often destabilize asynchrony of the I-cells. However, even if the I-cells were to remain completely asynchronous, I→I synapses would make it harder for the I-cells to suppress the E-cells, by lowering νI . To illustrate the latter point, we now assume that the I-cells are theta neurons. As before, we assume that I→I synapses take the idealized form given by equation 2.14. For fixed τI and gII , we draw the suppression boundary as a surface in (gIE , IE , II )-space. For τI = 10 and gII = 0, this is shown in Figure 7B. The dashed curve in Figure 7C shows a section through the surface of Figure 7B. In addition to fixing τI = 10, gII = 0, gIE = 0.2 is fixed as well. This curve is the graph of a function of IE . For later reference, we denote this function by G = G(IE ), suppressing in the notation the dependence on

582

C. Borgers ¨ and N. Kopell

τI , gIE , and gII . Thus, the dashed curve in Figure 7C is given by II = G(IE ) .

(6.5)

The phase walkthrough boundary for the same parameter values was plotted in Figure 5C; for comparison, it is reproduced in Figure 7C (solid curve). The two curves in Figure 7C intersect each other, enclosing a region, shaded in Figure 7C, that lies above the suppression boundary (so suppression of the E-cells by asynchronous activity of the I-cells is possible), but below the phase walkthrough boundary (so PING is possible as well). This is a region of bistability; it will be discussed further in section 7. Figure 7B changes dramatically when gII is taken to be equal to gIE (see Figure 7D). The region in phase space in which suppression of the E-cells by the I-cells is impossible (the region below the surface) is greatly enlarged by the I→I synapses. As will be shown in the next section, there is no region of bistability in this case. 7 A Region of Bistability We observed in section 6 that Figure 7C shows a region of bistability in parameter space, a region in which both asynchronous activity of the I-cells with suppression of the E-cells and PING are possible. Which of these two states occurs depends on initial conditions. This observation is not centrally important here, but it is an interesting example demonstrating that PING rhythms can be locally but not globally attracting network states. In this section, we give analytic arguments demonstrating the existence of the region of bistability (or, to be more precise, demonstrating that the two curves in Figure 7C must intersect in a point other than the origin). We also give a numerical example illustrating the bistability. The solid curve in Figure 7C, the phase walkthrough boundary, is the graph of a function. Recall from section 4 that we denote this function by F, so the curve in Figure 7C is given by II = F(IE ). We shall first show F (0) = 1

(7.1)

and lim F(IE ) = ∞ .

IE →∞

(7.2)

To prove equation 7.1, note that TP → ∞ as IE → 0. Thus, the (fixed) decay time τI becomes negligible in comparison with TP . This √ implies that the√inhibitory synapses become negligible, so TP ∼ TE = π/ IE , and TI ∼ π/ II . The phase walkthrough boundary √ is generally √ given by TP = TI . In the limit as IE → 0, this means π/ IE ∼ π/ II , or IE ∼ II , that is, equation 7.1.

Rhythms in the Presence of Noise

583

Figures 5B and 5C appear to suggest that F (0) is positive, but much smaller than 1, in contradiction to the result just derived. The explanation is that IE has to be extremely close to 0 for F (IE ) to come close to 1. A numerically computed blow-up of Figure 5C near the origin indeed confirms that F (0) = 1. To prove equation 7.2, we first note that F increases as gII increases: the greater gII , the more drive II to the I-cells is needed for phase walkthrough to occur. Therefore, equation 7.2 follows for gII > 0 if it can be shown for gII = 0. To prove equation 7.2 for gII = 0, we note that in the limit IE → ∞, the I→E synapses become negligible as well, but for a different reason: the terms, including the factor IE in the equation of the E-cell, dominate all others, in particular, the terms modeling the synapse. As a result, TP ∼ TE as IE → ∞. The phase walkthrough boundary is generally given by TP = √ TI . In the √limit as IE → ∞, this means TE ∼ TI . For gII = 0, this means π/ IE ∼ π/ II , or IE ∼ II . This implies equation 7.2. The dashed curve in Figure 7C, the suppression boundary, is the graph of a function as well. Recall from section 6 that we denote this function by G, so the dashed curve in Figure 7C is given by II = G(IE ). We next show G (0) =

gII gIE

(7.3)

and lim

IE →(3/2)gIE

G(IE ) = ∞ .

(7.4)

To show equation 7.3, note that as IE → 0, II = G(IE ) → 0; therefore, TI → ∞, and τI s∼ TI by equation 2.19. Thus, the suppression boundary becomes TI ∼r τI

(7.5)

by equation 6.4. Assuming gII > 0, using equation 2.23, this becomes (3/2)gIE (3/2)gII ∼ , II IE or II ∼

gII IE , gIE

that is, equation 7.3. If gII = 0, we use equation 2.16 to turn equation 7.5 into π (3/2)gIE √ ∼ IE II

584

C. Borgers ¨ and N. Kopell

or II ∼ π 2

IE2 (9/4)g2IE

.

Thus, equation 7.3 holds for gII = 0 as well. As IE → (3/2)gIE , r → 1. The suppression boundary is given by 1 1 − e−TI /τI = TI /τI r

(7.6)

(see equations 6.4 and 2.19). The left-hand side of equation 7.6 can easily be shown to be a strictly decreasing function of TI /τI that tends to 1 as TI /τI → 0. Thus, as r → 1, TI /τI → 0, so II → ∞. This proves equation 7.4. Equations 7.1 to 7.4, taken together, imply that the suppression and phase walkthrough boundaries in the (IE , II )-plane must intersect (in a point other than the origin) as long as gII < gIE . We present a numerical example illustrating the bistability that our theoretical arguments predict. We use gIE = 0.2, gEI = 0.05, gII = 0 and (II , IE ) = (0.0025, 0.03) , a point inside the region of bistability. We initialize the I-cells asynchronously (see section 2.6). If we initialize all E-cells at θ = −π/2, complete suppression of the E-cells results, as shown in Figure 8A. On the other hand, if we initialize the E-cells at θ = π/2, they start out so close to spiking that they are able to spike before enough inhibition has built up to suppress them. This moves the I-cells away from asynchrony, removing their ability to suppress the E-cells. The result is the PING rhythm shown in Figure 8B. The two simulations producing the two panels of Figure 8 are identical, except for initial conditions. 8 Phase Walkthrough of the I-Cells Resulting from Stochastic Spiking of the E-Cells In this section, we consider the effects of noisy spiking of the E-cells, for instance, as a result of stochastic external input. We show that noisy spiking of the E-cells disrupts PING rhythms by causing phase walkthrough of the I-cells. We give an approximate analysis of the conditions under which noise in the E-cells results in phase walkthrough of the I-cells. Our analysis is not very accurate quantitatively (we will explain why); however, it does reveal that slower PING rhythms are much more vulnerable to noise in the E-cells than faster ones.

Rhythms in the Presence of Noise

585

E−cells

0 40

0 0

B

160

I−cells

I−cells

E−cells

A

160

100 200 300

0 40

0 0

100 200 300

Figure 8: Example illustrating bistability. Suppression of the E-cells (A) or a rhythm (B) are possible with the same parameter values (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.03, II = 0.0025), different initial conditions.

In our simulations, we introduce random spiking of the E-cells as follows. At randomly selected times, ti1 < ti2 < ... , 1 ≤ i ≤ NE , the ith E-cell is forced to spike; that is, the value of θ associated with it is instantaneously reset to −π , and the synaptic gating variable s associated with it is set to 1. We assume that the time intervals between forced spikes, ti,k+1 − ti,k , 1 ≤ i ≤ NE , k = 1, 2, ..., are independent of each other and exponentially distributed, with a common expected value TSE . (The subscript S stands for “stochastic.”) We define νSE =

1000 . TSE

(8.1)

This is the average frequency of the random spiking of the E-cells. Figures 9C, 9D, and 9F (discussed in detail shortly) show examples of rhythms that persist in spite of random spiking in the E-cells. When an Ecell spikes out of order, it is temporarily out of synchrony with the bulk of the E-cells. It may or may not participate in the next population spike of the E-cells. However, it is brought back into synchrony with the bulk of the E-cells by the next population spike of the I-cells. If a large fraction of the E-cells spikes out of order, the number of Ecells participating in the population spikes may be reduced significantly. The population spikes of the E-cells may then no longer suffice to trigger population spikes of the I-cells. This effect is neglected in the analysis given

586

C. Borgers ¨ and N. Kopell

below, as is justified if gEI is sufficiently large and/or if the ratio νSE /νP is sufficiently small. For example, in Figure 9D, a large fraction of all E-cells spike out of order, but gEI is so large that those E-cells that participate in the population spikes still suffice to prompt the I-cells. 8.1 Analysis. The low-frequency random spiking of some of the E-cells between population spikes generates extra excitatory drive to the I-cells. If each I-cell receives input from sufficiently many E-cells, this drive is nearly constant. For two special cases, we will now analyze when the extra drive to the I-cells results in phase walkthrough. We assume here that the I-cells are modeled as theta neurons. However, we find it convenient to use the dependent variable V instead of θ (see section 2.1). In the absence of synaptic input, the equation of an I-cell is then dV = 2V(V − 1) + QI dt (see equation 2.7). The relation between QI and II is II = 2QI − 1 (see equation 2.10). The random spiking of the E-cells approximately adds the term TSE 1 1 − e−TSE /τE QSI = gEI e−t/τE dtVrev = gEI Vrev TSE 0 TSE /τE

(8.2)

to QI (compare equation 2.12). The average of e−t/τE over [0, TSE ] appears in this formula because the stochastic spiking of the E-cells is assumed to be asynchronous. We have also used the approximation Vrev − V ≈ Vrev in equation 8.2; this is reasonable because V is near 0 except during spikes. We are interested in low-frequency random spiking in the E-cells, so TSE τE = 2, and therefore the term e−TSE /τE in equation 8.2 is negligible: QSI ≈

gEI τE Vrev . TSE

(8.3)

QSI is added to QI , so II = 2QI − 1 turns into II + 2QSI . Denoting by fI (I) the frequency of an I-cell receiving drive I, the condition under which phase walkthrough of the I-cells is avoided becomes approximately 2gEI τE Vrev fI II + ≤ νP (8.4) TSE √ (see equation 4.1). If gII = 0, then fI (I) = (1000/π) I by equations 2.3 and 2.5. Therefore, inequality 8.4 becomes 1000 2gEI τE Vrev II + ≤ νP π TSE

Rhythms in the Presence of Noise

587

or, equivalently, π 2 (νP /1000)2 − II νSE ≤ . 1000 2gEI τE Vrev

(8.5)

Inequality 8.5 was derived assuming gII = 0. For gII > 0, the rhythm can withstand more drive to the I-cells, and therefore more random spiking of the E-cells. In the special case gII = gIE , the condition under which phase walkthrough of the I-cells is avoided is II + 2QSI ≤ IE

(8.6)

(compare equation 4.4). With the approximation 8.3, inequality 8.6 is equivalent to νSE IE − II . ≤ 1000 2gEI τE Vrev

(8.7)

To highlight the similarity between this inequality and equation 8.5, we note that IE = π 2 (νE /1000)2 by equations 2.3 and 2.5, so inequality 8.7 can be written as νSE π 2 (νE /1000)2 − II . ≤ 1000 2gEI τE Vrev

(8.8)

Note that equation 8.8 is obtained from equation 8.5 if νP is replaced by νE . Both formulas show that slower rhythms (smaller νP or smaller νE ) are more vulnerable to noisy activity in the E-cells than faster ones. 8.2 Numerical Examples. 8.2.1 Too Much Noise in the E-Cells Leads to Phase Walkthrough of the I-Cells. Figure 9A shows results of a simulation with gEI = 0.05, gIE = 0.20, gII = 0, IE = 0.1 (νE ≈ 101), II = 0 .

(8.9)

(Other parameter values are as specified in section 2.) A rhythm at frequency νP ≈ 40 is seen. We remark that r = (3/2)gIE /IE = 3 here. Inserting the parameter values 8.9 in inequality 8.5, we find that the rhythm should hold up for νSE ≤ 12.15. This prediction is not very accurate. With νSE = 7, phase walkthrough occurs already, as shown in Figure 9B. (Note that Figure 9B is indeed a noisy analog of the sort of phase walkthrough depicted in Figure 4.) Thus, the breakdown occurs earlier in our simulation than predicted

588

C. Borgers ¨ and N. Kopell

E−cells

0 40

0 0

B

160

I−cells

I−cells

E−cells

A

160

0 40

0 0

100 200 300

E−cells

0 40

0 0

D

160

I−cells

I−cells

E−cells

C

160

0 40

0 0

100 200 300

E−cells

0 40

0 0

400 800 1200

100 200 300 F

160

I−cells

I−cells

E−cells

E

160

100 200 300

0 40

0 0

400 800 1200

Figure 9: (A) Gamma frequency PING rhythm (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.1, II = 0). (B) Low-frequency stochastic spiking in the E-cells leads to phase walkthrough in the I-cells. (C) Phase walkthrough is counteracted by reducing gEI 0.02. (D) I→I synapses (gII = gIE ) make the rhythm astonishingly robust to noise in the E-cells. (E) A slow PING rhythm (IE = 0.003; all other parameter values as in A) is highly sensitive to noise in the E-cells. (F) Noise sensitivity can be greatly reduced, even for the slow rhythm, by setting gII = gIE and reducing gEI to 0.01.

Rhythms in the Presence of Noise

589

by our theory. To understand this discrepancy, we took a detailed look at the simulation underlying Figure 9B, focusing on the three population spikes of the I-cells that occur out of order, not prompted by the E-cells, in Figure 9B. Immediately before each of these population spikes, the actual drive to the I-cells rises significantly above that predicted by formula 8.2, as a result of random fluctuations. Our simple theory neglects those fluctuations, assuming instead that QSI is constant. The fluctuations would be less important in significantly larger networks. Therefore, we would expect the predictions of inequality 8.5 to be more accurate in larger networks. For the parameter values of Figure 9A, Figure 11A illustrates the transition from low, inconsequential levels of noise in the E-cells to disruptive levels. The figure shows quantities ρE ∈ [0, 1] and ρI ∈ [0, 1], measuring the regularity of the spiking of the E- and I-cells, plotted as functions of the noise frequency νSE . The precise definitions of ρE and ρI are given in appendix B. The closer ρE and ρI are to 1, the more regular is the rhythm. The figure shows that the regularity of the rhythm is lost abruptly as νSE is raised above 6. If in fact it is correct that the rhythm in Figure 9B is disrupted essentially as a result of too much drive to the I-cells, then it ought to be possible to restore the rhythm by lowering gEI . Indeed, this is the case. When gEI is lowered to 0.02, the rhythm shown in Figure 9C is obtained. 8.2.2 At Gamma frequency, I→I Synapses Greatly Reduce Sensitivity to Noise in the E-Cells. Figures 5B and 5D show that in large portions of parameter space, I→I synapses greatly enlarge the amount of drive to the I-cells that PING rhythms can withstand before breaking down as a result of phase walkthrough. To confirm this by simulation, we use the parameter values of Figure 9A, but replace gII = 0 by gII = gIE . The resulting rhythm is astonishingly insensitive to noisy spiking in the E-cells. Formula 8.8 predicts that the rhythm should survive for νSE ≤ 77. Figure 9D shows a simulation with νSE = 70. The spike time rastergram for the E-cells looks quite noisy, of course, but the rhythm is still visible. (Because of stochastic fluctuations, the rhythm would not, in reality, remain intact if νSE were equal to 77.) 8.2.3 Sensitivity to Noise in the E-Cells Increases as the Rhythm Slows Down. We now present numerical simulations illustrating that slower PING rhythms are more sensitive to noisy spiking of the E-cells, as predicted by Figure 5A and formulas 8.5 and 8.8. Figure 9E shows a slow PING rhythm disrupted by a small amount of stochastic spiking in the E-cells. (Notice that the time window shown in Figures 9E and 9F is four times longer than that shown in Figures 9A–9D.) The parameter values of Figure 9E are those of Figure 9A, except that IE has been reduced from 0.1 to 0.003. The frequency of the resulting PING rhythm is about 10. Formula 8.5 predicts that this rhythm should be abolished by random spiking in the E-cells at an average frequency of about νSE = 0.76. Indeed, phase walkthrough occurs

590

C. Borgers ¨ and N. Kopell

for νSE = 0.5 already. This is shown in Figure 9E. Thus, phase walkthrough again occurs at a (somewhat) lower value of νSE than predicted by our theory. As before, the discrepancy is due primarily to the stochastic fluctuations in the drive to the I-cells generated by the random spiking of the E-cells. For this case, Figure 11B illustrates the transition from nearly inconsequential levels of noise in the E-cells to disruptive levels. The regularity of the rhythm is lost abruptly as νSE is raised above 0.4. It is not surprising that the maximum allowable value of νSE decreases as νP decreases. For instance, if the maximum allowable value of νSE is 6 for νP = 40 (see Figure 11A), one might expect it to be four times smaller for νP = 10. What is surprising is that it is not 4 but 16 times smaller (see Figure 11B). This is in agreement with inequality 8.5, as the right-hand side of that inequality, for II = 0, is proportional to (νP /1000)2 , not to νP /1000. 8.2.4 At Subgamma Frequencies, Careful Parameter Tuning Can Reduce Sensitivity to Noise in the E-Cells. At low frequencies, I→I synapses do not help much. Inequality 8.8 predicts that even with gII = gIE , phase walkthrough of the I-cells will occur as soon as νSE > 2.3. However, the robustness of the rhythm can be enhanced by reducing gEI . Figure 9F shows a simulation in which the parameter values are as in Figure 9E, except gII = gIE , gEI = 0.01, and νSE = 7. 9 Suppression of the E-Cells Resulting from Stochastic Spiking of the I-Cells In this section, we consider the effects of random spiking of the I-cells, for instance, as a result of stochastic external input. We show that noisy spiking of the I-cells disrupts PING rhythms by causing suppression of the E-cells. We give an approximate analysis of the conditions under which noise in the I-cells results in suppression of the E-cells. As in section 8, our analysis is not very accurate quantitatively, again because it neglects statistical fluctuations that are significant at least in small networks. It does reveal that the crucial quantity here is the product rτI : the larger rτI , the less noisy spiking in the I-cells can be tolerated. In our simulations, we enforce random spiking of the I-cells in the same way in which we enforced random spiking of the E-cells in section 8. In analogy with section 8, TSI denotes the expected time between two random spikes of a given I-cell, and νSI = 1000/TSI .

(9.1)

Figures 10B and 10F (discussed in detail shortly) show examples of rhythms that persist in spite of random spiking in the I-cells. When an I-cell spikes out of order, it is temporarily out of synchrony with the bulk of the I-cells. However, it is brought back into synchrony when the next

Rhythms in the Presence of Noise

591

population spike of the E-cells prompts a population spike of the I-cells. 9.1 Analysis. The low-frequency random spiking of some of the I-cells between population spikes generates inhibitory synaptic drive to the Ecells. If each E-cell receives input from sufficiently many I-cells, this drive is nearly constant. If it is strong enough, it leads to suppression of the E-cells, and thereby abolishes the rhythm (see Figure 10D). A necessary and sufficient condition for the E-cells to escape suppression is s<

1 , r

s=

1 TSI

(9.2)

with

TSI

e−t/τI dt =

0

1 − e−TSI /τI TSI /τI

(9.3)

(see equation 6.3, which is the same as equation 9.2, and equation 2.19, which is nearly the same as equation 9.3, the only difference being that the deterministic interspike interval TI in equation 2.19 has been replaced by the expected interspike interval TSI in equation 9.3). We are interested in lowfrequency random spiking of the I-cells, and thus assume TSI τI = 10. Therefore, the term e−TSI /τI in equation 9.3 is negligible, and condition 9.2 becomes τI 1 < , TSI r or, equivalently, using equation 9.1, 1 νSI . < 1000 rτI

(9.4)

This formula shows that the sensitivity of PING rhythms to random spiking in the I-cells depends on the size of rτI . 9.2 Numerical Examples. 9.2.1 Too Much Noise in the I-Cells Leads to Suppression of the E-Cells. Figure 10A shows, once more, the simulation of Figure 9A. Figure 10B shows the result of adding random spiking of the I-cells at frequency νSI = 25 to the simulation in Figure 10A. The noisy spiking in the I-cells slows the rhythm (as it should, since it inhibits the E-cells), but it does not disrupt it. At νSI = 30 (not shown in Figure 10), the time intervals between population spikes become irregular. Formula 9.4 predicts that suppression of the E-cells should occur for νSI > 33, approximately. In our simulations, even for νSI = 40, the

592

C. Borgers ¨ and N. Kopell

E−cells

0 40

0 0

B

160

I−cells

I−cells

E−cells

A

160

0 40

0 0

100 200 300

E−cells

0 40

0 0

D

160

I−cells

I−cells

E−cells

C

160

0 40

0 0

400 800 1200

E−cells

0 40

0 0

400 800 1200

400 800 1200 F

160

I−cells

I−cells

E−cells

E

160

100 200 300

0 40

0 0

400 800 1200

Figure 10: (A) Gamma frequency PING rhythm with r = 3 (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.1, II = 0). (B) Considerable noise in the I-cells can be withstood by this rhythm. (C) Low-frequency PING rhythm with r = 60 (IE = 0.005; all other parameters as in A). (D) This rhythm is highly sensitive to noise in the I-cells. (E) Low-frequency PING rhythm with r = 3 (IE = 0.001, gIE = 0.002; all other parameter values as in A). (F) Considerable noise in the I-cells can be withstood by this rhythm.

Rhythms in the Presence of Noise

593

E-cells are not suppressed, but the intervals between population spikes of the E-cells are long and quite irregular. Thus, the regularity of the rhythm is disrupted earlier (for smaller νSI ) than predicted by our theory, but complete suppression of the E-cells occurs later (for larger νSI ) than predicted by our theory. This discrepancy between our theory and the numerical simulations is not hard to understand. In our theory, statistical fluctuations in the strength of inhibition received by the E-cells are neglected. These fluctuations make the rhythm irregular earlier (for smaller νSI ) than predicted by our theory. When the inhibition happens to be stronger than average, the E-cells are delayed, and when it happens to be weaker than average, the E-cells spike earlier. However, statistical fluctuations cause complete suppression of the E-cells to occur later (for larger νSI ) than predicted by our theory. Even when inhibition is strong enough, on the average, to suppress the E-cells, there are time windows when it happens to fall below the threshold required for suppression. Population spikes of the E-cells may occur during those time windows. For the parameter values of Figure 10A, Figure 11C illustrates the transition from low, inconsequential levels of noise in the I-cells to disruptive levels. The figure shows the regularity measures ρE and ρI (see appendix B) plotted as functions of νSI . 9.2.2 Sensitivity to Noise in the I-Cells Increases as Drive to the E-Cells Weakens. We lower IE to 0.005. The frequency of the PING rhythm decreases to about 12, and r rises to 60. Figure 10C shows the rhythm without any stochastic spiking. (Notice that the time window shown in parts C–F of Figure 10 is four times longer than that shown in parts A and B.) Adding stochastic spiking of the I-cells at average frequency νSI = 5, the E-cells are suppressed, and thereby the rhythm is abolished (see Figure 10D). For the parameter values of Figure 10C, Figure 11D illustrates the transition from low, inconsequential levels of noise in the I-cells to disruptive levels. Note that the value of IE is 20 times smaller in Figure 11D than in Figure 11C, and the value of r is therefore 20 times larger. According to inequality 9.4, the maximum frequency of noise in the I-cells that the rhythm can withstand should therefore be 20 times smaller in Figure 11D than in Figure 11C. Indeed, this is approximately the case. 9.2.3 For Weakly Driven E-Cells, Careful Parameter Tuning Can Reduce Sensitivity to Noise in the I-Cells. We have mentioned that PING rhythms with low values of IE can be made fairly noise insensitive by lowering gIE proportionally, ensuring that r does not become too large. To illustrate this, we lower IE even further, to 0.001, but also lower gIE to 0.002, bringing the value of r back to 3, as in Figure 10A. For reasons discussed in the next paragraph, we also use a smaller value of gEI here: gEI = 0.005. The frequency of the resulting rhythm is approximately 8 (see Figure 10E). Remarkably, the rhythm in Figure 10E survives stochastic spiking of the I-cells at frequency

594

C. Borgers ¨ and N. Kopell

A

B

ρE

1

0.5

0.5

1

1

0.5

0.5

ρI

ρI

ρ

E

1

0 2 4 6 8 10 νSE

0

C

ρE

ρE

1

0.5

0.5 1

ρI

1

ρI

1

D

1

0.5 0

0.5 νSE

50 ν

SI

100

0.5 0 1 2 3 4 5 νSI

Figure 11: Regularity measures ρE and ρI (defined in appendix B) as functions of noise frequency. (A) At gamma frequency (40 Hz), the rhythm can withstand 6 Hz noise in the E-cells. (B) At four times lower frequency (10 Hz), it can withstand no more than 0.4 Hz noise in the E-cells. (C) At gamma frequency (40 Hz), the regularity of the rhythm is lost if the frequency of the noise in the I-cells is greater than 40 Hz. (D) When IE is reduced twenty-fold, the rhythm can withstand no more than 1.5 Hz noise in the I-cells.

Rhythms in the Presence of Noise

595

5 (see Figure 10F). This is a result of the moderate (not very large) value of r. In the preceding simulation, we used a reduced value of gEI . In general, for small gIE , the PING rhythm is more rapidly established if gEI is small as well. (Of course, gEI must still be large enough for a population spike of the E-cells to trigger a population spike of the I-cells.) Section 8 suggests the following heuristic explanation of this numerical observation. In the initial phases of the simulation, before the rhythm is established, the population spikes are fuzzy (see, for instance, the beginning of the simulation in Figure 10E). If gIE is small, there is a considerable amount of nearly asynchronous activity of the E-cells between activity peaks. This activity results in nearly tonic drive to the I-cells, which may result in premature spiking of some of the I-cells, and may therefore make it more difficult for the rhythm to be established. Lowering gEI reduces this effect. 9.2.4 Effects of I→I Synapses. We have used gII = 0 throughout this section. As pointed out in earlier sections, I→I synapses generally stabilize PING rhythms by slowing disruptive spiking in the I-cells. However, in the experiments presented here, the disruptive spiking in the I-cells is forced, regardless of the value of gII . As a result, if gII were set to gIE , the results presented in this section would remain virtually unchanged. If the random spiking were instead generated by weaker random input pulses, not necessarily always inducing immediate spiking, then I→I synapses would indeed counteract the suppression of the E-cells by the I-cells. 10 Simulations Including Sparse Connectivity and Heterogeneity In this section, we present some numerical simulations including sparse connectivity and heterogeneity. We sparsen the connectivity as follows. The possible synaptic connections are considered one at a time. Each possible connection is removed with probability 0.5. If it is retained, its strength is doubled. For instance, the strength of an I→E connection is 0 with probability 0.5, and 2gIE /NI with probability 0.5. This is a moderate degree of sparseness. However, in earlier work (Borgers ¨ & Kopell, 2003), we showed that the effect of sparseness on the coherence of a PING rhythm is determined not by the fraction p of connections retained, but by pNE /(1 − p) and pNI /(1 − p). For instance, in a 10 times larger network (i.e., in a network of 1600 E-cells and 400 I-cells), we would get approximately the same effect if we removed connections with probability 10/11 and strengthened the retained connections eleven-fold. (Note that p/(1 − p) = 1/10 when p = 1/11.) In addition to sparseness, we introduce 20% heterogeneity in synaptic strengths. For instance, the strength of a given retained I→E synapse is not precisely 2gIE /NI , but rather a gaussian random number with mean 2gIE /NI and standard deviation 0.4gIE /NI . (This random number is negative with

596

C. Borgers ¨ and N. Kopell

very small but positive probability. If our program draws a negative strength for one of the synapses, the strength of that synapse is reset to zero.) Thus, gIE , gEI , and gII are no longer the total synaptic strengths, but they are (very close to) the expected total synaptic strengths. We also introduce 20% heterogeneity in the external drive to the E-cells. That is, the external drive received by a given E-cell is not precisely IE , but rather a gaussian random number with mean IE and standard deviation 0.20IE . Thus, IE is no longer the external drive to each E-cell, but it is the expected external drive to each E-cell. Similarly, we introduce 20% heterogeneity in the external drive to the I-cells. Finally, all cells are forced to spike at random times at average frequencies νSE (for the E-cells) and νSI (for the I-cells). At gamma frequency, PING is robust to heterogeneity, random connectivity, and noisy external drives. Figure 12A shows the results of a simulation with gIE = 0.2, gEI = 0.05, gII = 0.2, IE = 0.1 (νE ≈ 101), II = 0.05, νSE = 5, νSI = 5. Figure 12B shows, for the same simulation, quantities sE = sE (t) and sI = sI (t) measuring the average strength of excitatory and inhibitory synapses (see appendix B for precise definitions). The rhythm is clearly detectable in Figures 12A and 12B. Its frequency is approximately 40. Even with heterogeneity and random connectivity, it is possible to obtain PING rhythms at subgamma frequencies, but careful parameter tuning is required. As discussed in section 8.2, one should lower gEI , to avoid phase walkthrough as a result of the drive to the I-cells resulting from noisy activity of the E-cells. However, one cannot lower gEI too much, since the I-cells must respond promptly to population spikes of the E-cells. As discussed in section 9, one should lower both IE and gIE proportionally, keeping r constant. This avoids making the rhythm vulnerable to suppression of the E-cells by noisy spiking of the I-cells. The parameter values gIE = 0.004, gEI = 0.004, gII = 0.004, IE = 0.001 (νE ≈ 10), II = 0.0005, νSE = 2, νSI = 2 yield a PING rhythm at frequency νP ≈ 8 in the presence of sparse connectivity, heterogeneity, and noise. Sparseness and heterogeneity are introduced as described earlier. Figure 12C shows the spike times, and Figure 12D shows sE and sI for the same simulation. (Notice that the time window shown in Figures 12C and 12D is five times longer than that shown in Figures 12A and 12B.)

0.26

sE

160

I−cells

A

0 40

0 0

0 0.68

0 0

100 200 300

0.16

0 40

0 0.59

I−cells

B

s

E

160

100 200 300

sI

E−cells

597

sI

E−cells

Rhythms in the Presence of Noise

0 0

500 1000 1500

0 0

500 1000 1500

Figure 12: Sparse connectivity, heterogeneity, and noise do not necessarily abolish PING rhythm, either at gamma frequency or—with carefully tuned parameter values—at lower frequencies. (A) Gamma frequency PING rhythm (gEI = 0.05, gIE = 0.2, gII = 0.2, IE = 0.1, II = 0.05); spike times (left panel), and sE and sI as defined in appendix B (right panel). (B) Low-frequency PING rhythm (gEI = 0.004, gIE = 0.004, gII = 0.004, IE = 0.001, II = 0.0005); spike times (left panel), and sE and sI (right panel).

11 The Frequency Range in Which PING is Robust A major theme of this letter has been that PING rhythms at low frequencies, significantly below the gamma range, are easily abolished by noise, unless the parameter values are tuned very carefully. In fact, there is also a sense in which PING rhythms at high frequencies, above the gamma range, are not robust. We remarked earlier that one typically needs r ≥ 3 for synchronization in the presence of 20% heterogeneity in network parameter values. Combining

598

C. Borgers ¨ and N. Kopell

this with equation 3.15, and recalling q ≤ 1, we find TP ≥ (ln 3)τI ≈ 1.1τI . If τI = 10, then TP ≥ 11, which approximately means νP ≤ 91. Thus, in the presence of significant heterogeneity, PING rhythms cannot have a frequency above the gamma range. This argument is, of course, far from precise. In particular, nothing is really special about the value r = 3. Nevertheless, we do believe that the argument is correct in essence. In more intuitive language, when one attempts to drive the frequency of a PING rhythm above the gamma range by raising drive to the E-cells, one must also raise the strength of the I→E synapses in order to maintain synchronization. The rise in the strength of the I→E synapses brings the frequency back into the gamma range. 12 Summary and Discussion We have described and analyzed two distinct ways in which too much external drive to the I-cells can disrupt PING rhythms: The I-cells may synchronize, but get ahead of the E-cells (phase walkthrough of the I-cells), or the I-cells may not synchronize, and their activity may keep the E-cells from spiking altogether (suppression of the E-cells). Our analysis of the effects of deterministic drive to the I-cells casts light on the effects of stochastic drive to the E- or I-cells. If there is too much noisy spiking activity in the E-cells, the resulting rise in excitatory drive to the I-cells may lead to phase walkthrough; too much noisy spiking in the I-cells may result in suppression of the E-cells. I→I synapses reduce the spiking frequency of the I-cells, counteracting phase walkthrough of the I-cells, suppression of the E-cells, and the effects of noisy external drive on both I- and E-cells, and thereby enhancing the robustness of the rhythm. Our analysis also shows why PING rhythms are most robust when their frequency lies in the gamma range. Above the gamma range, synchronization breaks down easily when the E-cell population is heterogeneous. As the frequency is lowered, on the other hand, PING becomes increasingly sensitive to noisy spiking of the E-cells, and—unless the strength of the inhibitory synapses and the drive to the E-cells are calibrated carefully—more sensitive to noisy spiking of the I-cells as well. Gutkin and Ermentrout (1998) showed that noise sensitivity of type I model neurons is greater at lower frequencies than at higher frequencies, (see in particular Figure 6B of their article). Our point about noise sensitivity and frequency may seem similar at first but is in fact quite different. Gutkin and Ermentrout showed for a single theta neuron that noisy drive has a more severe effect when the intrinsic frequency of the neuron is low than when it is high. We have shown for an E/I network that the tonic component of the drive to the I-cells created by noisy spiking in the E-cells has a more severe effect when the population frequency is low than when it is high.

Rhythms in the Presence of Noise

599

We conclude with a brief discussion of the relation of our work to previous work on synchronization in networks of model neurons. Much of this work has addressed purely excitatory networks (e.g., Peskin, 1975; Mirollo & Strogatz, 1990; Somers & Kopell, 1993; Hansel, Mato, & Meunier, 1995; Crook, Ermentrout, & Bower, 1998; Bose, Kopell, & Terman, 2000; van Vreeswijk & Hansel, 2001; Acker, Kopell, & White, 2003), purely inhibitory networks (e.g., Wang & Rinzel, 1992; Golomb & Rinzel, 1993; Skinner, Kopell, & Marder, 1994; Chow, 1998; Chow, White, Ritt, & Kopell, 1998; Terman, Kopell, & Bose, 1998; White, Chow, Ritt, Soto-Tervino, & Kopell, 1998; Brunel & Hakim, 1999), or pairs of neurons that are either both excitatory, or both inhibitory (e.g. van Vreeswijk, Abbott, & Ermentrout, 1994; Kopell & Ermentrout, 2002). Synchronization in networks including both excitatory and inhibitory model neurons has been studied widely as well (e.g., Wang, Golomb, & Rinzel, 1995; Ermentrout & Kopell, 1998; Brunel, 2000a, 2000b; Whittington et al., 2000; Tiesinga et al., 2001; Hansel & Mato, 2001; van Vreeswijk & Hansel, 2001; Borgers ¨ & Kopell, 2003; Hansel & Mato, 2003). For more complete references, see Kopell and Ermentrout (2002) or Hansel and Mato (2003). Much of the work on (a)synchrony in neuronal networks has focused on states of asynchronous spiking and ways in which such states can lose stability (Abbott & Van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Hansel et al., 1995; Gerstner, 2000; Neltner Hansel, Mato, & Meunier, 2000; van Vreeswijk, 2000; Hansel & Mato, 2003). In this letter, we have taken a complementary point of view, focusing on states of nearly synchronous spiking, asking for which parameter values such states exist and how sensitive they are to changes in parameter values and noise. Hansel and Mato (2003) gave a comprehensive analysis of the bifurcations by which asynchronous states can lose stability in networks of excitatory and inhibitory neurons. They identified four codimension 1 bifurcations, of which three lead to oscillatory behavior. One of these (the one associated with crossing the curve L4 in Figure 11 of Hansel and Mato, 2003) corresponds to the emergence of PING and another (associated with the curve L2 in the same figure) to the emergence of ING. In drawing their phase diagrams, Hansel and Mato varied the strengths of synaptic conductances while adjusting external drives to keep the average firing rates of the E- and I-cell populations constant. By contrast, we have treated synaptic strengths and external drives as independent parameters. This difference in point of view complicates direct comparisons between the results of Hansel and Mato and ours. In particular, the lower panel of Figure 11 of Hansel and Mato shows that in a certain sense, I→I synapses counteract PING. For stronger I→I synapses, stronger coupling between the E- and I-cells is needed for the transition from asynchrony to PING (crossing the L4 boundary). At first sight, this appears to contradict our conclusion that I→I synapses promote PING by protecting against phase walkthrough of the I-cells and suppression of the E-cells. However, in fact, there is no contra-

600

C. Borgers ¨ and N. Kopell

diction, since different points in the phase plane depicted by Hansel and Mato correspond to different external drives. The frequency of PING oscillations was analyzed in a recent paper by Brunel and Wang (2003). We note that their study was concerned with oscillations driven by noisy input, whereas in this article, we have focused on oscillations driven by deterministic input (and, in some cases, disrupted by a noisy input component). However, not all discrepancies between their results and ours can plausibly be explained by this difference. For instance, equation 18 of Brunel and Wang (2003) determines the population frequency of the oscillation for E/I networks without E→E and I→I synapses. According to this equation, the frequency depends on synaptic delay, rise, and decay times. In our simulations, there are no synaptic delays, and synaptic rise times are very short. Furthermore, numerical results (not shown here) indicate that shortening the synaptic rise times much further has little impact on the results of our simulations, and the same holds even when the rhythm is driven by noise instead of deterministic tonic drive. This motivates consideration of the limit of equation 18 of Brunel and Wang (2003) as the synaptic delay and rise times tend to zero. Doing this, one finds the prediction that the population oscillation frequency should tend to infinity—in stark discrepancy with our numerical results. The reason for this and other discrepancies between the results of Brunel and Wang and ours remains to be determined. An important difference between the study of Brunel and Wang and ours lies in the choice of membrane time constants for the excitatory neurons. Brunel and Wang used linear integrate-and-fire neurons with time constants equal to 10 ms for inhibitory neurons and 20 ms for excitatory ones. Significantly shorter membrane time constants are believed to be appropriate for neocortical pyramidal neurons in vivo during high ongoing activity (Destexhe & Par´e, 1999; Destexhe, Rudolph, & Par´e, 2003). For theta neurons, as for real cortical neurons, the membrane time “constant” is not actually a constant, but depends on external input. In our simulation of gamma rhythms, the E-cells typically have rather short membrane time constants (on the order of few milliseconds) shortly after receiving input from the I-cells; this is what makes the stable river in Figure 3A strongly attracting and allows for synchronization of the E-cells within one gamma cycle. Appendix A: The Impact of Synchrony and Asynchrony on Downstream Effects In computing the suppression boundary, we assumed that the activity of the I-cells was completely asynchronous. The rationale for this assumption was that asynchrony maximizes the downstream effect of an assembly of inhibitory neurons. In this appendix, we prove a precise statement to this effect. We also prove a precise statement showing that synchrony maximizes

Rhythms in the Presence of Noise

601

the downstream effect of an assembly of excitatory neurons. In contrast with the main portion of this article, we use the linear integrate-and-fire model here. We have not so far generalized these results to the theta model. We begin with a precise version of the following claim. If N periodic current inputs succeed in making an integrate-and-fire neuron spike, then the same inputs would also succeed in making the neuron spike if they were synchronized. Theorem 1. Let ϕ ≥ 0 be periodic function with period T, and let t1 , ..., tN ∈ [0, T). Let τ > 0, and consider the initial value problems

N dV V ϕ(t − ti ) =− + dt τ i=1

V(0) = 0 and dVˆ Vˆ = − + Nϕ(t). dt τ ˆ V(0) =0 Then ˆ ≥ sup V(t) . sup V(t) t≥0

t≥0

Denote by V(V0 ; t1 , ..., tN ; t) the solution of

Proof.

N V dV ϕ(t − ti ). =− + dt τ i=1

V(0) = V0 . In general, V(V0 ; t1 , ..., tN ; t) =

t N 0

ϕ(s − ti ) e(s−t)/τ ds + V0 e−t/τ .

(A.1)

i=1

To find solutions with period T, we define V0,per > 0 to be the solution of 0

T

N i=1

ϕ(s − ti ) e(s−T)/τ ds + V0,per e−T/τ = V0,per .

602

C. Borgers ¨ and N. Kopell

We write Vper (t1 , ..., tN ; t) = V(V0,per ; t1 , ..., tN ; t) . From equation A.1, V(0; t1 , ..., tN ; t) < V(V0,per ; t1 , ..., tN ; t) for all t, but lim V(0; t1 , ..., tN ; t) − V(V0,per ; t1 , ..., tN ; t) = 0 .

t→∞

Therefore, sup V(0; t1 , ..., tN ; t) = sup Vper (t1 , ..., tN ; t) . t≥0

t≥0

We denote by Vˆ per the uniquely determined solution with period T of dVˆ per Vˆ per =− + ϕ(t) . dt τ Then Vper (t1 , ..., tN ; t) =

N

Vˆ per (t − ti ) .

i=1

Therefore, sup Vper (t1 , ..., tN ; t) ≤ N sup Vˆ per (t) , t≥0

t≥0

with “=” if ti = 0 for all i. Therefore, sup V(0; t1 , ..., tN ; t) = sup Vper (t1 , ..., tN ; t) t≥0

t≥0

≤ N sup Vˆ per (t) t≥0

= sup Vper (0, ..., 0; t) t≥0

= sup V(0; 0, ..., 0; t). t≥0

This proves the assertion.

Rhythms in the Presence of Noise

603

We next show that asynchrony minimizes the downstream effect of an assembly of excitatory neurons. More precisely, we will show that if a given asynchronous excitatory synaptic input succeeds in making an integrateand-fire neuron spike, then the same input, delivered phasically with a period T, will also succeed in making the neuron spike. Theorem 2. Let I ∈ IR, τ > 0, Vrev ∈ IR (think of Vrev as the reversal potential of an excitatory synapse), and let g = g(t) ≥ 0 be a function with period T. Define V = V(t) by dV V = − + I + g(Vrev − V) dt τ V(0) = 0 . Let g=

1 T

T

g(t)dt .

0

Define V = V(t) by V dV = − + I + g(Vrev − V) dt τ V(0) = 0 . Then sup V(t) ≤ sup V(t) . t≥0

t≥0

Proof. If supt≥0 V(t) = ∞, then we have nothing to prove, so assume supt≥0 V(t) < ∞. V dV = − + I + g(t)(Vrev − V) dt τ sup V(t) ≥−

t≥0

τ

+ I + g(t) Vrev − sup V(t) .

(A.2)

t≥0

The right-hand side of this inequality is a periodic function of t and a lower bound on dV/dt. The assumption supt≥0 V(t) < ∞ then implies that the average of the right-hand side of equation A.2 is ≤ 0: sup V(t) −

t≥0

τ

+ I + g Vrev − sup V(t) ≤ 0 . t≥0

604

C. Borgers ¨ and N. Kopell

This means that dV/dt would be ≤ 0 if V ever reached supt≥0 V(t). So V cannot exceed supt≥0 V(t). This proves the assertion. Thinking of Vrev as the reversal potential of an inhibitory synapse, one can reinterpret the statement just proved as follows. If a given inhibitory synaptic input suppresses an integrate-and-fire neuron, then the same input, delivered asynchronously, will also suppress the neuron. That is, asynchrony maximizes the downstream effect of an assembly of inhibitory neurons. Appendix B: A Measure of Regular Rhythmicity We associate synaptic gating variables with the presynaptic neurons (see section 2.2). Let sE,i denote the synaptic gating variable associated with ith E-cell, 1 ≤ i ≤ NE , and let sE (t) be the average of sE,i over i ∈ {1, 2, ..., NE } and over the time interval [t − 5, t + 5]. We approximate the time average using the trapezoid method. The time averaging is needed in some of our noisier simulations (for instance, those of Figure 12) to eliminate small random fluctuations that would make automatic detection of the underlying rhythm difficult. We define sE,min + sE,max sE,min = min sE (t) , sE,max = max sE (t) , and sE,av = . t t 2 The minimum and maximum are taken over the second half of the simulation time interval to avoid the effects of initial transients. We consider those times t in the second half of the simulation time interval at which sE changes from values below sE,av to values above sE,av . We denote these times by tE,1 < tE,2 < ... < tE,nE . Our measure of (regular) rhythmicity is   min(tE,i+1 − tE,i ) ρE = max(tE,i+1 − tE,i )  0

if nE ≥ 3 , otherwise .

Here the minimum and maximum are taken over i ∈ {1, 2, ..., nE −1}. Clearly ρE ∈ [0, 1]; the closer ρE is to 1, the more regular is the rhythm. To visualize and measure rhythmicity of the I-cells, sI (t) and ρI are defined analogously. Acknowledgments We are grateful to Steve Epstein for reading the manuscript and providing helpful criticism. We also thank the referees for their detailed and thoughtful comments, which led to several improvements in the article. C.B. is

Rhythms in the Presence of Noise

605

supported by NSF grant DMS-0418832. N.K. is supported by NSF grant DMS-9706694 and NIH grant MH47150. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous state in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Acker, C., Kopell, N., & White, J. (2003). Synchronization of strongly coupled excitatory neurons: Relating network behavior to biophysics. J. Comp. Neurosci., 15, 71–90. Borgers, ¨ C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Comp., 15(3), 509–539. Bose, A., Kopell, N., & Terman, D. (2000). Almost synchronous solutions for pairs of neurons coupled by excitation. Physica D, 140, 69–94. Brunel, N. (2000a). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comp. Neurosci., 8, 183–208. Brunel, N. (2000b). Phase diagrams of sparsely connected networks of excitatory and inhibitory spiking neurons. Neurocomputing, 32, 307–312. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Comp., 11, 1621–1671. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitation-inhibition balance. J. Neurophysiol., 90, 415–430. Charpak, S., G¨ahwiler, B. H., Do, K. Q., & Knopfel, ¨ T. (1990). Potassium conductances in hippocampal neurons blocked by excitatory amino-acid transmitters. Nature, 347, 765–767. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comp. Neurosci., 5, 407–420. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillators. Neural Comp., 10, 837–854. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Reviews Neurosci., 4, 739–751. Diener, F. (1985a). Propri´et´es asymptotiques des fleuves. C. R. Acad. Sci. Paris, 302, 55–58. Diener, M. (1985b). D´etermination et existence des fleuves en dimension 2. C. R. Acad. Sci. Paris, 301, 899–902. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226.

606

C. Borgers ¨ and N. Kopell

Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 879–1001. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math, 46, 233–253. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delay. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Fries, P., Neuenschwander, S., Engel, A. K., Goebel, R., & Singer, W. (2001). Rapid feature selective neuronal synchronization through correlated latency shifting. Nature Neurosci., 4, 194–200. Fries, P., Roelfsema, P. R., Engel, A. K., Konig, ¨ P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Proc. Natl. Acad. Sci. USA, 94, 12699–12704. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous state, and locking. Neural Comp., 12, 43–89. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Phys. Rev. Lett., 71, 312– 315. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047–1065. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comp., 15(1), 1–56. Hansel, D., Mato, G., & Meunier, G. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hasselmo, M. E. (1999). Neuromodulation: Acetylcholine and memory consolidation. Trends in Cognitive Sciences, 3 (9), 351–359. Hodgkin, A. L. (1948). The local changes associated with repetitive action in a nonmedullated axon. J. Physiol. (London), 107, 165–181. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifurcation and Chaos, 10, 1171–1266. Knowles, W. D., & Schwartzkroin, P. A. (1981). Local circuit synaptic interactions in hippocampal brain slices. J. Neurosci., 1, 318–322.

Rhythms in the Presence of Noise

607

Kopell, N., & Ermentrout, G. B. (2002). Mechanisms of phase-locking and frequency control in pairs of coupled neural oscillators. In B. Fiedler (Ed.), Handbook on dynamical systems (Vol. 2: Towards applications (pp. 3–54). Amsterdam: Elsevier. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural Comput., 12, 1607–1641. Olufsen, M., Whittington, M. A., Camperi, M., & Kopell, N. (2003). New functions for the gamma rhythm: Population tuning and preprocessing for the beta rhythm. J. Comp. Neurosci., 14, 33–54. Peskin, C. S. (1975). Mathematical aspects of heart physiology. New York: Courant Institute of Mathematical Sciences, New York University. Pulvermuller, ¨ F., Birbaumer, N., Lutzenberger, W., & Mohr, B. (1997). Highfrequency brain activity: Its possible role in attention, gestalt processing and language. Progress in Neurobiology, 52, 427–445. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch, & I. Segev (Eds.), Methods in neuronal modeling (pp. 251–292), Cambridge, MA: MIT Press. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations. Neuron, 24, 49–65. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Skinner, F., Kopell, N., & Marder, E. (1994). Mechanisms for oscillation and frequency control in networks of mutually inhibitory relaxation oscillators. J. Comp. Neurosci., 1, 69–87. Slotnick, S. D., Moo, L. R., Kraut, M. A., Lesser, R. P., & Hart, J. (2002). Interactions between thalamic and cortical rhythms during semantic memory recall in human. Proc. Natl. Acad. Sci. USA, 99(9), 6440–6443. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Storm, J. F. (1989). An afterhyperpolarisation of medium duration in rat hippocampal pyramidal cells. J. Physiol. (London), 409, 171–190. Tallon-Baudry, C., Bertrand, O., Peronnet, C. F., & Pernier, J. (1998). Induced gamma-band activity during the delay of a visual short-term memory task in humans. J. Neurosci., 18, 4244–4284. Tam´as, G., Buhl, E. H., Lorincz, ¨ A., & Somogyi, P. (2000). Proximally targeted GABAergic synapses and gap junctions synchronize cortical interneurons. Nature Neurosci., 3(4), 366–371. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Tiesinga, P. H. E., Fellous, J.-M., Jos´e, J. V., & Sejnowski, T. J. (2001). Computational model of carbachol-induced delta, theta, and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274.

608

C. Borgers ¨ and N. Kopell

Tiitinen, H., Sinkkonen, J., Reinikainen, K., Alho, K., Lavikainen, J., & Naatanen, R. (1993). Selective attention enhances the auditory 40-Hz transient response in humans. Nature, 364, 59–60. Traub, R. D., Bibbig, A., Fisahn, A., LeBeau, F. E. N., Whittington, M. A., & Buhl, E. H. (2000). A model of gamma-frequency network oscillations induced in the rat CA3 region by carbachol in vitro. European J. Neuroci., 12, 4093–4106. Traub, R. D., Buhl, E. H., Gloveli, T., & Whittington, M. A. (2003). Fast rhythmic bursting can be induced in layer 2/3 cortical neurons by enhancing persistent Na+ conductance or by blocking BK channels. J. Neurophysiol., 89, 909–921. Traub, R. D., Kopell, N., Bibbig, A., Buhl, E. H., Lebeau, F. E. N., & Whittington, M. A. (2001). Gap junctions between interneuron dendrites can enhance long-range synchrony of gamma oscillations. J. Neurosci., 21, 9478–9486. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (London), 493, 471–484. van Vreeswijk, C. (2000). Analysis of the asynchronous state in networks of strongly coupled oscillators. Phys. Rev. Lett., 84, 5110–5113. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Comp., 13, 959–992. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst firing in a thalamic model: Specific neuronal mechanisms. Proc. Natl. Acad. Sci. USA, 2, 5577–5581. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp., 4, 84–97. White, J., Chow, C., Ritt, J., Soto-Trevino, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G. R., & Traub, R. D. (1997). Spatiotemporal patterns of gamma frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol. 502, 591–607. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Whittington, M. A., Traub, R. D., Kopell, N., Ermentrout, B., & Buhl, E. H. (2000). Inhibition-based rhythms: experimental and mathematical observations on network dynamics. Int. J. Psychophysiol., 38, 315–336. Received February 17, 2004; accepted July 27, 2004.

LETTER

Communicated by Sebastian Seung

Supervised Learning Through Neuronal Response Modulation Christian D. Swinehart cds@brandeis. edu.

L.F. Abbott abbott@brandeis. edu. Volen Center and Department of Biology, Brandeis University, Waltham, MA 02454, U.S.A.

Neural networks that are trained to perform specific tasks must be developed through a supervised learning procedure. This normally takes the form of direct supervision of synaptic plasticity. We explore the idea that supervision takes place instead through the modulation of neuronal excitability. Such supervision can be done using conventional synaptic feedback pathways rather than requiring the hypothetical actions of unknown modulatory agents. During task learning, supervised response modulation guides Hebbian synaptic plasticity indirectly by establishing appropriate patterns of correlated network activity. This results in robust learning of function approximation tasks even when multiple output units representing different functions share large amounts of common input. Reward-based supervision is also studied, and a number of potential advantages of neuronal response modulation are identified. 1 Introduction Correlation-based, Hebbian mechanisms of synaptic plasticity have been used with considerable success to explain the spontaneous development of selectivity and sensory maps in neural circuits (see Miller, 1996). However, when such plasticity mechanisms are applied to the development of networks that perform specific functions, rather than simply represent input data, a problem arises. To guide correlation-based synaptic plasticity, the activity of a naive neural circuit must be correlated in a manner similar to that of the final, functioning circuit. But such correlations usually arise only after the synapses of the circuit have been appropriately adjusted. The consequence of this is a chicken-and-egg problem: Which comes first, correlations or synaptic modifications? The traditional answer to this question is that synaptic modifications come first, guided by a supervisor. The supervisor is a hypothetical neural circuit that assesses network performance, computes an error signal, and uses it to direct synaptic plasticity within the network. Such schemes work Neural Computation 17, 609–631 (2005)

c 2005 Massachusetts Institute of Technology

610

C. Swinehart and L. Abbott

extremely well for many tasks (Widrow & Stearns, 1985; Chauvin & Rumelhart, 1995; Hertz, Krogh, & Palmer, 1991; Dayan & Abbott, 2001), making them attractive models for learning in biological systems. However, for biological applications, it is important to identify the pathways through which the supervisory circuit controls synaptic plasticity. In some cases, such as climbing fiber input to cerebellar Purkinje cells, such a mechanism appears to be in place. In other systems, such as cerebral cortex, an appeal must be made to some form of modulatory (perhaps dopaminergic; see Schultz, Dayan, & Montague, 1997) control of synaptic plasticity that is largely conjectural. Furthermore, modulatory pathways tend to be slow and nonlocal, making them poorly suited for the rapid, precise control of synaptic plasticity needed during task learning. These considerations lead us to explore the possibility that supervision of synaptic plasticity takes place indirectly rather than directly. The scheme we study corresponds to correlations coming first. In other words, the supervisor modulates neuronal excitability in order to introduce correlations into network activity. These correlations then generate the synaptic plasticity needed to learn a task through Hebbian synaptic modifications that are not themselves subject to direct supervision. We are interested in supervision through response modulation because it is easy to see how this scheme could be realized in cortical circuitry. The response modulations that we consider can be generated through standard excitatory and inhibitory synaptic input. Therefore, in this scheme, the supervisory circuit can guide learning through the feedback projection pathways that are characteristic of cortical circuitry, and no appeal must be made to as-yetundiscovered forms of modulation. It is important to realize that we are not proposing this scheme as an algorithmic improvement. Indeed, such indirect supervision of synaptic plasticity has disadvantages, and an important element of our study is to determine how detrimental these are. In summary, we consider a network in which synaptic plasticity is purely Hebbian, a form typically used in unsupervised learning applications. We ask whether it is possible to implement supervised learning in such a network solely by communicating error signals to the network along conventional excitatory and inhibitory feedback pathways that modulate neuronal responsiveness but do not directly affect synaptic plasticity. Such a scheme is not optimal, so its virtues are not efficiency or elegance. Rather, we take this minimalist approach so that we can determine whether these wellestablished elements of cortical circuitry provide a sufficient basis for implementing supervised learning. 2 Response Modulation and Synaptic Plasticity Neural networks used for supervised learning consist of units with nonlinear response functions connected together through interactions characterized by synaptic weights. The response ri of network unit i to an input Ii is

Supervised Learning Through Neuronal Response Modulation

611

typically determined by a sigmoidal function, ri =

1 . 1 + exp −gi (Ii − si )

(2.1)

In biophysical terms, this can be thought of as the normalized firing rate generated by an input current Ii . The parameter si , which we call the shift, controls the value of Ii at which ri reaches one-half its maximal value, while gi , which we call the gain, determines the slope of the firing rate versus input curve at this point. Input currents are typically computed by multiplying presynaptic responses by synaptic weight factors and summing over all inputs. The role of the supervisor is to compute an error by comparing actual and desired network output and to use this error to direct the modification of network parameters such that network performance improves. Conventionally, the major targets of this process are the synaptic weights. For example, weights can be modified to produce a stochastic gradient descent of the error function. We deviate from this procedure by employing a standard Hebbian synaptic modification rule that is not directly affected by the supervisor. At each stimulus presentation, the synaptic weight connecting unit i with response ri to unit a with response Ra , wai , is augmented by a term proportional to the product of the pre- and postsynaptic activities, wai → wai + w Ra ri ,

(2.2)

where the parameter w controls the learning rate. In addition, to prevent the runaway excitation that results from this positive feedback rule, divisive normalization is included. This consisted of dividing all the weights by factors that maintain the sums (for all a) N

wai = α

(2.3)

i=0

at a constant value α. The important point here is that neither of the above rules, 2.2 or 2.3, involves the error function or any other form of supervisory signal. All the supervision in our network takes place at the level of the gain and shift parameters governing the input-output function of equation 2.1. It is not unusual for supervised learning schemes to modify such parameters, particularly shift parameters. Furthermore, in our scheme, supervision of shift and gain parameters takes place through the same type of stochastic gradient-descent procedure used in conventional supervised learning algorithms. The novel element in our approach is that these parameters are the only targets of supervised modification. The reason that we restrict supervision to the shift and gain parameters of neuronal response functions

612

C. Swinehart and L. Abbott

is that, unlike the supervision of synaptic plasticity, such supervision can be accomplished by ordinary, fast excitatory and inhibitory synapses from neurons of the supervisory circuit onto neurons of the function-approximation network. Changes in the shift variable si correspond to having the supervisor provide either net excitatory or net inhibitory input to network neuron i. It has been shown that balanced, parallel modulations of excitatory and inhibitory input can modify the gain of a postsynaptic neuron (Doiron, Longtin, Berman, & Maler, 2001; Chance, Abbott, & Reyes, 2002; Prescott & De Koninck, 2003), and on the basis of this result we argue that the supervisor can also control and modify the gain variable gi . In summary, the supervisory circuit in our model can modify both the shift and the gain variables for each of the neurons in the network (though in our examples, only the input neurons are modulated, and modulating the output units alone is not effective) through normal excitatory and inhibitory synaptic pathways. To reiterate what was said in section 1, our goal is not to introduce a new algorithm, but rather to see if existing algorithms can still operate when supervision is restricted to well-established cortical pathways. 3 Function Approximation We apply the proposed mechanism of supervised learning to function approximation, a well-studied task in the artificial neural network literature with obvious applications to biological systems (Poggio, 1990). In this task, network neurons are driven by a stimulus characterized by a single variable θ . The goal of learning is to produce a network output that matches a specified function or set of functions of θ. This is a very easy task for neural network learning that can be accomplished with a single layer of synapses modified, for example, by a delta learning rule (Widrow & Hoff, 1960). We consider this task because it allows us to illustrate clearly the features and limitations of the scheme we are studying. Specifically, we consider a two-layer feedforward network architecture with purely excitatory connections, as shown in Figure 1. The network consists of N input units, responding to a stimulus variable θ (which takes values in the range from 0 to 2π), that drive M output units. The input units of the network, indicated by the lower row of circles in Figure 1, are driven by currents that are gaussian functions of the difference between θ and a preferred stimulus value, which is different for each input unit. Specifically, the input to unit i, Ii , is given by Ii = G(θ − θi ) + G(θ − θi − 2π) + G(θ − θi + 2π ) ,

(3.1)

where

2 θ G(θ) = 1.5 exp − − 0.5 . 2

(3.2)

Supervised Learning Through Neuronal Response Modulation

DjiejiAVnZg

613

hjeZgk^hdg

=ZWW^VcEaVhi^X^in

>cejiAVnZg

Jk`dlclj Figure 1: The function approximation network. Input units (lower row of circles) receive input tuned to the value of a stimulus variable θ, as indicated by the gaussian curves. The input units drive output units, shown at the top of the figure, through synaptic connections that are subject to Hebbian plasticity. A supervisor modifies the response properties of the input units through feedback projections. The task is to induce the firing rates of the output units to match specified functions of the stimulus variable.

The three terms appearing in equation 3.1 impose an approximate periodicity on the network, which is convenient (though not essential) because it removes edge effects. The values of the preferred stimulus parameters, θi for i = 1, 2, . . . , N, are uniformly distributed over the range from 0 to 2π . The response of input unit i to stimulus θ, ri (θ ), is given in terms of the input Ii by equation 2.1. The output of the network consists of the firing rates of the units appearing at the top of Figure 1. These are determined by the same firing-rate function as in equation 2.1, but their inputs are given by a weighted sum of the firing rates of the input units. Specifically, using Ra (θ ) to denote the response of output unit a (for a = 1, 2, . . . , M) to stimulus θ , Ra (θ) =

1 + exp −ga

1 N

.

(3.3)

wai ri (θ) − sa

i=1

Here, wai is the weight of the connection from input unit i to output unit a. When we consider networks with a single output unit, we drop the output index and denote the weight from input i simply as wi . The goal of learning for this network is to match the outputs Ra (θ), as closely as possible, to a set of stimulus-dependent target functions Fa (θ).

614

C. Swinehart and L. Abbott

3.1 Action of the Supervisor. The supervisor in our network model computes an error by comparing the firing rates of the output units to the values of the target functions for each stimulus. It uses a stochastic gradientdescent algorithm to adjust the gain and shift values for the input units of the network of Figure 1 in such a way that the error, E(θ) =

M 1 (Ra (θ) − Fa (θ))2 , 2 a=1

(3.4)

is reduced after each stimulus presentation. Here, Ra is the response of output unit a, and Fa is the target response for that unit. As stated above, synaptic weights in the network are subject to Hebbian synaptic plasticity as described by equations 2.2 and 2.3, with w = 0.03 and α = 5.5. Error-based supervision is used to vary the shift and gain parameters (si and gi ) that control neuronal responsiveness. These are set to the initial values si = 1 and gi = 3 for all units, but are then changed by the supervised learning algorithm. During each run, a stimulus value θ is chosen randomly in the range from 0 to 2π, and the resulting output rates are computed. Then the shifts and gains for all the input units of the network are updated according to the rules si → si − s

∂E(θ) ∂si

and

gi → gi − g

∂E(θ) , ∂gi

(3.5)

where s and g are small parameters that control the rate of response modulation. For our simulations, these took the values s = g = 0.2/M. This process is repeated until performance stops improving. We could also adjust the corresponding parameters sa and ga for the output units, but for the examples we give, this is unnecessary. Instead, these have been held at their initial values sa = 1 and ga = 3 for all a. The adjustment of output shifts and gains is unnecessary in the examples we present because we have chosen parameters so that the mean of the output response, averaged across all stimuli, is equal to the stimulus average of the target function. This is not essential; it was done primarily to simplify the presentation. It is useful to compare and contrast our approach with the conventional use of the delta rule in this situation. In the conventional approach in which synaptic plasticity is supervised, the error in equation 3.4 is differentiated with respect to the synaptic weight wai . This weight is then updated according to the rule (assuming a gain of one) wai → −w

∂E = w (Fa − Ra ) Ra ri , ∂wai

(3.6)

where Ra stands for the derivative of the response of output unit a with

Supervised Learning Through Neuronal Response Modulation

H]^[i

9%

:%

8dbW^cZY

=`i`e^IXk\

8%

615

=`i`e^IXk\

Figure 2: Effects of shift and gain modulations on firing rates and response tuning curves. The upper plots show neural responses as a function of the input current I, and the lower plots show them as a function of the stimulus parameter θ. (A) Changing the shift variable slides the response current curve to the left or right and moves the tuned response up and down. (B) Changing the gain variable changes the slope of the response current curve and has a roughly multiplicative effect on the tuned response. (C) Changing both variables changes the width of the tuned response.

respect to its input current. The term (Fa − Ra )Ra can be thought of as an error signal sent to output unit a that, in conjunction with the presynaptic firing rate ri , controls modification of the weight wai . In contrast, the error signals in our scheme, given by the derivatives in equation 3.5, are “sent” to the input units of the network rather than to the output units. Furthermore, these guide the modification of parameters affecting neuronal responses, not synaptic weights. Although the supervised learning rules in equations 3.5 and 3.6 may look similar in terms of mathematical abstraction, we stress that the modification described by equation 3.5 can be generated by normal, ionotropic synaptic transmission from the supervisory circuit to the targeted neuron, whereas those of equation 3.6 cannot. This is why we are considering such a modified form of delta rule learning. The ability to change both the shift and gain variables that determine neuronal excitability provides considerable flexibility in modulated neuronal responsiveness. The different effects of shift and gain modulations on the firing rate of a model neuron, both as a function of its input current and of the stimulus variable, are shown in Figure 2. Changing the

616

C. Swinehart and L. Abbott

shift parameter translates the firing-rate curve right and left or, plotted as a function of the stimulus variable, shifts the tuning curve up and down (see Figure 2A). Changing the gain variable modifies the slope of the firingrate curve and modulates the firing-rate tuning curves in a roughly multiplicative manner (see Figure 2B). Adjusting both variables allows the width of the tuning curve to be changed without an iceberg effect (see Figure 2C). 4 Results In studying supervised learning through response modulation, we separately consider networks with a single output unit and networks with multiple output units. Obviously, the case of a single output unit provides less of a challenge than multiple outputs to any learning algorithm. Nevertheless, we consider it here because it provides a clear example of the interaction between supervised response modulation and unsupervised synaptic plasticity. We begin by showing that supervised response modulation, acting by itself without any accompanying synaptic plasticity, leads to a solution of the function approximation problem with a single output unit. This has some implications for network switching. However, it does not provide a satisfactory long-term solution because the supervisor-induced modulations do not produce any permanent changes in the network. This means that the task can be performed only, even after learning, with continuous input from the supervisor. This problem is resolved by adding Hebbian synaptic plasticity to the learning scheme. This allows the supervisor-induced modulations to be transferred into changes of synaptic strength. Ultimately, this transfer allows the network to function properly even in the absence of supervisory input. Supervised learning through response modulation is more difficult in networks with multiple output units. In the multi-output case, situations often arise in which response modulation, acting without synaptic plasticity, cannot solve the function approximation task. As an example, consider an input unit that projects to two output units that are supposed to represent two different functions. For one of these functions, it might be appropriate to enhance the response of this input unit, while for the other, it may be necessary to decrease its responsiveness. Clearly, without access to the separate synapses that connect this single input unit to its multiple output targets, both of these criteria cannot be satisfied. In such situations, Hebbian plasticity does not merely act as a way of transferring supervisory modulation into permanent network changes; it must act in concert with response modulation for the task to be learned at all. This is indeed what happens. We find that a combination of supervised response modulation and unsupervised synaptic plasticity allows networks with multiple outputs to compute multiple functions, provided that the connection probability between the input and output layers is less than about 95%.

Supervised Learning Through Neuronal Response Modulation

617

4.1 Networks with a Single Output Unit. 4.1.1 Learning and Switching Through Response Modulation. Because response modulation is the pathway through which supervision affects network responses in our studies, it is useful to start off by considering what happens when response modulation acts alone, without the Hebbian synaptic plasticity that will be added later. Therefore, we begin the study of networks with a single output unit by showing that function approximation can be accomplished solely on the basis of response modulation. For Figure 3, synaptic connection strengths were held fixed, while a gradient-descent supervisor varied the shifts and gains of the input units. In other words, we used equation 3.5 but not equation 2.2 during learning. Figure 3A shows the initial state of the network in which the output response is independent of the stimulus (upper panel), because all the input units have identical shifts and gains, as revealed by the identically shaped response curves in the lower panel. After the gradient-descent response modulation algorithm has acted, the output response matches the target function (upper panel of Figure 3B) due to the modulation of responses revealed by the modified response curves seen in the lower panel. To match the cosine-like target response, the input units selective for stimuli near zero and 2π have been upregulated by the supervisor, while those selective for stimuli near π have been downregulated. Using supervised response modulation, the network can approximate a wide variety of functions (some examples are shown, along with the distributions of shift and gain values that produce them, in Figure 4). It is important to keep in mind that the distributions of shift and gain variables shown in the left column of this figure could arise from specific excitatory and inhibitory inputs generated by a supervisor circuit. Thus, each function computed by the network corresponds to a specific pattern of activity within the hypothetical supervisory circuit. If these patterns of activity are remembered and later recreated within the supervisor circuit, this will induce the function approximation network to compute the target function related to that pattern of activity. Thus, after learning has taken place, the supervisor can act as a controller, rapidly switching the input-output relationship of the function approximation network between prelearned states. Although we do not consider this form of switching further in this article, it provides an interesting mechanism by which one neural circuit can control, activate, and switch the function of another (for a related discussion, see Lukashin, Wilcox, & Georgopoulos, 1994). In this network, the input unit responses act as basis functions for representing the output response. Because they do not provide a complete set for arbitrarily high frequencies, there are limits to the types of functions that can be accurately approximated. Limitations arise when the target function varies rapidly, as seen in Figure 4. Although these limitations exist, they are less severe than they would be in a function approximation network

618

C. Swinehart and L. Abbott

FlkglkIXk\

&

'

2

(2

&

$.

$.

$,

$,

$*

$*

$(

$(

&

6[iZgBdYjaVi^dc

+

+ &

@eglkIXk\j

'

>c^i^VaHiViZ

'

2

(2

&

2

(2

2

(2

Figure 3: Function approximation by supervised response modulation acting alone without synaptic plasticity. Upper curves show the response of the single output unit to various stimulus values (dots) and the target function (line). The lower panels show a sampling of the responses of the 230 input units as a function of the stimulus value. (A) State of the network before learning. All input responses have the same shifts and gains, and the output is independent of the stimulus. (B) State of the network after learning. The output unit responses match the target function due to modulation of the input unit responses.

that relied solely on synaptic modification. This is because the tuning curve narrowing seen in Figure 2C can somewhat ameliorate problems with approximating rapidly varying functions. In the following examples, we choose to approximate sinusoidally varying functions and do not present examples with other types of functions. All the networks shown can produce equivalent results with any target functions for which the input responses provide an adequate basis. 4.1.2 Transfer of Learning to Synapses. In the previous section, we considered supervised response modulation acting alone. We now add to this a Hebbian plasticity mechanism. In other words, we now use both equation 3.5 and equation 2.2 during learning. The combined effect of supervised response modulation and unsupervised synaptic plasticity is illustrated in Figure 5. As before, the network is initialized with uniform weights and all shifts and gains set to the same values. The supervisor then modifies response properties to minimize the output error. Early on during the learn-

Supervised Learning Through Neuronal Response Modulation >ceji;^g^c\GViZh

'

IXk\

J_`]k

'

'

;jcXi^dc6eegdm^bVi^dc

IXk\

BdYjaVi^dcHiViZ

8%

619

+

+

$.

>X`e

9%

-$,'

-$.

>X`e

.$&

&

2

(2

IXk\

J_`]k

>X`e

-$,(

&

&

2

(2

2

(2

'

+

-$,

(2

+

'

$.

2

IXk\

-$,

& '

+

'

$,

(2

IXk\

J_`]k

:% '$(

2

'

'

&

&

IXk\

-$,&

+

2

(2

&

Figure 4: Examples of learning through response modulation. Three different functions are approximated by response modulation. The left column shows the shift and gain variables for all 460 of the input units of the network, each dot representing one input unit. Gains and shifts can also be seen in the sampling of input unit response curves in the middle column. The right column shows the output unit response (dots) and target function (line) plotted against the stimulus value.

ing process (top row of plots), the performance of the network relies almost entirely on the response modulation of the input units produced by the supervisor (illustrated by the distribution of input responses in the top row, left panel). At this point, the weights have hardly changed from their initial values (as seen in the top row, center panel). However, as the simulation progresses, the weight changes become progressively larger (second row, center panel), and the response modulations become progressively smaller (second row, left panel). Ultimately, the weights take on the cosine shape of the target function (third row, center panel), and the responses are almost uniform for all the input units (third row, left panel), as they were at the beginning of the learning process. Note that a stable equilibrium is reached when Hebbian modification and response modulation act together. Once the response modulation and Hebbian plasticity have equilibrated, the supervisory input can be removed altogether, returning all shifts and

620

C. Swinehart and L. Abbott

>ceji;^g^c\GViZh

( '

$)

2

(2

(2

& (

(2

&

2

(2

2

(2

2

(2

IXk\

&

IXk\

+

2

Gi\]\ii\[

(2

& '

N\`^_k

IXk\

2

(2

'

'

n`k_flk $) jlg\im`jfi

&

(2

N\`^_k

IXk\

(2

$,

;^cVa

2

Gi\]\ii\[

'

2

2

+

(

$) &

&

& '

N\`^_k

IXk\

2

$,

AViZ

(2

'

$) &

2

Gi\]\ii\[

(

$,

B^Y

&

+

IXk\

&

IXk\

IXk\

:Vgan

;jcXi^dc6eegdm^bVi^dc

'

N\`^_k

$,

LZ^\]ih

+

2

Gi\]\ii\[

(2

Figure 5: Supervised response modulation along with unsupervised synaptic plasticity allows for the transfer of learning to the synapses. The left column shows a sampling of the responses of the 230 input units as a function of the stimulus value. The middle column depicts the synaptic weights of the network plotted as a function of the preferred stimulus value for the presynaptic neuron. The right column shows the responses of the output unit (dots) and the target function (line), plotted against the stimulus value. The top three rows of plots, from top to bottom, show the gradual transfer of learning as the network changes from relying primarily on response modulation (top row of plots) to relying primarily on the pattern of modified synaptic weights (third row of plots). The bottom row illustrates that the network can perform fairly well even when response modulation is totally eliminated, once the appropriate pattern of synaptic weights has been established.

Supervised Learning Through Neuronal Response Modulation

621

gains to their default values, and yet the network can still generate a good approximation of the target function (bottom row of Figure 5) (although, for stability, this necessitates the deactivation of Hebbian plasticity). Unsupervised synaptic plasticity thus allows the supervisor to contribute progressively less as the burden of representing the target function is taken up by the synapses. Supervised response modulation plays three critical roles in guiding the Hebbian development of synapses capable of performing the function approximation task. First, because supervised response modulation acting alone can solve the task, the supervisor can act through the input units to effectively clamp the output to the correct response profile while Hebbian plasticity is taking place. Second, by increasing the responsiveness of appropriate input units while clamping the output to the target function, supervised response modulation sets up the appropriate pattern of correlation across the synapses of the network to guide Hebbian modification. For example, input units that are important contributors to the correct output response will be pushed to high levels of responsiveness by the supervisor, enhancing their correlations with the correctly clamped output unit. This causes the synapses connecting such units to the output to grow rapidly. Input units not needed for the task will be made unresponsive by the supervisor, so their synapses to the output unit will not be enhanced by the Hebbian modification rule. Instead, these synapses will be weakened due to the synaptic normalization constraint. Finally, we consider a third role for supervised response modulation on the basis of an analysis of Hebbian modification. The form of synaptic plasticity we are using, Hebbian synaptic modification in conjunction with divisive normalization, ultimately sets synaptic weights in this case equal to αFri wi = , Frj

(4.1)

j

where Fri =

1 2π

2π

dθ F(θ)ri (θ) .

(4.2)

0

To simplify the analysis, we consider a linear approximation for the response function of the output unit rather than the full sigmoidal form of equation 3.3. In this case and for these weights, the condition that the output response matches the target function, R=

N i=0

wai ri = F ,

(4.3)

622

C. Swinehart and L. Abbott

requires that α

N i=1

ri (θ)ri (θ ) = δ(θ − θ )

N Frj .

(4.4)

j=0

The third role of supervised response modulation is to make this equation as near to an equality as possible. The accuracy with which the sum on the left side of this equation can match a δ function profile depends on the narrowness of the tuning curves of the input units. This places a limit on the degree to which rapidly varying target functions can be reproduced, but modifications in the gain and shift variables can improve this situation by narrowing the input tuning curves. More important, supervised response modulation acts to ensure that the normalization condition implied by equation 4.4 is met, and this is what ultimately allows Hebbian plasticity to solve the problem (Salinas & Abbott, 2000). Thus, by acting on these multiple levels, supervised response modulation guides Hebbian plasticity to a solution of the function approximation task. 4.2 Networks with Multiple Output Units. Networks with multiple output units present a greater challenge to the form of supervised learning we are proposing than do single-output networks. In particular, situations frequently arise where the representation of different functions by different output units cannot be achieved by response modulation alone due to shared input. Two cases are simple to analyze. If the connectivity between the input and output units is all-to-all with equal weights, response modulation alone is clearly unable to produce different responses in the output units. With all-to-all coupling, all the output units receive the same total drive, and whatever modulation is done at the input level affects all of the output units in the same way. Unsupervised synaptic plasticity does not help because the input correlation structure seen by the synapses to each output unit is identical, so the synapses will all be modified in an identical manner. Basically, the problem with all-to-all coupling is symmetry; all the output units are equivalent, and supervised modulation of input responses is not sufficient to break this symmetry and allow the output units to respond differently to the stimulus. Although we have assumed that the synapses take identical values, setting the initial synaptic weights to different values does not fix this problem. At the opposite extreme, if the coupling from input to output units is so sparse that each input unit projects to just a single output unit, the situation reduces to multiple copies of the single-output case, and the analysis becomes a trivial extension of what was done in the previous section. In this section, we consider intermediate cases where the input-to-output connectivity is not all-to-all, but there is nevertheless considerable overlap in the input to different output units. We start by considering the case of twooutput units and construct networks with various amounts of overlap in

Supervised Learning Through Neuronal Response Modulation

GZhedchZBdYjaVi^dc6adcZ

+

+

2

2

GBl^i]=ZWW^VcAZVgc^c\

.*d[^cejihh]VgZY

+

IXk\

IXk\

IXk\

GZhedchZBdYjaVi^dcl^i] =ZWW^VcAZVgc^c\

623

2

2

2

2

Figure 6: Unsupervised synaptic plasticity allows learning to succeed where interference causes supervised response modulation, acting without synaptic plasticity, to fail. In the left panel, a network of 460 input units connected with 88% overlap to two output units was simulated with response modulation acting without synaptic plasticity. The responses of the two output units, denoted by filled and open dots, failed to match the two target functions, indicated by dashed and solid curves, as a function of the stimulus value. When Hebbian plasticity was included, the two output responses accurately matched the target functions (middle panel). The combination of supervised response modulation and Hebbian plasticity continued to produce accurate outputs until the common input to the output units was increased to 95% (right panel).

the projections they receive from the input units. For an overlap of q, the number of input units that project to both output units is qN, and the number that project to only a single output unit is (1 − q)N. We ask whether, in such cases, unsupervised synaptic plasticity can exploit small differences in the drive to each output unit to break the symmetry and allow the output units to represent different functions. Figure 6 illustrates the ability of unsupervised synaptic plasticity to play the role of a symmetry-breaking mechanism. In the first panel, supervised learning acting without synaptic plasticity has set the shifts and gains to their optimal values, but due to the degree of interference caused by shared input, the approximation is quite poor, and neither target function has been matched. The network has essentially split the difference between the two functions, with only small disparities between the responses of the two output units. However, when Hebbian plasticity is activated, it is able to exploit and amplify these small differences to improve performance dramatically. This ultimately leads to a match of the two different target functions (see the center panel of Figure 6). At this point, the supervisory input is no longer necessary (provided that the Hebbian process is halted). Thus, the combination of supervised response modulation and unsupervised synaptic plasticity allows the network to perform this task at a level that could not be achieved via response modulation alone. The problem of indirectly supervising the plasticity of NM synapses by modulating only N neurons might at first appear to be a crippling limitation

624

C. Swinehart and L. Abbott

of response modulation. The example of Figure 6 shows at least one case in which this problem is not nearly as severe as might have been imagined. Separation of the two output units could still be achieved when they shared up to 95% of their inputs. However, it is critical to the success of supervision by response modulation that the requirement of a unique component for the input to each output unit scale appropriately as the size of the network and the number of output units increase. Initially, we investigate this issue by varying the amount of shared input in a network with two output units. The result of varying the proportion of shared inputs in two-output networks of different sizes, when both supervised response modulation and unsupervised synaptic plasticity are active, is shown in Figure 7. In this figure and in Figure 8, performance is quantified by computing the error of equation 3.4, divided by the number of output units. The left panel of this figure shows that when learning occurs via response modulation alone without synaptic plasticity, errors begin to grow once the proportion of shared inputs exceeds 50% (dashed curves in Figure 7A). Performance is virtually identical for different input population sizes. With Hebbian plasticity included, the required number of unique inputs decreases dramatically, with little error accumulating until 90% to 95% of inputs are shared (solid curves in Figures 7A and 7B). The point of the transition from small errors to large errors appears to be roughly the same for all the network sizes studied (see Figure 7B). The main effect of increasing the number of input units is to make the transition point, where the function approximation network fails, sharper. This suggests that a discontinuous phase transition occurs at a critical percentage of about 94% shared input in the N → ∞ limit. We now extend these results to networks with more than two output units. In this case, the proportion of shared inputs (q) is not appropriate for describing all the different possibilities for sharing projections from the input units. Instead, we use the connection probability (p) to characterize the networks we study. To construct these networks, we introduce a connection between any one of the N input units and any one of the M output units with probability p. If such a connection is formed, it is subject to Hebbian plasticity. If no connection forms during this stochastic initial wiring, the connection remains absent for the entire duration of the simulation. The connection probability controls the sparseness of the network in that small values of p correspond to sparse connectivity. Figure 8 shows function approximation errors for networks with different numbers of output units as a function of the connection probability p, for two sizes of input unit populations. In this case, both response modulation and synaptic plasticity are activated. As in the two-output case, interference does not become a serious impediment to learning until p reaches the .9 to .95 range, indicating that truly unique inputs are not necessary. Rather, the requirement is a certain degree of sparseness. Also noteworthy is the fact that the output population size can approach one-third of the total number

Supervised Learning Through Neuronal Response Modulation

$+

'

?dfkjFefkbWj_edI_p[

$-

D\Xe=`eXc
$-

D\Xe=`eXc
&

'(. (+& +&& '&&& (&&& *&&&

$+

CeZkbWj_ed CeZkbWj_ed>[XX

625

?dfkjFefkbWj_edI_p[

'(. (+& +&& '&&& (&&& *&&&

CeZkbWj_ed>[XX

$)

$)

$'

$' $+

$,

$-

$.

$/

Gifgfik`fef]@eglkjJ_Xi\[

'

$.

$.*

$..

$/(

$/,

Gifgfik`fef]@eglkjJ_Xi\[

'

Figure 7: Error as a function of the proportion of shared input for a two-output network with different numbers of input units. Dashed lines show the mean error per output unit resulting from supervised response modulation without synaptic plasticity. Solid lines are from runs that also employed Hebbian plasticity. Individual lines correspond to different numbers of input units (N). The functions being approximated are cosine and sine. (A) Network performance degrades as the proportion of inputs projecting to both outputs increases. (B) Detailed view of the results for supervised response modulation with Hebbian plasticity shown in panel A.

of inputs before performance begins to suffer from interference (provided p values are not too large). Our results indicate that connection probability is the dominant factor that controls whether a network with multiple output units, using both supervised response modulation and Hebbian plasticity, can function properly. The required sparseness in the connectivity is not stringent. Furthermore, approximation of multiple functions is possible even when output population size is a significant fraction of the total input population size. The analysis of networks with multiple output units is more difficult than in the single-output case, but some of the same basic principles apply. In this case, the supervisor cannot clamp the output units to their target functions, but the existence of even a small number of symmetry-breaking synapses is sufficient to break this impasse. These synapses initially get quite strong and drive the output units away from the degenerate state in which they are all the same, which starts off the combined response modulation Hebbian learning process. Through this process, the bulk of the synapses ultimately come to obey the multi-output generalization of equation 4.1, αFa ri wai = . Fa rj

(4.5)

j

Similar to the result in the one-output case and making the same linear

626

HbVaa>cejiEdejaVi^dc-4)'' EkfkjFefkbWj_edI_p[

+ '& (& )& +& -&

$+

9%

$-

D\Xe=`eXc
$-

D\Xe=`eXc
8%

C. Swinehart and L. Abbott

AVg\Z>cejiEdejaVi^dc-4+'' EkfkjFefkbWj_edI_p[

$+

$)

+ '& (& )& +& -+

$)

$'

$'

$-)

$.)

$..

$/)

:fee\Zk`feGifYXY`c`kp2

$/. '

$-)

$.)

$..

$/)

:fee\Zk`feGifYXY`c`kp2

$/. '

Figure 8: Error as a function of network connection probability for different numbers of output units. The functions being approximated are cosines with phases ranging from 0 to 2π, equally spaced between the output units. (A) A network with 200 input units. Error increases rapidly for p > 0.93 regardless of the number of output units. For 50 and 70 output units, the error is larger for all p values. (B) For a population of 400 input units, the results are similar, except that overall performance only degrades when there are 75 output units.

approximation, matching of the target function with these synaptic weights requires that α

N i=1

ri (θ)ri (θ ) = δ(θ − θ )

N Fa rj .

(4.6)

j=0

Subject to the same constraints on the approximation of the δ function, these equations represent M constraints that need to be satisfied by appropriate adjustment of the 2N shift and gain variables of the input units, which should be possible to satisfy when M < N. The key to making the combined response modulation synaptic plasticity scheme work is that Hebbian modification reduces the problem of setting pNM synaptic weights to the problem of satisfying the M constraints appearing above, and this can be done by the supervision through its control of the 2N shift and gain variables. 4.3 A Stochastic Supervisor. The supervisor used in the simulations discussed thus far employed a gradient-descent algorithm to modify intrinsic response properties on the basis of the error generated by each stimulus. A biological supervisor circuit is more likely to operate under a reinforcementbased scheme. As a first attempt at constructing such a supervisor, we have implemented a model using a stochastic search guided only by a reward signal that reflects network performance. Related ideas have been applied to the supervision of synaptic plasticity (Barto, Sutton, & Anderson, 1983; Mazzoni, Anderson, & Jordan, 1991; Jabri & Flower, 1992; Williams, 1992;

Supervised Learning Through Neuronal Response Modulation

&

)

'

[dZ

627

'

[dZ

IXk\

J_`]k

( ' &

+

ijWhj [dZ

($+

)

>X`e

)$+

&

*

2# Jk`dlclj

(2#

Figure 9: Learning under the random walk supervisor. (A) The paths in modulation space of three input units controlled by the supervisor. All three units started with the same shifts and gains (marked Start), but these then diverged as the supervisor found values that accomplished the function approximation task (End). (B) The end result is a good approximation (dots) of the target function (line) by the output unit as a function of stimulus value. In this simulation, 230 input units drove a single output unit.

Cauwnberghs, 1993; Doya & Sejnowski, 1995; O’Reilly, 1996; Xie & Seung, 2004; Seung, 2003). For stochastic reward-based supervision, two N-dimensional “modification” vectors, vs and vg , of unit length were generated randomly—one for shifts and one for gains. For all i values, the shift and gain of unit i was incremented by an amount proportional to component i of the appropriate modification vector, si → si + vsi

and

g

gi → gi + vi .

(4.7)

Simulations were divided into epochs of 20 stimulus presentations and error evaluations. After each epoch, the sum of the 20 errors was compared to the summed error from the previous epoch. If this total error was less than it was previously, the modification vectors were left unchanged. If the summed error increased from the previous epoch, new modification vectors, vs and vg , were generated randomly. In either case, the resulting modification vector was then used to further increment the shifts and gains, as described above. In this study of random walk learning through response modulation, we do not include any Hebbian synaptic modification. This strategy has the effect of steadily, although slowly, reducing the average error. The paths through modulation space of three input units over the course of a run are plotted in Figure 9A. The improvement in performance can be seen in the reduction in the lengths of the line segments seen in the traces. At the beginning of the run, most of these segments are relatively long as the network makes coarse adjustments to approach

628

C. Swinehart and L. Abbott

the target function. Later, more frequent trajectory changes appear as the network approaches a solution and makes fine adjustments. Figure 9B illustrates that this crude strategy is capable of solving the task, given enough time (in this case 600 iterations). Thus, a random walk supervisor strategy that requires much less information and algorithmic sophistication than gradient descent can, at least in simple cases, provide adequate supervision. There are clearly severe limitations on the sizes of networks that can be trained by this random walk algorithm. As the network grows in size, the algorithm gets prohibitively slow. In section 5, we propose ways that this problem might be addressed to achieve better scaling of performance with network size. 5 Discussion The novel feature of the supervised learning scheme we have proposed is that supervision takes place at the level of neuronal responsiveness rather than synaptic plasticity. Two apparent disadvantages of this scheme—that it does not lead to permanent network modification and that it severely limits the number of elements being supervised—appear to be far less severe than might have been imagined at first. By guiding synaptic plasticity that is otherwise unsupervised, supervised response modulation can lead to permanent changes that allow a network to operate effectively, even in the absence of supervision. Furthermore, Hebbian plasticity can take advantage of small inhomogeneities in randomly coupled networks to allow independent changes in synapses to output units that share presynaptic input. Given that it works, there are some potential advantages of supervising neuronal excitability rather than synaptic plasticity. First, supervision can occur through ordinary feedback projections that can act rapidly and can target individual neurons independently. Additional advantages concern the nature of the supervisory circuit. We have not attempted to construct a realistic model of this circuit, but we envision it as a network capable of maintaining a continuum of stable, self-sustained patterns of activity (Compte, Brunel, Goldman-Rakic, & Wang, 2000; Seung, Lee, Reis, & Tank, 2000). Such networks tend to drift, especially if provided with noisy input. Thus, it might be possible to implement the random walk supervisor as a network with self-sustained activity and random drift, with the rate of drift controlled by noisy inputs that are suppressed by reward. Whatever the form of the supervisory circuit, modulating neuronal responsiveness instead of synaptic plasticity has a number of tactical advantages. We considered two approaches to supervision: gradient descent, which involves more information and mathematical analysis than we would expect from a neural circuit, and a random walk model that uses less. A real circuit should lie somewhere between these extremes. From the point of

Supervised Learning Through Neuronal Response Modulation

629

view of the supervisor, the fact that there are far fewer neurons than synapses to supervise changes from a disadvantage to an advantage. The supervisor must search in the space of the parameters it is modifying for a solution to the problem at hand. By reducing the dimension of this space, supervision of neuronal responses, provided that it works (and we have shown that it does), is far easier than supervision of more numerous synapses. Another advantage of supervising neuronal responses is that the supervisor can monitor the activities that it is modulating in a way that is impossible with supervised synaptic plasticity. It is almost inevitable that the function approximation network, which receives input from the supervisor circuit, would also send projections to it. Such reciprocal connectivity is a typical feature of neuroanatomy. These projections allow the supervisor to monitor the activity of the units it is supervising and use this information to guide learning. For example, this information could be used to reduce the dimension of the space in which the supervisor must search for solutions of the task being learned. Consider, for example, two input units in the function approximation network that have almost totally overlapping response profiles. It is rather wasteful for the supervisor to vary the response properties of these two neurons independently, and yet this is what was done in the random walk model we studied. A more “intelligent” supervisor would use information about the correlations between the units it is modulating to find strategies that are most likely to produce large changes in the network being supervised, and to avoid wasting time generating modulations that have little effect. Thus, projections from the supervised units to the supervisor could be part of a secondary modulatory process that allows the supervisor to learn about learning. Acknowledgments This research was supported by the National Science Foundation (IBN0235463) and NSF-DGE-9972756 and the Sloan-Swartz Center for Theoretical Neurobiology at Brandeis University. References Barto, A. B., Sutton, R. S., & Anderson, C. W. (1983). Neuron-like adaptive elements that can solve difficult learning control problems.IEEE Trans. on Systems, Man and Cybernetics, 13, 834–846. Cauwnberghs, G. (1993). A fast stochastic error-descent algorithm for supervised learning and optimization. In Col. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing stystems, 5 (pp. 244–251). San Mateo, CA: Morgan Kaufmann. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation through background synaptic input. Neuron, 35, 773–782.

630

C. Swinehart and L. Abbott

Chauvin, Y., & Rumelhart, D. E., (Eds.) (1995). Back propagation: Theory, architectures, and applications. Hillsdale, NJ: Erlbaum. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang X. J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cereb. Cortex, 10, 910–923. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural Systems. Cambridge, MA: MIT Press. Doiron, B., Longtin, A., Berman, N., & Maler, L. (2001). Subtractive and divisive inhibition: Effect of voltage-dependent inhibitory conductances and noise. Neural Comput., 13, 227–248. Doya, K., & Sejnowski, T. J. (1995). A novel reinforcement model of birdsong vocalization learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.) Advances in nueral information processing systems, 7 (pp. 101–108) Cambridge, MA: MIT Press. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Jabri, M., & Flower, B. (1992). Weight perturbation—an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. IEEE Trans. on Neural Networks, 3, 154–157. Lukashin, A. V., Wilcox, G. L., & Georgopoulos A. P. (1994). Overlapping neural networks for multiple motor engrams. Proc. Natl. Acad. Sci. U.S.A., 9, 8651– 8654. Mazzoni, P., Andersen R. A., & Jordan M. I. (1991). A more biologically plausible learning rule for neural networks. Proc. Natl. Acad. Sci. U.S.A., 88, 4433–4437. Miller, K. D. (1996). Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks (Vol. 3, pp. 55–78). New York: Springer-Verlag. O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8, 895–938. Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposium on Quantitative Biology, 55, 899–910. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci. U.S.A., 100, 2076–2081. Salinas, E., & Abbott, L. F. (2000). Do simple cells in primary visual cortex form a tight frame? Neural Comp., 12, 313–336. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Seung, S. (2003). Learning in spiking neural networks by reinforcement of stochastic synapatic transmission. Neuron, 40, 1063–1073. Seung, H. S., Lee, D. D., Reis, B. Y., Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons, Neuron 26, 259–271. Widrow, B., & Hoff, M. E. (1960) Adaptive switching circuits. WESCON Convention Report, 4, 96–104.

Supervised Learning Through Neuronal Response Modulation

631

Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice Hall. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Xie, X., & Seung, S. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review, 69, 041909. Received October 2, 2003; accepted July 28, 2004.

LETTER

Communicated by Bard Ermentrout

The Combined Effects of Inhibitory and Electrical Synapses in Synchrony Benjamin Pfeuty [email protected] Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France

Germ´an Mato [email protected] Comisi´on Nacional de Energ´ıa At´omica and CONICET Centro At´omico Bariloche and Instituto Balseiro, R. N.,Argentina

David Golomb [email protected] Department of Physiology and Zlotowski Center for Neuroscience, Faculty of Health Sciences, Ben Gurion University of the Negev, 84105, Be’er-Sheva, Israel

David Hansel [email protected] Interdisciplinary Center for Neural Computation, Hebrew University, 91904 Jerusalem, Israel, and Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France

Recent experimental results have shown that GABAergic interneurons in the central nervous system are frequently connected via electrical synapses. Hence, depending on the area or the subpopulation, interneurons interact via inhibitory synapses or electrical synapses alone or via both types of interactions. The theoretical work presented here addresses the significance of these different modes of interactions for the interneuron networks dynamics. We consider the simplest system in which this issue can be investigated in models or in experiments: a pair of neurons, interacting via electrical synapses, inhibitory synapses, or both, and activated by the injection of a noisy external current. Assuming that the couplings and the noise are weak, we derive an analytical expression relating the cross-correlation (CC) of the activity of the two neurons to the phase response function of the neurons. When electrical and inhibitory interactions are not too strong, they combine their effect in a linear manner. In this regime, the effect of electrical and inhibitory interactions when combined can be deduced knowing the effects of each of the interactions separately. As a consequence, depending on intrinsic neuronal properNeural Computation 17, 633–670 (2005)

c 2005 Massachusetts Institute of Technology

634

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

ties, electrical and inhibitory synapses may cooperate, both promoting synchrony, or may compete, with one promoting synchrony while the other impedes it. In contrast, for sufficiently strong couplings, the two types of synapses combine in a nonlinear fashion. Remarkably, we find that in this regime, combining electrical synapses with inhibition amplifies synchrony, whereas electrical synapses alone would desynchronize the activity of the neurons. We apply our theory to predict how the shape of the CC of two neurons changes as a function of ionic channel conductances, focusing on the effect of persistent sodium conductance, of the firing rate of the neurons and the nature and the strength of their interactions. These predictions may be tested using dynamic clamp techniques. 1 Introduction Electrical synapses have long been known to exist in invertebrates (Watanabe, 1958; Furshpan & Potter, 1959), but only recently has evidence of their ubiquity been unequivocally found in the mammalian brain. Remarkably, electrical synapses in the CNS are frequently found to connect GABAergic interneurons. This is, for instance, the case in the striatum (Kita, Kosaka, & Heizmann, 1990), the hippocampus (Venance et al., 2000), the cerebellum (Mann-Metzer & Yarom, 1999), the reticular thalamic nucleus (Landisman et al., 2002), and the neocortex (Gibson, Beierlein, & Connors, 1999; Galarreta & Hestrin 1999; Fukuda & Kosaka, 2000). In the neocortex, electrical as well as inhibitory synapses exist between fast spiking (FS) interneurons (Gibson et al., 1999; Galarreta & Hestrin, 2002) or between multipolar bursting neurons (Blatow et al., 2003). In contrast, low threshold spiking (LTS) interneurons interact mostly via electrical synapses, the inhibitory synapses between them being sparse (Gibson et al., 1999; Amitai et al., 2002). FS and LTS neurons are interacting reciprocally via inhibition but not via electrical synapses. Recent experiments have shown that electrical synapses may be involved in synchronizing neural activity. Blocking of inhibition and excitation does not reduce the synchronization of interneurons in the molecular layer of the cerebellum between which electrical synapses have been identified (MannMetzer & Yarom, 1999). In slices of neocortex, in which LTS cells are activated by ACPD, the synchronous activity of LTS remains after blocking inhibition (Beierlein, Gibson, & Connors, 2000). Reciprocally, blockade of gap junctions through pharmacological agent (Draguhn et al., 1998; PerezVelazquez, Valiante, & Carlen, 1994; Traub et al., 2001; Friedman & Strowbridge, 2003) or genetic manipulations (Deans, Gibson, Sellitto, Connors, & Paul, 2001; Hormuzdi et al., 2001; Blatow et al., 2003) reduces the synchronized activity in cortex or in hippocampus. Other studies suggest that electrical synapses alone are not sufficient for getting synchronous activity but that they are necessary. This has been shown in preparations of cortical slices where electrical synapses or inhibitory ones are not able to synchro-

Inhibitory and Electrical Synapses in Synchrony

635

nize neuronal activity (Tamas, Buhl, Lorincz, & Somogyi, 2000; Blatow et al., 2003). In contrast with these results, it has been recently reported that in inspiratory motoneurons, synchronous activity depends on inhibition and is strongly enhanced in the presence of CBX, a blocker of electrical synapses (Bou-Flores & Berger, 2001). Therefore, electrical synapses desynchronize neural activity, in this case. These experimental facts raise the following issues. How do intrinsic cellular properties affect synchrony of neurons coupled via inhibitory or electrical synapses? What is the significance of the combination of electrical and inhibitory couplings for neuronal dynamics? The goal of this work is to address these issues in the simplest system in which they can be answered in models or in experiments: a pair of neurons, interacting via electrical synapses, inhibitory synapses, or both, and activated by the injection of a noisy external current. The pattern of firing of a pair of model neurons can be characterized by the cross-correlations (CCs) of their activity, which measure the probability that the two neurons fire with some given time delay. Of particular interest are the locations and the width of the peaks and the troughs of the CCs. The main limitation in measuring CCs in experiments is the stationarity of the data. This puts constraints on the number of spikes one must accumulate. Simulations of neuronal models can be useful to estimate how long the recordings should be to get reliable estimate of the CCs. Exploring in a systematic way the space of parameters of a model using only numerical simulations is not the best way to discover the general rules underlying the dependence of the CCs on neuronal properties. Here, we derive an analytical expression for the CC of a pair of spiking neurons under the assumptions of weak interactions and weak noise. It provides valuable information about the way noise and intrinsic and synaptic properties affect the shape of the CCs. To our knowledge, this is the first time such an analytical expression has been obtained. Only a few theoretical works have been devoted to neurons coupled via electrical synapses (Sherman & Rinzel, 1992; Chow & Kopell, 2000; Pfeuty, Mato, Golomb, & Hansel, 2003). Recently we combined analytical and numerical studies to investigate how cellular properties affect synchronization of activity of neurons coupled via electrical synapses (Pfeuty et al., 2003). We showed that the stability of the asynchronous state in a large network of tonically firing neurons and the way it is affected by the average neuronal firing rate depend crucially on the shape of the neuronal phase response function (for the definition of the PRF, see section 2). Our theory predicts that sodium or calcium currents impede synchrony of neurons coupled via electrical synapses, whereas potassium currents promote it. This study also shows that predictions for experiments derived in the framework of the leaky integrate-and-fire (LIF) model should be considered with caution. Indeed, we found that large networks of electrically coupled LIF neurons and conductance-based neurons behave in qualitatively different

636

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

ways unless strong potassium conductances are involved in the dynamics of the conductance-based model. This previous work focused on the conditions for antisynchronous states in pairs of neurons and for stability of asynchronous states in large networks. The analysis was also restricted to a single model. In this article, we complement that study by investigating in detail how the features of the CCs of a pair of electrically coupled neurons depend on their intrinsic properties. We also study in a two-compartment model how the location of the synapses, on the soma or on the dendrite, affects the CCs. In contrast to the case of electrically coupled neurons, synchrony of neurons coupled via chemical synapses has been extensively studied. Many of the results that have shaped our concepts regarding this issue have been obtained in the framework of the LIF model (Abbott & van Vreeswijk, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995; Brunel & Hakim, 1999) or of a modified version of it, the spike response model (Gerstner & Kistler, 2002). Several works have also considered neuronal models with more realistic biophysics (Wang & Rinzel, 1992; Hansel, Mato, & Meunier, 1983; Hansel et al., 1995; van Vreeswijk et al., 1994; Wang & Buzs´aki 1996; Ermentrout, 1996; Crook, Ermentrout, & Bower, 1998a, 1998b; Golomb, Hansel, & Mato, 2001; Ermentrout, Pascal, & Gutkin, 2001; Acker, Kopell, & White, 2003). However, few general results relate intrinsic properties and synchronization behavior. Moreover, these results deal mostly with the stability of fully synchronized states and not the stability of asynchronous state in large networks or antiphase locking of a pair of neurons. In this article, we provide new results regarding how intrinsic properties affect antiphase locking and bistability between in-phase and antiphase locking for a pair of reciprocally coupled inhibitory neurons. Another issue investigated here is the interplay of inhibitory and electrical synapses. This issue has been studied analytically and numerically in the framework of the LIF model (Lewis & Rinzel, 2003) or in specific conductance-based models (Skinner, Zhang, Velazquez, & Carlen, 1999; Traub et al., 2001; Bartos et al., 2002; Nomura, Fukai, & Aoyagi, 2003; Bem & Rinzel, 2004). However, general principles governing this interplay are lacking. When the coupling strengths are not overly strong, the two types of synapses interact linearly when combined. As a consequence, we show that depending on cellular properties, they can cooperate or act antagonistically. We derive rules to predict these effects. We also show that in the strong coupling regime, the two types of synapses can interact in a nonlinear manner. Indeed, we find a regime in which synchrony promoted by inhibition is amplified by electrical synapses, although these synapses alone do not promote synchrony. Finally, our work provides concrete predictions on the relationship between intrinsic properties and synchrony. We propose to test these predictions in experiments in which ionic currents or interactions are modified by

Inhibitory and Electrical Synapses in Synchrony

637

dynamic clamp (Sharp, O’Neil, Abbott, & Marder, 1993; Prinz, Abbott, & Marder, 2004). The letter is organized as follows. In section 2, we describe the methods employed in this work. We describe the two models that are the framework of our study: the quadratic integrate-and-fire model (QIF) and a conductance-based model that involves sodium and potassium currents. We also derive a general formula for the CCs of weakly coupled spiking neurons. Section 3 is devoted to the results obtained in the framework of the QIF model. We first consider the situation of weak or moderate couplings, and we rely on our analytical formula for the CCs. We checked with numerical simulations that these results extend over a substantial range of coupling strength and noise level. The effects specific to strong coupling are subsequently analyzed. The theory developed in the first part of the article allows one to predict the effects of calcium, potassium, or sodium currents. For the sake of concreteness, we focus in section 4 on the effect of a persistent sodium current. The case of other currents is briefly considered in section 5. 2 Methods 2.1 The Models. 2.1.1 Single Neuron Dynamics. Two models are considered in this work: the quadratic integrate-and-fire (QIF) model and a two-compartment conductance-based model. The QIF model is sufficiently simple that it can be studied with analytical calculations. In this model, the dynamical equation for the subthreshold membrane potential, v, is (Latham, Richmond, Nelson, & Nirenberg, 2000; Hansel & Mato, 2003; Pfeuty et al., 2003)

τm

dv = A(v − Vs )2 + Iext (t) − Ic , dt

(2.1)

where τm = 10 ms, A = 0.25, and Ic = 0 are constant parameters and Iext (t) is the total external current to the neuron. The parameter Vs is varied. The membrane potential is measured in mV. Hence, A is measured in mV−1 and the “current” in mV. If Iext is constant in time and smaller than Ic , v converges to a fixed point at large time. In contrast, if Iext > Ic , the solutions to equation 2.1 diverge in finite time. We define action potential firing as v reaching some threshold value VT from below. The membrane potential is then instantaneously reset to some value Vr < VT . In this article, we assume that Vr < Vs < VT . The membrane potential between two action potentials (Iext > Ic ) can be expressed analytically, solving equation 2.1 with the initial condition v(0) =

638

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

Vr . It reads: v(t) =

√ t A(Iext − Ic ) (Iext − Ic )/A tan + α + Vs τm

(2.2)

where α is an integration constant determined by the condition that at t = 0, the membrane potential of the neuron is at its reset value, Vr : α = tan−1

√

Vr − Vs . (Iext − Ic )/A

(2.3)

The condition v(T) = VT determines the firing period, and the firing frequency of the neuron is f = 1/T. One finds tan−1 ( √ VT −Vs ) − tan−1 ( √ Vr −Vs ) (Iext −Ic )/A (Iext −Ic )/A . T = τm √ A (Iext − Ic )

(2.4)

√ Near firing onset, Iext >≈ Ic , the frequency f behaves like f ∝ Iext − Ic , whereas f increases linearly with Iext for large Iext . The subthreshold dynamics needs to be supplemented with a model for the suprathreshold part of the membrane potential time course. Assuming that the width of the action potentials is much smaller than the interspike intervals, we represent them by a δ-function at each time a spike is fired. Therefore, the membrane potential of the neuron can be written at time t, V(t) = v(t) + θ

δ(t − tspike ),

(2.5)

spikes

where θ measures the integral over time of the suprathreshold part of action potentials. For given Vr and VT , the membrane potential time course depends on the parameter Vs . This is shown in Figure 1A, where V(t) is plotted for Vr = 0, VT = 12 mV and different values of Vs . In all these examples, the external current is such that the neurons fire at f = 50 Hz. It is clear that the concavity of the potential time course depends on Vs . For Vs ≈ Vr , it is convex except right after a spike, whereas for Vs ≈ VT , it is concave except right before a spike. For Vs = (VT + Vr )/2, the membrane potential displays an inflexion point at half the firing period. The conductance-based model used in this work has two compartments. The somatic compartment has a leak current, IL , a fast sodium current, INa , a delayed rectifier potassium current, IK , and a persistent sodium current, INaP . Details about these currents are given in appendix A. The second compartment, which corresponds to the dendrite, is passive. For simplicity, we assume that the two compartments have equal areas. Therefore, the

Inhibitory and Electrical Synapses in Synchrony

Vs=2 mV

639

Vs=6 mV

Vs=10 mV

A

B 0.6

0.6

0.4

0.4

0.2

0.2

0.2

Z

0.6 0.4 0

0

0.5

φ

1

0

0

0.5

φ

1

0

0

0.5

φ

1

Figure 1: Voltage traces and PRFs of the QIF neuron. Voltage traces and PRFs were computed using equations 2.2 and 3.3, respectively. The external current is such that the firing rate is 50 Hz (τ0 = 10 msec). (A), Voltage traces of the neuron for three values of Vs (2 mV, 6 mV, 10 mV) in response to a constant current (0.77 mV, 0.98 mV,0.77 mV). Note the changes in the concavity of the voltage traces between the spikes when Vs increases. (B), The PRF, Z, for the values of Vs as in A. For Vs = 6 mV= VT /2, the phase response is symmetric around φ = 1/2. For Vs = 2 mV (resp. Vs = 10 mV), the response is skewed toward the left (resp. the right).

membrane potentials of the soma, V, and of the dendrite, Vd , follow the equations: dV = −IL − INa − INaP − IK + Iext − gc (V − Vd ), dt dVd C = −gld (Vd − Vld ) − gc (Vd − V), dt C

(2.6) (2.7)

where gc is the coupling conductance between the two compartments, gld is the leak conductance of the dendritic compartment, and Iext is an external current. 2.1.2 Inhibitory and Electrical Synaptic Couplings. Inhibitory interactions are taken into account in the QIF model by adding to the right-hand side of the dynamical equation for v, equation 2.1, an additional current of the

640

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

form Iinh (t) = ginh s(t).

(2.8)

Here, ginh is a negative constant parameter measuring the strength of the inhibitory coupling. The function s(t) is defined by s(t) =

f (t − tspike ),

(2.9)

spikes

where the summation extends over all the spikes fired by the presynaptic neuron at time tspike prior to time t and f (t) =

−t/τ 1 e 1 − e−t/τ2 (t) τ1 − τ2

(2.10)

with τ1 and τ2 , the rise and the decay time of the synapse, respectively, and

the Heaviside function: (t) = 0 for t < 0 and (t) = 1 otherwise. Note that the integral of f (t) over time is normalized to 1. Each neuron receives s(t) from the other neuron. For simplicity, we limit our study of the conductance-based model to the case of inhibitory synapses located on the soma. Therefore, we incorporate inhibition in this model by adding to the right-hand side of the equation for V in equation 2.6 a current of the form Iinh (t) = −ginh s(V(t) − Vinh ),

(2.11)

where ginh > 0 is a constant conductance, Vinh is the reversal potential of the synapse, and s is the synaptic conductance dynamics defined in appendix A. The reversal potential of the inhibition is Vinh = −75 mV throughout the article. An electrical synapse between two neurons induces a synaptic current proportional to the difference of their membrane potential. In both the QIF and the conductance-based model, the effect on neuron i of an electrical synapse connecting neuron j and neuron i is taken into account by adding to the right-hand side of the dynamical equation of neuron i a current of the form Igap = −ggap (Vi − Vj ).

(2.12)

In the QIF model, Vi and Vj are the membrane potentials of the neurons defined by equations 2.1, and 2.5. In the conductance-based model, Vi and Vj stand for the potentials of the somatic or the dendritic compartments depending on the locations of the electrical synapses.

Inhibitory and Electrical Synapses in Synchrony

641

2.1.3 Noise. tochasticity is introduced in the dynamics of the QIF model by adding a gaussian white noise current, Inoise (zero mean, standard deviation, σ ), in the right-hand side of equation 2.1. In the conductance-based model, noise is incorporated in a similar way in the dynamics of the somatic compartment. 2.2 Cross-Correlations of Spike Trains. We characterize the synchrony properties of a pair of neurons by the CCs of their activity. Given the time series at which neuron i = 1, 2 fire action potentials, one can define a variable Si (t) by Si (t) = 1/δ

(2.13)

if neuron i has fired a spike in a time bin of size δ about time t and Si (t) = 0 otherwise. For a sufficiently small δ, the time average of Si , < Si (t) >, is the average firing rate of neuron i. The normalized CC of the spike trains of the two neurons is defined as C(τ ) =

< S1 (t)S2 (t + τ ) > . < S1 (t) >< S2 (t) >

(2.14)

It represents the density of probability that neuron 2 will fire a spike in some bin of size δ a delay τ after a spike of neuron 1. 2.2.1 Synchrony and Antisynchrony of a Pair of Neurons. A flat CC indicates that the two neurons fire independently, whereas a peak in the CC at some time shift indicates that the neurons have an increased probability to fire with some constant time delay. In particular, synchronous firing corresponds to an increased probability of the two neurons to fire simultaneously, that is, to a peak of the CC at τ = 0 in the CC. The model neurons considered in this article fire periodic trains of action potentials in response to constant input current. Noise in the input will decorrelate the periodic firing of action potentials. If the noise is not overly strong, this decorrelation requires several action potentials, and if the neurons fire with some correlation, the CCs will display a few oscillations with a period equal to the average firing period of the neurons, T. We will say that the two neurons tend to fire in antisynchrony if the CC of their activity displays a peak at τ = ±T/2. 2.3 Reduction to a Phase Model. 2.3.1 The Phase Dynamics of a Pair of Neurons. The study of the dynamics of a pair of interacting oscillatory neurons is greatly simplified if one assumes that their cellular properties are not too different, the noise they receive is small, and the coupling between them is weak. Under these assumptions, the dynamics of the neurons can be described in terms of two

642

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

phase variables—one for each neuron. In this reduced model, interactions between the neurons are taken into account by effective coupling functions that depend on their phase difference (Kuramoto, 1984; Ermentrout & Kopell, 1986; Hansel et al., 1993, 1995; van Vreeswijk et al., 1994) and the dynamics of the phases for the two neurons are given by the coupled differential equations, dφ1 = f1 + ginh inh (φ1 − φ2 ) + ggap gap (φ1 − φ2 ) + η1 (t) dt dφ2 = f2 + ginh inh (φ2 − φ1 ) + ggap gap (φ2 − φ1 ) + η2 (t), dt

(2.15) (2.16)

where f1 and f2 are the frequencies of the neurons when decoupled. In the following, we consider the homogeneous case : f1 = f2 = f . The phase coupling inh and gap , which corresponds to chemical and electrical interactions, respectively, depends on the phase response function (PRF) of the neuron, Z(φ). These can be computed from the convolution integrals with φ between 0 and 1: inh (φ) =

1

Z(u) s(u − φ) du

(2.17)

Z(u) (V(u) − V(u − φ)) du

(2.18)

0 1

gap (φ) = 0

where s(φ) =

1 e−φ/( f τ1 ) e−φ/( f τ2 ) ( − ) τ1 − τ2 1 − e−1/( f τ1 ) 1 − e−1/( f τ2 )

(2.19)

and V(φ) is the membrane potential expressed as a function of the phase variable. The functions gap (φ), inh (φ), and Z(φ) are periodic with period 1. The terms η1 (t) and η2 (t) in equations 2.15 and 2.16 are the effective noise on the phase dynamics that corresponds to the noise in the input received by the neurons. Assuming white and gaussian input noise with zero mean and standard deviation, σ , and generalizing the approach developed by Kuramoto (1991) in the context of the LIF model, one can show that for the QIF model, η1 (t) and η2 (t) are white gaussian noise with zero mean and standard deviation σφ : σφ2

=σ

2

1

[Z(u)]2 du.

(2.20)

0

Therefore, the effective noise in the phase dynamics of the QIF neuron is proportional to the average of its PRF over a period.

Inhibitory and Electrical Synapses in Synchrony

643

2.3.2 Analytical Expression for the CC of a Pair of Neurons in the Weak Coupling and Weak Noise Limit. In the weak coupling, weak noise limit, the time elapsed at time t since the last spike fired by neuron i is simply the firing period, T, multiplied by the phase, φ(t), computed modulo 1 (mod(φi (t), 1)). Similarly, the time elapsed between the last spikes fired by neurons i and j is τ = T mod(φi (t) − φj (t), 1). In the weak coupling limit, the CC and the phase shifts’ probability distribution function, P0 (), are related by C(τ ) = P0

τ T

.

(2.21)

A more detailed proof of this equation is given in appendix B. Using equations 2.15 and 2.16, one finds that = φ1 − φ2 satisfies the Langevin equation, d = − () + η(t), dt

(2.22)

where we have assumed that the two neurons fire at the same average firing rates. We have defined η(t) ≡ η1 (t) − η2 (t) and − () ≡ (() − (−)) with () = ginh inh () + ggap gap ().

(2.23)

The noise η(t) is the difference of two gaussian white noise, with zero average and standard deviation σφ . Therefore, √ it is a gaussian white noise with zero average and standard deviation 2σφ . Hence, the phase-shift distribution, P(, t), satisfies a Fokker-Planck equation (van Kampen, 1981): ∂2 ∂P(, t) ∂ P(, t) − = σφ2 [− () P(, t)]. 2 ∂t ∂ ∂

(2.24)

The stationary phase shifts distribution, P0 , is the solution of this equation, which satisfies the additional constraint ∂P0 (, t) = 0, ∂t

(2.25)

that is,

P0 () = eG() [A

e−G( ) d + B],

(2.26)

0

where

G() = 0

− ( ) d

σφ2

(2.27)

644

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

and A and B are two integration constants, which are determined by the

1 periodicity of P0 , that is, P0 (0) = P0 (1), and its normalization, 0 P0 ()d = 1. One finds P0 () = 1 0

eG()

eG( ) d

.

(2.28)

One can show that P0 is extremum for phase shifts solutions of − () = 0.

(2.29)

These solutions for which d− () = () < 0(resp. > 0) dφ

(2.30)

are maxima (resp. minima) of P0 . The function is periodic with period 1. Therefore, = 0, ±1/2 are always solutions of equation 2.29. Other solutions may also exist. Equations 2.29 and 2.30 can be easily interpreted. Indeed, in the absence of noise, when time goes to infinity, the two neurons phase-lock, and the possible phase shifts are the stable fixed point of equation 2.22 with η = 0. It is straightforward to see that these phase shifts are the solutions of equation 2.29 and that equation 2.30 is the condition for their stability. In the presence of noise, the phase shifts between the two neurons have an increase (resp. decrease) in probability density to be near the stable (resp. unstable) fixed points of the noiseless dynamics. In the limit σ → 0, P0 () is nonnegligable only for values of in the vicinity of the maxima of G(). For chemical synapses, G() is continuously differentiable at least up to the second order. Expanding G around its maxima, one finds that P0 is a sum of √ gaussians centered around these points, with a width proportional to 1/ − (). Therefore, the greater is the stability of a fixed point in the absence of noise, the sharper is the probability distribution around when noise is present. This is also true for electrical synapses for = 0. For = 0, the phase-shift distribution of the QIF model has a cusp. This is because in the QIF model, the PRF is discontinuous at = 0 and because we have modeled the spikes as a δ-function added to the subthreshold voltage. Note that the phase shift distribution depends on the couplings and the noise via the ratios ggap /σ 2 and ginh /σ 2 . However, the locations of the extrema of the phase shift distribution depend on only the coupling via the ratio ggap /gsyn and are independent of the noise. 3 Cross-Correlations in the Quadratic Integrate-and-Fire Model The QIF model is defined in section 2. Its dynamics are one-dimensional, with a temporal evolution of the voltage in its subthreshold range given by

Inhibitory and Electrical Synapses in Synchrony

645

one first-order differential equation. This equation is supplemented with a reset condition whenever the voltage reaches a threshold. In response to a suprathreshold step of current, the neuron fire spikes periodically. The QIF model is reminiscent of the more standard LIF model. However, in contrast with this model, the QIF dynamics of the voltage between the spikes are nonlinear. As a consequence of the quadratic nonlinearity in the dynamics, the temporal second-order derivative of the voltage of the QIF neuron vanishes during the interspike at a time that depends on a parameter, Vs (as shown in section 2). This parameter also controls the excitability of the neuron (see below). As we will see, this gives rise to a variety of synchronization regimes as Vs varies that do not occur in the LIF model but do occur in a conductance-based model with dynamics that are more faithful to neuronal biophysics. The goal of this section is to identify these regimes in the QIF model. We consider three situations: a pair of neurons interacting solely via inhibition, a pair of neurons interacting solely via electrical coupling, and dual coupling, in which the neurons interact via both types of synapses. We show that similar regimes are found in biophysically more realistic models. Because of the simplicity of the QIF model, many aspects of the dynamics of QIF networks (Hansel & Mato, 2003; Pfeuty et al., 2003) can be investigated using analytical techniques. In the case of a pair of neurons, an analytical formula can be established for their CC, assuming weak coupling and weakly noisy inputs (see equation 2.28). Using this formula, we conducted a systematic study of the ways in which the CC of the activity of a pair of QIF neurons depends on Vs , the firing frequency, and the nature of the neuronal coupling on synchrony. We present the results in the form of phase diagrams for the various regimes of parameters corresponding to qualitatively different shapes of CCs. Subsequently, we use numerical simulations to show that the results thus established hold for a broad range of coupling strengths and investigate new behaviors that may occur beyond this range. 3.1 Weakly Coupled Quadratic Integrate-and-Fire Neurons. 3.1.1 The Phase Response Function. The PRF of the QIF model can be computed analytically. It is given by (Ermentrout, 1981; Kuramoto, 1991; Pfeuty et al., 2003) Z(φ) =

dv(φ) dφ

−1

,

(3.1)

where φ = t/T with T given by equation 2.4 and v(φ) =

(Iext − Ic )/A tan

√ φT A(Iext − Ic ) + α + Vs . τm

(3.2)

646

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

Using equation 2.1, one finds Z(φ) =

τm A [v(φ) − Vs ]2 + Iext − Ic

.

(3.3)

Clearly Z(φ) > 0 for all φ. It is nonmonotonic and has one maximum, Z(φm ) at some φm , which depends on the frequency f and on Vs (for fixed Vr and VT ). Examples of the PRF, Z(φ), are plotted in Figure 1B for three values of Vs . For Vs < (Vr + VT )/2, φm is a decreasing function of f with φm → 1/2 in the limit f → 0 and φm → 0 in the limit f → ∞. For Vs > (VT + Vr )/2, φm increases with f . Still, in the limit f → 0, φm → 1/2 but φm → 1 for f → ∞. For Vs = (VT + Vr )/2, φm = 1/2, independently of f . Examples of the PRF are plotted in Figure 1B for three values of Vs . An analytical formula for the CC of a pair of QIF, equation 2.28, can be derived assuming weak coupling and weak noise. The results presented in this section were derived using this formula to compute the CCs. 3.1.2 The Three Generic Shapes of the CC for Inhibitory Coupling. Three examples of CCs computed using equation 2.28 are plotted in Figures 2A through 2C. The first example, displayed in Figure 2A, is for a high firing rate, 80 Hz, and a small Vs = 0.4 mV. In this case, C(0) and C(±T/2) are local minima, and C is maximum at time shifts ±τ , with τ ≈ T/6. Figures 2B and 2C for different values of Vs , Vs = 11.2 mV and Vs = 8 mV and the same firing, f = 20 Hz. In both cases, C(τ ) has a local maximum at τ = 0, but for Vs = 11.2 mV, the amplitude of this maximum is extremely small. It is much more pronounced for Vs = 8 mV. For Vs = 11.2 mV, C is also maximum at τ = ±T/2. Therefore, the pattern of firing of the neuronal pair is different in the two cases. For Vs = 8 mV, the neurons fire most of the time in synchrony, whereas for Vs = 11.2 mV, the neurons fire most of the time in antisynchrony. The three shapes of the CCs described in Figure 2 are typical. A phase diagram showing the domain of parameters corresponding to each shape can be computed. It is plotted in Figure 2D. In the dark gray region, the CC has only one maximum, located at τ = 0. For large Vs and sufficiently small frequency (gray-and-white striped region), the CC displays another peak at τ = ±T/2. However, C(0) decreases sharply below the boundary of the gray and white striped region. In most of this region, C(±T/2) is much larger than C(0), that is, the neurons are more likely to fire in antisynchrony than in synchrony. In contrast, for a small Vs and a large frequency (light gray region), the CC has two maxima, symmetrical around 0, at ±τ with 0 < τ < T/2. Figure 2E displays the CC against Vs and τ for a fixed frequency, f = 20 Hz. It shows that at this frequency, the CC is maximum around τ = 0 but broad at small Vs . The peak becomes more pronounced when Vs increases. However, when Vs is too large, the amplitude of this peak decreases, whereas a peak around τ = ±T/2 appears.

Inhibitory and Electrical Synapses in Synchrony

A C(τ)

B Vs =2mV, f=50Hz

2 1

Frequency (Hz)

2

C Vs =4mV, f=50Hz

0 τ/T

0.5

0 -0.5

0 τ/T

D

V =8mV, f=50Hz

s 4 3 2 1 0 0.5 -0.5

1

0 -0.5

100

647

0 τ/T

0.5

E

80 60

4

C(τ) 2

40

0 0

20 0

0.5

Vs (mV) 8 0

4

8 Vs (mV)

12

0

4

τ/T

12 -0.5

Figure 2: Cross-correlations of a pair of QIF neurons interacting via inhibitory synapses. (A–C) CCs for three values of Vs and f for ginh /σ 2 = 20. The synaptic time constants are τ1 = 1 msec, τ2 = 6 msec. (A): Vs = 0.4 mV, f = 80 Hz. The CC is peaked at ±T/6. (B) Vs = 8 mV, f = 20 Hz. The CC has only one peak located at τ = 0. (C) Vs = 11.2 mV, f = 20 Hz. The CC is peaked at τ = ±T/2 (antiphase). It also exhibits a small peak at τ = 0 (see inset). (D) The phase diagram. Regions in which the CCs have different shapes have different gray codes. Dark gray: CCs with a peak at τ = 0. White: CCs with a peak at τ = ±T/2. Light gray: CCs with peaks at ±τ with 0 < τ < T/2. Dark gray and white: CCs with a peak at τ = 0 and at τ = ±T/2. (E) The CCs as a function of Vs for f = 20 Hz.

The phase diagram in Figure 2 was computed for τ1 = 1 msec, τ2 = 6 msec. Similar phase diagrams are found for other values of τ1 and τ2 , but the positions of the transition lines are different. The slower the inhibition, the smaller the region where CC displays a peak around τ = ±T/2 (grayand-white striped region) and the larger the region where CC has a maxima at ±τ with 0 < τ < T/2 (red). 3.1.3 Electrical Couplings Mediate Synchrony or Antisynchrony Depending on Vs and the Firing Frequency. A detailed study of C(τ ) for a fixed frequency, f = 50 Hz, shows that when Vs increases from Vs = Vr ≡ 0 to Vs = VT ≡

648

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

12 mV, the shape of the CC undergoes a continuous transformation. This is shown in Figure 2E, where the CC is plotted against Vs and τ for a fixed frequency, f = 50 Hz. For a sufficiently small Vs , the CC is peaked around τ = ±T/2 and has a trough at τ = 0. When Vs increases, the trough persists, but when Vs becomes sufficiently large, the CC is no longer maximum at τ = ±T/2. The two maxima, symmetrical around 0, are then located at some intermediate time delay, the distance of which decreases when Vs increases. The trough at τ = 0 is also reduced, and the distribution becomes flatter when Vs increases. For some value of Vs , the two maxima merge, and the CC becomes monomodal at τ = 0. The CC keeps this shape, becoming sharper and then again flatter, as Vs increases. Therefore, for f = 50 Hz, the CC can have three qualitatively different shapes when Vs changes. Examples are plotted in Figures 3A through 3C. The phase diagram plotted in Figure 3D as a function of Vs and f shows that the three types of CCs described above for f = 50 Hz are also found for other values of the firing rate. It also shows that for sufficiently small frequency, a fourth type of CC is found for Vs sufficiently close to VT . It is maximum at τ = 0 but also at τ = ±T/2. The closer Vs is to VT , the larger the amplitude of the CC near τ = ±T/2 (not shown). The interactions induced by electrical synapses are diffusive. In contrast to chemical synapses, the synaptic currents they induced do not have intrinsic kinetics but depend on the membrane potential time course of the preand postsynaptic neurons. Subsequently, for electrical synapses, the current depends on the size and the shape of the action potential. In our model, this means that the amplitude of the spike, θ, may affect the synchronization properties of the neurons. In fact, we found that the phase diagram is affected quantitatively only when θ is changed. When θ increases, the grayand-white striped region shrinks and the white region expands. However, it can be shown that the white region always remains confined to the left half-part of the phase diagram, where Vs < 6 mV. 3.1.4 Three Modes of Interplay Between Electrical and Inhibitory Synapses. The results of the previous sections demonstrate that the CCs of a pair of QIF neurons coupled via inhibition or electrical synapses depend crucially on the parameter Vs . Moreover, comparing the phase diagram in Figure 3 with the one in Figure 2 reveals that these two types of synapses may have similar or opposing effects on synchrony, depending on Vs and f . Therefore, it is likely that when combined, the two types of synapses can either cooperate in promoting synchrony or be antagonistic, depending on the parameters. Using equations 2.28, 2.27, and 2.23, one can see that the number of maxima and minima of C(τ ) and their location depend on the coupling constants via their relative value, ggap /gsyn alone. However, other properties of C, for example, the amplitude of the extrema and their widths, depend on the absolute values of ggap and gsyn .

Inhibitory and Electrical Synapses in Synchrony

A C(τ)

2

Vs =0.4mV, f=80Hz

1

Frequency (Hz)

2

C

Vs =8mV, f=20Hz

2

1

0 -0.5

100

B

649

0 τ/T

0.5

Vs =11.2mV, f=20Hz

1

0 -0.5

0 τ/T

D

0 -0.5

0.5

0 τ/T

0.5

E

80 60

C(τ)

40 20 0

2 1 0 0

0.5

Vs (mV) 8 0

4

8 Vs (mV)

12

0

4

τ/T

12 -0.5

Figure 3: CCs for a pair of QIF neurons interacting via electrical synapses. (A– C) CCs for three values of Vs and f = 50 Hz for ggap /σ 2 = 10. (A) Vs = 2 mV, f = 50 Hz. The CC is peaked at ±T/2. (B) Vs = 4 mV. The CC is peaked at τ ± T/6. (C) Vs = 8 mV. The CC is peaked at τ = 0. (D) The phase diagram. The codes are as in Figure 2. (E) Plot of the CCs as a function of Vs and τ/T for f = 50 Hz.

We first investigate how the combination of these two types of synapses affects the extrema of C and the corresponding phase diagram as a function of f and Vs . As the ratio ggap /ginh varies from 0 to ∞, the phase diagram gradually modifies from the one in Figure 3D to the one in Figure 2D. For example, Figure 4A plots the phase diagram for ggap /ginh = 1/2. The codes have the same meaning as in Figures 2D, and 3D. Comparing Figures 4A and 3D shows that adding inhibitory synapses to electrical synapses reduces the size of the region of antisynchronous firing (white region), and increases the domain of intermediate phaseshift (light gray region) as well as the domain of bistability (gray-and-white striped region). Comparing Figure 4A with Figure 2D, one sees that adding electrical synapses to inhibitory synapses reduces the size of the domain of synchronous firing (dark grey region). This

650

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

Frequency (Hz)

100

A

B

80 60

C(τ) 4

2 0 0

40 20 0

0.5

Vs (mV) 8 0

4

8 Vs (mV)

12

0

4

τ/T

12 -0.5

Figure 4: A pair of QIF neurons interacting via dual coupling. The coupling ratio ggap /ginh = 1/2. (A) Phase diagram. The parameters and the codes are the same as in Figures 2D and 3D. Note the small region at large Vs and small frequency in which the CCs have a peak at τ = 0 and at τ = ±T/2. (B) The CC as a function of Vs for f = 50 Hz.

also decreases the size of the bistability domain (gray-and-white striped region). Figure 4B displays the CC as a function of Vs and τ for a fixed frequency, f = 50 Hz. The CCs for electrical synapses, alone, dual synapses, or inhibition alone are compared (from left to right) in Figure 5 for different values of the parameters Vs and f . In the top row (Vs = 2 mV and f = 50 Hz), electrical synapses alone promote antisynchronous firing, whereas inhibition promotes synchronous firing. When combined together, the CC is maximum for some intermediate time delay (neither 0 nor ±T/2). Note also that the CC is less modulated for dual coupling than for electrical or inhibitory couplings alone. Therefore, for these values of Vs and f , starting from dual coupling and blocking inhibition increases dramatically the probability of antisynchronous firing while suppressing almost completely the probability of synchronous firing. Blocking electrical synapses improves synchronous firing. In the middle row (Vs = 8 mV, f = 50 Hz), both electrical synapses and inhibitory synapses promote synchronous firing. When together, the synapses cooperate to increase further the probability of synchronous firing and to sharpen the CC around τ = 0. Therefore, in this case, blocking electrical synapses or inhibition reduces the probability of the neurons of firing in synchrony. In the bottom row (Vs = 11.2 mV and f = 20 Hz), electrical synapses alone promote synchrony, but with inhibition alone, the maxima of C are at τ = 0, ±T/2, C(±T/2), being much larger than C(0) (in fact, the maximum at τ = 0 cannot be seen in the figure because of its scale). For dual

Inhibitory and Electrical Synapses in Synchrony

651

Figure 5: Different regimes of interplay of electrical and inhibitory couplings for QIF neurons. (Left column) Electrical synapses (same parameters as in Figure 2). (Right column) Inhibitory synapses (same parameters as in Figure 3). (Middle column) Dual coupling. (Top row), Vs = 2 mV, f = 50 Hz. Antagonistic effects of electrical and inhibitory synapses on synchrony: electrical synapses impede synchrony, but inhibitory synapses promote it. (Middle row), Vs = 8 mV, f = 50 Hz: Electrical and inhibitory synapses cooperate. Bottom row, Vs = 11.2 mV, f = 20 Hz. Antagonistic effect of electrical and inhibitory synapses on synchrony: electrical synapses promote synchrony, but inhibitory synapses impede it.

coupling, the CC is maximum for τ = 0, ±T/2, with a larger correlation at τ = 0 than at τ = ±T/2. However, C(0) is substantially smaller in comparison to the case of electrical coupling alone. Hence, in this case, starting with neurons interacting via dual coupling and blocking inhibition would increase the probability of synchronous firing, whereas blocking electrical synapses would reduce it substantially and increase the probability of antisynchronous firing.

652

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

In summary, the weak coupling theory yields three parameter regimes for the interplay of electrical and inhibitory synapses in synchrony. In one regime, the synapses exhibit cooperative interplay, both of them promoting synchrony. In the other regimes the synapses act antagonistically. This may happen in two ways: because electrical synapses oppose synchrony but inhibitory synapses promote it or because electrical synapses promote synchrony but the net effect of the inhibitory interactions opposes it. Note that these three modes of interplay stem from the dependence of the CCs on Vs for electrical or inhibitory synapses alone combined with the fact that the function G() that determines the CCs for dual coupling, equation 2.28, depends linearly on the two interactions. 3.2 Beyond Weak Coupling and Weak Noise: The Nonlinear Interplay Between Electrical and Inhibitory Synapses. The analytical formula for the CCs used above in the derivation of the phase diagrams was established assuming weakly coupled neurons and weakly noisy inputs. Do the effects of Vs on the CCs found in this limit for inhibitory or electrical synapses alone still persist when the coupling and the noise are not weak? In the weak coupling limit, electrical and inhibitory synapses interact in a linear manner when they are combined. What are the nonlinear effects that occur at strong couplings? To answer these questions, we performed numerical simulations. Figure 6 displays the CCs obtained in numerical simulations (circles) for two values of Vs , Vs = 2 mV and Vs = 8 mV, with various coupling strengths and three combinations of the interactions (electrical, inhibitory, and dual). The noise was also varied to keep the ratios ginh /σ 2 , ggap /σ 2 fixed to 10 and 20, respectively. This is because the weak coupling theory predicts that the CCs depend on σ 2 and on ginh and ggap , via these ratios, as can be seen from equations 2.21 and 2.28. The solid lines in Figure 6 correspond to the CCs computed from the latter equations for the same values of these ratios. In the left column of Figure 6, which corresponds to relatively weak electrical and inhibitory couplings corresponding to postsynaptic potential of an amplitude of 0.2 mV, the simulations agree perfectly with the results from the weak coupling analysis. The middle column corresponds to coupling strengths, which are four times larger than in the left column. The agreement remains excellent except for dual coupling at Vs = 2 mV. Indeed, in this case, the weak coupling theory predicts that the CC should display a trough at zero time shift and that the firing probability should be maximum close to T/8. In contrast, in the simulations, the CC is maximum at zero time shift. Results for further increase of coupling strength by a factor of 3 are displayed in the right column in Figure 6. This corresponds to postsynaptic potentials of a typical size of 2 mV. For Vs = 8 mV, weak coupling theory predicts synchronous firing for electrical synapses alone. Synchrony is also found in the simulations, but

Inhibitory and Electrical Synapses in Synchrony

653

Figure 6: CCs of a pair of QIF neurons for different coupling strengths. Circles and dash lines: Results from simulations (averaging over 100 sec of activity). Solid lines: Predictions from weak coupling theory. The strength of the electrical synapses is from left to right: ggap = 0.0025, 0.01 and 0.03 corresponding, respectively, to coupling coefficients of 0.02 mS/cm2 , 0.06 mS/cm2 , and 0.16 mS/cm2 . The strength of inhibitory synapses is from left to right: ginh = 0.005 mS/cm2 , 0.02 mS/cm2 , and 0.06 mS/cm2 . The noise and the external current were adjusted to keep constant ggap /σ 2 = 10 and ginh /σ 2 = 20. The firing frequency is fixed (50 Hz). (A) Vs = 2 mV. (B) Vs = 8 mV.

the CC is much more peaked than predicted. This is due to direct threshold crossing, which is rare when the coupling is weak but contributes a great deal to the firing of the postsynaptic neuron for strong electrical coupling. For inhibitory coupling, the agreement between the theory and the simulations remains reasonable. Combining the two interactions sharpens the

654

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

CC, as predicted by the weak coupling theory, but quantitative agreement is poor. For Vs = 2 mV, the agreement is excellent for inhibitory coupling. For electrical coupling, the theory predicts antisynchronous firing. However, in the simulations, the CC is flat—the probability of firing is almost homogeneous, as it would be without coupling. Remarkably, in spite of the fact that electrical synapses are not able to induce synchrony, they strongly amplify synchrony and sharpen the CC when they are added to the inhibitory synapses (compare the CC in the middle line, right column). This nonlinear interplay between the two types of synapses requires strong couplings. 3.3 Predicting the Shape of the CCs from the Shape of the Phase Response Function. Our study of the QIF model in the weak coupling limit relies on the analytical formula, equation 2.28, which relates the CC to the synaptic couplings and the neuronal PRFs. The main effect on the PRF of changing Vs is to shift the location of its maximum. It varies continuously from 0 to 1, as Vs increases from Vr to VT (see Figure 1). Since equation 2.28 holds for any neuronal dynamics, our analysis of the QIF model suggests that more generally, synchrony may be predicted from the shape of the PRF independent of the details of the neuronal dynamics. These predictions can be summarized as follows. For electrical synapses, a pair of neurons with PRF skewed to the right tends to fire synchronously when interacting via electrical synapses. In contrast, neurons with PRF sufficiently skewed to the left tend to fire antisynchronously. For inhibitory synapses, CCs always have a peak at τ = 0 unless the PRF is too skewed to the left and the firing rate, f , of the neurons is too large. When the PRF is strongly skewed to the right and f is sufficiently small, the CC is also peaked around τ = ±T/2. The latter peaks may be much higher than the peak at τ = 0, indicating that the net effect of inhibition opposes synchrony. The faster the synapses, the larger the domain of parameters in which this happens. Our simulations of the QIF model indicate that these predictions from the weak coupling theory remain relevant, at least qualitatively, within a broad range of coupling strengths. For weak enough synapses, the interplay of electrical and inhibitory interactions is essentially linear. This property stems from the general formula, equation 2.28, and therefore it is not restricted to the QIF model. Hence, we expect that for general neuronal dynamics, the shapes of the CC for dual coupling can be predicted from the effect of each synapse taken separately, provided the interactions are not too strong. When the interactions are strong, nonlinear effects start to be significant. Nevertheless, knowing the PRF of the neurons can serve to qualitatively predict the interplay of the two types of interactions. For instance, from the analysis of the QIF model, we predict that for neurons with a PRF skewed to the left, adding strong electrical synapses to inhibitory synapses amplifies synchronous firing, although singly electrical synapses would not lead to synchrony. This effect

Inhibitory and Electrical Synapses in Synchrony

655

is a consequence of the rapid direct threshold crossing induced by strong electrical synapses when they act at a time when the postsynaptic neuron is not far from threshold. Therefore, one may expect that such synergetic interplay of electrical and inhibitory interactions is a general effect that occurs in conductance-based models and in real neurons. 4 Cross-Correlations in a Conductance-Based Model The results of the previous section suggest that the shape of the CC of a pair of neurons can be related to their intrinsic properties, provided one knows how ionic channels and neuronal morphology shape their PRF. In this section, we verify that this indeed holds in the framework of a twocompartment conductance-based model neuron. The somatic compartment incorporates one fast and one persistent sodium current and a delayed rectifier potassium current. The dendritic compartment is passive. Details of the model are given in section 2. To be specific, we focus on the effects of the persistent sodium current and the dendritic compartment on the synchrony of the neurons. The results presented in this section lead to experimentally testable predictions, as explained in section 5. 4.1 Increasing gNaP in the Conductance-Based Model and Decreasing Vs in the QIF Model Have Similar Effects on the PRF. The effect of INaP on the voltage trace and the PRF of the neuron is shown in Figures 7A and 7B. For all the considered values of gNaP , Z(φ) is positive for all φ. It is also nonmonotonic and monomodal. The PRF is small just before, during, and just after the action potential. The position of the maximum depends on gNaP . For gNaP = 0, it is in the second half of the period at φ ≈ 2/3. As a result, the global shape of the PRF is skewed toward the second half of the period of the neuron. The response of the neuron increases with gNaP , especially in the first half of the period. This shifts the maximum of Z(φ) toward smaller φ. For large enough gNaP , the PRF is skewed toward the left. The PRF also depends on the firing rate, f . For a given parameter set, the maximum of Z moves closer to 1/2 when f goes to 0; in that limit, the function Z becomes proportional to 1 − cos 4π φ (Ermentrout, 1996), as in the QIF model. Therefore, increasing gNaP in our conductance-based model or decreasing Vs in the QIF model affects the PRF of the neurons in a similar way. Hence, we may expect that this should also affect the CCs similarly. We show that this is indeed the case. 4.2 The Effect of the Persistent Sodium Current on the Cross-Correlations. In this section, we assume that gc = 0 (in this case, the model has a single compartment). Figure 8 plots the CCs for three values of gNap — when the neurons are coupled solely via electrical synapses (left column), chemical synapses (right column), or both synapses (middle column). The synaptic conductances were ggap = 0.005 mS/cm2 and ginh = 0.02 mS/cm2 .

656

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

Figure 7: Voltage traces and phase-response functions of the conductance-based model neuron are shaped by the persistent sodium current. The soma and the dendrite are decoupled (gc = 0). (A) Voltage traces. (B) PRF. The PRFs were obtained by computing the delay on the spike times induced by small perturbations (see Hansel et al., 1993). The external current was adjusted to keep the firing constant, f = 50 Hz: The external currents are 1.1 µA/cm2 , −0.3 µA/cm2 , and −1.04 µA/cm2 , respectively, for gNaP = 0, 0.2 mS/cm2 , and 0.7 mS/cm2 .

The corresponding spikelets and IPSPs are also shown in Figure 8A. Their peak amplitudes are about 0.3 mV and 0.4 mV, respectively. The results obtained for gNaP = 0.7 mS/cm2 , f = 50 Hz are similar to those obtained for the QIF model with Vs = 2 mV and the same firing frequency (see Figure 5 left line). In both cases, electrical synapses promote antisynchrony, whereas inhibition alone promotes synchrony. For dual coupling, the two trends compete, and the peak of the CC around τ = 0 is very flat. Reducing gNaP to gNaP = 0.2 mS/cm2 but keeping the frequency constant, f = 50 Hz, electrical synapses or inhibitory synapses alone promote synchronous firing. For dual coupling, the two types of synapses cooperate, leading to more synchronous activity than when each of them is considered alone. This is similar to what was obtained in the QIF model for Vs = 2, f = 20 Hz (see Figure 5, middle row). For even smaller gNaP and also reducing the frequency ( f = 20 Hz) (see Figure 8B, bottom row) leads to CCs that are similar to those for Vs = 11.2 mV and f = 20 Hz in the QIF (bottom row in Figure 5). In both cases, electrical synapses alone favor synchronous firing. Adding inhibitory interactions substantially reduces the probability of the neurons’ firing simultaneously. This is because for gNaP = 0, electrical and inhibitory synapses act antagonistically, since the latter promote antisynchrony (see Figure 8B, bottom row, right column). We also investigated the effect of gNaP on the CCs when the synapses are strong. Results are shown in Figure 9 for synaptic conductances six times

Inhibitory and Electrical Synapses in Synchrony

657

Figure 8: Effects of the persistent sodium current and the firing frequency on the CCs in the conductance-based model. (A) Postsynaptic potentials due to one presynaptic spike are shown (for gNaP = 0). The soma and the dendrite are decoupled (gc = 0). (Left) Electrical coupling, ggap = 0.005 mS/cm2 . (Right) Inhibitory coupling, ginh = 0.02 mS/cm2 . (Middle) Dual coupling. (B) CCs are computed over 100 sec of activity for three values of gNaP . The SD of the noise is σ = 0.15 mV/ms1/2 . (Top row) gNaP = 0.7 mS/cm2 , f = 50 Hz. (Iext = −1.04 µA/cm2 ). Electrical and inhibitory synapses have antagonistic effects on synchrony. Electrical synapses suppress synchrony and inhibition-induced synchrony. (Middle row) gNaP = 0.2 mS/cm2 , f = 50 Hz (Iext = −0.30 µA/cm2 ). Electrical and chemical synapses cooperate. (Bottom row) gNaP = 0, f = 20 Hz, (Iext = 0.56 µA/cm2 ). Electrical and inhibitory synapses have antagonistic effects on synchrony. Inhibitory synapses suppress synchrony, but electrical synapses promote it.

658

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

larger than in the results depicted in Figure 8, namely, for ggap = 0.03 mS/cm2 and ginh = 0.12 mS/cm2 . The amplitude of the corresponding spikelets and inhibitory postsynaptic potentials is about 2.5 mV (see Figure 9A). The coupling coefficient for electrical synapses is 20%. This corresponds to synapses with strengths in the upper physiological range. Our numerical simulations confirmed that the crucial influence of the persistent sodium current on the CCs also occurs for such strong synapses. Indeed, for gNaP = 0.2 mS/cm2 , the neurons fire in synchrony, and the CC has a peak at τ = 0 (see Figure 9C, left panel). In contrast, for gNaP = 0.7 mS/cm2 (see Figure 9B, left panel), the central peak of the CC is suppressed. Note, however, that the peak is sharper and larger in Figure 9C than in Figure 5B, and the peaks of the CC in Figure 9A are around τ = ±3T/8 and not at τ = ±T/2, as in the corresponding case for weaker coupling (see Figure 5). The right panels in Figures 9A and on B show the CCs for strong inhibition. Reducing gNaP has a minor effect on the CC peak value, as in the corresponding case in Figure 5. However, in contrast to weak coupling, this also decreases the firing probability around τ = T/4 and increases it at τ = T/2. This did not happen at weaker coupling (see Figure 5). The most significant difference between the CCs at weak and strong couplings occurs when electrical and inhibitory interactions are combined and gNaP is sufficiently large. Indeed, in this case, the two types of synapses show synergy. As in the QIF model, for small Vs , synchronous firing is completely suppressed for electrical synapses alone. But when added to inhibition, the peak of the CC at τ = 0 is amplified. 4.3 Synchrony of Neurons Coupled via Electrical Synapses Decreases When the Somato-Dendritic Coupling Increases. We now assume gNaP = 0 and discuss the effect of the dendrite on synchrony. Therefore, we assume gc = 0. In this case, perturbations acting on the somatic or the dendritic compartment affect the firing of the neuron differently. Therefore, two PRFs can be defined, one, Zsoma , for perturbations on the soma and the other, Zdendrite , for perturbations on the dendrite. These two functions are plotted in Figure 10A for our two-compartment model assuming gNaP = 0, a somato-dendritic coupling conductance gc = 0.3 mS/cm2 and for a firing frequency, f = 50 Hz. The somatic PRF for gc = 0 and f = 50 Hz is also plotted on the same figure (dashed line). The coupling with the dendritic compartment affects the shape of the somatic PRF by skewing it toward the first half of the period (compare the somatic PRFs for gc = 0.3 mS/cm2 and gc = 0). This is because the voltage perturbation at the soma backpropagates to the dendrite, and from there reverberates back to the soma. Therefore, the overall response to a perturbation at the soma is the sum of two contributions. The first one is the same as though the dendrite was absent. The second one, which is due to the dendrite, is a delayed and broadened version of the first one. Together, these

Inhibitory and Electrical Synapses in Synchrony

659

Figure 9: Nonlinear interplay between strong electrical and inhibitory synapses for conductance-based neurons. Synaptic conductances are ggap = 0.03 mS/cm2 and ginh = 0.12 mS/cm2 . The SD of the noise is adjusted to obtain ggap /σ 2 and ginh /σ 2 to ≈ 1.1 and ≈ 4.4. The external current is such that f = 50 Hz. (A) Postsynaptic potential induced by a presynaptic action potential for (left to right) electrical coupling alone, dual coupling, and inhibitory coupling alone. (B) gNaP = 0.7 mS/cm2 (Iext = −1.04 µA/cm2 ). Electrical synapses alone reduce probability to fire in-phase (top), but combined with inhibitory synapses they amplify synchronous firing. (see right panels). (C) gNaP = 0.2 mS/cm2 (Iext = −0.30 µA/cm2 ). Electrical synapses strongly amplify the peak in-phase with or without inhibitory synapses.

two contributions lead to a response that is stronger and extends earlier in the limit cycle. When the perturbation is on the dendrite, the PRF is skewed to the left, as shown in Figure 10A (solid line) compared to the somatic PRF (dash-dot line). This is because the dendritic compartment delays and filters the perturbation before it can affect the firing of action potentials in the somatic compartment.

660

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

0.4

A ES in dendrites

Z

ES in soma with dendrites

0.2 0

ES in soma without dendrites

0

φ

1

B

4

C(τ)

C(τ)

4 2 0 -0.5

0 τ/T

0.5

C 50 Hz

10 Hz

2

0

0 τ/T

Figure 10: Effect of the dendritic compartment on the PRF and on the CCs. (A) The somatic PRF for gc = 0, Iext = 1.02 µA/cm2 (dash line) and gc = 0.3 mS/cm2 , (dash-dot line); the dendritic PRF for gc = 0.3 mS/cm2 , Iext = 1.50 µA/cm2 ) (solid line). f = 50 Hz in all the cases. (B) The CC for somatic electrical synapses for gc = 0 (dashed line) and gc = 0.3 mS/cm2 (dash-dot line) and for dendritic synapses for gc = 0.3 mS/cm2 (solid line). The firing frequency is f = 50 Hz. (C) CCs with dendritic synapses for two values of the firing frequencies f = 50 Hz (solid line) and f = 10 Hz (circles).

Therefore, adding a dendritic compartment to the soma shapes the PRF similarly to reducing Vs in the QIF model or increasing gNaP in the single compartment model. Hence, we expect that for electrical synapses located on the soma, synchrony decreases with gc . We also expect that neurons are less synchronized if the electrical synapses are on the dendrites than if they are on the soma. This is confirmed by the results displayed in Figure 10B, where the CCs of two neurons firing at 50 Hz are compared for somatic electrical synapses for gc = 0.3mS/cm2 (dash-dot line) and gc = 0 (dashed line) and dendritic synapses (solid line). In the first two cases, the CC is peaked around τ = 0. However, it is much broader and smaller, gc = 0.3 mS/cm2 , than for gc = 0. Moving the synapses to the dendrites reduces the synchrony further. In the example shown in Figure 10B (solid line), it even leads to antisynchronous firing. The transition to antisynchrony induced by electrical

Inhibitory and Electrical Synapses in Synchrony

661

synapses when the dendritic compartment is added is basically similar to the transition to antisynchrony in the QIF model when Vs is reduced. As in the QIF, this effect is firing rate dependent. This is shown in Figure 10C, where the CC is plotted for f = 50 Hz and f = 10 Hz. In the latter case, the CC, although broad, is maximum at τ = 0 in contrast with f = 50 Hz, where antisynchrony occurs. 5 Discussion 5.1 Relating CCs to Cellular and Synaptic Properties. We have shown that the CCs of the activity of a pair of tonically spiking neurons interacting via electrical and/or inhibitory synapses can be predicted if one knows the shape of their PRF. Subsequently, one can predict how cellular properties affect the CCs if one knows how the former shape the PRF. Strictly speaking, the PRF is defined only when neurons are weakly interacting. However, in our model, weak coupling predictions remain valid at least qualitatively for spikelet amplitudes and IPSPs as large as 2 to 2.5 mV. These values are similar to those observed in vitro (Galarreta & Hestrin, 2001, 2002). In the case of the QIF model, we focused on the effect of the parameter Vs . The effects of Vr and VT can be studied in a similar way. In conductancebased models as in real neurons ionic channels determine the PRF (Crook et al., 1998a; Ermentrout et al., 2001; Oprisan & Canavier, 2002; Acker et al, 2003; Pfeuty et al., 2003). We have focused here on the effect of INaP or of a passive dendrite (modeled as a single compartment), which shifts the maximum of the PRF toward the left. Other inward currents, such as calcium currents, have the same effects on the PRF and must affect synchrony in a similar way. Potassium currents, in contrast, skew the PRF to the right (Ermentrout et al., 2001; Pfeuty et al., 2003). Therefore, we predict that potassium and sodium currents have mirror effects on synchrony (Pfeuty et al., 2003). We have verified these predictions in simulations of conductance-based models (results not shown). We have found that electrical synapses located on the dendritic compartment of the conductance-based neuron are less synchronizing than if they are located on the soma. One contribution to this effect is the additional delay induced by the dendritic compartment, which is like shifting the PRF to the left. One can similarly predict the effect of conductions or the synaptic delays in the inhibitory interactions. Although we did not include these delays in our model, our approach can be straighforwardly extended to take them into account (Izhikevich, 1998). For given cellular properties, this would amount to shifting the PRF of the neurons to the left, as would be the effect of increasing gNaP without delays. 5.2 Linear and Nonlinear Interplays Between Electrical and Inhibitory Synapses. When the interactions are not too strong, the CCs of the activity of two neurons are very well approximated by our analytical result, equa-

662

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

tions 2.21 and 2.28, which relates the CC to the effective phase interaction, . For dual coupling, is the sum of two contributions due to each of the interactions. In this sense, the interactions combine their effect linearly. This implies that the CC for dual coupling is the normalized product of the CCs for each interaction alone. When the interactions are sufficiently strong, substantial deviations from equations 2.21 and 2.28 occurs due, for instance, to nonlinear interaction between electrical and inhibitory synapses. This is what we have found when electrical synapses amplify synchronous firing of a pair of neurons when combined with inhibition, although the CC would display a trough at zero time delay with electrical synapses alone due to the intrinsic properties of the neurons. A counterpart of this synergetic effect in a large network of such neurons is enhancement of synchrony by electrical synapses when combined with inhibition, although alone the latter would promote asynchronous firing (results not shown). This effect should not be confused with the more trivial fact that combining electrical and inhibitory synapses enhances synchrony when the neuronal properties are such that both interactions alone promote synchronous firing. 5.3 Relation to Previous Studies. 5.3.1 Modeling Works. Previous work has addressed the role of cellular properties in inhibition-mediated synchrony focusing on the interplay between postinhibitory rebound (PIR) and inhibition. They have shown that PIR stabilizes in-phase synchrony (Wang & Rinzel, 1992; Golomb, Wang & Rinzel, 1994) for sufficiently slow and strong inhibition. The synchronizing role of inhibition studied here is essentially different; it occurs for QIF neurons that do not display PIR, and it does not require strong coupling. Rather, it is similar to the mechanism considered by van Vreeswijk et al. (1994) and Hansel et al. (1995). Our study helps clarify how intrinsic neuronal properties affect synchrony in this mechanism. Previous studies of conductance-based models have shown that a pair of electrically coupled neurons may fire in synchrony or in antisynchrony depending on the model or the firing rate (Sherman & Rinzel, 1992; Han, Kurrer, & Kuramoto, 1995; Chow & Kopell, 2000). For the compartmental model studied by Alvarez, Chow, van Bockstaele, and Williams (2002), electrical synapses located on the soma synchronize the spikes of the neurons in the whole range of frequency investigated. In contrast, for synapses on the dendrites, synchrony occurs only for sufficiently small firing rates. Our work provides a general framework to understand this diversity of behaviors. Synchrony of a pair of LIF neurons interacting via electrical and/or inhibitory synapses has been studied by Lewis and Rinzel (2003). The PRF of the LIF increases monotonically with the phase, regardless of the cellular parameters (Kuramoto, 1991; Hansel et al., 1995). Such a shape is found in the QIF model for Vs → VT . Therefore, in this limit, the properties of

Inhibitory and Electrical Synapses in Synchrony

663

the QIF model resemble those of the LIF. For instance, in the regions in the bottom-right of the QIF phase diagrams for electrical synapses (see Figure 3) and for inhibitory synapses (see Figure 2), the CCs have a peak at τ = 0 and at τ = ±T/2. Changing the size of the spike θ and the inhibitory time constants modifies the extent of these regions and their overlap. As a consequence, for large Vs and small f , the interplay of the two types of synapses depends on these parameters. For sufficiently large θ and fast inhibition (the case considered in Figure 4) C(T/2) decreases and C(0) increases when electrical coupling is added to inhibitory coupling, compared to the case of inhibition alone. The opposite effect occurs for slow inhibition and small θ (result not shown). This is similar to what Lewis and Rinzel (2003) found for the LIF, but it occurs only in a restricted domain of the phase diagrams of the QIF. The dynamics of networks consisting of a large number of inhibitory interneurons coupled also via electrical synapses have been investigated recently (Traub et al., 2001; Bartos et al., 2002). Bartos et al. found that in a network of Wang-Buszaki (WB) neurons, electrical synapses enhanced synchrony even in the absence of inhibition. This agrees with our work, since one can check that the PRF of WB neurons is qualitatively similar to the one of our model in the absence of INaP . Traub et al. found that in a network of multicompartmental neurons coupled with inhibition, electrical synapses located on the dendrites improve synchrony. The mechanism underlying this effect is not clear since Traub et al. did not study their model when inhibition is suppressed.

5.3.2 Experimental Work. Recent experimental work suggests that electrically coupled pairs of LTS and pairs of FS cells have different synchronization properties (Mancilla, Lewis, Pinto, Rinzel, & Connors, 2002). Alvarez et al. (2002) have shown that synchrony of neurons in locus coeruleus (LC) interacting via electrical synapses is affected differentially by the firing rate of the neurons in adult and neonatal slices. It would be interesting to investigate experimentally whether these differences in synchronization properties can be explained by differences in the neuronal PRF, as our work suggests. The synchronizing effect of electrical synapses has been also assessed in neocortex and hippocampus by showing a reduction in synchronous oscillations in the gamma frequency range after suppressing Cx36-based electrical synapses (Deans et al., 2001; Hormuzdi et al., 2001). In the work of Tamas et al. (2000), both inhibitory and electrical synapses were found to be required to obtain tight synchrony in a neocortical slice. Although all of these studies support the role of electrical synapses in synchrony, the underlying mechanism remains to be defined in the light of our work, which shows that synchrony occurs in different ways depending on the intrinsic properties of the neurons and on the strength of the interactions.

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

τ/T

τ/T

gNaP

C

τ/T

τ/T

gNaP

C(t) τ/T

C(t)

τ/T

C(t)

τ/T

Large gNaP C(t)

C(t)

C(t)

Frequency

C(t)

τ/T

C(t)

C(t)

C(t)

τ/T

Inhibitory coupling

τ/T

C(t)

B

Electrical coupling

C(t)

Frequency

A

ggap

664

τ/T

τ/T

ginh

Figure 11: Predictions for the dependence of the shape of CCs on cellular and synaptic properties. (A) Qualitative predictions for the effect of changing gNaP and the firing frequency for the CCs of a pair of neurons interacting solely via electrical synapses. The coupling strength is fixed. (B) Qualitative predictions for a pair of neurons interacting solely via inhibitory synapses. The synaptic coupling conductance is fixed. (C) Qualitative predictions for the effects of the inhibitory and electrical coupling strength on the shapes of the CCs for dual coupling. The persistent-sodium conductance and the firing rates of the neurons are fixed and sufficiently large.

5.4 Predictions for Experiments. The dynamic clamp technique (Sharp et al., 1993; Prinz et al., 2004; Brizzi et al., 2004) can be used to modify artificially the cellular properties of neurons as well as the couplings between neurons. Therefore, it can serve to study how the dynamics of neurons depend on their intrinsic properties or the nature of their interactions (Prinz et al., 2004; Perez-Velazquez, Carlen, & Skinner, 2001). This approach can be used to test experimentally the predictions of this work, for instance, in pairs of cortical interneurons activated by injection of external noisy currents. Pairs of interneurons interact frequently via electrical and inhibitory synapses. A first experiment would be to record from pairs coupled only via electrical or inhibitory synapses to study how their CCs vary as a function of the average firing rates. We predict different behaviors, as schematically depicted in Figures 11A and 11B. Which of these behaviors actually occurs depends on the balance between inward and outward currents in the cells and on the location of the synapses on their dendritic trees. A second experiment would be to study the effect of the modifications of the cellular properties. We focus here on the effect of INaP . Depending on the outcomes of the first experiment, the effect would have to be tested by either increasing or decreasing the importance of this current (a decrease can be achieved by simulating a persistent sodium current with a negative conductance). We also predict that the effect of combination between inhibitory and electrical coupling on CCs depends on their strength as depicted in Figure 11C. Of particular interest would be to test the nonlinear interplay of these interactions.

Inhibitory and Electrical Synapses in Synchrony

665

Appendix A: Parameters of the Conductance-Based Model The dynamics of the membrane potentials of the somatic and dendritic compartments of our conductance-based model are given by C

dV = Iext −IL −INa −IK −INaP −gc (V − Vd )+Inoise +Igap +Iinh dt

(A.1)

C

dVd = −gl (Vd − Vld ) − gsd(Vd − V) + Igap , dt

(A.2)

where IL = −gL (V − VL ) is a leak current, Iext is a constant external current, Inoise a gaussian white noise with zero mean, and standard deviation, σ . The synaptic current received by the neuron i from the neuron j is Iinh = −ginh sj (Vi − Vinh ) for the inhibitory current, and Igap = −ggap (Vi − Vj ) or Igap = −ggap (Vdi − Vdj ) for current due to an electrical synapse respectively located between somatic or dendritic compartments. INa = gNa m3∞ h (V−VNa ) and IK = gK n4 (V−VK ) are the spike-generating currents, and INaP = gNaP p∞ (V − VNa ) is a noninactivating (persistent) sodium current. The kinetics of the gating variable h, n, s are given by dx = αx (V)(1 − x) − βx (V)x, dt

(A.3)

with x = h, n, s and αh (V) = 0.21 e−(V+58)/20 , βh (V) = 3./(1 + e−(V+28)/10 ), αs (V) = 50 (1 + tanh(V/4)), βs (V) = 1/τinh The activation functions, m∞ and p∞ , are given by m∞ (V) = αm (V)/(αm (V) + βm (V)), where αm (V) = 0.1(V + 35)/(1 − e−(V+35)/10 ), βm (V) = 4e−(V+60)/18 , and p∞ (V) = 1/(1 + e−(V+40)/6 ). Throughout this work, the parameters gNa = 35 mS/cm2 , VNa = 55 mV, gK = 10 mS/cm2 , VK = −75 mV, gL = 0.15 mS/cm2 , VL = −65 mV, gld = 0.15 mS/cm2 , Vld = −65 mV, Vinh = −75 mV, τinh = 6 msec and C = 1 µF/cm2 are kept constant. The conductance gNaP varies from 0 to 0.7mS/cm2 , and the conductance between the dendritic and somatic compartment is either gc = 0 (one-compartment model) or 0.3 mS/cm2 . Appendix B: Relationship Between the Phase-Shift Distribution Function and the Cross-Correlation in the Weak Coupling and Weak Noise Limit The normalized cross-correlation of the spike trains is (see equation 2.14) C(t2 − t1 ) =

< S1 (t1 )S2 (t2 ) > , < S1 (t) >< S2 (t) >

(B.1)

where S1 (t) is defined as 1/δ if there is a spike in a bin with a size δ and 0 otherwise. The average firing rate of the neuron is < S1 (t) >= P1 (t) = f .

666

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

We define the joint probability P1,2 (t1 , t2 ) as the probability that neuron 1 fires between t1 and t1 + δ and neuron 2 fires between t2 and t2 + δ. Then, < S1 (t1 )S2 (t2 ) >= P1,2 (t1 , t2 ) = P2|1 (t2 |t1 ) P(t1 ),

(B.2)

where P2|1 is the conditional probability that neuron 2 fires between t2 and t2 + δ given that neuron 1 fired between t1 and t1 + δ. Assuming stationarity, this function depends solely on the difference τ = t2 −t1 : P2|1 (t2 |t1 ) = P˜2|1 (τ ), and hence < S1 (t1 )S2 (t1 + τ ) > P˜2|1 (τ ) = . < S1 (t) >< S2 (t) > f

(B.3)

Switching from a function of time to a function of phase, we use the equation P˜2|1 (τ )dτ = P0 Tt d Tτ , to obtain C(τ ) ≡

τ < S1 (t1 )S2 (t1 + τ ) > = P0 , < S1 (t) >< S2 (t) > T

(B.4)

where P0 is the probability distribution of the phase shift between the two neurons. Acknowledgments We thank B. Connors, C. Landisman, and C. van Vreeswijk for useful discussions and C. Meunier for careful reading of this manuscript. This work was supported in part by a NATO PST-CLG (reference 977683), the ACI “Neurosciences integratives et computationnelles” (Minist`ere de la Recherche, France), and project SECyT-ECOS A99E01. D.G. was supported by the Israel Science Foundation (grant no. 657/01). References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Acker, C. D., Kopell, N., & White, J. A. (2003). Synchronization of strongly coupled excitatory neurons: Relating network behavior to biophysics. J. Comput. Neurosci., 15, 71–90. Alvarez, L. F., Chow, C., van Bockstaele, E. J., & Williams, J. T. (2002). Frequencydependent synchrony in locus coeruleus: Role of electrotonic coupling. Proc. Natl. Acad. Sci., 99, 4032–4036. Amitai, Y., Gibson, J. R., Patrick, A., Ho, B., Connors, B. W., & Golomb, D. (2002). Spatial organization of electrically coupled network of interneurons in neocortex. J. Neurosci., 22, 4142–4152. Bartos, M., Vida, I., Frotscher, M., Meyer, A., Monyer, H., Geiger, J. R., & Jonas, P. (2002). Fast synaptic inhibition promotes synchronized gamma oscillations in hippocampal interneuron networks. Proc. Natl. Acad. Sci., 99, 13222–13227.

Inhibitory and Electrical Synapses in Synchrony

667

Beierlein, M., Gibson, J. R., & Connors, B. W. (2000). A network of electrically coupled interneurons drives synchronized inhibition in neocortex. Nat. Neurosci., 3, 904–910. Bem, T., & Rinzel, J. (2004). Short duty cycle destabilizes a half-center oscillator, but gap junctions can restabilize the anti-phase pattern. J. Neurophysiol., 91, 693–703. Blatow, M., Rozov, A., Katona, I., Hormuzdi, S. G., Meyer, A. H., Whittington, M. A., Caputi, A., & Monyer, H. (2003). A novel network of multipolar bursting interneurons generates theta frequency oscillations in neocortex. Neuron, 38, 805–817. Bou-Flores, C., & Berger, A. J. (2001). Gap junctions and inhibitory synapses modulate inspiratory motoneuron synchronization. J. Neurophysiol., 85, 1543– 1551. Brizzi, L., Meunier, C., Zytnicki, D., Donnet, M., Hansel, D., Lamotte D’Incamps, B., & van Vreeswijk, C. (2004). How shunting inhibition affects the discharge of lumbar motoneurones. A dynamic clamp study in anaesthetized cats. J. Physiol, 558, 671–683. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Comput., 11, 1621–1671. Chow, C. C., & Kopell, N. (2000). Dynamics of spiking neurons with electrical coupling. Neural Comput., 12, 1643–1678. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998a). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillations. Neural Comput., 10, 837–854. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998b). Dendritic and synaptic effects in systems of coupled cortical oscillators. J. Comput. Neurosci., 5, 315– 329. Deans, M. R., Gibson, J. R., Sellitto, C., Connors, B. W., & Paul, D. L. (2001). Synchronous activity of inhibitory networks in neocortex requires electrical synapses containing connexin36. Neuron, 31, 477–485. Draguhn, A., Traub, R. D., Schmitz, D., & Jefferys, J. G. (1998). Electrical coupling underlies high-frequency oscillations in the hippocampus in vitro. Nature, 394, 189–192. Ermentrout, B. (1981). n:m phase-locking of weakly coupled oscillators. J. Math. Biol., 12, 327–342. Ermentrout, B. (1996). Type I membranes, phase resetting curves and synchrony. Neural Comput., 8, 979-1001. Ermentrout, B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comput., 13, 1285–1310. Friedman, D., & Strowbridge, B. W. (2003). Both electrical and chemical synapses mediate fast network oscillations in the olfactory bulb. J. Neurophysiol., 89, 2601–2610.

668

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

Fukuda, T., & Kosaka, T. (2000). Gap-junction coupling linking the dendritic network of GABAergic neurons in the hippocampus. J. Neurosci., 20, 1519– 1528. Furshpan, E. J., & Potter, D. D. (1959). Transmission at the giant motor synapses of the crayfish. J. Physiol., 145, 289–325. Galarreta, M., & Hestrin, S. (1999). A network of fast spiking cells in the neocortex connected by electrical synapses. Nature, 402, 72–75. Galarreta, M., & Hestrin, S. (2001). Electrical synapses between GABA-releasing neurons. Nature Neurosci., 2, 425–433. Galarreta, M., & Hestrin, S. (2002). Electrical and chemical synapses among parvalbumin fast-spiking GABAergic interneurons in adult mouse neocortex. Proc. Natl. Acad. Sci. USA., 19, 12438–12443. Gerstner, W., & Kistler W. M. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Gibson, J. R., Beierlein, M., & Connors, B. (1999). Two networks of inhibitory neurons electrically coupled. Nature, 402, 75–79. Golomb, D., Hansel, D., & Mato, G. (2001). Mechanisms of synchrony of neural activity in large networks. In F. Moss & S. Gielen (Eds.), Handbook of biological physics, Vol. 4: Neuro-informatics and neural modeling (pp. 887–968). Amsterdam: Elsevier Science. Golomb, D., Wang, X. J., & Rinzel, J. (1994). Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol., 72, 1109– 1126. Han, S. K., Kurrer, C., & Kuramoto, Y. (1995). Dephasing and bursting in coupled neural oscillators. Phys. Rev. Lett., 75, 3190–3193 Hansel, D., & Mato, G. (2003) Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comput., 15, 1–56. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hormuzdi, S. G., Pais, I., Lebeau, F. E .N., Towers, S. K., Rozov, A., Buhl, E., Whittington, M. A., & Monyer, H. (2001). Impaired electrical signaling disrupts gamma frequency oscillations in connexin 36-deficient mice. Neuron, 31, 487–495. Izhikevich, E. M. (1998). Phase models with explicit time delays. Phys. Rev. E, 58, 905–908 Kita, H., Kosaka, T., & Heizmann, C. W. (1990). Parvalbumin-immunoreactive neurons in the rat neostriatum: A light and electron microscopic study. Brain Res., 536, 1–15. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer. Kuramoto, Y. (1991). Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30.

Inhibitory and Electrical Synapses in Synchrony

669

Landisman, C. E., Long, M. A., Beierlein, M., Deans, M. R., Paul, D. L., & Connors, B. W. (2002). Electrical synapses in the thalamic reticular nucleus. J. Neurosci., 22, 1002–1009. Latham, P. E., Richmond, B. J., Nelson P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Lewis, T., & Rinzel, J. (2003). Dynamics of spiking neurons connected by both inhibitory and electrical coupling. J. Comput. Neurosci., 14, 283–309. Mancilla, J. G., Lewis, T., Pinto, D. J., Rinzel, J., & Connors, B. W. (2002). Firing dynamics of single and coupled pairs of inhibitory interneurons in neocortex. Program No. 840.13. Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Available at: http://apu.sfn.org. Mann-Metzer, P., & Yarom, Y. (1999). Electrical coupling interacts with intrinsic properties to generate synchronized activity in cerebellar networks of inhibitory interneurons. J. Neurosci., 19, 3298–3306. Nomura, M., Fukai, T., & Aoyagi, T. (2003). Synchrony of fast-spiking interneurons interconnected by GABAergic and electrical synapses. Neural Comput., 15, 2179–2198. Oprisan, S. A., & Canavier, C. C. (2002). The influence of limit cycle topology on the phase resetting curve. Neural Comput, 14, 1027–1057. Perez-Velazquez, J. L., Carlen, P. L. & Skinner F. K. (2001). Artificial electrotonic coupling affects neuronal firing patterns depending upon cellular characteristics. Neuroscience, 103, 841–849. Perez-Velazquez, J. L., Valiante, T. A., & Carlen, P. L. (1994). Modulation of gap junctional mechanisms during calcium-free induced field burst activity: A possible role for electrotonic coupling in epileptogenesis. J. Neurosci., 14, 4308–4317. Pfeuty, B., Mato, G., Golomb, D., & Hansel, D. (2003). Electrical synapses and synchrony: The role of intrinsic currents. J. Neurosci, 23, 6280–6294. Prinz, A. A., Abbott, L. F., & Marder A. (2004). The dynamic clamp comes of age. Trends Neurosci., 27, 228–234. Sharp, A. A., O’Neil, M. B., Abbott, L. F., & Marder, E. (1993). The dynamic clamp: Artificial conductances in biological neurons. Trends Neurosci., 16, 389–394. Sherman, A., & Rinzel, J. (1992). Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Natl. Acad. Sci., 89, 2471–2474. Skinner, F. K., Zhang, L., Velazquez, J. L., & Carlen, P. L. (1999). Bursting in inhibitory interneuronal networks: A role for gap-junctional coupling. J. Neurophysiol., 81, 1274–1283. Tamas, G., Buhl, E. H., Lorincz, A., & Somogyi, P. (2000). Proximally targeted GABAergic synapses and gap-junctions precisely synchronize cortical interneurons. Nature Neurosci., 3, 366–371. Traub, R. D., Kopell, N., Bibbig, A., Buhl, E., Lebeau F. E. N., & Whittington, M. (2001). Gap junction between interneuron dendrites can enhance synchrony of gamma oscillations in distributed networks. J. Neurosci., 21, 9478–9486. van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: Elsevier Science. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci., 1, 313–321.

670

B. Pfeuty, G. Mato, D. Golomb, and D. Hansel

Venance, L., Rozov, A., Blatow, M., Burnashev, N., Feldmeyer, D., & Monyer, H. (2000). Connexin expression in electrically coupled postnatal rat brain neurons. Proc. Natl. Acad. Sci., 97, 10260–10265. Wang, X. J., & Busz´aki, J. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X. J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comput., 4, 84–117 Watanabe, A. (1958). The interaction of electrical activity among neurons of lobster cardiac ganglion. Jap. J. Physiol., 8, 305–318. Received June 1, 2004; accepted August 3, 2004.

LETTER

Communicated by Chris Watkins

Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces Lavi Shpigelman [email protected] School of Computer Science and Engineering and Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel

Yoram Singer [email protected] School of Computer Science and Engineering Hebrew University, Jerusalem 91904, Israel

Rony Paz [email protected]

Eilon Vaadia [email protected] Interdisciplinary Center for Neural Computation and Department of Physiology, Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel

Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this letter is the construction of biologically motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach by comparing the Spikernel to various standard kernels in the task of predicting hand movement velocities from cortical recordings. All of the kernels that we tested in our experiments outperform the standard scalar product used in linear regression, with the Spikernel consistently achieving the best performance.

1 Introduction Neuronal activity in primary motor cortex (MI) during multijoint arm reaching movements in 2D and 3D (Georgopoulos, Schwartz, & Kettner, 1986; Neural Computation 17, 671–690 (2005)

c 2005 Massachusetts Institute of Technology

672

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Georgopouls, Kettner, & Schwartz, 1988) and drawing movements (Schwartz, 1994) has been used extensively as a test bed for gaining understanding of neural computations in the brain. Most approaches assume that information is coded by firing rates, measured on various timescales. The tuning curve approach models the average firing rate of a cortical unit as a function of some external variable, like the frequency of an auditory stimulus or the direction of a planned movement. Many studies of motor cortical areas (Georgopoulos, Kalaska, & Massey, 1983; Georgopoulos et al., 1988; Moran & Schwartz, 1999; Schwartz, 1994; Laubach, Wessberg, & Nicolelis, 2000) showed that while single units are broadly tuned to movement direction, a relatively small population of cells (tens to hundreds) carries enough information to allow for accurate prediction. Such broad tuning can be found in many parts of the nervous system, suggesting that computation by distributed populations of cells is a general cortical feature. The population vector method (Georgopoulos et al., 1983, 1988) describes each cell’s firing rate as the dot product between that cell’s preferred direction and the direction of hand movement. The vector sum of preferred directions, weighted by the measured firing rates, is used both as a way of understanding what the cortical units encode and as a means for estimating the velocity vector. Several recent studies (Fu, Flament, Cotz, & Ebner, 1997; Donchin, Gribova, Steinberg, Bergman, & Vaadia, 1998; Schwartz, 1994) propose that neurons can represent or process multiple parameters simultaneously, suggesting that it is the dynamic organization of the activity in neuronal populations that may represent temporal properties of behavior such as the computation of transformation from desired action in external coordinates to muscle activation patterns. A few studies (Vaadia, et al., 1995; Laubach, Schuler, & Nicolelis, 1999; Reihle, Grun, Diesmann, & Aersten, 1997) support the notion that neurons can associate and dissociate rapidly to functional groups in the process of performing a computational task. The concepts of simultaneous encoding of multiple parameters and dynamic representation in neuronal populations together could explain some of the conundrums in motor system physiology. These concepts also invite using of increasingly complex models for relating neural activity to behavior. Advances in computing power and recent developments of physiological recording methods allow recording of ever growing numbers of cortical units that can be used for real time analysis and modeling. These developments and new understandings have recently been used to reconstruct movements on the basis of neuronal activity in real time in an effort to facilitate the development of hybrid brain-machine interfaces that allow interaction between living brain tissue and artificial electronic or mechanical devices to produce brain-controlled movements (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Laubach et al., 2000; Nicolelis, 2001; Wessberg et al., 2000; Laubach et al., 1999; Nicolelis, Ghazanfar, Faggin, Votaw, & Oliveira, 1997; Isaccs, Weber, & Schwartz, 2000).

Spikernels

673

Most of the current attempts that focus on predicting movement from cortical activity rely on modeling techniques that employ parametric models to describe a neuron’s tuning (instantaneous spike rate) as a function of current behavior. For instance, cosine-tuning estimation (population vector) is used by Taylor, Tillery, and Schwartz (2002), and linear regression was employed by Wessberg et al. (2000) and Serruya, Hatsopoulos, Paninski, Fellows, and Donoghue (2002). Brown, Frank, Tang, Quirk, and Wilson (1998) and Brockwell, Rojas, and Kass (2004) have applied filtering methods that, apart from assuming parametric models of spike rate generation, also incorporate a statistical model of the movement. An exception is the use of artificial neural nets (Wessberg et al, 2000) (though this study reports getting better results by linear regression). This article describes the tailoring of a kernel method for the task of predicting two-dimensional hand movements. The main difference between our method and the other approaches is that our nonparametric approach does not assume cell independence, imposes relatively few assumptions on the tuning properties of the cells and how the neural code is read, allowing representation of behavior in highly specific neural patterns, and, as we demonstrate later, results in better predictions on our test sets. However, due to the implicit manner in which neural activity is mapped to feature space, improved results are often achieved at the expense of understanding tuning of individual cells. We attempt to partially overcome this difficulty by describing the feature space that our kernel induces and by explaining the way our kernel parameters affect the features. The letter is organized as follows. In section 2, we describe the problem setting that this article is concerned with. In section 3, we introduce and explain the main mathematical tool that we use: the kernel operator. In section 4, we discuss the design and implementation of a biologically motivated kernel for neural activities. The experimental method is described in section 5. We report experimental results in section 6 and give conclusions in section 7. 2 Problem Setting Consider the case where we monitor instantaneous spike rates from q cortical units during physical motor behavior of a subject (the method used for translating recorded spike times into rate measures is explained in section 5). Our goal is to learn a predictive model of some behavior parameter (such as hand velocities, as described in section 5) with the cortical activity as the input. Formally, let S ∈ Rq×m be a sequence of instantaneous firing rates from q cortical units consisting of m time samples. We use S, T to denote sequences of firing rates and denote by len(S) the time length of a sequence S (len(S) = m in above definition). Si ∈ Rq designates the instantaneous firing rates of all neurons at time i in the sequence S and Si,k ∈ R the rate of neuron k. We also use Ss to denote the concatenation of S with one more sample s ∈ Rq . The instantaneous firing rate of a unit k in sample s is then

674

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

sk ∈ R. We also need to employ a notation for subsequences. A consecutive subsequence of S from time t1 to time t2 is denoted St1 :t2 . Finally, throughout the work, we need to examine possibly nonconsecutive subsequences. We denote by i ∈ Rn a vector of time indices into the sequence S such that 1 ≤ i1 < i2 < · · · < in ≤ len(S). Si ∈ Rq×n is then an ordered selection of spike rates (of all neurons) from times i. Let y ∈ Rm denote a parameter of the movement that we would like to predict (e.g., the movement velocity in the horizontal direction, vx ). Our goal is to learn an approximation y of the form f: Rq×m → Rm from neural firing rates to movement parameter. In general, information about movement can be found in neural activity both before and after the time of movement itself. Our long-term plan, though, is to design a model that can be used for controlling a neural prosthesis. We therefore confine ourselves to causal predictors that use an l (l ∈ Z) long window of neural activity ending at time step t, S(t−l+1):t , to predict yt ∈ R. We would like to make yt = ft S(t−l+1):t as close as possible (in a sense that is explained in the sequel) to yt . 3 Kernel Method for Regression The major mathematical tool employed in this article is kernel operators. Kernel operators allow algorithms whose interface to the data is limited to scalar products to employ complicated premappings of the data into feature spaces by use of kernels. Formally, a kernel is an inner-product operator K: X × X → R where X is some arbitrary vector space. In this work, X is the space of neural activities (specifically, population spike rates). An explicit way to describe K is via a mapping φ: X → H from X to a feature space H such that K(x, x ) = φ(x) · φ(x ). Given a kernel operator, we can use it to perform various statistical learning tasks. One such task is support vector regression (SVR) (see, e.g., Smola & Scholkopf, ¨ 2004) which attempts to find a regression function for target values that is linear if observed in the (typically very high-dimensional) feature space mapped by the kernel. We give here a brief description of SVR for clarity. SVR employs the ε-insensitive loss function (Vapnik, 1995). Formally, this loss is defined as yt − f (S(t−l+1):t ) = max 0, yt − f (S(t−l+1):t ) − ε . ε Examples that fall within ε of the target value (yt ) do not contribute to the error. Those that fall further away contribute linearly to the loss (see Figure 1, left). The form of the regression model is f (S(t−l+1):t ) = w · φ S(t−l+1):t + b,

(3.1)

which is linear in the feature space H and is implicitly defined by the kernel. Combined with a loss function, f defines a hyperslab of width ε centered at the estimate (see Figure 1, right, for a single-dimensional illustration).

Spikernels

675 y f(x)

Loss

ε

ε

y − yˆ

φ(x)

Figure 1: Illustration of the ε-insensitive loss (left) and an ε-insensitive area around a linear regression in a single-dimensional feature space (right). Examples that fall within distance ε have zero loss (shaded area).

For each trial i and time index t designating a frame within the trial, we received a pair, (Si(t−l+1):t , yit ), consisting of spike rates and a corresponding target value. The optimization problem employed by SVR is arg min w

1 w2 + C yit − f Si(t−l+1):t , ε 2 i,t

(3.2)

where ·2 is the squared vector norm. This minimization casts a trade-off (weighed by C ∈ R) between a small empirical error and a regularization term that favors low sensitivity to feature values. Let φ(x) denote the feature vector that is implicitly implemented by kernel function K(·, x). Then there exists a regression model equivalent to the one given in equation 3.1, is a minimum of equation 3.2, and takes the form f (T) = At,i K(Si(t−l+1):t , T) + b. t,i

Here, T is the observed neural activity, and A is a (typically sparse) matrix of Lagrange multipliers determined by the minimization problem in equation 3.2. In summary, SVR solves a quadratic optimization problem aimed at finding a linear regressor in an induced high-dimensional feature space. This regressor is the best penalized estimate for the ε-insensitive loss function where the penalty is the squared norm of the regressor’s weights. 4 Spikernels The quality of kernel-based learning is highly dependent on how the data are embedded in the feature space via the kernel operator. For this reason, several studies have been devoted to developing new kernels (Jaakola & Haussler, 1998; Genton, 2001; Lodhi, Shawe-Taylor, Cristianini, & Watkins, 2000). In fact, a good kernel could render the work of a classification or regression algorithm trivial. With this in mind, we develop a kernel for neural spike rate activity.

676

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

R a te

Pattern A Pattern B

T im e

(a) Bin by bin P attern A P attern B

P attern A P attern B

R a te

R a te

Time of Interest

T im e

(b) Time warp

T im e

(c) Time of interest

Figure 2: Illustrative examples of pattern similarities. (a) Bin-by-bin comparison yields small differences. (b) Patterns with large bin-by-bin differences that can be eliminated with some time warping. (c) Patterns whose suffix (time of interest) is similar and prefix is different.

4.1 Motivation. Our goal in developing a kernel for spike trains is to map similar patterns to nearby areas of the feature space. In the description of our kernel, we attempt to capture some well-accepted notions on similarities between spike trains. We make the following assumptions regarding similarities between spike patterns: • A commonly made assumption is that firing patterns that have small differences in a bin-by-bin comparison are similar. This is the basis of linear regression. In Figure 2a, we show an example of two patterns that are bin-wise similar. • A cortical population may display highly specific patterns to represent specific information such as a particular stimulus or hand movement.

Spikernels

677

Though in this work we make use of spike counts within each time bin, we assume that a pattern of spike counts may be specific to an external stimulus or action and that the manner in which these patterns change as a function of the behavior may be highly nonlinear (Segev & Rall, 1998). Many models of neuronal computation are based on this assumption (see, e.g., Pouget & Snyder, 2000). • Two patterns may be quite different from a simple bin-wise perspective, yet if they are aligned using a nonlinear time distortion or shifting, the similarity becomes apparent. An illustration of such patterns is given in Figure 2b. In comparing patterns, we would like to induce a higher score when the time shifts are small. Some biological plausibility for time-warped integration of inputs may stem from the spatial distribution of synaptic inputs on cortical cells’ dendrites. In order for two signals to be integrated along a dendritic tree or at the axon hillock, their sources must be appropriately distributed in space and time, and this distribution may be sensitive to the current depolarization state of the tree (Magee, 2000). • Patterns that are associated with identical values of an external stimulus at time t may be similar at that time but rather different at t ± when values of the external stimulus for these patterns are no longer similar (as illustrated in Figure 2c. We would like to give a higher similarity score to patterns that are similar around a time of interest (prediction time), even if they are rather different at other times. 4.2 Spikernel Definition. We describe our kernel construction, the Spikernel, by specifying the features that make up the feature space. Our construction of the feature space builds on the work of Haussler (1999), Watkins (1999), and Lodhi et al. (2000). We chose to directly specify the features constituting the kernel rather than describe the kernel from a functional point of view, as it provides some insight into the structure of the feature space. We first need to introduce a few more notations. Let S be a sequence of length l = len(S). The set of all possible n-long index vectors defining a subsequence of S is In,l = i: i ∈ Zn , 1 ≤ i1 < · · · < in ≤ l . Also, let d(α, β ), (α, β ∈ Rq ) denote a bin-wise distance over a pair of samples (population rates). We also overload notation and denote by

n firing d (Si , U) = k=1 d Sik , Uk a distance between sequences. The sequence distance is the sum of distances over the samples. One such function (the d we explore here) is the squared 2 norm. Let µ, λ ∈ (0, 1). The U component of our (infinite) feature vector φ(S) is defined as

φnU (S) =

µd(Si ,U) λlen(S)−i1 ,

(4.1)

i∈In,len(S)

where i1 is the first entry in the index vector i. In words, φnU (S) is a sum

678

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

over all n-long subsequences of S. Each subsequence Si is compared to U (the feature coordinate) and contributes to the feature value by adding a number that is proportional to the bin-wise similarity between Si and U and is inversely proportional to the amount of stretching Si needs to undergo. That is, the feature entry indexed by U measures how similar U is to the latter part of the time series S. This definition seems to fit our assumptions on neural coding for the following reasons: • A feature φnU (S) gets large values only if U is bin-wise similar to subsequences of the pattern S. • It allows for learning complex patterns: choosing small values of λ and µ (or concentrated d measures) means that each feature tends toward being either 1 or 0, depending on whether the feature coordinate U is almost identical to a suffix of S. • We allow gaps in the indexes defining subsequences, thus allowing for time warping. • Subpatterns that begin further from the required prediction time are penalized by an exponentially decaying weight; thus, more weight is given to activities close to the prediction time (the end of the sequence), which designates our time of interest. 4.3 Efficient Kernel Calculation. The definition of φ given by equation 4.1 requires the manipulation of an infinite feature space and an iteration over all subsequences that grows exponentially with the lengths of the sequences. Straightforward calculation of the feature values and performing the induced inner product is clearly impossible. Building on ideas from Lodhi et al. (2000), who describe a similar kernel (for finite sequence alphabets and no time of interest), we developed an indirect method for evaluating the kernel through a recursion that can be performed efficiently using dynamic programming. The solution we now describe is rather general, as it employs an arbitrary base feature vector ψ . We later describe our specific design choice for ψ . The definition in equation 4.1 is a special case of the following feature definition,

φnU (S)

=

i∈In,len(S)

len(S)−i 1 ψ U k S ik λ ,

(4.2)

k

where n specifies the length of the subsequences, and for the Spikernel, we substitute d S ,U ψ U k S ik = µ i k k .

Spikernels

679

The feature vector of equation 4.2 satisfies the following recursion:  len(S)  i=1 λlen(S)−i+1 ψ (Si ) φn−1 (S1:i−1 ) len(S) ≥ n > 0 n φ (S) = 0 len(S) < n, n > 0 (4.3)  1 n=0 The recursive form is derived by decomposing the sum over all subsequences into a sum over the last subsequence index and a sum over the rest of the indices. The fact that φn (S) is the zero vector for sequences of length less than n is due to equation 4.2. The last value of φn when n = 0 is defined to be the all-one vector. The kernel that performs the dot product between two feature vectors is then defined as Kn (S, T) = φn (S) · φn (T) = φnU (S)φnU (T)dU, Rq×n

where n designates subsequences of length n as before. We now plug in the recursive feature vector definition from equation 4.3:

len(S) n−1 n len(S)−i+1 λ ψ Un (Si ) φU1:n−1 (S1:i−1 ) K (S, T) = Rq×n



×

i=1

len(T)

  dU. λlen(T)−j+1 ψ Un Tj φn−1 U1:n−1 T1:j−1

(4.4)

j=1

Next, we note that n−1 n−1 φn−1 (S1:i−1 ) · φn−1 (T1:j−1 ) U (S1:i−1 )φU (T1:j−1 )dU = φ Rq×n−1 = Kn−1 S1:i−1 , T1:j−1 , and also Rq

ψ Uk (Si )ψ Uk (Tj )dUk = ψ (Si ) · ψ (Tj ).

Defining K∗ (Si , Tj ) = ψ (Si )·ψ (Si ), we now insert the integration in equation 4.4 into the sums to get Kn (S, T) len(S) len(T) n−1 = λlen(S)+len(T)−i−j+2 ψ (Si )· ψ Tj φ (S1:i−1 )· φn−1 T1:j−1 i=1 j=1 len(S) len(T)

=

i=1 j=1

λlen(S)+len(T)−i−j+2 K∗ Si , Tj Kn−1 S1:i−1 , T1:j−1 .

(4.5)

680

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

We have thus derived a recursive definition of the kernel, which inherits the initial conditions from equation 4.3: ∀S, T K0 (S, T) = 1 if min len(S), len(T) < i Ki (S, T) = 0.

(4.6)

The kernel function K∗ (·, ·) may be any kernel operator and operates on population spike rates sampled at single time points. To obtain the Spikernel described above, we chose K∗ (α, β ) =

Rq

µd(u,α) µ

d u,β

du.

A more time-efficient description of the recursive computation of the kernel may be obtained if we notice that the sums in equation 4.5 are used, up to a constant λ, in calculating the kernel for prefixes of S and T. Caching these sums yields Kn (Ss, Tt) = λKn (S,Tt)+λKn (Ss,T)−λ2 K(S,T)+λ2 K∗ (s, t) Kn−1 (S,T) , where s and t are the last samples in the sequences (and the initial conditions of equation 4.6 still hold). Calculating Kn (S, T) using the above recursion is computationally cheaper than equation constructing a three-dimensional 4.5 and comprises array, yielding an O len(S) len(T) n dynamic programming procedure. 4.4 Spikernel Variants. The kernels defined by equation 4.1 consider only patterns of fixed length (n). It makes sense to look at sub-sequences of various lengths. Since a linear combination of kernels is also a kernel, we can define our kernel to be K(S, T) =

n

pi Ki (S, T),

i=1

Rn+

is a vector of weights. The weighted kernel summation can where p ∈ be interpreted as performing the dot product between vectors that take the √ √ form [ p1 φ1 (·), . . . , pn φn (·)], where φi (·) is the feature vector that kernel i K (·, ·) represents. In the results, we use this definition with pi = 1 (a better weighing was not explored). Different choices of K∗ (α, β ) allow different methods of comparing population rate values (once the time bins for comparison were chosen). We 2 q 2 chose d(α, β ) to be the squared 2 norm: α − β 2 = k=1 αk − β k . The kernel K∗ (·, ·) then becomes 2 1 α−β 2 K∗ (α, β ) = cµ 2 ,

Spikernels

with c =

681 π −2 ln µ

q

; however, c can (and did) set to 1 WLOG. Note that 2 − ln(µ) α−β 2 and is therefore a K∗ (α, β ) can be written as K∗ (α, β ) = e 2 simple gaussian. It is also worth noticing that if λ ≈ 0, Kn (S, T) → ce−

ln(µ) S−T22 2

.

That is, in this limit, no time warping is allowed, and the Spikernel is reduced to the standard exponential kernel (up to a constant c). 5 Experimental Method 5.1 Data Collection. The data used in this work were recorded from the primary motor cortex of a rhesus (Macaca mulatta) monkey (approximately 4.5 kg). The animal’s care and surgical procedures accorded with The NIH Guide for the Care and Use of Laboratory Animals (rev. 1996) and with the Hebrew University guidelines supervised by the institutional committee for animal care and use. The monkey sat in a dark chamber, and eight electrodes were introduced into each hemisphere. The electrode signals were amplified, filtered, and sorted (MCP-PLUS, MSD, Alpha-Omega, Nazareth, Israel). The data used in this report were recorded on four different days during a one-month period of daily sessions. Up to 16 microelectrodes were inserted on each day. Recordings include spike times of single units (isolated by signal fit to a series of windows) and multiunits (detection by threshold crossing). The monkey used two planar-movement manipulanda to control two cursors (× and + shapes) on the screen to perform a center-out reaching task. Each trial begun when the monkey centered both cursors on a central circle for 1.0 to 1.5 s. Either cursor could turn green, indicating the hand to be used in the trial (× for right arm and + for the left). Then, after an additional hold period of 1.0 to 1.5ss one of eight targets (0.8 cm diameter) appeared at a distance 4 cm from the origin. After another 0.7 to 1.0ss, the center circle disappeared (referred to as the Go Signal), and the monkey had to move and reach the target in less than 2 s to receive liquid reward. We obtained data from all channels that exhibited spiking activity (i.e., were not silent) during at least 90% of the trials. The number and types of recorded units and the number of trials used in each of the recording days are described in Table 1. At the end of each session, we examined the activity of neurons evoked by passive manipulation of the limbs and applied intracortical microstimulation (ICMS) to evoke movements. The data presented here were recorded in penetration sites where ICMS evoked shoulder and elbow movements. Identification of the recording area was also aided by MRI. More information can be found in (Paz, Boraud, Natan, Bergman, and Vaadia (2003).

682

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Table 1: Summary of the Data Sets Obtained.

Data Set

Number of Single Units

1 2 3 4

27 22 27 25

Number of Mean Number of Multiunit Spike Rate Successful Trials Channels (spikes/sec) 16 9 10 13

8.6 11.6 8.0 6.5

317 333 155 307

Net Cumulative Movement Time (sec) 529 550 308 509

The results that we present here refer to prediction of instantaneous hand velocities (in 2D) during the movement time (from Go Signal to Target Reach times) of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vice versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and nonmovement for each hand. 5.2 Data Preprocessing and Modeling. The recordings of spike times were made at 1 ms precision, and hand positions were recorded at 500 Hz. Our kernel method is suited for use with windows of observed neural activity as time series and with the movement values at the ends of those windows. We therefore chose to partition spike trains into bins, count the spikes in each bin, and use a running window of such spike counts as our data with the hand velocities at the end of the window as the label. Thus, a labeled example (St , vt ) for time t consisted of the X (left-right) or Y (forward-backward) velocity for the left or right hand as the target label vt and the preceding window of spike counts from all q cortical units as the input sequence St . In one such preprocessing, we chose a running window of 10 bins of size 100 ms each (a 1-second-long window), and the time interval between two such windows was 100 ms. Two such consecutive examples would then have nine time bins of overlap. For example, the number of cortical units q in the first data set was 43 (27 single +16 multiple), and the total length of all the trials used in that data set is 529 seconds. Hence, in that session, with the above preprocessing, there are 5290 consecutive examples, where each is a 43 × 10 matrix of spike counts and there are 4 sequences of movement velocities (X/Y and R/L hand) of the same length. Before we could train an algorithm, we had to specify the time bin size, the number of bins, and the time interval between two observation windows. Furthermore, each kernel employs a few parameters (for the Spikernel, they are λ, µ, n, and pi , though we chose pi = 1 a priori) and the SVM regression setup requires setting two more parameters, ε and C. In order to test the algorithms, we first had to establish which parameters work best. We used the first half of data set 1 in fivefold cross-validation to choose the best

Spikernels

683

Table 2: Parameter Values Tried for Each Kernel. Kernel

Parameters

Spikernel

λ ∈ {0.5, 0.7, 0.9, 0.99}, µ ∈ {0.7, 0.9, 0.99, 0.999}, n ∈ {3, 5, 10}, pi = 1, C = 0.1, ε = 0.1 γ ∈ {0.1, 0.01, 0.001, 0.005, 0.0001}, C ∈ {1, 5, 10}, ε = 0.1 C ∈ {0.001, 10−4 , 10−5 }, ε = 0.1 C ∈ {10−4 , 10−5 , 10−6 }, ε = 0.1 C ∈ {100, 10, 1, 0.1, 0.01, 0.001, 10−4 }, ε = 0.1

Exponential Polynomial, second degree Polynomial, third degree Linear

Notes: Not all combinations were tried for the Spikernel. n was chosen to be and some peripheral combinations of µ and λ were not tried either.

number of bins , 2

preprocessing and kernel parameters by trying a wide variety of values for each parameter, picking the combinations that produced the best predictions on the validation sets. For all kernels, the prediction quality as a function of the parameters was single peaked, and only that set of parameters was chosen per kernel for testing of the rest of the data. Having tuned the parameters, we then used the rest of data set 1 and the other three data sets to learn and predict the movement velocities. Again, we employed fivefold cross-validation on each data set to obtain accuracy results on all the examples. The five-fold cross-validation was produced by randomly splitting the trials into five groups: four of the five groups were used for training, and the rest of the data was used for evaluation. This process was repeated five times by using each fifth of the data once as a test set. The kernels that we tested are the Spikernel, the exponential kernel 2 K(S, T) = e−γ (S−T) , the homogeneous polynomial kernel K(S, T) = (S · T)d d = 2, 3, and standard scalar product kernel K(S, T) = S · T, which boils down to a linear regression. In our initial work, we also tested polynomial kernels of higher order and nonhomogeneous polynomial kernels, but as these kernels produced worse results than those we describe here, we did not pursue further testing. During parameter fitting (on the first half of data set 1), we tried all combinations of bin sizes {50ms, 100ms, 200ms} and number of bins {5, 10, 20} except for bin size and number of bins resulting in windows of length 2 s or more (since full information regarding the required movement was not available to the monkey 2S before the Go Signal and therefore neural activity at that time was probably not informative). The time interval between two consecutive examples was set to 100 ms. For each preprocessing, we then tried each kernel with parameter values described in Table 2. 6 Results Prediction samples of the Spikernel and linear regression are shown in Figure 3. The samples are of consecutive trials taken from data set 4 without

684

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

v

y

1 0 −1 200

202

x

206

208

210

212

214

216

218

220

Go Signal / Target Reach Actual Spikernel Linear Regression

1

v

204

0 −1 200

202

204 206 208 210 212 214 216 cumulative prediction time from session start (sec)

218

220

Figure 3: Sample of actual movement, Spikernel prediction and linear regression prediction of left-hand velocity in data set 4 in a few trials: (top) vy , (bottom) vx (both normalized). Only intervals between Go Signal and Target Reach are shown (separated by vertical lines). Actual movement in thick gray, Spikernel in thick black, and linear regression in thin black. Time axis is seconds of accumulated prediction intervals from the beginning of the session.

prior inspection of their quality. We show the Spikernel and linear regression as prediction quality extremes. Both methods seem to underestimate the peak values and are somewhat noisy when there is no movement. The Spikernel prediction is better in terms of mean absolute error. It is also smoother than the linear regression. This may be due to the time warping quality of the features employed. We computed the correlation coefficient (r2 ) between the recorded and predicted velocities per fold for each kernel. The results are shown in Figure 4 as scatter plots. Each circle compares the Spikernel correlation coefficient scores to those of one of the other kernels. There are five folds for each of the four data sets and four correlation coefficients for each fold (one for each movement direction and hand), making up 80 scores (and circles in the plots) for each of the kernels compared with the Spikernel. Results above the diagonal are cases where the Spikernel outperformed. Note that though the correlation coefficient is informative and considered a standard performance indicator, the SVR algorithm minimizes the ε-insensitive loss. The results in terms of this loss are qualitatively the same (i.e., the kernels are ranked by this measure in the same manner), and we therefore omit them.

Spikernels

685

0.8

0.8

Spikernel − r

Spikernel − r

2

1

2

1

0.6

0.4

0.2

0

0

0.4

0.2

0.2 0.4 0.6 0.8 Exponential Kernel − r 2

0

1

0

0.2 0.4 0.6 0.8 Polynomial 2nd Deg. Kernel − r 2

1

0.8

0.8

1

Spikernel − r

2

1

2

Spikernel − r

0.6

0.6

0.4

0.2

0

0

0.6

0.4

0.2

0.2 0.4 0.6 0.8 Polynomial 3rd Deg. Kernel − r 2

1

0

0

0.2 0.4 0.6 0.8 Linear Regression − r 2

1

Figure 4: Correlation coefficient comparisons of the Spikernel versus other kernels. Each scatter plot compares the Spikernel to one of the other kernels. Each circle shows the correlation coefficient values obtained by the Spikernel and by the other kernel in one fold of one of the data sets for one of the two axes of movement. (a) Compares the Spikernel to the exponential kernel. (b) Compares with the second-degree polynomial kernel. (c) Compares with the third-degree polynomial kernel. (d) Compares with linear regression. Results above the diagonals are cases where the Spikernel outperformed.

686

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Table 3: Chosen Preprocessing Values, Parameter Values, and Mean Correlation Coefficient Across Folds for Each Kernel in Each Data Set. Kernel

Kernel Parameters

Day 1

Day 2

Day 3

Day 4

Total

LhX RhX LhX RhX LhX RhX LhX RhX Mean LhY RhY LhY RhY LhY RhY LhY RhY

Type

Preprocessing

Spikernel

10 bins 100 ms

µ = .99 λ = .7 .71 .79 N = 5 C = 1 .74 .63

.77 .63 .87 .79

.75 .70 .67 .66

.69 .66 .79 .71

.72

Exponential

5 bins 200 ms

γ = .01 C = 5

.67 .76 .69 .45

.73 .57 .85 .74

.71 .67 .56 .61

.66 .63 .74 .65

.67

Polynomial second degree

10 bins 100 ms

C = 10−4

.67 .76 .70 .51

.72 .54 .85 .74

.70 .68 .61 .63

.65 .62 .76 .68

.68

Polynomial third degree

10 bins 100 ms

C = 10−5

.65 .72 .66 .51

.69 .52 .84 .71

.67 .68 .61 .62

.63 .61 .75 .64

.66

Linear (dot product)

10 bins 100 ms

C = 0.01

.54 .65 .67 .35

.62 .34 .75 .57

.66 .64 .50 .57

.51 .51 .70 .62

.57

Notes: Total means (across data sets, folds hands, and movement directions) are computed for each kernel. The Spikernel outperforms the rest, and the nonlinear kernels outperform the linear regression in all data sets.

Table 3 shows the best bin sizes, number of bins, and kernel parameters that were found on the first half of data set 1 and the mean correlation coefficients obtained with those parameters on the rest of the data. Some of the learning problems were easier than others (we observed larger differences between data sets than between folds of the same data set). Contrary to our preliminary findings (Shpigelman, Singer, Paz, & Vaadia, 2003), there was no significantly easier axis to learn. The mean scores define a ranking order on the kernels. The Spikernel produced the best mean results, followed by the polynomial second-degree kernel, the exponential kernel, the polynomial third-degree kernel, and finally, the linear regression (dot product kernel). To determine how consistent the ranking of the kernels is, we computed the difference in correlation coefficient scores between each pair of kernels in each fold and determined the 95% confidence interval of these differences. The Spikernel is significantly better than the other kernels and the linear regression (the mean difference is larger than the confidence interval). The other nonlinear kernels (exponential and polynomial second and third degree) are also significantly better than linear regression, but the differences between these kernels are not significant. For all kernels, windows of 1 s were chosen over windows of 0.5 s as best preprocessing (based on the parameter training data). However, within the 1 s window, different bin sizes were optimal for the different kernels. Specifically, for the exponential kernel, 5 bins of size 200 ms, were best and for the rest, 10 bins of 100 ms were best. Any processing of the data (such as time binning or PCA) can only cause loss of information (Cover & Thomas, 1991)

Spikernels

687

Table 4: Mean Correlation Coefficients (Over All Data) for the Spikernel with Different Combinations of Bin Sizes and Number of Bins. Number of Bins

50 ms

100 ms

200 ms

5 bins 10 bins 20 bins

.52 .66 .66

.65 .72

.68

Note: The best fit to the data is still 10 bins of 100 ms.

but may aid an algorithm. The question of what the time resolution of cells is is not answered here, but for the purpose of linear regression and SVR (with the kernels tested) on our MI data, it seems that binning was necessary. In many cases, a poststimulus time histogram, averaged over many trials, is used to get an approximation of firing rate that is accurate in both value and position in time (assuming that one can replicate the same experimental condition many times). In single-trial reconstruction, this approach cannot be used, and the trade-off between time precision versus accurate approximation of rate must be balanced. To better understand what the right bin size is, we retested the Spikernel with bin sizes of 50 ms, 100 ms, and 200 ms and number of bins 5, 10, and 20, leaving out combinations resulting in windows of 2 seconds or more. The mean correlation coefficients (over all the data) are shown in Table 4. The 10 by 100 ms configuration is optimal for the whole data as well as for the parameter fitting step. Note that since the average spike count per second was less then 10, the average spike count in bins of 50 ms to 200 ms bins was 0.5 to 2 spikes. In this respect, the data supplied to the algorithms are on the border between a rate and a spike train (with crude resolution). Run-time issues were not the focus of this study, and little effort was made to make the code implementation as efficient as possible. Run time for the exponential and polynomial kernels (prediction time) was close to real time (the time it took for the monkey to make the movements) on a Pentium 4, 2.4 GHz CPU. Run time for the Spikernel is approximately 100 times longer than the exponential kernel. Thus, real time use is currently impossible. The main contribution of the Spikernel is the allowance of time warping. To better understand the effect of λ on Spikernel prediction quality, we retested the Spikernel on all of the data, using 10 bins of size 100 ms (n = 5, µ = 0.99) and λ ∈ {0.5, 0.7, 0.9}. The mean correlation coefficients were {0.68, 0.72, 0.66}, respectively. We also retested the exponential kernel with γ = 0.005 on all the data with 5 bins of size 100 ms. This would correspond to a Spikernel with its previous parameters (µ = 0.99, n = 5) but no time warping. The correlation coefficient achieved was 0.64. Again, the relevance of time warping to neural activity cannot be directly addressed by our method. However, we can safely conclude that consideration of time-

688

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

warped neural activity (in the Spikernel sense) does improve prediction. We believe that it would make sense to assume that this is relevant to cells as well, as mentioned in section 4.1, because the 3D structure of a cell’s dendritic tree can align integration of differently placed inputs from different times. 7 Conclusion In this article, we described an approach based on recent advances in kernelbased learning for predicting response variables from neural activities. On the data we collected, all the kernels we devised outperform the standard scalar product that is used in linear regression. Furthermore, the Spikernel, a biologically motivated kernel, operator consistently outperforms the other kernels. The main contribution of this kernel is the allowance of time warping when comparing two sequences of population spike counts. Our current research is focused in two directions. First, we are investigating the adaptations of the Spikernel to other neural activities such as local field potentials and the development of kernels for spike trains. Our second goal is to devise statistical learning algorithms that use the Spikernel as part of a dynamical system that may incorporate biofeedback. We believe that such extensions are important and necessary steps toward operational neural prostheses. Acknowledgments We thank the anonymous reviewers for their constructive criticism. This study was partly supported by a Center of Excellence grant (8006/00) administrated by the ISF, BMBF-DIP, and by the United States–Israel Binational Science Foundation. L.S. is supported by a Horowitz fellowship. References Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91, 1899–1907. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience, 2, 664–670. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley.

Spikernels

689

Donchin, O., Gribova, A., Steinberg, O., Bergman, H., & Vaadia, E. (1998). Primary motor cortex is involved in bimanual coordination. Nature, 395, 274– 278. Fu, Q. G., Flament, D., Coltz, J. D., & Ebner, T. J. (1997). Relationship of cerebellar Purkinje cell simple spike discharge to movement kinematics in the monkey. Journal of Neurophysiology, 78, 478–491. Genton, M. G. (2001). Classes of kernels for machine learning: A statistical perspective. Journal of Machine Learning Research, 2, 299–312. Georgopoulos, A. P., Kalaska, J., & Massey, J. (1983). Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor cortical populations. Experimental Brain Research (Supp.), 7, 327–336. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. Journal of Neuroscience, 8, 2913–2947. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Haussler, D. (1999). Convolution kernels on discrete structures (Tech. Rep. No. UCSC-CRL-99-10). Santa Cruz, CA: Uniiversity of California. Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control of a cortical neural prosthesis. IEEE Trans. Rehabil. Eng., 8, 196–198. Jaakola, T. S., & Haussler, D. (1998). Exploiting generative models in discriminative calssifiers. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Laubach, M., Wessberg, J., & Nicolelis, M. A. (2000). Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. Nature, 405(1), 141–154. Laubach, M., Shuler, M., & Nicolelis, M. A. (1999). Independent component analyses for quantifying neuronal ensemble interactions. J. Neurosci Methods, 94, 141–154. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. C. H. (2000). Text classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in nueural information processing systems. Cambridge, MA: MIT Press. Magee, J. (2000). Dendritic integration of excitatory synaptic input. Nat. Rev. Neuroscience, 1, 181–190. Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82, 2676–2692. Nicolelis, M. A. (2001). Actions from thoughts. Nature, 409(18), 403–407. Nicolelis, M. A., Ghazanfar, A. A., Faggin, B. M., Votaw, S., & Oliveira, L. M. (1997). Reconstructing the engram: Simultaneous, multi-site, many single neuron recordings. Neuron, 18, 529–537. Paz, R., Boraud, T., Natan, C., Bergman, H., & Vaadia, E. (2003). Preparatory activity in motor cortex reflects learning of local visuomotor skills. Nature Neuroscience, 6(8), 882–890. Pouget, A., & Snyder, L. (2000). Computational approaches to sensorimotor transformations. Nature Neuroscience, 8, 1192–1198.

690

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Reihle, A., Grun, S., Diesmann, M., & Aersten, A. M. H. J. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1952. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Segev, I., & Rall, W. (1998). Excitable dendrites and spines: Earlier theoretical insights elucidate recent direct observations. TINS, 21, 453–459. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416, 141–142. Shpigelman, L., Singer, Y., Paz, R., & Vaadia, E. (2003). Spikernels: Embedding spiking neurons in inner product spaces. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Smola, A., & Scholkopf, ¨ B. (2004). A tutorial on support vector regression. Statistics of Computing, 14, 199-222. Taylor, D. M., Tillery, S. I. H. S. I., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 1829–1832. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Watkins, C. (1999). Dynamic alignment kernels. In P. Bartlett, B. Scholkopf, ¨ D. Schuurmans, & A. Smola (Eds.), Advances in large main classifiers (pp. 39– 50). Cambridge, MA: MIT Press. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(16), 361–365. Received August 8, 2003; accepted August 5, 2004.

LETTER

Communicated by Ad Aertsen

Memory Capacity of Balanced Networks Yuval Aviel [email protected] Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel

David Horn [email protected] School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel

Moshe Abeles [email protected] Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel

We study the problem of memory capacity in balanced networks of spiking neurons. Associative memories are represented by either synfire chains (SFC) or Hebbian cell assemblies (HCA). Both can be embedded in these balanced networks by a proper choice of the architecture of the network. The size wE of a pool in an SFC or of an HCA is limited from below √ and from above by dynamical considerations. Proper scaling of wE by K, where K is the total excitatory synaptic connectivity, allows us to obtain a uniform description of our system for any given K. Using combinatorial arguments, we derive an upper limit on memory capacity. The capacity allowed by the dynamics of the system, αc , is measured by simulations. For HCA, we obtain αc of order 0.1, and for SFC, we find values of order 0.065. The capacity can be improved by introducing shadow patterns, inhibitory cell assemblies that are fed by the excitatory assemblies in both memory models. This leads to a doubly balanced network, where, in addition to the usual global balancing of excitation and inhibition, there exists specific balance between the effects of both types of assemblies on the background activity of the network. For each of the memory models and for each network architecture, we obtain an allowed region (phase √ space) for wE / K in which the model is viable. 1 Introduction An interesting property of neural networks is their ability to function as memory devices. In this mode, memories are embedded in the synaptic connections of the network, to be recalled later. Recalling is typically done by external ignition of part of the desired memory. The system, using the Neural Computation 17, 691–713 (2005)

c 2005 Massachusetts Institute of Technology

692

Y. Aviel, D. Horn, and M. Abeles

given hint, should then settle on an associated memory; hence, it is called an associative memory model. Reading out the network’s activity, it is possible to decide if the input memory exists, and if it does, then what its constituents are. One method of embedding memories is that of attractor neural networks (Amit, 1989), where an initial input may lead to dynamical flow into an attractor, which serves as a Hebbian cell assembly (HCA) (Hebb, 1949) representing the recalled memory through elevated firing rates of its neurons. Another method is that of storing memories in spatiotemporal patterns of activity, such as synfire chains (SFC) (Abeles, 1991). The memories are encoded in the precise firing patterns produced by the given external input. As this method involves fine temporal structure, it is capable of dynamic binding (Bienenstock, 1995). Applying binding (von der Malsburg & Schneider, 1986) to HCAs, one has to resort to other temporal means (Horn, Sagi, & Usher, 1991). A network memory model should allow a neuron to participate in more than one memory, that is, have memories stored in a distributed manner. In this case, a neuron may receive input also when it is not supposed to fire. This input is due to the overlap between memories and should be treated as noise. Moreover, background activity in the absence of any memory retrieval should also be allowed and may be regarded as being due to some other noise source. One should keep in mind that the statistics of an active cortical tissue is one of high variability. An irregular spike train is observed on the individual neuron level, together with an asynchronous activity on the global level (Shadlen & Newsome, 1994). Among the sources of that variability are the 50% inputs from other cortical areas (Braitenberg & Schuz, 1991) and probabilistic synaptic release (Huang & Stevens, 1997). A network model capable of displaying stationary noisy activity of this kind is the balanced network (BN) model. A network is said to be balanced (Shadlen & Newsome, 1994; van Vreeswijk & Sompolinsky, 1998) if each neuron in the network receives equal amounts of excitation and inhibition. Its membrane potential will then fluctuate around some mean value; the firing process is noise driven and therefore irregular (Abeles, 1982; Gerstein & Mandelbrot, 1964). BNs have been shown (Brunel, 2000) to mimic the in vivo firing statistics of cortical tissue, and it is therefore plausible that cortical neurons receive balanced input. BNs have also been shown (Brunel, 2000; van Vreeswijk & Sompolinsky, 1998) to have a stable asynchronous state (AS). Generally, BNs assume sparse and random connectivity. It is possible to embed memories in the connections of a BN, but this would violate the random connectivity assumption. The main consequence of introducing ordered connectivity in an otherwise random connectivity matrix is the appearance of a new critical point beyond which the AS is unstable. This was studied in detail in Aviel, Mehring, Horn, and Abeles (2003).

Memory Capacity of Balanced Networks

693

Having set the background, we will now turn to our memory models of choice. A milestone in the theory of attractor neural network is the Hopfield model (Hopfield, 1982), capable of storing and retrieving binary vectors by a clever construction of the connectivity matrix of the neural network. The models of Hopfield, as well as some predecessors (Willshaw, Buneman, & Longuet-Higgins, 1969) and followers (Tsodyks & Feigelman, 1988), are, however, based on binary rather than spiking neurons (but see Treves, 1990, for a Hopfield model of linear threshold neurons). All attractor neural networks serve as some kind of implementation of the idea of Hebbian cell assemblies (HCA) (Hebb, 1949), that is, they contain neuronal ensembles that represent a memory by elevated firing rates. The cell assembly may be characterized by denser intra-assembly connectivity that facilitates a sustained activity on partial assembly excitation (Braitenberg, 1978). The Hopfield model is a step backward from the biological complexity of HCAs, but it allows detailed analysis. Gerstner (Gerstner & van Hemmen, 1992) showed how the Hopfield model may be incorporated in a network of spiking neurons. A network that reacts to external inputs with reproducible firing patterns (over neurons and time) is said to code its input by spatiotemporal patterns. The model we use for producing spatiotemporal patterns is the synfire chain (SFC) (Abeles, 1982). (For other interesting models, see Izhikevich, Gally, & Edelman, 2004; Levy, Horn, Meilijson, & Ruppin, 2001; and Miller, 1996.) The SFC dictates a well-defined connectivity pattern among neurons in the form of feedforward connections between pools of neurons. Each of the wE neurons in a pool receives L connections from neurons in the previous pool, thus creating a chain of pools. Other additional input connections as well as outputs are allowed. If L is large enough, then a synchronized firing volley of most of the neurons in a pool may form a wave of activity that propagates along the chain (Diesmann, Gewaltig, & Aertsen, 1999). Also if L is large enough and if the igniting volley is synchronized and strong enough, the waves are stable in the presence of background noise (Diesmann et al., 1999). To avoid terminological confusion, the feedforward connectivity schemes are referred to henceforth as chains, and the synchronized volley propagating along a chain as a synfire wave, or simply a wave. A wave can propagate in a synchronized manner along a chain, or it can lose its synchrony, dissolving into the background activity. A wave is said to be stable if it remains as a synchronized volley for more than 100 ms. Having introduced all the ingredients—asynchronous activity at the global level, attractors or spatiotemporal patterns at the assemblies’ level, and irregular spiking at the neuronal level—we may state our mission: we seek a balanced network of spiking neurons that can serve as a high-capacity memory device. To establish the goal, we use BNs of integrate-and-fire (IF)

694

Y. Aviel, D. Horn, and M. Abeles

neurons, similar to the model described in Brunel (2000). As memory models, we use either a HCA or SFC. HCAs embedded in a recurrent network were suggested as a model (Amit & Brunel, 1997) of the persistent activity observed in delayed matchto-sample experiments performed in the inferotemporal and prefrontal cortex area (Miyashita & Chang, 1988). Within this model, the network, in the absence of memory stimulation, exhibits sustained asynchronous activity. After learning, an elevated activity within an HCA is obtained if a familiar stimulus is presented briefly. Wang (1999) found that in order to realize a stable, low-rate, persistent activity coexisting with a stable resting state, recurrent excitation should be primarily mediated by kinetically slow synapses of the NMDA type. Later, Compte, Brunel, Goldman-Rakic, and Wang (2000) examined the synaptic mechanisms of selective persistent activity underlying spatial working memory in the prefrontal cortex. Their model reproduces the phenomenology of the occulomotor delayed-response experiment of Funahashi, Bruce, and Goldman-Rakic (1989). Brunel and Wang (2001) studied the effects of external input and neuromodulation on persistent activity in a working memory model. (Additional reviews on HCA can be found in Sommer and Wennekers, 2000, 2001.) In Aviel, Mehring et al. (2003) we studied the embedding of SFCs in a BN of IF neurons. The main obstacle that we found is the conflict between constraints due to embedded memories, on one hand, and the AS stability, on the other hand. The conflicting demands are not easily resolved. This is in agreement with the study of Mehring, Hehl, Kubo, Diesmann, and Aertsen (2003), where a similar setup was used with added topological, cortex-like connectivity. The authors report a fairly small parameter regime where SFC can be embedded and recalled without destabilizing the AS. None of these publications paid special attention to the question of high capacity. The focus of this article is capacity. We load the system with memory patterns and look for conditions under which we can (1) maximize the load, (2) keep the asynchronous state stable, and (3) recall every memory in a stable manner. Obtaining maximal capacity is, of course, a desirable goal. Many articles have dealt with this kind of question in networks of binary neurons. While far from biological reality, binary neurons lend themselves to analytical examination. The Hopfield model, for example, is useful because it can be analyzed theoretically. In particular, its behavior with respect to memory load is analyzed in Amit, Gutfreund, and Sompolinsky (1985), where it is shown that the maximal number of uncorrelated memory patterns, Pmax , is linear in N, the number of neurons: Pmax = αc N, with αc = 0.14. Applying results obtained in a binary neural network to neural networks of spiking neurons is not straightforward. Spiking neurons involve membrane time constants and nonlinear resetting, which lead to much richer system dynamics. Little is known on the capacity of neural networks of spiking neurons.

Memory Capacity of Balanced Networks

695

Sommers and Wennekers (2001) obtained a high-capacity limit of HCA in a symmetrically coupled network of 100 Pinsky-Rinzel neurons. An inhibitory loop provided negative feedback leading to dynamic threshold control that can improve capacity. Their network operated in the oscillatory regime and obtained high capacity. Here we limit ourselves to the asynchronous regime within a BN, which introduces other types of constraints. The capacity of SFCs has also been investigated. Bienenstock (1995) used an r-winners-take-all model and applied to it signal-to-noise analysis. Hermann, Hertz, and Prugel-Bennett (1995) and Hertz (1999) reduced the IF model to a binary model so statistical mechanics methods can be used. Both arrived at similar conclusions; again Pmax = αc N, but now αc ∼ = 8. Here Pmax is the maximal number of pools. In section 2, our model is presented in detail, in section 3 the scaling of the system is discussed, in section 4 the results are reported, and in section 5 we discuss the results. 2 The Model We employ three levels of modeling in our system. On the microscopic level, a neuronal model is required, for which we use an IF model. On the intermediate level, we model memory patterns through synfire chains and by Hebbian cell assemblies. On the macroscopic level, we use a balanced network and discuss its modification into a doubly balanced network. For simulations we used the SYNOD environment (Diesmann, Gewaltig, & Aertsen, 1995) with the Paranel kernel (Morrison, Mehrin, Geisel, Aertsen, & Diesmann, 2004). The neuronal model was integrated with time steps of 0.1 ms. 2.1 Single Neuron Model. Following Lapique (Tuckwell, 1988), we use an IF model in which the ith neuron’s membrane potential, Vi (t), obeys the equation: τ

dVi (t) = −Vi (t) + RIi (t), dt

(2.1)

where Ii (t) is the synaptic current arriving at the soma and R is the membrane resistance. Spikes as well as postsynaptic currents are modeled by delta functions; hence, the input is written as RIi (t) =

j

f

Jij δ(t − tj − τdelay ),

(2.2)

f tj

where the first sum is over different neurons, and the second sum represents f f their spikes arriving at times t = tj − τdelay . tj is the emission time of the f th spike by neuron j, and τdelay is a transmission delay, which we assume

696

Y. Aviel, D. Horn, and M. Abeles

here to be the same for any pair of neurons. The sum is over all neurons that project their output to neuron i, both local and external afferents. The strength of the synapse that neuron j forms on neuron i is Jij . When Vi (t) reaches the firing threshold θ, an action potential is emitted by neuron i, and after a refractory period τrp , during which the potential is insensitive to stimulation, the depolarization is reset to Vreset . The following parameters were used in all simulations: the transmission delay τdelay = 1.5 ms, the threshold θ = 20 mV, the membrane time constant τ = 10 ms, the refractory period τrp = 2.5 ms, the resetting potential Vreset = 0 mV, and the membrane resistance R = 40 M . The inhibitory and excitatory neurons have identical parameters.

2.2 Memory Model. Two memory models are explored, constructed on the basis of either the HCA or SFC. HCAs form an associative memory model in a biologically plausible network. An assembly is a group of wE randomly chosen excitatory neurons. The assembly is distinguished by a dense interassembly connectivity. A neuron in an assembly receives L connections from other neurons in the assembly. The dense connectivity allows for a sustained high firing rate once an assembly is ignited. A neuron will typically participate in more than one assembly. In that case, it will have high connectivity with all these assemblies. If two neurons happen to participate together in two assemblies, we assign two different synapses to the connectivity induced by the two memories. All excitatory (inhibitory) synaptic connections are assumed to be of equal strength, J (JI ), and a stronger bond between a pair of neurons is brought about by multiple synapses. Each neuron will be assumed to have a fixed synaptic resource, made of K excitatory (inhibitory) “synaptic units” J (JI ). This constraint on synaptic resources will impose an upper bound on the number of memory patterns one can embed in the network. SFCs (Abeles, 1991) are feedforward connections among neuronal pools, with a fixed number of excitatory neurons, wE , in each pool. wE is also referred to as the chain width. To wire an SFC, we randomly pick pools of wE neurons and connect them in a feedforward manner, thus obtaining a chain of pools. Converging and diverging connections form a link between two consecutive pools. Similar to HCAs, a neuron will typically participate in more than one pool. In contrast to HCAs, each neuron in a pool receives Lspecific connections from the previous pool, not from the pool to which it belongs. If wE is large enough, then a synchronized firing volley of most of the neurons in a pool may propagate along the chain, forming a synfire wave, a stable wave of activity (Diesmann et al., 1999). The wave signals an activity of a specific memory object. The wave can be (de-) synchronized with other waves, allowing for a mechanism of dynamically (un-) binding objects (Hayon, 2002).

Memory Capacity of Balanced Networks

697

In these two memory models, we have three parameters: wE , the size of the memory pattern (either an assembly or a pool); L, the interpattern connectivity; and P, the number of memories (assemblies or pools). Ignition of a memory pattern is performed by increasing the external input to all members of the pattern during 5 ms. An HCA pattern will be said to be stable if, after ignition, it manifests a sustained elevated firing rate for at least 100 ms. A synfire wave will be said to be stable if a wave propagates along the chain for at least 100 ms. Both patterns have a minimal value of wE , wmin , below which patterns are unstable. The minimum is due to the condition that the firing of the assembly or pool guarantees the continuing firing of the same assembly or the next pool. In case of SFCs (Aviel, Mehring et al., 2003; Tetzlaff, Buschermohle, ¨ Geisel, & Diesmann, 2003), w has also a maximal value, above which spontaneous emergence of waves occur. This is also likely to be the case for HCAs. Once wiring of all memory patterns is done, more connections are added such that if an excitatory neuron has fewer than K excitatory (KI inhibitory) synapses, random sources are chosen from the network until it reaches exactly K excitatory (KI inhibitory) synapses. The resulting connectivity is a mixture of random connectivity and ordered connectivity. 2.3 Network Model. The excitatory population consists of NE excitatory neurons, and the inhibitory one consists of NI ≡ γ NE inhibitory neurons. A sparse connectivity is required to adhere as closely as possible to biological values and to induce a source of randomness. In addition to the external input (K excitatory afferents), each neuron in the network receives exactly K excitatory and KI = γ K inhibitory afferents from the excitatory and inhibitory populations, respectively. In our simulations, we use K = εNE , with ε = 0.1. Our formalism allows other relations as well, for example, K = Const when NE grows. (This will be elaborated in section 5.) If a neuron of population y (either E or I) innervates a neuron of population x (either E or I), its synaptic strength Jxy is defined as follows: √ J ≡ JxE = J0 / K,

JI ≡ JxI = −gJ0 / KI ,

(2.3)

√ √ where J0 is a constant. Note that JI = −g/ γ · J; hence, the constant g/ γ is the relative strength of the inhibitory synapses. A Poisson process with rate vext K simulates the external input. vext ≡ v · vthre , where vthre is the minimal rate needed to emit a spike within τ milliseconds (on the average) in a neuron that gets balanced input (Brunel, 2000). The network parameters used in this article are γ = 1/4, ε = 0.1, g = 5, J0 = 10, and v = 1/20. The total input to a neuron is therefore nearly balanced. There is a small bias toward an excess of excitation, which controls the firing rates.

698

Y. Aviel, D. Horn, and M. Abeles

3 Scaling Scaling is a crucial issue in modeling complex systems. It tells us how parameters should be modified (scaled) while changing the number of elements (or the size) of a system. In our case, the system is the network with its connectivity matrix, the external input, and the single-neuron dynamics. The number of elements is the number of neurons or synapses in the network. Complex systems often show complex behavior that depends on the parameters of the system. For proper scaling, the behavior may only weakly depend on the size of the system. In this regard, it makes sense to consider the thermodynamic limit, where the number of elements goes to infinity. In this section, we propose a scaling scheme that allows us to capture the essential behavior of the model for all values of K. We start by scaling the synaptic weight, J. The proportionality factor of the synaptic weight is typically chosen in model calculations as a function of the number of synapses, K, J = J0 /Kβ ,

(3.1)

where β gets values 0, 0.5, or 1. The mean field approach using β = 1 leads to weak coupling in the high K regime. Examples of such studies are the Hopfield model (Hopfield, 1982) and phase-variable analysis (Hansel, Mato, & Meunier, 1995). Constant synaptic strength, β = 0, was also used in the literature. Brunel (2000), for example, used fixed synapses together with the assumption that the constant postsynaptic potential (PSP) contribution is small relative to the distance between threshold and resting potential. Also in this regime, the synaptic coupling is weak, and one can model the barrage of incoming PSPs as gaussian noise. Since the contribution of each PSP is small, the overall change in the membrane potential is smooth, and diffusion approximation can be used. Following van Vreeswijk and Sompolinsky (1998), we propose using β = 0.5. The reason underlying this choice comes from requiring the firing rate of each neuron to be independent of the number of synapses, K. The same logic applies to the inhibitory synaptic strength JI . Choosing β = 0.5, we obtain equation 2.3. Our system operates under balanced conditions; the mean input to a neuron is near zero, and fluctuations drive the spiking process. Let hx be the field generated by synapses of population x (x = E or I) felt by some neuron, and let us assume that the system is in an asynchronous state (AS). Under these conditions, all synaptic inputs can be modeled by a Poisson process with rate v. To simplify calculations, we assume that both populations fire at the same rate v. Using equation 2.3, the mean and variance can be described as follows: √ µ ≡ hE − hI = v · (JK − JI KI ) = J · v · K(1 − g γ )

Memory Capacity of Balanced Networks

= J0 · v ·

√

699

√ K(1 − g γ )

(3.2)

σ 2 = var(hE ) + var(hI ) = v · (J2 K + JI2 KI ) = J2 · v · K(1 + g2 ) = Jo2 v(1 + g2 ).

(3.3)

√ From equation 3.2, we see that g γ controls the balance between the two √ populations. The mean of the field is zero if g γ = 1, and therefore independent of K. But also the variance is independent of K. This would √ not have been the case for β = 0 or 1. On average, therefore, an input of K excitatory PSPs produces a spike. The minimum rate of external input that is needed in order to emit a spike within τ ms (on average) in a neuron that √ does not get other inputs (Brunel, 2000) is defined to be vthre ≡ θ/(τ · J · K) = θ/(τ · J0 ). Again, vthre is independent of K. √ In our simulations, g > 1/ γ , so there is a potential excess of inhibition, but there is also an additional excitatory external input. Even if the total input is not exactly √ balanced, the feedback between the population, which is on the order of K according to equation 3.2, leads to an appropriate change of the firing rates that stabilize the AS (van Vreeswijk & Sompolinsky, 1998). Another hint for setting β to 0.5 comes from our previous study (Aviel, Mehring √ et al., 2003), where we demonstrated that wE has to obey wE = Const· K. In this case, an input from wE synapses, JwE , should lead to firing with √ high probability. It means that JwE ∼ = O(1), that is, J is proportional to 1/ K. √ As a consequence, we set √ the pattern size wE = Cw K, and the intrapattern connectivity L = CL K, where Cw and CL are the scaling prefactors. The scaling of the external rate and √ the synaptic weights is given by vext ≡ v · vthre = v · θ/(τ · J0 ) and J ≡ J0 / K respectively, as discussed in section 2. We use the values in Table 1. Table 1: Parameter Values. Parameter

J0 Cw CL v

Values HCA

SFC

10 3.3 0.75Cw 0.05

10 4 Cw 0.05

The usefulness of the scaling of pattern connectivity and external rate will become evident from the discussion that follows. In the next section we discuss properties of the resulting balanced network.

700

Y. Aviel, D. Horn, and M. Abeles

4 Results 4.1 Balanced Network. In Aviel, Mehring et al. (2003), we studied the embedding of SFCs in the synaptic connectivity matrix of a balanced network. There, as we do here, we enforced two constraints: the AS has to be the stable background mode of the system, and the synfire wave has to propagate on top of it in a stable manner. Based on simplified models and simulations, we concluded that these two constraints could be met if K is large enough (K > Const · w2min ). In this article, we take the network load α ≡ P/NE into consideration, using synaptic scaling. This allows further refinement of our statements. Brunel’s network (2000) serves here as the starting point, in which we embed HCA or SFC patterns. Apart from introducing ordered patterns in its connectivity matrix, we also use a particular scaling procedure. Synaptic √ weights, instead of being constants, are now scaled like 1/ K (van Vreeswijk & Sompolinsky, 1998). This in turn leads to vthre = θ/(τ · J0 ) = Const. We verified by simulations that indeed the mean firing rate (in the AS) is linearly related to vext and is only weakly dependent on NE . The analysis of the appendix in Aviel, Mehring et al. (2003) supports square root scaling. We have shown there that wE is limited by two constraints: wmin < wE < wmax ,

(4.1)

√ where wmax ≡ Cb K. The upper bound is due to requiring stability of the AS, and the lower bound is posed by wave stability demand. We can estimate wmin as the mean number of excitatory PSPs needed to evoke an action potential with high probability. Assuming a normal distribution of the membrane potential, wmin can be approximated by twice the distance between threshold and V in units of J, where V is the mean membrane potential and J is the synaptic unit: wmin = 2

θ − V

. J

(4.2)

Substituting equation 4.2 in equation 4.1, √ we notice that wE is sandwiched between an upper bound that is of order K and a lower bound that is of order Kβ . Unless β ≤ 0.5, wE has no valid √ value in the limit of large K. By choosing β = 0.5 and wmin ≡ Ca K accordingly, equation 4.1 can be rewritten as Ca < Cw < Cb .

(4.3)

Under these conditions, we expect the SFC to be stable in our system regardless of the number of synapses K.

Memory Capacity of Balanced Networks

701

In Figure 1a, we can see a wave propagating along a chain of 750 pools in a BN of 15,000 excitatory neurons. If the chain width exceeds a critical value or if the network load exceeds capacity, global oscillations appear. In Figure 1b, we increase the number of pools embedded in the network. This leads to the emergence of global oscillations after wave ignition. We also embedded HCAs according to the protocol described in section 2. In a lightly loaded system, as in Figure 2a, an evoked assembly sustained its activity for hundreds of milliseconds without provoking strong global oscillations. As in the SFC case, Figure 2b shows that the global oscillations become prominent as the load increases. A surprisingly high capacity is obtained in both cases. In order to quantify the emergence of global oscillations, the values of the population rate, that is, number of spikes of a population in 1 ms, are considered. An average and standard deviation (SD) of these values are taken over a time window of 300 ms in order to compute a population’s coefficient of variance (CV). The CV is the SD divided by the mean, and in our case it measures the amount of synchrony of the population activity. CV near one means an SD that is close to the mean, which is in agreement with an AS. High CV is a result of a high SD, which implies global synchronous (aperiodic) oscillations in our model. Low CV (< 0.5) signifies high firing rates. Low CV will become apparent in the next section, where the mild oscillations in the background activity during wave propagation will be addressed. In Figure 3, the CV of the population rate is plotted as a function of network load, P/NE , for various network sizes. We present statistics for two cases: pre- and postpattern ignition. In the former, the time window is taken prior to pattern ignition (time 200–500 ms), and in the latter, the time window is taken postignition (500–800 ms). All curves exhibit transitions near critical points. We define the critical point of the network load by Pc , and we define the capacity of the network as αc ≡ Pc /NE . The capacity obtained here is 0.1 and 0.065 for HCA and SFC, respectively. One should note an important difference between HCAs and SFCs. Whereas when HCAs are embedded, the pre- and postignition curves are indistinguishable, the SFCs shows a significantly higher network CV after their ignition. 4.2 Doubly Balanced Network. In the background activity of Figure 1, oscillation can be seen during wave propagation. In this section, we provide a mechanism that gets rid of this problem. The solution we arrived at involves architectural modifications to the original SFC structure. Let us attach a pool of randomly chosen inhibitory neurons, a shadow pool, to each excitatory pool in a synfire chain. A neuron in an excitatory pool projects its output not only to the next pool in the chain, but also to all neurons in its shadow pool. A neuron in the shadow pool does not project its output in any ordered manner, but diffuses its output randomly to the rest

702

Y. Aviel, D. Horn, and M. Abeles

A

K=1500 P=750 w=136 d=0.0

Neuron number

2500 2000 1500 1000 500 0 200

300

400

500

600

Time [ms]

700

800

B

K=1500 P=1050 w=136 d=0.0

Neuron number

2500 2000 1500 1000 500 0 200

300

400

500

600

Time [ms]

700

800

Figure 1: Raster plots of an SFC embedded in a BN with K = 1500. (a) α = 0.05, (b) α = 0.07. Neurons that participate in more than one pool may appear more than once on the raster plot, whose y-axis is ordered according to pools and represents every second neuron in each pool. The first 40 pools of the SFC are presented on the y-axis.

Memory Capacity of Balanced Networks

703

A

K=1500 P=750 w=128 d=0.0 2500

Neuron number

2000 1500 1000 500 0 200

300

400

500

600

Time [ms]

700

800

B

K=1500 P=1575 w=128 d=0.0 2500

Neuron number

2000 1500 1000 500 0 200

300

400

500

600

Time [ms]

700

800

Figure 2: Raster plots of HCAs embedded in a BN with K = 1500. (a) α = 0.05. (b) α = 0.105. The twelfth pattern is externally ignited at time t = 500 ms for 5 ms. Neurons that participate in more than one HCA may appear more than once on the raster plot, whose y-axis is ordered according to HCAs, and represents every second neuron in each pattern. Forty HCAs are presented on the y-axis.

704

Y. Aviel, D. Horn, and M. Abeles

A 4

3.5

Population CV

3

2.5

2

1.5

1

0.5 0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

P/NE

B 4

3.5

Population CV

3

2.5

2

1.5

1

0.5 0.05

0.055

0.06

0.065

0.07

0.075

0.08

P/N

E

Figure 3: BN population CV as a function of α for various values of K. (a) HCAs are embedded. (b) An SFC is embedded. Post- and preignition curves are indicated by o and + markers, respectively. In both cases, statistics were gathered from the 300 ms pre- and postignition. Curves for three values of K (500, dotted curve; 1000, dashed curve; and 1500, solid line) are superimposed. The transitions to the oscillatory mode occur near the same α value, αc (0.1 for HCAs, 0.07 for SFCs preignition, and 0.65 for SFC postignition) for all K values.

Memory Capacity of Balanced Networks

705

Pool

Excitatory chain Inhibitory (shadow) chain

Figure 4: A modified synfire chain. Excitatory neurons (open circles) construct the traditional excitatory chain, characterized by the full feedforward connections. Inhibitory neurons (filled circles) form the shadow pools. Each shadow pool receives excitation from its associated excitatory pool. A neuron can participate more than once in the chain. Additional excitatory (solid line) and inhibitory (dashed lines) inputs and outputs are allowed (marked with curved lines).

of the network, as in a completely random network. Similar connectivity, but for different reasons, was suggested in Hayon (2002) and Mehring et al. (2003). A sketch of the connectivity scheme of a modified synfire chain is shown in Figure 4. These inhibitory pools do not carry specific information down the chain, as is the case for the excitatory pools, but rather echo a synchronized activity of their attached excitatory pool. The role of the shadow pools is to guarantee a correct amount of inhibition during the propagation of the wave. The excess of converging connections in the network, due to the embedded chain, induces an excess of correlated excitation, which may lead to global excitation (Aviel, Mehring et al., 2003). The sole purpose of the shadow pools is to cancel that excess of excitation. An analogous shadow pattern is associated with every HCA in an HCA model. ˜ E . This leads to the facThe size of a shadow pattern is defined as wI ≡ dw tor d, representing the relative strength of inhibitory to excitatory currents, due to a pattern or pool, affecting a neuron that is connected to both: √ −JI wI gJ0 Kd˜ gd˜ d≡ = = √ , x ∈ {E, I}. √ JwE γ J0 KI √ In other words wI = d( γ /g)wE . In the simulations reported below, we use d = 2 for HCAs and d = 1 for SFCs, that is, wI = 0.2 · wE and wI = 0.1 · wE , respectively. These values were chosen after a search for optimal d values that lead to the smoothest background behavior after pattern ignition.

706

Y. Aviel, D. Horn, and M. Abeles 3

Population CV

2.5

2

SFC

HCA

1.5

1

0.5 0.05

0.06

0.07

0.08

0.09 P/N

0.1

0.11

0.12

0.13

E

Figure 5: DBN preignition population CV as a function of α, for three values of K (500, dotted curve; 1000, dashed curve; and 1500, solid line). The transition in SFC (+) occurs before that of HCA (the square marks) and is independent of K. Capacity is 0.115 and 0.07 for HCA and SFC, respectively.

We refer to this new type of network as a doubly balanced network (DBN). To avoid confusion, we will refer to the previous case, d = 0, as a singly balanced network (SBN). In Figure 5, we repeat the simulations of Figure 3, only this time we simulate DBNs. Since the pre- and postignition curves overlap in the DBN, we present only the preignition curve. The two cases are plotted on the same scale, so that the difference in capacity can be appreciated. As evident, the capacity of the DBN is superior to that of the SBN in both cases. 4.3 Capacity. In this section, we show that the maximal number of patterns that can be embedded in the network is limited by combinatorial considerations of synaptic resources and is proportional to N. That such a limit has to exist follows from simple counting of all excitatory synapses. On one hand, we know there are KN such synapses available. On the other, we know that each HCA, or pair of consecutive pools in an SFC, use up w2 P of them. Hence, Pw2 < KN, or α = N < wK2 = C12 . w A more careful analysis can take care of the accounting of synapses in the way they are assigned within our model. We divide the K excitatory synapses of each neuron into chunks of L synapses. A neuron of population x (E or I) can participate in at most m ≡ K/L patterns due to synaptic constraints. The total number of chunks, Nx m, sets an upper bound on the

Memory Capacity of Balanced Networks

707

number of patterns, since embedding P patterns requires wx P chunks. Hence we get wx Pmax ≤ m · Nx ,

(4.4)

where Pmax is the maximal number of patterns or pools. Next, defining , we find that αx ≡ PNmax x αx ≤

√ m K/CL K = . wx wx

(4.5)

To leading order in NE , this turns into √ K/CL K (4.6) √ NE = (Cw CL Dx )−1 NE − O( NE ), Dx Cw K √ where Dx ≡ d/(g/ γ ) if x = I, or 1 for x = E. Thus, we conclude that synaptic combinatorial considerations lead to a maximal number of patterns Pmax . If DI ≥ 1, then γ αI < αE , and the inhibitory neurons set the maximum value of P. Otherwise, if DI = 1, the excitatory neurons set Pmax : αx Nx =

Pmax = αmax NE αmax ≡ min (Cw CL )−1 , (Cw CL DI )−1 .

(4.7)

In our DBN, as well as the SBN where DI < 1, the excitatory neurons determine the limit to be Pmax = (Cw CL )−1 NE . Substituting the parameters of our HCA in equation 4.5, we get Pmax ∼ = 0.12NE . In Figure 6, Pc of HCAs is plotted as a function of NE . A linear relation can be observed up to values of NE as high as 45,000. Furthermore, the proximity of the dynamical capacity to the combinatorial one is evident. Results for cases of SBN are also plotted for comparison. As already noted, their dynamical capacity is lower than that of DBNs. 4.4 Allowed Phase Space. We are now in a position to return to the issues of scaling and inquire about the dependence of Cb of equation 4.3 on network load. We fix α and progressively increase wE in an SBN until global oscillation appears. As in Figures 3 and 5, the curves display a critical point, wc . This criticality √ was discussed in Aviel, Mehring et al. (2003), where we found wc = Cb K. In Figure 7, we repeat these simulations for various values of α and K. Each simulation (Cw , α, K) is repeated five times to give an indication of the trial error. We also plot a crude estimate, based on simulations, of Ca and the combinatorial capacity upper limit, αmax , according to equation 4.7.

708

Y. Aviel, D. Horn, and M. Abeles

Figure 6: Combinatorial and dynamical capacities of HCAs as a function of NE for DBN (o) and SBN (∗). The solid line is αmax · NE .

Figure 7: The allowed region of Cw , the shaded gray area, is limited by Ca from below and Cb from above. The latter, represented by markers with error bars, is estimated from numerical simulations for three different values of K. The markers are slightly shifted with respect to each other—K = 500 (circles) to the left and 1500 (squares) to the right—to allow easy interpretation. These values lie below the constraint Cw CL α < 1 that led us (see equation 4.7) to the combinatorial upper bound αmax .

Memory Capacity of Balanced Networks

709

In the phase portrait illustrated in Figure 7, values of Cw and α that are above the combinatorial upper bound (dashed-dotted line) are forbidden due to the synaptic constraints of our model. Combinations of Cw and α that are below the combinatorial upper bound but outside the shaded area lead to instability of the AS. Below Ca , pattern activities dissolve into the background; hence, they are unstable. The only regime that allows stability of both the AS and the pattern activity is the shaded region in the figure. As the network load increases, Cb and Ca approach each other, shrinking the allowed regime of Cw , in which embedding is possible. In a DBN, the Cb values are higher, closer to the combinatorial upper bound. This enlarges the stability regime of the assemblies in a BN to the extent that it enables realizing the combinatorial upper bound in some cases (Aviel, Horn, & Abeles, 2003a). When embedding SFCs in the BN, the phase portrait stays qualitatively simiv, but Cb values are lower, leading to lower dynamical capacity than in the case of HCAs. 5 Discussion In this letter, we studied embedding of memory patterns in a balanced network of IF neurons. Two types of memories were used: Hebbian cell assemblies (HCAs; Hebb, 1949) and synfire chains (SFCs; Abeles, 1991). We propose a scaling behavior of synaptic weights and other parameters that render our model invariant to changes in K, the synaptic size of the network. This square root scaling allows for high capacity of both HCA and SFC. We emphasized the scaling of variables with K, but it is usually NE that one varies. Using the relation K = εNE , these two variables are linearly related. Note, however, that K may also be assumed to be constant, or vary at a different rate from NE , and our results will still be valid as long as K/NE 1. In particular, the increase of the maximal number of patterns Pmax with NE is valid even if K is constant. We distinguished between the traditional balanced network, which we termed singly balanced network (SBN), and our new doubly balanced network (DBN). In a DBN, memory patterns embedded in the excitatory-toexcitatory part of the connectivity matrix project also onto their shadow patterns. The shadow patterns are merely random pools of inhibitory neurons that receive inputs from their associated excitatory patterns. Counteracting emerged excitatory correlations by inhibitory ones imposes another type of balance. Hence, it is a doubly balanced network. We have shown in Aviel, Horn, and Abeles (2003b) that an optimal choice of d, the ratio of inhibitory to excitatory correlation currents, achieves the desired effect. If d is too small, background oscillations appear, as was the case of the SBN. If d is too large, the induced inhibition kills the synfire waves. The proper value, around

710

Y. Aviel, D. Horn, and M. Abeles

d = 1, allows for optimal performance. The DBNs have the advantage that their background activity during memory recall stays asynchronous even for high memory loads. Another indication of the stabilizing affect of the shadow patterns is given in Aviel et al. (2003b), where it is shown that if a strong, synchronized input is used to ignite a wave in an SBN, rather than a step current, then global oscillations are inevitable. With a DBN, on the other hand, waves are possible on top of asynchronous background activity. Introducing double balance is only one way of stabilizing the AS. Other ways involve more sophisticated neuronal models (Wang, 1999), or introducing variability through nonhomogeneous neuronal population or through different connectivity schemes. For example, models such as Amit and Brunel (1997) and van Vreeswijk and Sompolinsky (1998) used a variable number of synapses on each neuron. This leads to a broad distribution of firing rates across the population, which in turn introduces variability that helps to stabilize the AS. In our work, however, each neuron receives exactly the same number of synapses (Brunel, 2000). Here we suggest a stabilizing mechanism that is directly related to the cause of the instability. Clearly, additional stabilizing mechanisms, such as variable number of synapses, variable transmission delays, or more sophisticated neuronal models, will only help to increase the AS’s stability. In the current-based IF model used here, the synaptic currents are independent of the membrane voltage. While this significantly simplifies the model, it is also different from the more biologically plausible conductancebased IF model. These two models show appreciable differences if operated under balanced conditions (Wang, 1999; Lerchner, Ahmadi, & Hertz, 2004). Under balanced input conditions, the conductance-based IF model has an effective membrane time constant that vanishes for large K. The currentbased IF model does not change its effective membrane time constant. This can be a subject for further investigation. A capacity limit αc = 0.12 of HCAs calls for comparison with analytic bounds obtained for binary models, like αc = 0.14 in the Hopfield model (Amit et al., 1985). The comparison must be done with care, as the two types of neuronal models are qualitatively different. Hertz (1999) has argued that a capacity limit obtained in a network of IF neurons should be multiplied by τ/2 to compare it with a network of binary neurons. Hence, the αc = 0.115 obtained here is equivalent to αc = 0.57 in a binary model. It is not surprising that the last number is higher than 0.14, since our model’s memory patterns are sparse, as, for example, in the Tsodyks and Feigelman (1988) model, where larger capacities were achieved. An example of an IF network with high capacity was demonstrated by Sommer and Wennekers (2001). However, their network was not balanced, and memories were retrieved from oscillatory modes. To the best of our knowledge, the model presented here is the first to obtain such a high capacity in an asynchronous mode of a network of spiking neurons.

Memory Capacity of Balanced Networks

711

Finally, let us touch some of the interesting open issues. The firings of the neurons in the ignited pool are at a much higher rate and higher regularity than those reported by delayed match-to-sample experiments (Miyashita & Chang, 1988). In our model, neurons in the ignited pool receive much more excitatory than inhibitory input and therefore do not operate in the balanced-input regime anymore. Ways to reduce the ignited state’s firing rate have been suggested (Amit & Brunel, 1997; Wang, 1999). It will be interesting to incorporate these studies with the DBN approach in order to better fit experimental data. In this work, we recall only one pattern at a time. We require the ignited pattern to be stable (i.e., sustained high firing rate for HCA or sustained propagation of synfire wave) for at least 100 ms without provoking global oscillations. The next step is to ask how many concurrent patterns can be successfully recalled. This is yet another type of capacity. Preliminary results show that three HCAs can be recalled in a K = 1500 model, one of which will survive the competition and exhibit sustained activity for hundreds of milliseconds. Our model uses binary synapses—a synapse either exists, with the strength of one synaptic unit, or is absent. We strengthen bonds if the two neurons participate together in more memories. This construction leads to the combinatorial maximum capacity. One may speculate that different synaptic arrangements, for example, along the spirit of the Willshaw model (Willshaw et al., 1969), will lead to even higher dynamical capacity. Acknowledgments This work was supported in part by grants from GIF and DIP. References Abeles, M. (1982). Local cortical circuits—An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Amit, D.J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Amit, D.J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D.J, Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55, 1530–1533. Aviel, Y., Horn, D., & Abeles, M. (2003a). The doubly balanced network of spiking neurons: A memory model with high capacity. In S. Becker, S. Thrun, & K.Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press.

712

Y. Aviel, D. Horn, and M. Abeles

Aviel, Y., Horn, D., & Abeles, M. (2003b). Synfire waves in small balanced networks. Neurocomputing, 58–60, 123–127. Aviel, Y., Mehring, C., Horn, D., & Abeles, M. (2003). On embedding synfire chains in a balanced network. Neural Computation, 15(6), 1321–1340. Bienenstock, E. (1995). A model of the neocortex. Network: Computation in Neural Systems, 6, 179–224. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems. Berlin: Springer. Braitenberg, V., & Schuz, A. (1991). The anatomy of the cortex: Statistics and geometry. Berlin: Springer. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. Journal of Computational Neuroscience, 11, 63–85. Compte, A., Brunel, N., Goldman-Rakic, P.S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Diesmann, M., Gewaltig, M.O., & Aertsen, A. (1995). SYNOD: An environment for neural systems simulations. Rehovot, Israel: Weizmann Institute of Science. Diesmann, M., Gewaltig, M.O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402(6761), 529–533. Funahashi, S., Bruce, C.J., & Goldman-Rakic, P.S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophsiology, 61, 331–349. Gerstein, G.L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Gerstner, W., & van Hemmen, L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Hansel, D., Mato, G., & Meunier, C. (1995). Synchronization in excitatory neural networks. Neural Computation, 7, 307–337. Hayon, G. (2002). Modeling compositionality in biologival neural networks by dynamic binding of synfire chains. Unpublished doctoral dissertation, Hebrew University, Jerusalem. Hebb, D.O. (1949). The organization of behavior. New York: Wiley. Herrmann, M., Hertz, J.A., & Prugel-Bennett, A. (1995). Analysis of synfire chains. Network: Comput. Neural Syst., 6, 403–414. Hertz, J.A. (1999). Modeling synfire networks. In G. Burdet, P. Combe, & O. Parodi (Eds.), Neuronal information processing—From biological data to modelling and application. Carg`ese, France. Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. PNAS, 79, 2554–2558. Horn, D., Sagi, D., & Usher, M. (1991). Segmentation, binding and illusory conjunctions. Neural Computation, 3, 510–525. Huang, P.E., & Stevens, F.C. (1997). Estimating the distribution of synaptic reliabilities. Journal of Neurophsiology, 78(6), 2870–2880.

Memory Capacity of Balanced Networks

713

Izhikevich, E.M., Gally, J.A., & Edelman, G.M. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Lerchner, A., Ahmadi, M., & Hertz, J. (2004) High conductance states in a mean field cortical network model. Neurocomputing, 58–60, 935–940. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in cell assembly of spiking neurons. Neural Networks, 14, 815–824. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in loacally connected random networks. Biological Cybernetics, 88, 395–408. Miller, R. (1996). Neural assemblies and laminar interactions in the cerebral cortex. Biological Cybernetics, 75(3), 253–261. Miyashita, Y., & Chang, H.S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Morrison, A., Mehring, C., Geisel, T., Aertsen A., & Diesmann, M. (2004). Advancing the boundaries of high connectivity network simulation with distributed computing. Manuscript submitted for publication. Shadlen, M.N., & Newsome, W.T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4(4), 569–579. Sommer, F.T, & Wennekers, T. (2000). Modeling studies on the computaional function of fast temporal structure in cortical circuit activity. Journal of Physiology Paris, 94(5–6), 473–488. Sommer, F.T, & Wennekers, T. (2001). Associative memory in networks of spiking neurons. Neural Networks, 14(6–7), 825–834. Tetzlaff, T., Buschermohle, ¨ M., Geisel, T., & Diesmann, M. (2003). The spread of rate and correlation in stationary cortical networks. Neurocomputing, 52–54, 949–954. Treves, A. (1990). Graded-response neurons and information encodings in autoassociative memories. Physical Review A, 42(4), 2418. Tsodyks, M.V., & Feigelman, M.V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhysics Letters, 6(2), 101–105. Tuckwell, H.C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10(6), 1321–1371. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail party processor. Biological Cybernetics, 54, 29–40. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. Journal of Neuroscience, 19(21), 9587– 9607. Willshaw, D.J., Buneman, O.P., & Longuet-Higgins, H.C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Received October 6, 2003; accepted August 3, 2004.

LETTER

Communicated by Irwin Sandberg

On the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach Michael Schmitt [email protected] ¨ Mathematik, Ruhr-Universit¨at Lehrstuhl Mathematik und Informatik, Fakult¨at fur Bochum, D–44780 Bochum, Germany

Higher-order neurons with k monomials in n variables are shown to have Vapnik-Chervonenkis (VC) dimension at least nk + 1. This result supersedes the previously known lower bound obtained via k-term monotone disjunctive normal form (DNF) formulas. Moreover, it implies that the VC dimension of higher-order neurons with k monomials is strictly larger than the VC dimension of k-term monotone DNF. The result is achieved by introducing an exponential approach that employs gaussian radial basis function neural networks for obtaining classifications of points in terms of higher-order neurons.

1 Introduction Higher-order neurons are simple but powerful extensions of linear neuron models. They introduce the concept of nonlinearity by incorporating monomials, that is, products of input variables, as an intermediate step— effectively a hidden layer—of computation. A common way of specifying a higher-order neuron is to determine the number of input variables and the number of monomials, but leaving the order of the monomials open. Thus, one obtains a powerful neuron model that is capable of computing with unlimited degrees of nonlinearity. The computational capabilities that arise from this model, however, are far from being completely understood. One of the well-studied theoretical notions used to characterize the computational richness of neural networks is the Vapnik-Chervonenkis (VC) dimension. More generally, the VC dimension of a function class quantifies its classification capabilities (Vapnik & Chervonenkis, 1971). It indicates the cardinality of the largest set for which all possible binary-valued classifications can be obtained using functions from the class. The VC dimension is also well founded as a measure for the sample complexity of learning (see, e.g., Anthony & Bartlett, 1999). It yields estimates for the number of examples needed by a learning algorithm to output functions that have low generalization errors. Neural Computation 17, 715–729 (2005)

c 2005 Massachusetts Institute of Technology

716

M. Schmitt

We establish here a new lower bound on the VC dimension of higherorder neurons. We show that the higher-order neuron with k monomials in n variables has VC dimension at least nk + 1. The largest lower bound that has been known previously is derived from the lower bound for boolean formulas in k-term monotone disjunctive normal form (DNF), that is, disjunctions of at most k monomials without negations. This bound has been obtained by Littlestone (1988). In particular, Littlestone has shown that the class of k-term monotone l-DNF formulas (i.e., with monomials containing at most l variables) has VC dimension at least lklog(n/m), where l ≤ m ≤ n, and k ≤ m1 . Using l = n/4 and m = n/2, for instance, this yields the lower bound nk/4 for the VC dimension of k-term monotone DNF and, hence, of higher-order neurons with k monomials, where k has to satisfy the given constraints. The new bound that we provide here for higher-order neurons supersedes this previous bound in a threefold way. First, it improves the bound from k-term monotone DNF formulas in value. Second, it releases k from the constraints through n in that the new bound holds for every n and k—in particular, for values of k that are larger than the number of monotone monomials. Finally, nk + 1 is even larger than the VC dimension of the class of k-term monotone DNF formulas itself. We show that the difference between both dimensions is larger than k log(k/e) + 1. So far, a considerable number of results and techniques for VC dimension bounds have been provided in the context of real-valued function classes (see, e.g., Bartlett & Maass, 2003). For specific subclasses that can be computed by higher-order neurons, tight bounds have been calculated. Karpinski and Werther (1993) have shown that univariate polynomials with at most k terms have a VC dimension proportional to k.1 Further, the VC dimension of the class of monomials over the reals is equal to n (see Ehrenfeucht, Haussler, Kearns, & Valiant, 1989; Schmitt, 2002c, for lower and upper bound, respectively). There is also a VC dimension result known for n-variate ddegree polynomials (see, e.g., & Lindenbaum, 1998). This class Ben-David has VC dimension equal to n+d . However, as the class contains polynomid n+d als consisting of d terms and the bound k on the number of monomials restricts the number of variables in terms of k, this result entails for higherorder neurons (without a constraint on the degree) a lower bound not better than the bound due to Littlestone (1988). Previous work has established techniques for deriving lower bounds on the VC dimension for quite general types of real-valued function classes.

1 Strictly speaking, Karpinski and Werther (1993) studied a related notion—the socalled pseudo-dimension. Following their methods, it is not hard to obtain this result for the VC dimension (see also Schmitt, 2002a).

On the Capabilities of Higher-Order Neurons

717

Building on results by Lee, Bartlett, & Williamson (1995), Erlich, Chazan, Petrack, & Levy (1997) provide powerful means for obtaining lower bounds for parameterized function classes.2 An essential requirement for using these techniques, however, is that the function class is “smoothly” parameterized, a fact that does not apply to the exponents of polynomials. The lower bound method of Koiran and Sontag (1997) for various types of neural networks, generalized by Bartlett, Maiorov, and Meir (1998) to neural networks with a given number of layers, cannot be employed for higher-order neurons either. This technique is constrained to networks where each neuron computes a function with finite limits at infinity, a property monomials do not have. Further, Koiran and Sontag (1997) designed a lower-bound method for networks consisting of linear and multiplication gates. However, the way these networks are constructed, with layers consisting of products of linear terms,3 does not give rise to higher-order neurons even when the number of layers is restricted. We provide a completely new approach to the derivation of lower bounds on the VC dimension of higher-order neurons. First, we establish the lower bound nk + 1 on the VC dimension of a specific type of radial basis function (RBF) neural networks (see, e.g., Haykin, 1999). The networks considered here have k gaussian units as computational elements and satisfy certain assumptions with respect to the input domain and the values taken by the parameters. The bound for these networks improves a result of Erlich et al. (1997) in combination with Lee et al. (1995) who established the lower bound n(k − 1) for RBF networks4 with restrictions on neither inputs nor parameters. Then we use our result for RBF networks to obtain the lower bound on the VC dimension of higher-order neurons. Thus, RBF networks open a new way to assess the classification capabilities of higherorder neurons. This gaussian RBF approach has also proven to be helpful in a different context dealing with the roots of univariate polynomials (Schmitt, 2004). Higher-order neurons are a special case of a particular type of neural networks, the so-called product unit neural networks (Durbin & Rumelhart, 2 A parameterized function class is given in terms of a function having two types of variables: input variables and parameter variables. The function class is obtained by instantiating the parameter variables with, in general, real numbers. Neural networks are prominent examples for parameterized function classes. l 3 Precisely, such a layer uses products of the form (x − ai ) where it is crucial that i=1 there is no bound on l. 4 These results and the one presented here concern RBF networks with uniform width, that is, where all units have equal width. This constraint is not a fundamental restriction since, as shown by Park and Sandberg (1991), these networks still have universal approximation capabilities. As far as the VC dimension is concerned, however, better lower bounds are known for more general types of RBF networks (Schmitt, 2002b).

718

M. Schmitt

1989). It immediately follows from the bound for higher-order neurons established here that the VC dimension of product unit neural networks with n input nodes and one layer of k hidden nodes (that is, nodes that are neither input nor output nodes) is at least nk + 1. Concerning upper bounds for the VC dimension of higher-order neurons, there are two relevant results. The bound O(n2 k4 ) due to Karpinski and Macintyre (1997) is the smallest bound known for higher-order neurons with unlimited degree (see also Schmitt, 2002c). The higher-order neuron with n variables, k monomials, and degree no larger than d has VC dimension no more than 2nk log(9d) (Schmitt, 2002c). The derivation of the new lower bound not only narrows the gap between upper and lower bounds, but also gives rise to subclasses of degree-restricted higher-order neurons for which the bound is optimal up to the factor 2 log(9d). We introduce definitions and notation in section 2. Section 3 provides geometric constructions that are required for the derivations of the main results presented in section 4. Finally, in section 5, we compare the new bound with the upper bound for k-term monotone DNF formulas and show that the new bound exceeds the VC dimension of this class. 2 Definitions A higher-order neuron (or sigma-pi unit) with k monomials in n variables computes the functions b

b

b

b

a0 + a1 x11,1 · · · xn1,n + · · · + ak x1k,1 · · · xnk,n

(2.1)

with real coefficients a0 , . . . , ak (the output weights of the neuron with bias a0 ) and nonnegative integer exponents b1,1 , . . . , bk,n . The coefficients and the exponents are the adjustable parameters of the neuron. For given n and k, the function class constituted by all functions in equation 2.1 is also known as the class of k-sparse n-variate polynomials. If the exponents b1,1 , . . . , bk,n are allowed to be arbitrary real numbers, each monomial becomes a product unit, and we obtain a product unit neural network with one hidden layer of k product units. We use bold symbols to indicate vectors, such as x and ci ; nonbold symbols are reserved for scalar variables. Let · denote the Euclidean norm. A radial basis function neural network (RBF network, for short) with n input nodes computes functions that map from Rn to R and can be written as x − c1 2 x − ck 2 + · · · + w , w0 + w1 exp − exp − k σ2 σ2 where k is the number of RBF units. This particular type of network is also known as a gaussian RBF network. Each exponential term corresponds to the function computed by a gaussian RBF unit with center ci ∈ Rn , where

On the Capabilities of Higher-Order Neurons

719

n is the number of variables, and width σ ∈ R \ {0}. The width is a network parameter that we assume here to be equal for all units, that is, we consider RBF networks with uniform width. Further, w0 , . . . , wk are the output weights, and w0 is also referred to as the bias of the network. The Vapnik-Chervonenkis (VC) dimension of a class F of real-valued functions is defined via the notion of shattering. A set S ⊆ Rn is said to be shattered by F if every dichotomy of S is induced by F , that is, if for every pair (S− , S+ ), where S− ∩ S+ = ∅ and S− ∪ S+ = S, there is some function f ∈ F such that sgn ◦ f (S− ) ⊆ {0} and sgn ◦ f (S+ ) ⊆ {1}. Here sgn: R → {0, 1} denotes the sign function, satisfying sgn(x) = 1 if x ≥ 0, and sgn(x) = 0 otherwise. The VC dimension of F is then defined as the cardinality of the largest set shattered by F . (It is said to be infinite if there is no such set.) The VC dimension of a neuron or a neural network is equated with the VC dimension of the class of functions computed by the neuron or the neural network, respectively. Finally, we make use of the geometric notions of a ball and a hypersphere. A ball in Rn is given in terms of a center c ∈ Rn and a radius ρ ∈ R as the set B(c, ρ) = {x ∈ Rn : x − c ≤ ρ}. A hypersphere is the set of points on the surface of a ball, that is, S(c, ρ) = {x ∈ Rn : x − c = ρ}. 3 Ancillary Constructions In the following, we provide the geometric constructions that are the basis for the main result in Section 4. The idea pursued here is to represent classifications of sets using unions of balls, where a point is classified as positive if and only if it is contained in some ball. In order to be shattered, the sets are chosen to satisfy a certain condition of independence with respect to the positions of their elements. The points are required to lie on hyperspheres such that each hypersphere is maximally determined by the set of points. In other words, removing any point increases the set of possible hyperspheres that contain the reduced set. The following definition makes this notion of independence precise. Definition 1. A set Q ⊆ Rn of at most n + 1 points is in general position for hyperspheres if the system of equalities p − c = η,

for all p ∈ Q,

(3.1)

720

M. Schmitt

in the variables c = (c1 , . . . , cn ) and η has a solution and, for every q ∈ Q, the solution set is a proper subset of the solution set of the system p − c = η,

for all p ∈ Q \ {q}.

(3.2)

Given a set of points that satisfies this definition and lies on a hypersphere, we next want to find a ball such that one of the points lies outside the ball while the other points are on its surface. We show that this can be done provided that the set is in general position for hyperspheres. Moreover, the ball can be chosen with the center and the radius as close as possible to the center and the radius of the hypersphere that contains all points. Lemma 1. Suppose that Q ⊆ Rn is a set of at most n + 1 points in general position for hyperspheres, and let q ∈ Q. Further, let c ∈ Rn , η ∈ R be a solution of the system p − c = η,

for all p ∈ Q.

(3.3)

Then, for every ε > 0, there exists a solution c(ε) ∈ Rn , η(ε) ∈ R of the system p − c(ε) = η(ε),

for all p ∈ Q \ {q},

q − c(ε) > η(ε) satisfying c − c(ε) < ε and |η − η(ε)| < ε. Proof. Without loss of generality, we may assume that η > 0. (If η = 0, then we have |Q| = 1, and the statement is trivial.) Since c and η solve the system 3.3, c and ϑ = η2 − c 2 are a solution of the system p 2 − 2pc = ϑ,

for all p ∈ Q.

(3.4)

Because Q is in general position for hyperspheres, the solution set of the system 3.4 is a proper subset of the solution set of the system p 2 − 2pc = ϑ,

for all p ∈ Q \ {q}.

(3.5)

According to facts from linear algebra, there exist a ∈ Rn and α ∈ R such that for every λ = 0, we have with c + λa and ϑ + λα a solution of the system 3.5 that does not solve the system 3.4. For a given ε > 0, choose λ(ε) ∈ R \{0} such that |λ(ε)| is sufficiently small to satisfy λ(ε)a < ε and | ϑ + c 2 − ϑ + λ(ε)α + c + λ(ε)a 2 | < ε. (3.6)

On the Capabilities of Higher-Order Neurons

721

It is obvious that the second inequality can be met due to the fact that the equation ϑ + c 2 = η holds, which we get from the definition of ϑ and the assumption η > 0. Since c + λ(ε)a and ϑ + λ(ε)α solve equation 3.5 but not 3.4, it follows that q 2 − 2q(c + λ(ε)a) = ϑ + λ(ε)α, which, using q 2 − 2qc = ϑ from equation 3.4, is equivalent to −2λ(ε)qa = λ(ε)α. Due to this inequality, we can choose the (not yet specified) sign of λ(ε) such that −2λ(ε)qa > λ(ε)α. Again with q 2 − 2qc = ϑ, it follows that q 2 − 2q(c + λ(ε)a) > ϑ + λ(ε)α, and, therefore, q − (c + λ(ε)a) 2 > ϑ + λ(ε)α + c + λ(ε)a 2 . Hence, defining c(ε) = c + λ(ε)a and η(ε) =

ϑ + λ(ε)α + c + λ(ε)a 2 ,

we obtain q − c(ε) > η(ε). Furthermore, the inequalities 3.6 imply that c − c(ε) < ε and |η − η(ε)| < ε hold as claimed. We now apply the previous result to show that any dichotomy of a given set of points can be obtained using balls. As the set may generally be a subset of some larger set, we also ensure that the balls do not enclose any additional point. Further, we guarantee that this can be done with all centers remaining positive, a condition that will turn out to be useful in the following section. We say here that a vector is positive if all of its components are larger than zero. Lemma 2. Let Q ⊆ Rn be a set of n points in general position for hyperspheres, and let P ⊆ Rn be a finite set with Q ⊆ P. Assume further that there exists a positive center c ∈ Rn and a radius η ∈ R such that Q ⊆ S(c, η), P ∩ B(c, η) = Q.

722

M. Schmitt

Then for every R ⊆ Q, there exists a positive center d ∈ Rn and a radius ζ ∈ R such that R ⊆ S(d, ζ ), P ∩ B(d, ζ ) = R. Proof. Clearly, it is sufficient to consider sets R that are proper subsets of Q. Without loss of generality, we may assume that |R| = |Q| − 1. The general case then follows inductively. Suppose that q ∈ Q, and let R = Q \ {q}. According to lemma 1, for every ε > 0 there exist c(ε), η(ε) satisfying p − c(ε) = η(ε),

for all p ∈ Q \ {q},

(3.7)

q − c(ε) > η(ε),

(3.8)

c − c(ε) < ε,

(3.9)

|η − η(ε)| < ε.

(3.10)

Obviously, property (3.7) implies that R ⊆ S(c(ε), η(ε)). Property (3.8) states that q ∈ B(c(ε), η(ε)). Since the assumption P ∩ B(c, η) = Q implies that for every p ∈ P \ Q the constraint p − c > η holds, properties (3.9) and (3.10) entail the condition p − c(ε) > η(ε) for all sufficiently small ε. Thus, for any such ε, we get the assertion P ∩ B(c(ε), η(ε)) = R. Further, as c is positive, property (3.9) ensures that c(ε) is positive for some sufficiently small ε. Hence, the claim follows for d = c(ε), ζ = η(ε). 4 VC Dimension Bound for Higher-Order Neurons Before getting to the main result, we derive the lower bound nk + 1 for the VC dimension of a restricted type of RBF network. For more general RBF networks, results of Erlich et al. (1997) and Lee et al. (1995) yield the lower bound n(k − 1). The following theorem is stronger not only in the value of the bound, but also in the assumptions that hold. The points of the shattered set all have the same distance from the origin, the centers of the RBF units are rational numbers, and the width can be chosen arbitrarily small. Theorem 1. Let n ≥ 2, k ≥ 1, and ρ > 0 be given. There exists a set P ⊆ S(0, ρ) ⊆ Rn of nk + 1 points and a real number σ0 > 0 so that P is shattered by the RBF network with k hidden units, positive rational centers, and any width 0 < σ ≤ σ0 .

On the Capabilities of Higher-Order Neurons

723

Figure 1: The points of the shattered set are chosen from the intersections of the hypersphere S(0, ρ) with the surfaces of pairwise disjoint balls B(ci , ηi ). All balls have their centers in the positive orthant. There is one additional point s not contained in any of the balls.

Proof. Suppose that B(c1 , η1 ), . . . , B(ck , ηk ) are pairwise disjoint balls with positive centers c1 , . . . , ck ∈ Rn such that for i = 1, . . . , k, the intersection S(ci , ηi ) ∩ S(0, ρ) is nonempty and not a single point. (An example for n = 2 and k = 3 is shown in Figure 1.) For i = 1, . . . , k, let Pi ⊆ S(ci , ηi ) ∩ S(0, ρ) be a set of n points in general position for hyperspheres. (Note that Pi is constrained to lie on two different hyperspheres. This still allows choosing Pi in such a way that it is in general position since Pi contains n (and not n + 1) points, so that the set of possible centers for Pi yields a line.) Further, let s ∈ S(0, ρ) be some point such that s ∈ B(ci , ηi ), for i = 1, . . . , k. We claim that the set P = {s} ∪ P1 ∪ · · · ∪ Pk , which has nk + 1 points, is shattered by the RBF network with the postulated restrictions on the parameters. Assume that (P− , P+ ) is some arbitrary dichotomy of P where s ∈ P− . (We will argue at the end of the proof that the complementary case can be + treated by reversing signs.) Let (P− i , Pi ) denote the dichotomy induced on Pi . By construction, every Pi satisfies

Pi ⊆ S(ci , ηi ) and P ∩ B(ci , ηi ) = Pi .

724

M. Schmitt

Hence by lemma 2, instantiating the set Q with Pi and the set R with P+ i , it follows that there exist positive centers di and radii ζi such that + P+ i ⊆ S(di , ζi ) and P ∩ B(di , ζi ) = Pi ,

for i = 1, . . . , k. Moreover, the centers di can be replaced by rational centers d˜ i that are sufficiently close to di , such that every point of P lying outside the ball B(di , ζi ) is outside the ball B(d˜ i , ζ˜i ) for some ζ˜i ∈ R close to ζi , and every point of P lying on the hypersphere S(di , ζi ) is contained in the ball B(d˜ i , ζ˜i ). Thus, every p ∈ P satisfies p ∈ B(d˜ i , ζ˜i ) if and only if p ∈ P+ i ,

(4.1)

for i = 1, . . . , k. Clearly, since the centers di are positive, the rational centers d˜ i can be chosen to be positive as well. The parameters of the RBF network are specified as follows. The ith unit is associated with the ball B(d˜ i , ζ˜i ). Assigned to it is d˜ i as the center and as 2 output weight the value exp(ζ˜i /σ 2 ) (where σ will determined below) so that the unit contributes the term

ζ˜i2 x − d˜ i 2 exp exp − σ2 σ2 to the computation of the network. From assertion 4.1, we obtain that every p ∈ P \ P+ i satisfies the constraint p − d˜ i > ζ˜i . Thus, for every sufficiently small σ > 0 and every p ∈ P \ P+ i , we achieve that

p − d˜ i 2 − ζ˜i2 1 exp − (4.2) < 2 σ k is valid for i = 1, . . . , k. On the other hand, for every p ∈ P+ i , condition 4.1 implies p − d˜ i ≤ ζ˜i , which entails

p − d˜ i 2 − ζ˜i2 exp − σ2

≥1

(4.3)

On the Capabilities of Higher-Order Neurons

725

for every σ > 0. Finally, we set the bias term equal to −1. It is now easy to see that the dichotomy (P− , P+ ) is induced by the parameter settings. If p ∈ P− , then, according to inequality 4.2, the weighted output values of the units and the bias sum up to a negative value. In the case p ∈ P+ , we have p ∈ P+ i for some i, and by inequality 4.3, the weighted unit i outputs a value of at least 1, while the other units output positive values, so that the total network output is positive. The construction for the case that classifies s as positive works similarly. We invoke lemma 2 substituting P− i for R and derive the analogous version of assertion 4.1 with P+ replaced by P− i i . Then it is obvious that if the weights defined above are equipped with negative signs and 1 is used as the bias, the network induces the dichotomy as claimed. We observe that σ may have been chosen such that it depends on the particular dichotomy. To complete the proof, we require σ0 to be small enough so that inequality 4.2 holds for σ ≤ σ0 on all points and dichotomies of P. One assumption of the theorem can be slightly weakened. It is not necessary to require that s ∈ S(0, ρ). Instead, every point not contained in any of the balls B(ci , ηi ) can be selected for s. However, the restriction is required for the application of the theorem in the following result, which is the main contribution of this article. For its proof, we recall the definition of a product unit neural network in section 2. Theorem 2. For every n, k ≥ 1, the higher-order neuron with k monomials in n variables has VC dimension at least nk + 1. Proof. We first consider the case n ≥ 2. By theorem 1, for ρ > 0, let P ⊆ Rn , P ⊆ S(0, ρ), be the set of cardinality nk + 1 that is shattered by the RBF network with k hidden units and the stated parameter settings. We show that P can be transformed into a set P that is shattered by the higher-order neuron with k monomials. The weighted output computed by unit i in the RBF network on input p ∈ P can be written as

p − ci 2 wi · exp − σ2

p 2 − 2pci + ci 2 = wi · exp − σ2 2pci p 2 + ci 2 exp = wi · exp − σ2 σ2 2 2 ρ + ci = wi · exp − σ2 2p1 ci,1 2pn ci,n · exp · · · exp , σ2 σ2

726

M. Schmitt

where we have used the assumption P ⊆ S(0, ρ) for the last equation and pj , ci,j to denote the jth components of the vectors p, ci , respectively. Consider a product unit network with one hidden layer, where unit i has output weight ρ 2 + ci 2 wi = wi · exp − σ2 and exponents 2ci,j /σ 2 for j = 1, . . . , n. On inputs from the set P = {(ep1 , . . . , epn ): (p1 , . . . , pn ) ∈ P}, this product unit network computes the same values as the RBF network on P. Moreover, the exponents of the product units are positive rationals. According to theorem 1, for some σ0 , any width 0 < σ ≤ σ0 can be used. Therefore, we may choose σ 2 = 1/ l for some natural number l that is sufficiently large and a common multiple of all denominators occurring in any ci,j , so that the exponents become integers. With these parameter settings, we have a higher-order neuron with k monomials that computes on P the same output values as the RBF network on P. As this can be done for every dichotomy of P, it follows that P is shattered by the higher-order neuron with k monomials. For the case n = 1, we again use the RBF technique and ideas from Schmitt (2002a, 2004). Clearly, the set M = {0, . . . , k} can be shattered by an RBF network with k + 1 hidden units and zero bias. For each i ∈ M, we employ an RBF unit with center i. Given a dichotomy (M− , M+ ), we let the output weight for unit i be −1 if i ∈ M− and 1 if i ∈ M+ . If the width σ is small enough, the output value of the network has the requested sign on every input i ∈ M. Now, let σ be the smallest width sufficient for all dichotomies of M. Then 2 x (x − 1)2 w0 exp − 2 +w1 exp − +··· σ σ2 (x − k)2 · · ·+wk exp − ≥0 σ2 is, by multiplication with exp(x2 /σ 2 ), equivalent to 2x − 1 2kx − k2 exp w0 + w1 exp + · · · + w ≥ 0. k σ2 σ2 The latter can be written as 2x 1 +· · · w0 +w1 exp − 2 exp σ σ2

2 k 2kx · · ·+wk exp − 2 exp ≥ 0. σ σ2

On the Capabilities of Higher-Order Neurons

727

Substituting y = exp(2x/σ 2 ), this holds if and only if 2 1 k w0 + w1 exp − 2 y + · · ·+wk exp − 2 yk ≥ 0. σ σ Thus, for every dichotomy of M, we obtain a dichotomy of the set M = {e2i/σ : i = 0, . . . , k} 2

induced by a higher-order neuron with k monomials. In other words, M is shattered by this neuron. 5 Comparison with k-Term Monotone DNF A boolean formula that is a disjunction of up to k monomials without negations can be considered as a polynomial restricted to boolean inputs. The previously best-known lower bound for the VC dimension of higher-order neurons with k monomials was the bound for k-term monotone DNF due to Littlestone (1988). By deriving an upper bound for the latter class and applying theorem 2, we show that the VC dimension for higher-order neurons with k monomials is strictly larger than for k-term monotone DNF. We use “log” to denote the logarithm of base 2. Corollary 1. Let n ≥ 1 and 3 ≤ k ≤ 2n . The VC dimension of the higher-order neuron with k monomials in n variables exceeds the VC dimension of the class of k-term monotone DNF in n variables by more than klog(k/e) + 1. Proof. A k-term monotone DNF formula corresponds to a collection of up to k subsets of the set of variables. For n variables, there are no more n than ki=0 2i such collections. The known inequality di=0 mi < (em/d)d , where 1 ≤ d ≤ m, (see, e.g. Anthony & Bartlett, 1999, theorem 3.7) yields k n

2 i=0

i

<

e k k

2nk .

By definition, the VC dimension of a finite function class F cannot be larger than log |F |. Hence, the VC dimension for k-term monotone DNF is less than nk − k log(k/e). Theorem 2 implies that this bound falls short of the VC dimension for higher-order neurons with k monomials in n variables by at least k log(k/e) + 1. It is easy to see that in the cases k = 1, 2, which are not covered by corollary 1, the VC dimension of higher-order neurons is larger as well.

728

M. Schmitt

First, as there are no more than 2n boolean monotone monomials, the VC dimension of monotone monomials is at most n. Second, the number of monotone DNF formulas with at most two terms is not larger than 22n + 1, and log(22n + 1) is less than 2n + 1. 6 Conclusion A new lower bound for the VC dimension of higher-order neurons with a given number of monomials has been derived. The bound is stronger and more general than the previous bound established using boolean formulas in monotone DNF. Moreover, the new bound implies that the VC dimension of higher-order neurons with k monomials exceeds the VC dimension of the class of k-term monotone DNF formulas. Therefore, the techniques that use DNF formulas for deriving lower bounds on the VC dimension of higherorder neurons seem to have reached their limits. We have introduced a method whereby higher-order neurons, via gaussian RBF networks, shatter sets. This seems to be paradoxical as, with regard to the domain of the parameters, the gaussian RBF network appears to be more powerful than the higher-order neuron. Each parameter of a gaussian RBF network may assume any real number, whereas the higher-order neuron must have exponents that are nonnegative and integers. Nevertheless, we have shown here that RBF networks can be used to establish lower bounds on the computational capabilities of higher-order neurons. While the previous lower bound via monotone DNF formulas gives rise to monomials with exponents not larger than 1, the approach that uses RBF networks shows that and how large exponents can be employed to shatter sets of a cardinality that is larger than known before. Moreover, the constructions give reason for a completely new interpretation of the exponent vectors of the monomials when higher-order neurons are used for classification tasks. They have been chosen as centers of balls. This perspective might open new ways to the design of learning algorithms for higher-order neurons. With the result presented in this letter, we have narrowed the gap between lower and upper bounds for the VC dimension of higher-order neurons. As the bounds are not yet tight, the hope is that the method introduced here may lead to further insights that eventually yield additional improvements. Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. References Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press.

On the Capabilities of Higher-Order Neurons

729

Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks, (2nd ed., pp. 1188–1192). Cambridge, MA: MIT Press. Bartlett, P. L., Maiorov, V., & Meir, R. (1998). Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Computation, 10, 2159–2173. Ben-David, S., & Lindenbaum, M. (1998). Localization vs. identification of semialgebraic sets. Machine Learning, 32, 207–224. Durbin, R., & Rumelhart, D. (1989). Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1, 133–142. Ehrenfeucht, A., Haussler, D., Kearns, M., & Valiant, L. (1989). A general lower bound on the number of examples needed for learning. Information and Computation, 82, 247–261. Erlich, Y., Chazan, D., Petrack, S., & Levy, A. (1997). Lower bound on VCdimension by local shattering. Neural Computation, 9, 771–776. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks. Journal of Computer and System Sciences, 54, 169–176. Karpinski, M., & Werther, T. (1993). VC dimension and uniform learnability of sparse polynomials and rational functions. SIAM Journal on Computing, 22, 1276–1285. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54, 190–198. Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1995). Lower bounds on the VC dimension of smoothly parameterized function classes. Neural Computation, 7, 1040–1053. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318. Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basisfunction networks. Neural Computation, 3, 246–257. Schmitt, M. (2002a). Descartes’ rule of signs for radial basis function neural networks. Neural Computation, 14, 2997–3011. Schmitt, M. (2002b). Neural networks with local receptive fields and superlinear VC dimension. Neural Computation, 14, 919–956. Schmitt, M. (2002c). On the complexity of computing and learning with multiplicative neural networks. Neural Computation, 14, 241–301. Schmitt, M. (2004). New designs for the Descartes rule of signs. American Mathematical Monthly, 111, 159–164. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. Received March 24, 2004; accepted July 30, 2004.

LETTER

Communicated by Steven Nowlan

Estimating the Posterior Probabilities Using the K-Nearest Neighbor Rule Amir F. Atiya [email protected] Department of Computer Engineering, Cairo University, Giza, Egypt

In many pattern classification problems, an estimate of the posterior probabilities (rather than only a classification) is required. This is usually the case when some confidence measure in the classification is needed. In this article, we propose a new posterior probability estimator. The proposed estimator considers the K-nearest neighbors. It attaches a weight to each neighbor that contributes in an additive fashion to the posterior probability estimate. The weights corresponding to the K-nearest-neighbors (which add to 1) are estimated from the data using a maximum likelihood approach. Simulation studies confirm the effectiveness of the proposed estimator. 1 Introduction The posterior probability is a key variable in any pattern classification problem. For one thing, the optimal classifier (Bayes classifier) is based on classifying the point according to the largest posterior probability. In practice, it is not possible to implement the Bayes classification rule precisely because the posterior probabilities are not known for sure and have to be estimated from the data. An estimate as accurate as possible for the posterior probabilities is therefore very desirable. There are several other reasons for the importance of this estimator. If the classifier has the “reject” option, then it would be desirable to refrain from classifying a pattern based on our confidence in the classification accuracy. This is typically done on the basis of observing the posterior probabilities. Also, for some applications, the user requests not only a classification, but also the probabilities of belonging to each class for any given pattern. An example where this fact is true is the bankruptcy prediction application, whereby one predicts whether a company will default on a loan based on the health of its financial variables (Atiya, 2001). A default/nondefault classification of the companies is desirable but not sufficient. The posterior probabilities will allow banks to compute their expected loss on their loan portfolios. The other issue about the importance of the posterior probabilities is the following. To assess the performance (i.e., the classification accuracy) of a specific classifier, one way is to count the number of training set patterns classified correctly. A much more accurate approach (proposed by FukuNeural Computation 17, 731–740 (2005)

c 2005 Massachusetts Institute of Technology

732

A. Atiya

naga & Kessell, 1971, 1973, and Fukunaga & Hostetler, 1975) is the posterior probability approach, whereby one computes P(correct|x) for each pattern in the training set (using the posterior probabilities or their statistical estimates) and averages all these. This estimator even beats other variations of the counting estimator that eliminate or reduce its bias, such as the holdout estimator, the leave-one-out-estimator, and bootstrap approaches. This is because the posterior probability estimator has a much lower variance than all these other methods (see the reviews and analyses of Glick, 1978; Hand, 1986; Jain, Duin, & Mao, 2000; Kittler & Devijver, 1981; Kroke, 1986; Kulkarni, Lugosi, & Venkatesh, 1998; and Lugosi & Pawlak, 1994). In this article, we propose a new posterior probability estimate based on observing the K-nearest neighbors. The K-nearest-neighbor method has had a long history in the pattern classification and density estimation fields. A well-known estimate of posterior probabilities is the following: the posterior probability for class Cm equals the fraction of its K-nearest-neighbors that belong to class Cm (see Fukunaga & Hostetler, 1975). This estimator treats each neighbor equally. However, the problem is that far-away patterns are less “relevant,” as the densities could have changed significantly. This is more so in posterior probability estimation than in classification, because our simulations indicate that a larger K is needed for posterior probability estimation. This is because it lessens the effect of discretization, and estimating a real quantity needs more points than estimating a binary quantity. In this article, we propose an estimator that weighs the different neighbors differently in the posterior probability computation. Weighted nearest-neighbor classifiers have been considered in the past (see Royall, 1966; Bailey & Jain, 1978; the review by Toussaint, 2002; Devroye, Gyorfi, ¨ & Lugosi, 1996; and Dudani, 1976), but these studies focus on classification, not posterior probability estimation. Also, the approach presented here is different and is based on estimating the weights by a maximum likelihood approach. Bermejo & Cabestany (2000) propose a new classification method whereby they fit a mixture-of-gaussians to the immediate K-nearest-neighbors. A posterior probability estimate is also obtained this way. Non-K-nearest-neighbor methods for estimating posterior probabilities have also been developed (see Guerrero-Curieses, Cid-Sueiro, AlaizRodriguez, & Figueiras-Vadal, 2004). In the approach developed here, we propose a different approach from those discussed. Our method produces different weights for the K-nearest neighbors for each problem, and so the characteristics of each problem as to the smoothness of the densities and the sample size can be reflected in the weighting coefficients. The details of the proposed method are presented proposed in the next section. 2 The Proposed Method Consider a pattern classification problem with M classes Cm , m = 1, . . . , M and assume N training patterns x(n), n = 1, . . . , N are available. The class

Estimating the Posterior Probabilities

733

posterior probability is defined as P(Cm |x). A well-known estimate of the posterior probabilities, based on the K-nearest-neighbor classifier, is the following (see Fukunaga & Hostetler, 1975; Fukunaga & Kessell, 1973; Kanal, 1974). Let Km be the number of patterns among the K nearest neighbors (to point x) that belong to class Cm . Then the estimate is given by ˆ m |x) = Km . P(C K

(2.1)

Rather than treating every pattern among the K nearest neighbors the same, we propose to generalize this estimator by attaching different weights to the different neighbors. Let us illustrate the method through a simple example. Consider the case of using K = 7, three-class case. Assume for a particular point, x neighbors 2, 3, 6, and 7 belong to class 1; neighbors 1 and 4 belong to class 2; and neighbor 5 belongs to class 3. Let the weights attached to the seven neighbors be v1 , . . . , v7 , and assume that these weights are greater than or equal to zero and sum to 1: 7i=1 vi = 1. Then the posterior probability estimates are given by ˆ 1 |x) = v2 + v3 + v6 + v7 , P(C

ˆ 2 |x) = v1 + v4 , P(C

ˆ 3 |x) = v5 .(2.2) P(C

The optimal weights v1 , . . . , v7 are determined by maximizing the likelihood of the data. Note, of course, that the weights obtained are used for all patterns rather than tuning different weights for every pattern. Here is a general description of the method. To take care of the constraints that the weights are positive and sum to 1, we use a softmax representation. Thus, e wi vi = K+1 j=1

i = 1, . . . , K + 1,

ewj

(2.3)

where the wj weights are now unconstrained and can take any value. Notice that we have K + 1 weights rather than K weights, as would have been expected. The role and definition of wK+1 will be described shortly. Construct a matrix B, whose first K columns are defined as follows: Bij = 1 if the jth neighbor belongs to class Ci , and zero otherwise. The K + 1st column corresponds to the extra weight wK+1 introduced above, and all its entries equal 1/M (M is the number of classes). For the example presented in the previous paragraph, the B matrix is given by 

0 B = 1 0

1 0 0

1 0 0

0 1 0

0 0 1

1 0 0

1 0 0

1 3 1 . 3 1 3

(2.4)

734

A. Atiya

Note that all columns of B sum to 1. The proposed estimate for the posterior probabilities is ˆ m |x) = P(C

K+1

wi i=1 Bmi e K+1 wj . j=1 e

(2.5)

Let the B matrix for a specific training pattern x(n) be B(n). Let I(n) ∈ {1, . . . , M} denote the class membership of pattern x(n). The likelihood of the training data set is L=

ˆ N n=1 P

CI(n) |x(n) = N n=1

K+1 wi i=1 BI(n),i (n)e K+1 wj j=1 e

.

(2.6)

Note that in reality, the posterior probability estimates for the different patterns are not independent. However, we have assumed independence as a first-order approximation. The dependent case is very hard to analyze. The log likelihood becomes log(L) =

N

n=1

log

K+1 wi i=1 BI(n),i (n)e K+1 wj j=1 e

.

(2.7)

Now we explain the extra complication of introducing the K + 1st weight wK+1 . It is a regularization parameter that prevents obtaining −∞ log likelihood. If it was not present, then if for any point of a certain class all its neighbors belong to different classes, then we get a zero likelihood no matter what weights we choose. The weight wK+1 will restore some positive probability even if all the neighbors are of a different class. It is as if we add an imaginary K + 1st neighbor whose class identity is not known and therefore is considered to belong to each class with an equal probability. The optimal weighting functions wj are determined by maximizing the log-likelihood function. Because of the simplicity of the optimization task, we have used a simple steepest-ascent algorithm. It is described as follows: 1. Start with some initial choice of weights, say, equal values. 2. Update the weights in the direction of the gradient of the log-likelihood function: wi (new) = wi + η

N

n=1

BI(n),i (n)ewi

K+1 j=1

BI(n),j (n)ewj

Newi − K+1 wj j=1 e

,

(2.8)

where η is the step size and the term multiplying η is the gradient.

Estimating the Posterior Probabilities

735

3. Repeat step 2 for many iterations until the change in weights is negligible. The following theorem establishes the global convergence of of the above algorithm: Theorem: Proof:

The above algorithm leads to globally optimal weights.

Consider the related constrained optimization problem:

Maximize log(L) =

N

n=1

 log 

K+1

 BI(n),j (n)vj  ,

(2.9)

j=1

subject to

vj ≥ 0,

K+1

vj = 1.

(2.10)

j=1

This problem represents the maximization of a concave function over a convex region. Hence, it has a unique optimum. Assume the contrary to what we are to prove; that is, assume that the original unconstrained optimization of equation 2.7 has two local maxima, w(1) = (w1 (1), . . . , wK+1 (1))T and w(2) = (w1 (2), . . . , wK+1 (2))T . On the line connecting the two maxima, there is a point w with objective function log(L(w)) < log(L(w(l)), l = 1, 2. Transform this line to v-space through the transformation equation 2.3. We will get equation 2.9 with the constraints 2.10 satisfied. Since for this problem, the maximum is unique, then all the points on the line connecting w(1) and w(2) achieve the same objective function log(L), contradicting the assumption of more than one local maximum. It is to be noted, however, that the solution in w space is not unique, as there are (one-dimensional) curves that achieve the same (global) optimum objective function. The theorem shows that the algorithm is guaranteed to reach the global optimum, that is, the best solution. The standard posterior probability estimator of equation 2.1 is contained in the search space of the proposed algorithm (choose vi = 1/K, i = 1, . . . , K). Hence, the solution achieved by the algorithm is nessarily better than the standard estimator in terms of the likelihood of the given set of points. Since what we care about is performance in expectation rather than finite-sample performance, the above argument is invoked for the large sample case. In the asymptotic case of large N, the sample mean of the likelihood function log(L)/N will get very close to the expectation of the likelihood. Hence, the achieved solution will then be superior to the standard equal-weighted estimator in terms of expected likelihood (for large enough N). (It is also well known that under

736

A. Atiya

certain conditions, maximizing the likelihood function asymptotically leads to an efficient estimator, that is, an estimator with the smallest variance (see Scholkopf ¨ & Smola, 2002), but we will not get into the technical details of the asymptotic performance.) If the procedure is intended for the purpose of estimating a more accurate estimate of P(correct), as detailed in section 1, then we have to note that there will be a positive bias in the estimate. This bias can be reduced to almost zero using a leave-one-out type procedure, where the pattern x(n) for which P(correct|x(n)) is being sought is excluded from the optimization procedure. This is at the expense of having to repeat the optimization N times. But this is not a problem, because it takes typically around 300 to 400 iterations to converge from a random point. In this case, a very close initial guess is available (using a solution whereby a different point is eliminated). So the optimization procedure will be much faster. Most distributions encountered in practice possess smoothness properties that make the far-away neighbors less relevant than nearby ones. It is expected that wj decreases with j. It is, however, a good idea to enforce the constraints: w1 ≥ w2 ≥ w3 ≥ · · · ≥ wK .

(2.11)

We are utilizing this extra knowledge to prevent randomness from leading to an unreasonable solution. The achieved solutions leads to the following interpretation, which gives some insight into the optimal solution. At the optimum, the gradient equals zero, leading to N

n=1

K+1 BI(n),i (n)

j=1

N

K+1 j=1

ewj

BI(n),j (n)ewj

= 1.

(2.12)

Substituting from equation 2.5, we get N BI(n),i (n) 1

=1 ˆ N n=1 P(C(n)|x(n))

(2.13)

N 1 1

= , ˆ Ni n∈S P(C(n)|x(n)) Ni i

(2.14)

or

where Si is the set of training patterns that are of the same class as their ith neighbor and Ni is the size of set Si . One observation is that at the optimum, the harmonic mean of the posterior probabilities over the set Si equals the fraction Ni /N, which represents the fraction by which the ith neighbor in

Estimating the Posterior Probabilities

737

question agrees with the true class identity. This agrees with the main intuition that neighbors that agree a lot in class membership to the pattern in question should have larger weights, and that leads to larger posterior probability estimates. 3 Implementation Examples We implemented the proposed method on two examples. Example 1. We consider a three-dimensional, two-class gaussian problem, with the mean vectors for the two classes being µ1 = (0, 0, 0)T

µ2 = (2, 2, 2)T ,

(3.1)

and the covariance matrices for the two classes are

1 = 2 = I,

(3.2)

where I represents the identity matrix. We generated 1000 training points from both classes. After training with these points, we tested the accuracy of the probability estimates with a test set of size 1000. We considered the cases of K = 50, 100, 150 neighbors. The proposed algorithm is implemented with the constraints on the monotonicity of the weights as described in the previous section. We observed that the algorithm converges quite quickly (typically 400 iterations are sufficient). To assess the merits of the method, we have compared it to the standard estimate in equation 2.1, which is equivalent to taking wj = 1/K, j = 1, . . . , K, wK+1 = 0. Since we know the true posterior probabilities, we can simply compute the distance between the estimated posterior probabilities and the true probability for both the proposed algorithm and the benchmark. We chose the Kullback divergence measure as a distance function. It is defined as M N

P (Cm |x(n)) ˆ = 1 d(P, P) P (Cm |x(n))) log , N n=1 m=1 Pˆ (Cm |x(n))

(3.3)

where Pˆ denotes the estimate, whether it is the new method or the benchmark. To average out the effect of the randomness of the generated sample, we repeated the experiment five times with different sampling points. Table 1 shows the average results of both distance measures for the proposed method, as well as the benchmark, for the different tested K’s. One can see that the proposed method achieves better performance for all tested K. Figure 1 plots the weights of one of the runs with K = 50 in the v-space (i.e., vi as in equation 2.3).

738

A. Atiya

Nearest neighbor weights (in v−space) as a function of k 0.05

weight vi

0.04

0.03

0.02

0.01

0 0

10

20 30 40 neighbor number (k)

50

60

Figure 1: Final weights in v-space (after exponentiation and normalization, i.e., vi as in equation 2.3).

Example 2. We consider a two-dimensional two-class problem problem with mixture-of-gaussian class conditional densities. Specifically, they are given by 1 1 − 1 (x2 +x2 ) 1 − 1 ((x1 −4)2 +x2 ) 2 p(x|C1 ) = e 2 1 2 + e 2 3 2π 2π 1 − 1 ((x1 −2)2 +(x2 +2)2 ) (3.4) + e 2 2π

p(x|C2 ) =

1 1 − 1 (x2 +(x2 −4)2 ) 1 − 1 ((x1 −2)2 +(x2 −2)2 ) + e 2 1 e 2 3 2π 2π 1 − 1 ((x1 −4)2 +(x2 −4)2 ) . + e 2 2π

(3.5)

We generated 600 training patterns from both classes and also generated the same number of test patterns. Again, we used the version of the algorithm with monotonicity constraints. We considered the case of 10, 20, 30, and 40 neighbors, and also repeated the experiments five times with different sampling point generations. Table 1 shows the results of average Kullback distance for the proposed method as well as the benchmark. One can also see the outperformance of the proposed method.

Estimating the Posterior Probabilities

739

Table 1: Comparison Between the New Method and the Benchmark EqualWeight Estimate for Examples 1 and 2. Example Number

Equal Weight

New Method

Example 1, K = 50 Example 1, K = 100 Example 1, K = 150 Example 2, K = 10 Example 2, K = 20 Example 2, K = 30 Example 2, K = 40

0.0214 0.0136 0.0172 0.1304 0.0261 0.0302 0.0330

0.0081 0.0102 0.0123 0.0282 0.0163 0.0164 0.0136

For both problems, we chose the range of K as that which gives good performance for both methods. Outside the chosen ranges, performance deteriorates appreciably.

4 Conclusion A method has been proposed that enhances the estimate of the posterior probabilities for pattern classification. It is based on observing the K-nearestneighbors and weighting their contribution to the posterior probabilities differently. The weights are estimated using the maximum likelihood procedure. The simulation results indicate that the method is a more accurate estimator than the benchmark K-nearest-neighbor method. References Atiya, A. F. (2001). Bankruptcy prediction for credit risk using neural networks: A survey and new results. IEEE Transactions on Neural Networks, 12(4), 929– 935. Bailey, T., & Jain, A. (1978). A note on distance-weighted K-nearest neighbor rules. IEEE Trans. Systems, Man, Cybernetics, 8, 311–313. Bermejo, S., & Cabestany, J. (2000). Adaptive soft K-nearest-neighbour classifiers. Pattern Recognition, 33, 1999–2005. Devroye, L., Gyorfi, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag. Dudani, S. (1976). The distance weighted K-nearest-neighbor rule. IEEE Trans. Systems, Man, Cybernetics, 6, 325–327. Fukunaga, K., & Hostetler, L. (1975). K-nearest-neighbor Bayes-risk estimation. IEEE Trans. Information Theory, 21(3), 285–293. Fukunaga, K., & Kessell, D. (1971). Estimation of classification error. IEEE Trans. Computers, 20, 1521–1527.

740

A. Atiya

Fukunaga, K., & Kessell, D. (1973). Nonparametric Bayes error estimation using unclassified samples. IEEE Trans. Information Theory, 19, 434–439. Glick, N. (1978). Additive estimators for probabilities of correct classification. Pattern Recognition, 10, 211–222. Guerrero-Curieses, A., Cid-Sueiro, J., Alaiz-Rodriguez, R., & Figueiras-Vidal, A. (2004). Local estimation of posterior class probabilities to minimize classification errors. IEEE Trans. Neural Networks, 15(2), 309–317. Hand, D. (1986). Recent advances in error rate estimation. Pattern Recognition Letters, 4, 335–346. Jain, A., Duin, R., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Trans. Pattern Analysis, Machine Intelligence, 22(1), 4–37. Kanal, L. (1974). Patterns in pattern recognition: 1968–1974. IEEE Trans. Information Theory, 20(6), 697–722. Kittler, J., & Devijver, P. (1981). An effient estimator of pattern recognition system error probability. Pattern Recognition, 13, 245–249. Kroke, J. (1986). The robust estimation of classification error rates. Comput., Math., Appl., 12A, 253–260. Kulkarni, S., Lugosi, G., & Venkatesh, S. (1998). Learning pattern classification— a survey. IEEE Trans. Information Theory, 44(6), 2178–2206. Lugosi, G., & Pawlak, M. (1994). On the posterior-probability estimate of the error rate of nonparametric classification rules. IEEE Trans. Information Theory, 40(2), 475–481. Royall, R. (1966). A class of nonparametric estimators of a smooth regression function. Unpublished doctoral dissertation, Stanford University. Scholkopf, ¨ B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Toussaint, G. (2002). Proximity graphs for nearest neighbor decision rules: Recent progress. In Proc. 34th Symp. Computing and Statistics (INTERFACE-2002). Montreal, Canada. Available at http://wwwcgrl.cs.mcgill.ca/∼godfried/research/nearest.neighbor.html. Received April 2, 2004; accepted July 14, 2004.

REVIEW

Communicated by Laurence Abbott

Quantifying Stimulus Discriminability: A Comparison of Information Theory and Ideal Observer Analysis Eric E. Thomson [email protected]

William B. Kristan [email protected] University of California, San Diego, La Jolla, CA 92093-0357, U.S.A.

Performance in sensory discrimination tasks is commonly quantified using either information theory or ideal observer analysis. These two quantitative frameworks are often assumed to be equivalent. For example, higher mutual information is said to correspond to improved performance of an ideal observer in a stimulus estimation task. To the contrary, drawing on and extending previous results, we show that five information-theoretic quantities (entropy, response-conditional entropy, specific information, equivocation, and mutual information) violate this assumption. More positively, we show how these information measures can be used to calculate upper and lower bounds on ideal observer performance, and vice versa. The results show that the mathematical resources of ideal observer analysis are preferable to information theory for evaluating performance in a stimulus discrimination task. We also discuss the applicability of information theory to questions that ideal observer analysis cannot address.

1 Introduction Many adaptive behaviors require that an organism respond differently to different stimuli, that is, discriminate among stimuli. Familiar examples include the frog’s ability to discriminate prey location (Lettvin, Maturana, McCulloch, & Pitts, 1959) and the monkey’s ability to indicate the net direction of motion in a field of randomly flickering dots (Britten, Shadlen, Newsome, & Movshon, 1992). What properties of sensory neurons mediate such abilities? Do different stimuli reliably produce spike trains with different temporal patterns, or is variability in spike timing merely noise that should be averaged away to extract the underlying signal? Such questions have generated heated and often productive discussions for at least 50 years (MacKay & McCulluch, 1952; for reviews, see Perkel & Bullock, 1968; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Victor, 1999). Neural Computation 17, 741–778 (2005)

© 2005 Massachusetts Institute of Technology

742

E. Thomson and W. Kristan

Before evaluating performance in a discrimination task, one must first decide what measure to use. While ideal observer performance has historically been used to measure stimulus discrimination (Green & Swets, 1966; Geisler, 2003), information measures are often used with the same goal (Alkasab et al., 1999; Arabzadeh, Panzeri, & Diamond, 2004; Buracas & Albright, 1999; Li, Piech, & Gilbert, 2004; Paz & Vaadia, 2004; Petersen, Panzeri, & Diamond, 2002; Pola, Thiele, Hoffmann, & Panzeri, 2003; Theunnissen & Miller, 1991). The main goal of this article is to analyze the relationship between these two approaches. The relationship between information theory and ideal observer analysis has been the subject of several studies by statisticians and engineers (Wagner, 1965; Tebbe & Dwyer, 1968; Kovalevsky, 1968; Golic, 1987; Feder & Merhav, 1994), providing an untapped vein of research relevant for neuroscience. We review and extend the results of this previous research, showing that information theory and ideal observer analysis are not equivalent and that ideal observer analysis is more appropriate for quantifying stimulus discriminability. We also describe how to derive upper and lower bounds on ideal observer performance as a function of the information measures. Because one of our goals is to determine when it is more appropriate to use information theory or ideal observer analysis to answer a particular research question, we devote significant attention to the interpretation of the quantities typically encountered in the two approaches. Most of the article assumes only knowledge of basic probability theory (e.g., Yates & Goodman, 1999); we introduce and define all key terms from ideal observer analysis and information theory. We place proofs that would interrupt the flow of the paper in footnotes.

2 Preliminary Assumptions and Definitions Initially, we limit our analysis to experiments in which on each trial an organism is presented with one of M stimuli from a discrete set S = {s1 , . . . , s M }. The probability that a particular stimulus will be presented on a given trial is represented by the probability distribution P(S) = {P(s1 ), . . . , P(s M )}, which we assume does not change with time. We assume that each stimulus evokes a response from an N-element set R = {r1 , . . . , r N }. Typically, N M. In section 5, we extend the results to the case in which S and R are continuous variables. Also, note that none of our conclusions rests on the assumption that S describes a set of stimuli and R describes a set of responses; the analysis applies to any pair of random variables, whether they describe sensory, neural, behavioral, or other states. The dependence of R on S is represented by a channel matrix P(R |S), which is an M by N matrix in which row i contains P(R |si ) (Ash, 1965):

Quantifying Stimulus Discriminability

743

Figure 1: Examples of the quantities used to analyze performance in a sensory discrimination task. (A) A two-element stimulus distribution P(S) (top) and a two-by-two channel matrix P(R | S) (bottom). (B) The corresponding joint distribution P(S,R), calculated from P(S) and P(R | S) in A using the fact from probability theory that P(si , r j ) = P(r j |si )P(si ).



P(r1 | s1 )  ..  P(R | S) =  . P(r1 | s M )

 · · · P(r N | s1 )  .. .. . . .  · · · P(r N | s M )

(2.1)

We assume that P(R |S) does not change with time.1 Note that P(R | S) is not a probability distribution. Rather, each row of P(R | S) is a conditional probability distribution P(R | si ) that sums to one. See Figure 1A for a numerical example. Given P(S) and P(R |S), the joint distribution P(S,R) can easily be calculated (see Figure 1B). Hence, if P(S) and P(R |S) are given, then it is possible to calculate the values of all functions of P(S,R) such as mutual information and ideal observer performance in a stimulus-estimation task. 3 Ideal Observers: Minimum Error Classifiers 3.1 Defining Ideal Observer Performance: P(c). An ideal observer of the neural response R is a minimum-error classifier. As shown in Figure 2, a classifier is a function C that maps the response variable R into Sˆ , where Sˆ is the M-element set {ˆs1 , . . . , sˆ M } that contains all possible estimates of which stimulus was presented. A classifier is correct on a given trial if sˆ = s, that is, if the estimate is identical to the actual stimulus. Similarly, a classifier is in error on those trials in which sˆ = s. A useful and widespread measure of classifier performance is the probability that it will provide a correct 1 That is, we assume the system implements a discrete memoryless channel (Cover & Thomas, 1991).

744

E. Thomson and W. Kristan

Figure 2: Cartoon representation of the steps in a classification task. First, a stimulus from S is selected that evokes a response from R . Then the classifier C estimates which stimulus produced the given response. Formally, a classifier is a function that, given a response r j as input, returns an estimate (ˆs ) of the stimulus that produced that response.

estimate of S (Duda, Hart, & Stork, 2001). By definition, an ideal observer is a classifier that maximizes the probability of being correct (Liu, Knill, & Kersten, 1995; Geisler, 1989, 2003; Knill & Kersten, 1991).2 This is equivalent to saying that an ideal observer minimizes the probability of error. Ideal observers follow a simple decision rule (Duda et al., 2001): given response r j , choose the estimate sˆi that corresponds to the stimulus with the maximum probability in the conditional distribution P(S | r j ). More formally, Given r j ∈ R, choose sˆi such that for all s ∈ S, P(S = si |r j) ≥ P(S = s|r j). (3.1) We denote the stimulus with the maximum probability, given r j , as

2 Albert Siegert, a physicist in the MIT Radiation Laboratory during World War II, introduced the term ideal observer to refer to minimum error classifiers. The term was first used in publication in a signal detection theory text in which the authors attribute the idea to Siegert (Lawson & Uhlenbeck, 1950).

Quantifying Stimulus Discriminability

745

Figure 3: Example of the steps followed to calculate P(c | r j ) and P(c), a continuation of the example from Figure 1. (A) The conditional distributions P(S | r j ) are calculated using P(R | S) and P(S) from Figure 1 and the fact that P(si | r j ) = P(s j , r j )/P(r j ). (B) The corresponding ideal observer decision scheme, constructed using Rule 3.1. (C) P(c | r j ) (and P(e | r j )) are calculated using equation 3.2 and the fact that P(e | r j ) = 1 − P(c | r j ). (D) P(c) and P(e) are calculated using equation 3.3 and the fact that P(e) = 1 − P(c).

smax( j) . Hence, Rule 3.1 simplifies to “Choose sˆmax( j) ” (see Figures 3A and 3B).3 If multiple elements of P(S | r j ) have the same maximum probability, then the ideal observer makes an arbitrary choice among those elements 1 1 1 1 (Duda et al., 2001). For instance, if P(S | r j ) = { 15 , 15 , 10 , 5 , 10 , 5 }, then the ideal observer can choose sˆ1 , sˆ2 , sˆ4 , or sˆ6 . Before describing how to calculate the probability that an ideal observer will correctly estimate S, we introduce notation for use in the rest of the article. Let C = {c, e}, where c is the event that a classifier is correct and e is the event that a classifier is in error. P(C) is the probability distribution of C (i.e., P(C) = {P(c), P(e) = 1 − P(c)}), and P(C | r j ) is the distribution of C when response r j is observed. The probability that an ideal observer will correctly estimate S if response r j occurs is (Duda et al., 2001): P(c |r j ) = P(S = smax( j) |r j ) =

P(smax( j) , r j ) . P(r j )

(3.2)

3 In the classification literature, ideal observers are usually called maximum a posteriori classifiers or Bayesian classifiers (Duda et al., 2001). The reason for this name is that smax( j) is the element with the maximum value in P(S | r j ), which in a Bayesian context is often called the a posteriori distribution of S. We use the term ideal observer because it is more prevalent in psychophysics and neuroscience.

746

E. Thomson and W. Kristan

The first equality in equation 3.2 asserts that P(c | r j ) is the conditional probability of the stimulus chosen by the ideal observer rule (Duda et al., 2001). See Figure 3C. This is because on trials in which r j occurs, the ideal observer will pick sˆmax( j) as its estimate of S, and by definition smax( j) will be the stimulus that actually evoked r j with frequency P(S = smax( j) |r j ). The second equality in equation 3.2 is an instance of Bayes’ theorem (Yates & Goodman, 1999). Note that P(smax( j) , r j ), the numerator on the right-hand side of equation 3.2, is the maximum element in the column that corresponds to response r j in the joint distribution P(S,R).4 P(c), the probability that an ideal observer will correctly estimate S, is the average (over R) of P(c |r j ) (Duda et al., 2001):

P(c) =

N j= i

P(r j )P(c |r j ) =

N

P(smax( j) , r j ).

(3.3)

j= i

The second equality in equation 3.3 follows from equation 3.2. Equation 3.3 shows that P(c) is just the sum of the maximum from each column of the joint distribution P(S,R). Further, since the N maximal elements from the columns of P(S,R) sum to P(c), the N · (M − 1) nonmaximal elements must sum to P(e). See Figure 3D for an example. In the rest of the article, when we use P(c) or P(c | r j ), we are specifically referring to the probability that an ideal observer will correctly estimate S, probabilities that can be calculated using equations 3.2 and 3.3. Similarly, P(e) and P(e | r j ) will be used to refer to the probability that an ideal observer will make an error. It follows from equation 3.3 that for a given value of M, P(c) must lie 1 between M and 1 (inclusive). The upper bound (P(c) = 1) corresponds to the best performance possible by an ideal observer of R, and equation 3.3 implies that this perfect performance can be achieved only when each column of P(S,R) has a single nonzero element. The worst ideal observer performance possible is P(c) = P(smax ), the probability of the most likely

Proof: P(si , r j ) = P(si |r j )P(r j ), and since P(r j ) is the same for all elements in the jth column of P(S,R), the maximum conditional probability P(smax( j) |r j ) must also pick out the row with the maximum joint probability in column j. 4

Quantifying Stimulus Discriminability

747

stimulus (Duda et al., 2001).5 Since the smallest possible value of P(smax ) is 1 6 1 , the lowest possible P(c) for any distribution over S is M . For example, M consider the special case in which R is independent of S (i.e., P(S | r j ) = P(S)). In such a case, the ideal observer’s estimate sˆ is simply the most likely stimulus sˆmax . This stimulus occurs with probabilityP(smax ) = P(c), which can 1 1 1 vary between M when the stimuli are equiprobable (i.e.,P(S) = { M ,..., M }) and 1.0 when only a single stimulus is presented (i.e., P(S) = {0, . . . , 1, . . . , 0}). 3.2 Interpreting P(c). By definition, we say the response variable R is a good discriminator of S if different stimuli tend to lead to different responses. In what sense does P(c) quantify this notion? Figure 4 shows three arbitrary gaussian-shaped conditional distributions P(R | si ), each weighted by the probability of the corresponding stimulus P(si ). Note that each such response ensemble is a row of the joint distribution P(S,R). As discussed in section 3.1, because P(c) is the sum of the maxima from the N columns of P(S,R), it follows that P(e) is the sum of the nonmaximal elements from the columns. In Figure 4, these nonmaximal elements of P(S,R) are shaded in gray, illustrating that P(e) is the area of overlap among the M stimulusdependent ensembles. On the whole, then, ideal observer performance is a useful measure of how well a response R discriminates among different stimuli. If the response ensembles corresponding to different stimuli are so segregated that there is very little overlap among them (i.e., low P(e) and high P(c)), then different stimuli typically lead to different responses, the defining feature of discriminability. Conversely, if the ensembles show a good deal of overlap (i.e., high P(e) and low P(c)), then different stimuli often evoke the same response and R is a bad discriminator of S. 4 Bounding Information-Theoretic Quantities with Ideal Observers In this section, we quantitatively compare ideal observer analysis and information theory. In particular, we address the claim that they are 5 In Duda et al. (2001) this property of classifiers is mentioned but not proven, so we prove it here. Proof: Let row k of P(S,R) be the row whose marginal probability is P(smax ), that is, P(sk ) = P(smax ) = N j= i P(sk , r j ). For each column j, we know that P(smax( j) , r j ) ≥ P(sk , r j ). Summing the terms in this inequality over R yields:

P(c) =

N j=1

P(smax( j) , r j ) ≥

N

P(sk , r j ) = P(smax ).

j=1

This lower bound on P(c) is attained when P(smax( j) , r j ) = P(sk , r j ) for each column of P(S,R). 6 Proof: If the maximum of an M element set were less than 1 , then the elements could M not sum to 1.

748

E. Thomson and W. Kristan

Figure 4: Graphical representation of the meaning of P(c) and P(e). The area of overlap among the weighted stimulus-conditional distributions (the grayshaded bars) is identical to P(e). The area of the unfilled bars sums to P(c).

interchangeable theoretical methods that can be used to quantify stimulus discriminability. While most neuroscientists are interested in mutual information, we begin with entropy because it is useful for building intuitions for the more complicated cases and because most of the results generalize to the other information measures. 4.1 Entropy and the Ideal Observer 4.1.1 Three Interpretations of Entropy. If P(S) is an M-element probability distribution, then the entropy of S is defined as (Cover & Thomas, 1991): H(S) = −

M

P(si ) log2 P(si ).

(4.1)

i=1

Because we take the logarithm to base two, all information measures are in units of bits, though in the rest of the review, we suppress the subscript. Qualitatively, H(S) is usually described as a measure of our uncertainty about S (Ash, 1965), but how should this be interpreted? We discuss three interpretations of H(S). First, H(S) measures how evenly the probability mass is spread among the M elements of S (Cover & Thomas, 1991). At one extreme, if all the probability mass in P(S) is concentrated on one outcome (P(S) = {0, . . . , 1, . . . , 0}), then H(S) is minimized and is zero bits. At the other extreme, if each outcome

Quantifying Stimulus Discriminability

749

Figure 5: Examples of two possible codebooks that encode the value of S (M = 4). The stimulus value si is in the first column of each table, the corresponding codeword wi is in the second column, and the third column shows i , the corresponding codeword length. The average codeword length, L , is calculated below each table using equation 4.2 under the assumption of equiprobable stimuli. 1 1 is equally probable (P(S) = { M ,..., M }), then the mass is evenly spread 1 1 (P(S) = { M , . . . , M }) and H(S) is maximized at log(M) bits. A second interpretation of H(S) is that it provides lower bounds on the number of binary digits (bits) required to encode S. This interpretation is based on the source coding theorem (see equation 4.3), a central result from coding theory. Before stating the theorem, we briefly review terminology from coding theory. To encode random variable S is to build a codebook, which assigns to each element si of S a unique codeword wi . Each codeword is a sequence of symbols from a set of elementary symbols called the alphabet. The number of symbols in codeword wi is the codeword length, which we denote i . The codeword length i can be considered a measure of the cost incurred by encoding stimulus si . The average codeword length, denoted L, measures the average (over S) cost when a particular codebook is used (Cover & Thomas, 1991): M L= P(si )i . (4.2) i=1

L is useful for comparing the cost of using different codebooks. Figure 5 provides examples to illustrate these concepts. It shows two of the infinite number of possible codebooks if S is a set of four stimuli. The alphabet in this example is the set {0, 1} of binary digits, so the codebooks assign to each element si of S a codeword wi consisting of a sequence of binary digits. The corresponding codeword lengths i are shown in the third column of the tables. If we assume the four stimuli are equiprobable, then LA, the average codeword length for codebook A, is 2 bits, and L B is 3 12 bits. Hence, the average cost of using codebook A is less than the average cost of using codebook B.

750

E. Thomson and W. Kristan

The source coding theorem (also known as the noiseless coding theorem; Ash, 1965) provides the basis for the second interpretation of H(S). The theorem is (Cover & Thomas, 1991) H(S) ≤ L min < H(S) + 1,

(4.3)

where L min is the minimum possible average codeword length required to encode random variable S. The theorem shows that as H(S) increases, more bits are required to encode S. The theorem also provides an absolute lower bound on L that can be used to evaluate any individual codebook. For example, it follows from equation 4.3 that codebook A in Figure 5 is in the set of best possible codebooks for S because it actually reaches the lower bound LA = H(S). Also, because L B > H(S) + 1, it follows that there exists a better codebook than codebook B. Note that in both codebooks, each stimulus is assigned a different codeword, so if there existed a channel in which the response to each stimulus was the corresponding codeword, P(c) would be 1.0. Thus, the source coding theorem tells you the fewest number of bits required to encode a variable so that it can be decoded without error. A third interpretation of H(S), inspired by the description of H(S) as a measure of uncertainty about S, is that it is a measure of the difficulty an ideal observer would have estimating S. The flaws in this interpretation are discussed in the next section. 4.1.2 Comparing Entropy and Ideal Observers. H(S) is a function only of P(S). Hence, to directly compare H(S) and P(c), we consider the behavior of an ideal observer faced with the task of estimating S given only P(S). Given P(S), the ideal observer picks the most likely stimulus sˆmax as its estimate of S, and P(c) =P(smax ) (Duda et al., 2001). H(S) is sometimes interpreted as a measure of the difficulty in estimating S on a single trial: the higher the entropy, the less likely that an ideal observer will correctly estimate S (Alkasab et al., 1999). Formally: H(S1 ) ≥ H(S2 ) ⇔ P(c)1 ≤ P(c)2 ,

(4.4)

where S1 and S2 are two random variables and P(c)i is ideal observer performance in estimating Si on the basis of P(Si ). That equation 4.4 is 65 18 17 incorrect can be shown by counterexample. Let P(S1 ) = { 100 , 100 , 100 } and 50 49 1 P(S2 ) = { 100 , 100 , 100 }. In this case, H(S1 ) = 1.28 > 1.07 = H(S2 ) and P(c)1 = 0.65 > 0.50 = P(c)2 . The rest of this section provides a more general analysis of the relationship between P(c) and H(S). We first review previous results that show how to calculate the range of H(S) values that are consistent with a given value of P(c) (Tebbe & Dwyer, 1968; Kovalevsky, 1968). Toward this end, we define MP(c) as the set of all M-element probability distributions for which ideal observer performance is P(c). Since entropy is invariant under permutations

Quantifying Stimulus Discriminability

751

Figure 6: Some members of the set MP(c) = 40.5 . The empty circle indicates Pmin (40.5 ), the distribution in 40.5 with the minimum entropy (H = 1 bit). The filled circle indicates Pmax (40.5 ), the distribution with the maximum entropy (H = 1.8 bits).

of elements of P(S) we can, without loss of generality, sort the probability values in each distribution in MP(c) in descending order (Feder & Merhav, 1994). That is, by construction, all distributions in MP(c) satisfy P(c) = P(s1 ) ≥ P(s2 ) ≥ · · · ≥ P(s M ).

(4.5)

Figure 6 shows an example of MP(c) for the case in which P(c) = 0.5 and M = 4. Each distribution in MP(c) = 40.5 has the same maximal element P(s1 ) = P(smax ) = 0.5, but the probability mass in the remaining three elements can arbitrarily vary as long as the constraints provided by equation 4.5 are satisfied. H(S) is sensitive to this variability, but P(c) is not. In fact, the distributions in MP(c) take on a range of H(S) values. We denote the distribution in MP(c) that has the minimum entropy Pmin (MP(c) ) and the distribution with the maximum entropy Pmax (MP(c) ). We denote the corresponding entropy values h min (MP(c) ) and h max (MP(c) ), respectively. Previous papers (Tebbe & Dwyer, 1968; Kovalevsky, 1968) show that Pmax (MP(c) ) and h max (MP(c) ) are

P(e) P(e) Pmax (MP(c) ) = P(c), ,..., M−1 M−1

(4.6)

h max (MP(c) ) = H(C) + P(e)log(M − 1),

(4.7)

where C is the random variable with outcomes {c, e} defined in section 3.1. Equation 4.6 holds because once P(c) is fixed, the way to maximize entropy is to distribute the residual P (e ) = [1 − P(c)] probability mass evenly to the remaining M − 1 elements of the distribution. Equation 4.7 follows when equation 4.6 is substituted into equation 4.2. In Figure 6, Pmax (40.5 ) is indicated with a filled circle, and this distribution has entropy h max (40.5 ) = 1.8 bits. In Figure 7A, h max (MP(c) ) is plotted as a function of P(c) for M = 2,

752

E. Thomson and W. Kristan A

(1)

1

A 2,h

H( )/H( |rj) [bits]

0.5

M=2 0 2

k=3

(2)

A4,h

k=2

1

M=4

k=1

0 6

(3) A 64,h

4 2 0

M=64 0

0.4

0.2

0.6

0.8

1

P(c)/P(c|rj) B

H=3 bits

0.4 0

PC

g

(M, H)

0.6

0.2 0

H=0.5 bits 0

2 8

25

50

100

M Figure 7: Comparison of H(S) and P(c). (A) Plot of the upper and lower bounds on H(S) as a function of P(c) for M = 2, M = 4, and M = 64 (panels 1, 2, and 3, respectively). The upper bound, h max (MP(c) ), is indicated by a dashed line, and the lower bound, h min (MP(c) ), is indicated by a solid line. The sets of allowable {P(c), H(S)} points are labeled A M,h . Note that A2,h (panel 1) is a line. In panel 2, the arch-shaped line segments corresponding to different values of k are indicated, as are the points corresponding to h max (40.5 ) and h min (40.5 ) from Figure 6 (filled and empty circles, respectively). (Recall that the integer k describes the minimum number of stimuli that must have probability P(c) when the goal is to minimize H(S); see the text). In panel 3, the lower (0.14) and upper (0.69) bounds on P(c) when H(S) = 3 bits are indicated with filled circles and connected with a dashed line. As discussed in section 4.2, the graphs also plot the relationship between P(c|r j ) and H(S |r j ), so these quantities are included on the x- and y-axes, respectively. (B) Plot of PCrange (M,H) as a function of M shows that PCrange (M,H) increases exponentially with M. The filled circles show PCrange (M,3), and the unfilled circles plot PCrange (M,0.5). The lines are the best saturating exponential fit to the points (see text). The large filled circle on the PCrange (M,3) line indicates PCrange (64,3), the range delineated by the dashed line in panel 3 in Figure 7A.

Quantifying Stimulus Discriminability

753

4, and 64 (dotted lines). It can be seen that at a given value of H(S), the maximum possible P(c) value is specified by h max (MP(c) ), so h max (MP(c) ) provides an upper bound on P(c). Note that for a given P(c) value, h max (MP(c) ) increases with log(M − 1), the only term in equation 4.7 that depends on M. h max (MP(c) ) increases with M because for a larger M, there exist a greater number of elements across which the residual probability mass P(e) can be spread. At the other extreme, to minimize H(S), the residual probability mass P(e) must be concentrated as much as possible while satisfying the constraints in equation 4.5. It is as if there are M cups, each of which can hold P(c) liters of water, and the goal is to distribute a single liter of water to the cups while filling as few of the cups as possible. For a given P(c) value, there exists an integer k that indicates the minimum number of cups that must be completely filled when following such a strategy. For the case in which M = 4 and P(c) = 0.5 (see Figure 6), Pmin (40.5 ) is { 12 , 12 , 0, 0}, so k = 2 and no water is distributed to the third or fourth cups. Consider also the case in which M = 4 and P(c) = 0.4. In this case, Pmin (40.4 ) = {0.4, 0.4, 0.2, 0}, k is again two, but there remain 0.2 liters of water once the second cup is full, and this water must be distributed to the third cup (i.e., cup k + 1). Mathematically, this procedure of concentrating probability mass leads to the following equations for Pmin (MP(c) ) and h min (MP(c) ) (Tebbe & Dwyer, 1 1968; Kovalevsky, 1968). For each P(c) value between M and 1, there exists 1 an integer k between 1 and M − 1 such that k+1 ≤ P(c) ≤ k1 , and Pmin (MP(c) ) = {P(c)1 , . . . , P(c)k , 1 − k P(c), 0, . . . , 0}

(4.8)

h min (MP(c) ) = − [k P(c)log(P(c)) + (1 − k P(c)) log(1 − k P(c))].

(4.9)

As can be seen in equation 4.8, k corresponds to the case in which k out of the M stimuli are assigned a probability of P(c). The remaining [1 − k P(c)] probability mass is assigned to stimulus k + 1, and stimuli k + 2 through M are assigned probabilities of zero. Figure 7A plots h min (MP(c) ) as a function of P(c) for M = 2, 4, and 64 (solid lines). Each arch-shaped line segment in Figure 7A corresponds to a different value of k (in Figure 7A.2 the line segments are labeled with their corresponding k values). It can be seen in the figure that h min (MP(c) ) provides a lower bound on P(c). That is, for a given value of H(S), the lowest possible value of P(c) is given by equation 4.9. Note that h min (MP(c) ) does not depend on M. Increasing M, the number of possible stimuli, does not affect the outcome when the goal is to concentrate the probability mass to those stimuli as much as possible. For instance, if P(c) is 0.5, then no matter how large M is, only the first two stimuli are assigned nonzero probabilities (e.g., P(60.5 ) = { 12 , 12 , 0, 0, 0, 0}). The upper and lower bounds h min (MP(c) ) and h max (MP(c) ) circumscribe the set A M,h of all possible points {P(c), H(S)} (see Figure 7). We note four general features of A M,h . First, its upper and lower bounds, h max (MP(c) ) and

754

E. Thomson and W. Kristan

h min (MP(c) ), are tight (Feder & Merhav, 1994). That is, for a given M and P(c), there exists a distribution P(S) that actually equals h max (MP(c) ) and h min (MP(c) ). These distributions are given by equations 4.6 and 4.8, respectively. Second, for a given value of P(c), the range of possible H(S) values increases with log(M − 1). For the case in which there are only two stimuli (M = 2), H(S) is uniquely specified by P(c). This is because once P(s1 ) is fixed, there is no freedom to vary the remaining probability mass: P(s2 ) must equal 1 − P(s1 ). As M increases, h max (MP(c) ) increases with log(M − 1) without bound. This is because the only term in equations 4.7 and 4.9 that depends on M is log(M − 1), which is in the equation for h max (MP(c) ). Third, while equation 4.4 is incorrect (i.e., higher entropy does not imply a decrement in ideal observer performance), it is possible to infer the range of P(c) values consistent with a given value of H(S). That is, given H(S), it is possible to calculate upper and lower bounds on P(c). This is because both h max (MP(c) ) and h min (MP(c) ) are one-to-one, strictly monotonic functions of P(c) (Feder & Merhav, 1994). Hence, both h max (MP(c) ) and h min (MP(c) ) have inverses that provide upper and lower bounds of P(c), respectively. While closed-form analytical solutions for the inverses do not exist, the bounds can be numerically estimated with accuracy limited only by machine precision.7 For example, numerical methods applied with M = 64 and H(S) = 3 bits show that P(c) can vary between 0.14 and 0.69, bounds marked with filled circles in Figure 7A, panel 3.8 In this case the range of P(c) values, defined as the difference between the maximum and minimum P(c) value, is 0.55. Fourth, the range of P(c) values consistent with a given H(S) is a saturating exponential function of M. If we define the maximum and minimum P(c) values consistent with a given H(S) as PCmax (M, H) and PCmin (M, H), respectively, then the range of P(c) values at a given entropy is given by PCrange (M, H) = PCmax (M, H) − PCmin (M, H). Figure 7B plots PCrange (M, 0.5) and PCrange (M, 3.0) as functions of M (filled and open circles, respectively). As can be seen in the figure, PCrange (M, H) is a saturating exponential function of M that saturates at 1 − PCmin (M, H). We expected this saturating exponential based on the following two considerations. First, the range of H(S) values at a given P(c) increases logarithmically with M (see above), and this is due to the fact that h max (MP(c) ) increases logarithmically

7 Equations 4.7 and 4.9 are of the form y = x + log(x), for which there exists no analytical solution for the inverse. 8 We carried out the calculation of the lower bound on P(c) as follows. A set of h max (MP(c) ) values was obtained over the range P(c) = 14 to 1 using equations 4.7 and 4.9. Then a cubic spline was fit to these numbers using P(c) as the dependent variable (de Boor, 1978). Then, given any value H(S), P(c)max can be estimated using the spline. A similar method was used to calculate the upper bound on P(c).

Quantifying Stimulus Discriminability

755

with M. Because the inverse of the log is an exponential function, and PCmax (M, H) grows with the inverse of h min (MP(c) ), PCrange (M, H) should grow exponentially with M. However, since P(c) can never be greater than 1.0, PCrange (M, H) must also be bounded from above by 1 − PCmin (M, H), so we also expected the curve to saturate at that value. While there do not exist analytical solutions for PCrange (M, H) (see the previous paragraph), the points are indeed well fit by a saturating exponential function constrained to have a maximum of 1 − PCmin (M, H)(see Figure 7B).9 4.1.3 Multiple Question Ideal Observers and Entropy. If an ideal observer makes an error on the first guess and is given another chance to guess S, then it will pick the stimulus with the maximum probability in the conditional distribution P(S |r j , S = smax( j) ). In that case, the analysis from the previous section applies, though with the number of stimuli effectively reduced to M − 1. More generally, after G guesses, the stimulus set is effectively reduced to M − G stimuli, where G can vary between zero and M − 1. In contrast, if the goal is to guess the value of a random variable using the fewest number of yes-or-no questions (i.e., the game of 20 Questions; Cover & Thomas, 1991), then the questions that receive yes and no answers can be represented by ones and zeros, respectively. Then, by the source coding theorem, H(S) describes the best possible performance L min (Cover & Thomas, 1991). In 20 Questions, the optimal question sequence will cut down the range of possible outcomes to quickly specify the stimulus. For instance, if there are eight equiprobable stimuli, then the initial question should be of the form, “Is it above four?” The ideal observer, on the other hand, must always guess a single outcome, that is, the ideal observer must always ask questions of the form, “Was it an 8?” 4.2 Response-Conditional Entropy and the Ideal Observer. The entropy of S, given that r j is observed, is known as the response-conditional entropy H(S |r j ), which is defined as (Cover & Thomas, 1991) H(S |r j ) = −

M

P(si |r j ) log(P(si |r j )).

(4.10)

i=1

Note that equation 4.10 is simply equation 4.2 with the conditional distribution P(S |r j ) substituted for P(S). Hence, all the properties of H(S) (see section 4.1.1) extend to H(S |r j ). For instance, in addition to measuring how evenly spread the conditional distribution P(S |r j ) is, H(S |r j ) provides bounds on the minimum average number of binary digits required to encode S once r j is known. The lines are the least-squares fits to saturating exponentials of the form (1 − M−T β PCmin (M))(1 − e −( α ) ) + PCra nge (T), where T is the smallest value of M for which the given H(S) is defined and α/β are free parameters. 9

756

E. Thomson and W. Kristan

Since H(S |r j ) is a function of P(S |r j ), we compare H(S |r j ) to the performance of an ideal observer of r j (i.e., P(c |r j )), which is also a function of P(S |r j ). Recall that an ideal observer of r j will pick the most likely stimulus from the conditional distribution P(S |r j ) (Rule 1). Because H(S |r j ) is mathematically equivalent to H(S) (i.e., H(S |r j ) is a function of an ordinary M-element probability distribution, and the ideal observer picks the stimulus with the maximum probability from this distribution), H(S |r j ) has the exact same properties with respect to P(c |r j ) that H(S) has with respect to P(c) (see section 4.1.2). For example, for a given P(c |r j ), there is a range of possible H(S |r j ) values bounded by h max (MP(c | r j ) ) = H(C |r j ) + P(e |r j ) log(M − 1)

(4.7 )

h min (MP(c | r j ) ) = −[k P(c |r j ) log P(c |r j ) + (1 − k P(c |r j )) log(1 − k P(c |r j )).

(4.9 )

We label equations 4.7 and 4.9 because they are simply equations 4.7 and 4.9 with P(c) replaced by P(c |r j ) and MP(c) replaced by MP(c | r j ) (see Figure 7). Also note that the set of possible {P(c |r j ), H(S |r j )} points is identical to the set AM,h of possible points {P(c), H(S)} discussed in section 4.1 (see Figure 7). We use the label AM,h for both sets of points, letting the context make clear whether we are referring to entropy or response-conditional entropy. 4.3 Specific Information and the Ideal Observer. How much information does a specific response r j provide about the stimulus set S? In information theory, this is quantified by the specific information between S and r j (DeWeese & Meister, 1999). Specific information is formally defined as I (S, r j ) = H(S) − H(S |r j ) = −

M

P(si ) log(P(si ))

i=1

+

M

P(si |r j ) log(P(si |r j )).

(4.11)

i=1

Since H(S) is the original entropy of S, and H(S |r j ) is the entropy of S remaining after r j is observed, it follows that I (S, r j ) is a measure of how much entropy about S is removed upon observation of response r j . (For examples, see Figure 8A). Interestingly, I (S, r j ) can be negative (DeWeese & Meister, 1999), which occurs when P(S |r j ) is more evenly spread than P(S) (e.g., rows 1 and 2 in Figure 8A). Conversely, I (S, r j ) is positive when P(S |r j ) is more concentrated than P(S) (e.g., rows 3 and 4 in Figure 8A). Interpreted in terms of coding theory (see section 4.1.1), I (S, r j ) indicates how many fewer binary digits, on average, are needed to encode S once r j is known. This number is negative when more digits are required to encode S once r j is observed.

Quantifying Stimulus Discriminability

757

Figure 8: Examples that compare P(c) and specific information. (A) Each row of the table contains a conditional distribution P(S |r j ). Columns 2–5 contain the corresponding ideal observer performance P(c |r j ), response-conditional entropy H(S |r j ), and specific information I (S, r j ), respectively. The calculation of I (S, r j ) assumes that P(S) is as given with an entropy H(S) of 1.25 bits. (B) Scatter plot of the four {P(c |r j ), I (S, r j )} pairs from the rows in the table in A. The point from row j of the table in A is denoted P I j . The absolute upper and lower bounds on I (S, r j ) as a function of P(c |r j ) are superimposed for reference (dashed and solid lines, respectively).

While specific information has not received much attention from neuroscientists (but see DeWeese & Meister, 1999), it is a potentially useful measure that can be used to determine whether certain classes of neuronal responses transmit more information than others. For instance, is I (S, r j ) greater for spike trains with higher firing rates? How is I (S, r j ) related to ideal observer performance? While this question has not been discussed in the literature, we extend the previous results to address it. Since I (S, r j ) measures the amount of information carried by a particular response r j about the stimulus set S, we compare I (S, r j ) to the performance of an ideal observer that has observed a particular response r j , that is, P(c |r j ). Before examining the general relationship between I (S, r j ) and P(c |r j ), we use the examples in Figure 8A to highlight three features of their relationship. First, surprisingly, a response can carry negative information about S and be a better predictor of S than a response that carries positive information about S (e.g., compare rows 1 and 3). Second, P(c |r j ) does not

758

E. Thomson and W. Kristan

uniquely specify I (S, r j ) (compare rows 2 and 3). The converse also holds (rows 3 and 4). Third, associated with each conditional distribution P(S|r j ) is a point {P(c |r j ), I (S, r j )} that indicates the ideal observer performance and specific information for that distribution. Figure 8B plots the points that correspond to the four conditional distributions in Figure 8A. The bounds on the set of allowable such points, AM,i , are superimposed on Figure 8B for comparison. The derivation of AM,i , the set of allowable {P(c |r j ), I (S, r j )} points, is a natural extension of the results from sections 4.1 and 4.2. Assume that we know P(S) (a reasonable assumption, as in most discrimination tasks P(S) is controlled by the experimenter). Then H(S), the first term in equation 4.11 is fixed. Hence, I (S, r j ) varies only with the response-specific entropy H(S |r j ) (see section 4.2), the second term in equation 4.11. As shown in section 4.2 {P(c |r j ), H(S |r j )} must lie in the set AM,h , circumscribed by h max (MP(c | r ) ) and h min (MP(c | r ) ), bounds that are defined in equations 4.7 and 4.9 , respectively. It follows from basic properties of inequalities that for a given P(c |r j ) the lower and upper bounds on specific information are: i min (P(S) P(c | r ) ) = H(S) − h max (MP(c | r ) )

(4.12)

i max (P(S) P(c | r ) ) = H(S) − h min (MP(c | r ) ).

(4.13)

Figure 9 shows examples of i min (P(S) P(c | r ) ) and i max (P(S) P(c | r ) ) for different values of M and H(S). Four consequences of equations 4.12 and 4.13 deserve mention. First, if M (the number of stimuli) is fixed and H(S) varies, the shape of AM,i does not change, but is merely shifted vertically with H(S) (see Figure 9). Second, the greater H(S) is, the fewer possible negative I (S, r j ) values there are. If H(S) is maximized and equals log(M), then I (S, r j ) is always greater than or equal to zero (see Figure 9). This is because in such a case, P(S |r j ) cannot be more evenly spread than P(S), which is the necessary condition for the existence of a negative I (S, r j ). At the other extreme, as H(S) → 0, most distributions P(S |r j ) have greater entropy than the highly concentrated P(S), so there exists a greater number of possible negative values of I (S, r j ). Third, the range of permissible I (S, r j ) values at a particular P(c |r j ) increases with log(M − 1). This is because i max (P(S) P(c | r ) ) does not depend on M, and for a given P(c |r j ), i min (P(S) P(c | r ) ) decreases with log (M − 1). As with H(S), at a given value of H(S |r j ), the range of permissible P(c |r j ) values is a saturating exponential function of M (see section 4.1.2). Fourth, because i min (P(S) P(c | r ) ) and i max (P(S) P(c | r ) ) are invertible functions, it is possible to use numerical methods to calculate the range of P(c |r j )—values consistent with a given I (S, r j ) (see section 4.1.2). 4.4 Equivocation and the Ideal Observer. If we average the responseconditional entropy H(S | r j ) (see section 4.2) over all R, we obtain the

Quantifying Stimulus Discriminability

759

Figure 9: Comparison of I (S, r j ) and P(c). (A) Plot of upper and lower bounds on I (S, r j ) as a function of P(c) (dashed and solid lines, respectively). In this example, M = 2, so the upper and lower bounds are the same and the dashed lines are not visible. The three sets of bounds shown correspond to cases in which stimulus distributions with different H(S) values are chosen, and these entropy values are indicated next to the corresponding three lines. δ stands for an arbitrarily small real number. (B) Same as A, but with M = 4.

equivocation (Cover & Thomas, 1991) H(S | R) =

N j=1

P(r j )H(S | r j ) = −

M N

P(si , r j ) log(P(si | r j )).

(4.14)

j=1 i=1

H(S | R) is often described as the average uncertainty remaining in S once R is given (Ash, 1965). It describes the average minimum number of binary digits required to encode the value of S once R is specified. Before describing the general relationship between H(S | R) and P(c), we consider the examples in Figure 10A, which show four different joint

760

E. Thomson and W. Kristan

Figure 10: Examples comparing P(c) with H(S | R) and I (S,R). (A) The second column of the table contains four joint probability distributions. Columns 3–5 contain the corresponding values of P(c) (see equation 3.3), H(S | R) (see equation 4.14), and I (S,R) (see equation 4.17), respectively. (B) Scatter plot of the points {P(c), H(S | R)} corresponding to the rows from the table: the point labeled PHi corresponds to row i. The absolute upper and lower bounds on H(S | R) are overlaid for comparison.

distributions (M = 2, N = 3). For each distribution, P(S) = { 12 , 12 }, so H(S) = 1 bit. Given the joint distribution P(S,R), H(S | R) and P(c) can be calculated by substituting the appropriate terms into equations 4.14 and 3.3, respectively. The examples highlight two features of the relationship between H(S | R) and P(c). First, H(S | R) is not uniquely specified by P(c) (rows 1 and 2). The converse also holds (rows 1 and 4). These examples illustrate that H(S | R) and P(c) measure quite different features of P(S,R). P(c) remains unchanged as long as the sum of the maximal elements in the columns remains the same, and the nonmaximal elements of P(S,R) can arbitrarily vary within this constraint. Equivocation, on the other hand, increases as the entropy of this nonmaximal probability mass is increased. For example, the gray area in

Quantifying Stimulus Discriminability

761

Figure 4 shows the probability mass that contributes to P(e). Provided that the maximal elements remain unchanged, H(S | R) will increase as this gray area is spread out and decrease as the gray area becomes more concentrated. Second, higher equivocation does not imply lower P(c) (rows 1 and 3). That is, just because a variable R1 removes more uncertainty about S than another variable R2 (i.e., H(S | R1 ) < H(S | R2 )), this does not imply that an ideal observer can better estimate S on the basis of R1 . More generally, just as with H(S), there exist upper and lower bounds on H(S | R) as a function of P(c). For a given M and P(c), we denote the upper and lower bounds on H(S | R) as Hmax (MP(c) ) and Hmin (MP(c) ), respectively. Previous papers (Tebbe & Dwyer, 1968; Kovalevsky, 1968) show that Hmax (MP(c) ) = h max (MP(c) ).

(4.15)

Equation 4.15 shows that the upper bound on equivocation is the same as the upper bound on entropy. Figure 11A plots Hmax (MP(c) ) for the cases M = 2, 4, and 64 (dashed lines). Obviously, Hmax (MP(c) ) has the exact same properties as h max (MP(c) ) discussed in section 4.1.2. Equation 4.15 is equivalent to Fano’s inequality (Cover & Thomas, 1991), and provides an upper bound on P(c). That is, it delineates the best P(c) consistent with a given equivocation. On the other hand, it is possible that the actual P(c) of a distribution with a given value of H(S | R) is much lower than the upper bound provided by equation 4.15. Previous papers derive Hmin (MP(c) ) (Tebbe & Dwyer, 1968; Kovalevsky, 1968), which provides the lowest P(c) consistent with a given 1 equivocation. We present their result without proof. For P(c) between M and 1, there exists an integer k (the same k that was introduced in equation 1 4.9) such that k+1 ≤ P(c) ≤ k1 , and k+1 1−k Hmin (MP(c) ) = log(k) + k(k + 1) log P(e) + . (4.16) k k Just as was the case with h min (MP(c) ), Hmin (MP(c) ) is broken up into M − 1 line segments. Each segment is the straight line that connects the end points of the M − 1 arch-shaped segments that delineate h min (MP(c) ), the lower bounds on entropy (see section 4.1.2). The lower bounds on equivocation are shown in Figure 11A for the M = 2, 4, and 64 cases (solid lines), and we include the lower bounds on entropy [h min (MP(c) )] in the same figure for comparison (light gray lines). Note that Hmin (MP(c) ) is not a function of M, so at a given value of P(c), increasing M does not change the lower bound on equivocation. Hmax (MP(c) ) and Hmin (MP(c) ) circumscribe the set AM,H of allowable {P(c), H(S | R)} points, which are indicated in Figure 11A for M = 2, 4, and 64.10 We highlight four features of AM,H . First, Feder and Merhav (1994) give

10

Formally, AM,H is the convex hull of AM,h (Tebbe & Dwyer, 1968; Kovalevsky, 1968).

762

E. Thomson and W. Kristan

Figure 11: Comparison of P(c) and H(S | R). (A) Plot of the upper and lower bounds on H(S | R) as a function of P(c) for M = 2, M = 4, and M = 64 (panels 1, 2, and 3, respectively). The upper bound (Hmax (MP(c) )) is indicated by a dashed line and the lower bound (Hmin (MP(c) )) by a solid line. The corresponding lower bounds on entropy (h min (MP(c) )) are overlaid in gray for comparison. The sets of allowable {P(c), H(S | R)} pairs are labeled AM,H . (B) Example of how P(S) further constrains what values of P(c) are consistent with a given H(S | R) (M = 3 in this example). In this case, we assume that P(smax ) = 0.7, so P(c) cannot be less than 0.7 (see section 3.1). The bounds on H(S | R) with this additional constraint included are shown in black, and the bounds provided by equations 4.15 and 4.16 alone are shown in gray for comparison.

algorithms for generating joint distributions that actually attain the bounds, so the bounds are tight.11 11 As a caveat, note that for a fixed input distribution P(S), these bounds are not necessarily tight (see section 4.5.2).

Quantifying Stimulus Discriminability

763

Second, just as was the case with H(S), for a given P(c) the range of possible H(S | R) values increases with log(M − 1). This is because the only term from equations 4.15 and 4.16 that varies with M is the log(M − 1) term that equation 4.15 inherits from equation 4.7 . As a corollary, the range of P(c) values consistent with a given H(S | R) increases exponentially with M. Third, although H(S | R) does not uniquely map onto P(c), the fact that Hmax (MP(c )) and Hmin (MP(c) ) are invertible (Feder & Merhav, 1994) implies that it is possible to calculate upper and lower bounds on P(c) for a given H(S | R) (see section 4.1 for details). Also, if P(S) is known, then an additional constraint can be used to further narrow the range of P(c) values consistent with a given equivocation. Namely, since P(c) must be greater than or equal to P(smax ) (see section 3.1), we can eliminate all P(c) values below P(smax ). As illustrated in Figure 11B, this constraint can considerably tighten the range of P(c) values consistent with a given H(S | R). Fourth, AM,H does not depend on the response distribution P(R) or the number of possible responses N. If the goal is to make inferences between H(S | R) and P(c), this is a very useful property, as it allows us to avoid two practical problems. First, it is in general a very difficult problem to obtain unbiased estimates of N and P(R). This is partly because when R is a variable describing a neuronal response, N is usually quite large and most outcomes have a very low probability of occurring. In such cases, estimates of N and P(R) suffer from biases due to undersampling (Paninski, 2004; Orlitsky, Santhanam, & Zhang, 2003). Second, if Hmax (MP(c) ) and Hmin (MP(c) ) depended on N, then for each representation of the neural response, we would have to recalculate the upper and lower bounds on H(S | R) in order to calculate the corresponding bounds on P(c). The independence of AM,H from N and P(R) circumvents both of these problems. 4.5 Mutual Information and the Ideal Observer. Typically, researchers calculate equivocation as an intermediate step in the estimation of mutual information, the information-theoretic quantity most often used by neuroscientists. The mutual information (also called transinformation and transmitted information) between random variables S and R is the average (over R) specific information:

I (S,R) =

N

P(r j ) I (S,r j ) = H(S) − H(S | R) =

j=1

× log

M N

P(si ,r j ) . P(si )P(r j )

P(si ,r j )

j=1 i=1

(4.17)

I (S,R) is the standard measure of how much information a variable R transmits about variable S (Ash, 1965; Cover & Thomas, 1991).

764

E. Thomson and W. Kristan

4.5.1 Three Interpretations of I(S,R). Almost universally, I (S,R) is interpreted as a measure of the average amount of uncertainty removed by R about S (Ash, 1965; Cover & Thomas, 1991). Operationally, what does this mean? In this section, we discuss three interpretations of I (S,R) that often motivate the use of I (S,R) by neuroscientists. First, I (S,R) indicates how many fewer binary digits are required, on average, to encode S once R is specified. This follows from the fact that the specific information I (S, r j ) quantifies how many fewer digits are required to encode S once r j is known (see section 4.3), and I (S,R) is the average of I (S,r j ) over all R . Second, I (S,R) is a direct measure of the degree of statistical dependence between S and R (Schneidman, Bialek, & Berry, 2003). If S and R are independent variables, then for all stimulus-response pairs, P(si ,r j ) = P(si )P(r j ) (Yates & Goodman, 1999). This equality implies that all arguments of the log in equation 4.17 are one, so I (S,R) is zero. Interestingly, if observed frequencies are used to estimate the probabilities in equation 4.7, then I (S,R) is one-half the log-likelihood test statistic (G 2 ) when the null hypothesis is that S and R are independent (Forbes, 1995). A third interpretation of I (S,R) is that it measures stimulus discriminability, or how well the stimulus can be predicted given the neuronal response (Alkasab et al., 1999; Arabzadeh et al., 2004; Buracas & Albright, 1999; Li et al., 2004; Paz & Vaadia, 2004; Petersen et al., 2002; Pola et al., 2003; Theunnissen & Miller, 1991; for exceptions, see Oram, Foldiak, Perret, & Sengpiel, 1998; Treves, 1997). That is, researchers often assume that I (S,R) can be used as a surrogate for ideal observer performance, P(c). For example, one paper claims, “Mutual information quantifies how well an ideal observer of neuronal responses can discriminate between all the different stimuli based on a single trial” (Pola et al., 2003, p. 37). Even in time-series analysis, predictive information is defined as the mutual information between past and future events (Bialek, Nemenman, & Tishby, 2001), again suggesting that higher mutual information implies improved predictability. When examined quantitatively, this third interpretation of I (S,R) is shown to be incorrect. Let us assume P(S) is fixed, R1 and R2 are different representations of the neural response (e.g., PSTHs at different bin widths), and P(c)i is ideal observer performance when observing response variable Ri . The third interpretation is equivalent to I (S,R1 ) ≥ I (S,R2 ) ⇔ P(c)1 ≥ P(c)2 .

(4.18)

That is, if R1 carries more information about S than R2 , then an ideal observer would be better at estimating S on the basis of R1 than R2 . That equation 4.18 is false has been known since 1965 when a single counterexample was published (Wagner, 1965). In the next section we examine the general relationship between I (S,R) and P(c).

Quantifying Stimulus Discriminability

765

Figure 12: Examples comparing I (S,R) to P(c). Each point labeled P Ii is the {P(c), I (S,R)} point corresponding to row i of the table in Figure 10A. The upper and lower bounds on I (S,R) as a function of P(c) are overlaid in gray for comparison. In this example, M = 2.

4.5.2 Comparing I (S,R) and P(c). We highlight certain features of the relationship between I (S,R) and P(c) using previous examples: the mutual information corresponding to the four joint distributions in Figure 10A is shown in column 5 of the table in Figure 10A. Figure 12 illustrates the {P(c), I (S,R)} points associated with each of these joint distributions. We note two features of the relationship between P(c) and I (S,R) from these examples. First, equation 4.18 is incorrect, as can be seen by comparing rows 1 and 3. Second, P(c) is not associated with a unique I (S,R) (rows 1 and 2). The converse is also true (rows 1 and 4). These examples suggest that there is a range of I (S,R) values consistent with a given P(c), and vice versa. To address their relationship more generally, we derive upper and lower bounds on I (S,R) as a function of P(c).12 Assume P(S), and hence H(S), is fixed by the experimenter. For a given P(c), there exist the following upper and lower bounds on I (S,R):

(4.19) Imin P(S) P(c) = H(S) − Hmin (Mp(c) )

(4.20) Imax P(S) P(c) = H(S) − Hmax (Mp(c) ), where Hmax (MP(c) ) and Hmin (MP(c) ) are as defined in equations 4.12 and 4.13 (see section 4.4). The proof is as follows. Since the first term in equation 4.17 [H(S)] is constant, I (S,R) varies only with H(S | R). However, as Treves (1997) addresses a similar question under the assumption that M = N, that is, the number of responses equals the number of stimuli. Our results relax that assumption and produce tighter upper bounds. 12

766

E. Thomson and W. Kristan

described in section 4.4, we know that all points {P(c), H(S | R)} must lie in AM,H , the region whose lower and upper bounds are given by equations 4.12 and 4.13, respectively. Equations 4.19 and 4.20 follow from this fact and basic properties of inequalities. Clearly the bounds imposed on I (S,R) by equations 4.19 and 4.20 can be tightened further. For one, I (S,R) is always nonnegative (Cover & Thomas, 1991). Second, P(c) cannot be less than P(smax ) (see section 3.1). Plots of the bounds on I (S,R) with these additional two constraints added are shown in Figure 13 for different stimulus distributions P(S) (solid bold lines). The bounds provided by equations 4.19 and 4.20 alone are shown in light gray. We mention three facts about the relationship between P(c) and I (S,R). First, since Imin (P(S) P(c) ) and Imax (P(S) P(c) ) are one-to-one, monotonically increasing functions of P(c), both functions have inverses that can be estimated using the methods discussed in section 4.1. Hence, given an estimate of mutual information I (S,R), it is possible to calculate the range of P(c) values consistent with that estimate. As with all other quantities discussed so far, the range of P(c) values consistent with a given I (S,R) increases exponentially with M, the number of stimuli. Second, neither Imin (P(S) P(c|r ) ) nor Imax (P(S) P(c|r ) ) depends on P(R) or N. We discussed the benefits of this fact in section 4.4. Finally, using numerical optimization, we have generated many joint distributions P(S,R) that reach the bounds Imin (P(S) P(c) ) and Imax (P(S) P(c) ). Four such distributions are provided in Figure 10A. However, we have not proved in general that the bounds are tight. That is, for an arbitrary P(S) and P(c), it is not guaranteed that there exists a distribution P(R,S) such that I (S,R) = Imin (P(S) P(c) ) or Imax (P(S) P(c) ). We conjecture that the bounds are not generally tight. Aside from numerical optimization procedures we have implemented in which the bounds were not reached (data not shown), this conjecture is based on the following reasoning. Given P(S) and P(c), there must exist channel matrices P(R | S) and response distributions P(R) such that P(R | S)T P(S) = P(R). For this equation to hold and the bounds on I (S,R) to be tight, four constraints must be satisfied: the rows of P(R | S) must sum to 1, as must the elements of P(R), the resultant joint distribution P(R,S) must satisfy equation 3.3, and P(S,R) must reach the bound on I (S,R) given in equation 4.19 or 4.20. Even if N > M > 1, we think it is unlikely that these constraints can be satisfied for an arbitrary P(S), P(c) pair. An interesting area for future research would be to use these constraints to derive even tighter bounds on I (S,R). 4.6 Channel Capacity and the Ideal Observer. The capacity of a channel P(R | S) is defined as (Cover & Thomas, 1991) C(P(R | S)) = max I (S,R). P(S)

(4.21)

Quantifying Stimulus Discriminability

767

Figure 13: Comparison of I (S,R) and P(c). (A) Plots of Imax (P(S) P(c) ) and Imin (P(S) P(c) ) as functions of P(c) (dashed and solid gray lines, respectively). In these examples, M = 3. Panels 1 and 2 plot bounds on mutual information under two different assumptions about the value of the stimulus distribution P(S), and these P(S) values are shown in each panel. The corresponding entropy (H(S)) and maximum stimulus probability (P(smax )) values are also shown in each panel. The black lines show the constraints that are added to the bounds due to the nonnegativity of I (S,R) and the fact that P(c) cannot be less than P(smax ) (see the text). (B) Same as in A, except M = 64.

That is, the capacity is the maximum possible mutual information between S and R using the specified channel, where the maximum is calculated over the set of all possible input distributions. We use PC (S) to denote the stimulus distribution that solves this maximization problem. In general,

768

E. Thomson and W. Kristan

Figure 14: Examples comparing C(P(R | S)) to P(c). As discussed in the text, although C(P2 (R | S)) is greater than C(P1 (R | S)) (i.e., C2 > C1 ), an ideal observer of both channels achieves the same performance (P(c) = 2/3).

calculating the channel capacity is a difficult problem that is often solved using numerical optimization techniques (Cover & Thomas, 1991). As with the other information-theoretic quantities, we would like to know whether the fact that a channel has a higher capacity than another implies that an ideal observer of R would better be able to predict S using that channel. More precisely, it would be interesting to know whether the following is true: C(P1 (R | S)) > C(P2 (R | S)) ⇔ P1 (c) > P2 (c),

(4.22)

where Pi (R | S) is channel i and Pi (c) is ideal observer performance using channel i. Also, we stipulate that the P(c) values are calculated when the input distribution P(S) is set to PCi (S), the stimulus distribution that satisfies equation 4.21 for channel i. To our knowledge, nobody has published analytical results that bear on equation 4.22. Also, one must be cautious in interpreting counterexamples to equation 4.22 because different channels will likely have different input distributions that cause the channel to reach capacity, and it is not clear that a direct comparison of P(c) in such cases is appropriate. The channels shown in Figure 14 provide a counterexample to equation 4.22, and we have constructed the example so that in both channels,

Quantifying Stimulus Discriminability

769

PC (S) is the same. The first channel is a symmetrical channel, so C(P(R | S)) and PC (S) can be calculated with known formulas (Cover & Thomas, 1991).13 By substituting the second channel’s values into equation 4.17, the mutual information of the second channel can be simplified to I (S,R) = 13 H(S), which is maximized when P(S) = {1/2 ,1/2 }. While the results of the previous sections do not suggest an obvious way to calculate upper and lower bounds on C(P(R | S)) as a function of P(c), the example in Figure 14 suggests that such bounds likely exist. 5 Extension to Continuous Distributions 5.1 Background and Definitions. The results in section 4 depend on the assumption that S and R are discrete sets. In practice, this is not a significant limitation because continuous distributions are always effectively binned and discretized since we can never represent them with infinite precision. However, for conceptual completeness, we extend the previous analysis to continuous distributions. This extension requires that we modify both the definitions of the information measures and the measure of estimation error. The upshot of the analysis is that if S is continuous, the information measures provide even fewer constraints on the error measure than when S is discrete. The entropy of a continuous distribution S, also known as the differential entropy, is defined analogously to entropy for the discrete case, with the sum replaced by an integral (Cover & Thomas, 1991): h(S) = −

∞

−∞

P(s) log(P(s))ds.

(5.1)

The remaining information measures such as equivocation are also defined analogously to the discrete case (Cover & Thomas, 1991). The term differential entropy is used instead of entropy because the differential entropy does not have the same mathematical properties as entropy. For example, the differential entropy changes when S is multiplied by a scaling factor a (i.e., h(a S) = h(S) + log(|a |)) (Shannon & Weaver, 1949). Also, h(S) can be negative (Shannon & Weaver, 1949). For example, the differential entropy of a uniform distribution defined over the interval [−w/2, w/2] is log(w), which is negative for w < 1 (Cover & Thomas, 1991). A second change we make to accommodate continuous variables is in our evaluation of estimation error. This is because when sˆ is a point estimate of the stimulus, it follows from equation 3.2 that P(c|r j ) = 0, so P(c) = 0. Hence,

13

For a symmetrical channel, PC (S) is the uniform distribution and the capacity is log(M) − H(P(R | si)), where H(P(R | si )) is the entropy of an arbitrary row of the channel (Cover & Thomas, 1991).

770

E. Thomson and W. Kristan

P(c) is an inappropriate measure of the success in estimating the value of a continuous variable S and needs to be replaced by an error term that increases with the distance between s and sˆ . The most common such measure is the squared error between s and sˆ , (s − sˆ )2 (Yates & Goodman, 1999). The average (over S) squared error, or mean squared error, is a useful overall measure of the quality of sˆ as an estimate of S (Yates & Goodman, 1999): MSE(S) =

∞

−∞

P(s)(s − sˆ )2 ds.

(5.2)

When using MSE(S) to evaluate an estimate of S, the goal is to use an estimate sˆ that will minimize MSE(S), or the minimum mean squared error estimator of S. The minimum mean square estimator of any random variable S (discrete or continuous) is the mean of S, which we denote as µS (Yates & Goodman, 1999). As a corollary, the variance of S, σ 2S , is the actual value of the minimum mean squared error, which we denote E(S): E(S) =

∞

−∞

p(s)(s − µS )2 ds = σ 2S .

(5.3)

The first equality is just equation 5.2 with the minimum mean squared error estimator µS substituted for sˆ , and the second equality is the definition of the variance of S (Yates & Goodman, 1999). While not technically an ideal observer, the minimum mean-squared error estimator is a close relative because it minimizes an error function. In sum, to extend the results from section 4 to continuous distributions, we must first replace the entropy-based information measures with the differential entropy-based measures and then replace P(c) with E(S). In the following section, we briefly show how to calculate the bounds on each information measure as a function of E(S). Note that this idea that such bounds could be derived was suggested but not carried out in Feder & Merhav (1994). 5.2 Comparing Information Measures and Minimum Mean Squared Error. The results from section 4 assume that S and R are discrete sets. The extension to the case where R is continuous and S is discrete is trivial, as none of the results depends on the assumption that R is discrete. However, for reasons described in the previous section, the extension to the case of continuous S requires a significant extension of the analysis, to which we now turn. 5.2.1 Differential Entropy and E(S). For a given value of E(S), there is an upper bound on h(S) given by Cover and Thomas (in press):

Quantifying Stimulus Discriminability

h max (S) =

1 log(2π e E(S)). 2

771

(5.4)

Figure 11A plots this relationship.14 Equation 5.4 places a lower bound on E(S) for a given value of h(S) (Cover & Thomas, in press), as can be seen in Figure 15A. We prove that there exists no upper bound on E(S) for a given value of h(S). Assume that there exists such an upper bound U on E(S) when h(S) takes on some arbitrary value h. To generate a counterexample, let [−w/2, w/2] be the interval of a uniform distribution that is constructed to satisfy h = log(w), as depicted in Figure 15B.1. We then split this uniform distribution into two segments of width w/2, such that there is a distance D between these two segments (see w Figure 15B.2). Because E(S) = 12 + 14 (D2 + Dw), which can be verified by substituting the equation for the split distribution into equation 5.3, there exists a number D such that E(S) > U . By indirect proof, we have shown that there exists no upper bound U on the error for a given entropy. The set B of possible {h(S), E(S)} points for continuous distributions includes the curve delineated by equation 5.4 and all points to the right of the curve (see Figure 15A). It follows that for a given E(S), there is no lower bound on h(S), or: h min (S) = −∞

(5.5)

In sum, h(S) can vary between − ∞ and h max (S) at a given E(S). 5.2.2 Response-Conditional Entropy and E(S | r j ). If we know that response r j occurred and the resulting conditional distribution of S is P(S | r j ), then the same analysis as in section 5.2.1 applies. That is, the minimum MSE estimator of S given r j is µS|r j, the mean of P(S | r j ), and the error E(S | r j ) in this case is σS2|r j , the variance of P(S | r j ) (Yates & Goodman, 1999). In this case, equations 5.4 and 5.5 apply, with S replaced by S|r j . 5.2.3 Specific Information and E(S|r j ). Following a proof strategy exactly analogous to that in section 4.3, it can be shown that I (S,r j )max = ∞

(5.6)

I (S,r j )min = − h max (S),

(5.7)

bounds, which are illustrated in Figure 15C. 14 Note that equation 5.3 is stated in Cover and Thomas (in press) without proof. Proof of equation 5.4: In the set of all continuous distributions with fixed variance σ 2 , the

gaussian has the maximum differential entropy, which is h gauss (S) = 12 log 2π eσ 2 (Cover 2 & Thomas, 1991). Also, since the minimum error estimator has error E(S) = σ (see section 5.1), this implies that the maximum differential entropy consistent with a certain value of E(S) is given by equation 5.3.

772

E. Thomson and W. Kristan

Figure 15: Comparison of continuous information measures with minimum mean squared error estimators. (A) Plot of the maximum differential entropy (h max (S)) as a function of the minimum mean squared error E(S) (see equation 5.7). The set of points to the right of, and including, the curve is the set of permissible {E(S), h(S)} points. The same equation also describes the maximum response-conditional entropy and equivocation as a function of minimum meansquared error (see the text). (B) Geometrical representation of the proof that there is no upper bound on error for a given entropy. Panel 1 shows a uniform distribution with width w. When this probability mass is split into two sections of width w/2, and these sections are separated by distance D (panel 2), h(S) remains the same but the error increases with D2 (see the text). (C) Plot of the lower bound on specific information (gray line) and mutual information (black line) as a function of E(S |r ) and E(S), respectively. There is no upper bound on the information measures: the points to right of, and including, the lines are the set of permissible {E(S), I (S,R)} points (see the text).

Quantifying Stimulus Discriminability

773

5.2.4 Equivocation and E(S). The average minimum mean squared error E(S) when response variable R is available to help estimate S is the mean (over R) of E(S | r j ): E(S) =

N

p(r j )E(S | r j ).

(5.8)

j=1

If R is continuous, the sum can be replaced by the appropriate integral. The bounds on h(S | R) are identical to the bounds on h(S). To prove this, we use a mathematical technique borrowed from the papers that derived the bounds on equivocation in the case of discrete S (Tebbe & Dwyer, 1968; & Kovalevsky, 1968).15 First, recall from section 5.2.2 that B, the set of permissible {E(S | r j ), h(S | r j )} points, is fixed by equations 5.4 and 5.5. Second, note that {E(S), h(S | R)} =

N

P(r j ){E(S | r j ), h(S | r j )},

(5.9)

j=1

where the sum can be replaced by an integral if R is continuous. The right-hand side of equation 5.9 is simply a convex combination of points of the form {E(S | r j ), h(S | r j )}, which from section 5.2.2 we know must be in the set B. Hence, the set B∗ of permissible {E(S), h(S | R)}, pairs must lie in the convex hull of B. However, B is already a convex set, so B = B∗ (Boyd & Vandenberghe, 2004). In other words, the bounds on h(S | R) as a function of E(S) are identical to the bounds on h(S | r j ) as a function of E(S | r j ), that is, h max (S | R) = h max (S)

(5.10)

h min (S | R) = h min (S) = −∞.

(5.11)

Bounds are shown in Figure 15A. 5.2.5 Mutual Information and E(S). Using the results from section 5.2.4, it is possible to use an argument analogous to that in section 4.5 to show Imax (S,R) = h(S) − h min (S) = ∞

(5.12)

Imin (S,R) = h(S) − h max (S).

(5.13)

The fact that I (S,R) ≤ h(S) (Cover & Thomas, 1991) provides an additional constraint and the set of permissible {E(S), I (S,R)} points with all of these constraints included is shown in black in Figure 15C.

15 For an introduction to the concepts from convex analysis used in the following proof, see Boyd & Vandenberghe (2004).

774

E. Thomson and W. Kristan

6 Discussion 6.1 Measuring Stimulus Discriminability. An animal that cannot discriminate successfully on single trials will not survive long in natural conditions. A frog, for example, will not get a second chance to catch a fly if it misses on its first try. As discussed in section 3.2, ideal observer performance, P(c), is a useful and natural measure of single-trial stimulus discriminability. Researchers also often use information measures with the goal of quantifying discrimination performance. In section 4, we showed that this is typically not justified. In particular, rather than there being a one-to-one relationship between information-theoretic quantities and P(c), there is typically a range of permissible P(c) values associated with a given information measure. In section 5 we showed that when the analysis is extended to continuous stimulus distributions, the problems with the information measures are only exacerbated, as they provide no upper bounds on estimation error. If the goal is to make inferences from information measures to P(c), a caveat should be noted: as the number of stimuli (M) increases, the range of P(c) values associated with a given information-theoretic quantity increases exponentially (see section 4). Hence, it would be prudent to pick a stimulus set that is small enough to significantly narrow the range of permissible P(c) values. While the most conservative option is to use only two stimuli (M = 2), in some cases this will not be desirable because the stimulus space will not be adequately sampled. 6.2 Information Theory in Neuroscience. We have shown that to quantify stimulus discriminability, the tools of ideal observer analysis are preferable to those of information theory. Information theory, however, is helpful for answering questions that ideal observer analysis cannot address. We list three. We do not intend the list to be exhaustive, but are simply mentioning those applications that follow most naturally from the results in section 4. First, as we discussed in section 4.5.1, mutual information I (S,R) measures the degree of statistical dependence between random variables S and R. It is a useful measure because it is completely general: it will detect any deviation from independence whether it is due to linear correlation or some nonlinear dependency.16 Ideal observer analysis does not directly quantify such dependencies. But even if two random variables are dependent, the tools of ideal observer analysis are required to determine how well one can be predicted from the other. Second, there clearly exist psychophysical tasks that should be evaluated using information measures rather than P(c). For instance, if the goal is to

16 Of course, there exist other tests for statistical dependence (e.g., the χ 2 test), and it is up to the researcher to decide which statistic is appropriate for his or her data.

Quantifying Stimulus Discriminability

775

evaluate a subject’s performance in the game of 20 Questions, then H(S) indicates the best possible performance L min (see section 4.1.3). Third, natural selection may have discovered a coding strategy that uses the fewest number of elementary symbols required to encode certain classes of stimuli. That is, natural selection may have solved the optimization problem of reaching L min (see section 4.1.1). Such a hypothesis would be challenging to test empirically (e.g., it requires determining the set of elementary symbols used in the neural code, a problem discussed extensively in Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000). Regardless of such practical difficulties, information theory contains the quantitative resources to address such questions about neural coding, resources that ideal observer analysis does not provide. 6.3 Open Questions and Future Directions. We finish by discussing five outstanding questions that are theoretically interesting and, to our knowledge, have not been addressed by previous work. First, it would be interesting to extend the discussion in section 4.6 by generating general bounds on channel capacity as a function of ideal observer performance. Also, an extension of the result to capacity in continuous channels would be useful. Second, we have analyzed quantities that depend on entropy rather than entropy rates (Cover & Thomas, 1991; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). However, Feder and Merhav (1994) show that for stationary processes, the results for H(S) described in section 4.1 also apply to entropy rates, so we expect our results to generalize to information rates (Strong et al., 1998). That is, we expect that a neuron that transmits information at a higher rate than another neuron is not necessarily the better predictor of a time-varying stimulus. However, this idea should be formally developed. Third, since spike trains are nonstationary (Berry & Meister, 1998), it would be interesting to determine how neurally inspired nonstationarities in P(S) and P(R | S) would affect the results in sections 4 and 5. Fourth, by picking as our point of comparison the ideal observer, we have treated all errors equally.17 This would be inappropriate in some instances. For example, if the goal is to estimate the spatial location of the stimulus, an estimate close to the actual location is better than an estimate that is completely off. In such cases, instead of judging an estimate as categorically right or wrong, the estimate should be evaluated by an error term that increases with the distance between sˆ and s, such as the mean squared error (see section 5.1). We implicitly addressed this concern in section 5.2, in which we used the minimum mean squared error as a cost

17 Technically, we have used a 0/1 loss function, also known as the Hamming distance between s and sˆ .

776

E. Thomson and W. Kristan

function. We showed that using the mean squared error as the cost function only amplifies the discrepancies between ideal observers and the information measures. However, a more general consideration of the relationship between arbitrary cost functions (Schneidman et al., 2003) and the information measures deserves analysis. Fifth, one of the virtues often stressed of the information-theoretic approach to neural coding is that the information measures are nonparametric (Paz & Vaadia, 2004; Peterson et al., 2002). Ironically, it is partly because we freed the proofs in sections 4 and 5 from assumptions about P(S) and P(R | S) that it was possible to demonstrate the differences between the information measures and ideal observers. On the other hand, under certain assumptions (e.g., the responses to different stimuli are univariate gaussians with identical variances), the relationship between P(c) and the informationtheoretic quantities is one-to-one. We would like to know the minimal assumptions required about P(S) and P(R | S) for ideal observer performance to be uniquely specified by the information measures. We hope that the work presented here will act as an impetus for such an analysis so that we can know under what conditions it is possible to make unambiguous inferences between encoding and decoding measures in the nervous system. Acknowledgments Thanks to Kevin Briggman, E. J. Chichilnisky, Philip Gill, Dan Kersten, Samar Mehta, Joy Thomas, Jonathan Victor, the Sejnowski lab, Diane Whitmer, and an anonymous reviewer. Thanks especially to Barbara Leary. References Alkasab, T. K., Bozza, T. C., Cleland T. A., Dorries, K. M., Pearce, T. C., White, J., & Kauer, J. S. (1999). Characterizing complex chemosensors: Information-theoretic analysis of olfactory systems. TINS, 22, 102–108. Arabzadeh, E., Panzeri, S., & Diamond M. E. (2004). Whisker vibration information carried by rat barrel cortex neurons. J. Neurosci., 24, 6011–6020. Ash, R. B. (1965). Information theory. New York: Dover. Berry, M. J., & Meister, M. (1998). Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neural Computation, 13, 2409–2463. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Comp., 12, 1531–1552. Britten, K., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4767.

Quantifying Stimulus Discriminability

777

Buracas, G. T., & Albright, T. D. (1999). Gauging sensory representations in the brain. TINS, 22, 303–309. Cover, T. J., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Cover T. J., & Thomas J. (In press). Elements of information theory (2nd ed.). New York: Wiley. de Boor, C. (1978). A practical guide to splines. New York: Springer-Verlag. DeWeese, M. R., & Meister, M. (1999). How to measure the information gained from one symbol. Network: Comput. Neural Syst., 10, 325–340. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. Feder, M., & Merhav, M. (1994). Relations between entropy and error probability. IEEE Transactions on Information Theory, 40, 259–266. Forbes, D. A. (1995). Classification-algorithm evaluation: Five performance measures based on confusion matrices. J. Clin. Monit., 11, 189–206. Geisler, W. S. (1989). Ideal observer theory in psychophysics and physiology. Physica Scripta, 39, 153–160. Geisler, W. S. (2003). Ideal observer analysis. In L. M. Chalupa & J. S. Werner (Eds.), Visual neurosciences. Cambridge, MA: MIT Press. Golic, J. (1987). On the relationship between the information measures and the Bayes probability of error. IEEE Transactions on Information Theory, 33, 681–693. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Knill, D. C., & Kersten, D. (1991). Ideal perceptual observers for computation, psychophysics, and neural networks. In R. J. Watt (Ed.), Pattern recognition by man and machine (pp. 83–97). New York: Macmillan Press. Kovalevsky, V. A. (1968). Character readers and pattern recognition. New York: Spartan Press. Lawson, J. L., & Uhlenbeck, G. E. (1950). Threshold signals. New York: McGraw-Hill. Lettvin, J., Maturana, H., McCulloch, W. S., & Pitts, W. (1959). What the frog’s eye tells the frog’s brain. Proc. IRE, 47, 1940–1951. Li, W., Piech, V., & Gilbert, C. D. (2004). Perceptual learning and top-down influences in primary visual cortex. Nat. Neurosci., 7, 651–657. Liu, Z., Knill, D. C., & Kersten, D. (1995). Object classification for human and ideal observers. Vision Research, 35, 549–568. MacKay, D., & McCulluch, D. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Oram, M. W., Foldiak, P., Perret, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. TINS, 21, 259–265. Orlitsky, A., Santhanam, N. P., & Zhang, J. (2003). Always good Turing: Asymptotically optimal probability estimation. Science, 302, 427–431. Paninski, L. (2004). Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50, 2200–2203. Paz, R., & Vaadia, E. (2004). Learning-induced improvement in encoding and decoding of specific movement directions by neurons in the primary motor cortex. PLoS Biol, 2(2), e45. Available online: http://www.plosbiology.org/plosonline/? request=get-document&doi=10.1371%2F journal.pbio.0020045. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosciences Research Bulletin, 6, 221–348.

778

E. Thomson and W. Kristan

Petersen, R. S., Panzeri, S., & Diamond, M. E. (2002). Population coding in somatosensory cortex. Curr. Opin. Neurobiol., 12, 441–447. Pola, G., Thiele, A., Hoffmann, K. P., & Panzeri, S. (2003). An exact method to quantify the information transmitted by different mechanisms of correlational coding. Network: Comput. Neural Syst., 14, 35–60. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. J. Neurosci., 23, 11539–11553. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Tebbe, D. L., & Dwyer, S. J. (1968). Uncertainty and the probability of error. IEEE Transactions on Information Theory, 14, 516–518. Theunissen, F. E., & Miller, J. P. (1991). Representation of sensory information in the cricket cercal sensory system. II. Information theoretic calculation of system accuracy and optimal tuning-curve widths of four primary interneurons. J. Neurophys., 66, 1690–1703. Treves, A. (1997). On the perceptual structure of face space. BioSystems, 40, 189–196. Victor, J. D. (1999). Temporal aspects of neural coding in the retina and lateral geniculate: A review. Network, 10, R1–66. Wagner, T. J. (1965). Some remarks concerning uncertainty and the probability of error. IEEE Transactions on Information Theory, 11, 144–145. Yates, R. D., & Goodman, D. J. (1999). Probability and stochastic processes: A friendly introduction for electrical and computer engineers. New York: Wiley.

Received May 13, 2004; accepted September 4, 2004.

REVIEW

Communicated by Mark Plumbley

Nonlinear Complex-Valued Extensions of Hebbian Learning: An Essay Simone Fiori [email protected] Facolt`a di Ingegneria dell’Universit`a di Perugia, Polo Scientifico e Didattico del Ternano, I-05100, Terni, Italy

The Hebbian paradigm is perhaps the best-known unsupervised learning theory in connectionism. It has inspired wide research activity in the artificial neural network field because it embodies some interesting properties such as locality and the capability of being applicable to the basic weight-and-sum structure of neuron models. The plain Hebbian principle, however, also presents some inherent theoretical limitations that make it impractical in most cases. Therefore, modifications of the basic Hebbian learning paradigm have been proposed over the past 20 years in order to design profitable signal and data processing algorithms. Such modifications led to the principal component analysis type class of learning rules along with their nonlinear extensions. The aim of this review is primarily to present part of the existing fragmented material in the field of principal component learning within a unified view and contextually to motivate and present extensions of previous works on Hebbian learning to complex-weighted linear neural networks. This work benefits from previous studies on linear signal decomposition by artificial neural networks, nonquadratic component optimization and reconstruction error definition, neural parameters adaptation by constrained optimization of learning criteria of complex-valued arguments, and orthonormality expression via the insertion of topological elements in the networks or by modifying the network learning criterion. In particular, the learning principles considered here and their analysis concern complex-valued principal/minor component/subspace linear/nonlinear rules for complexweighted neural structures, both feedforward and laterally connected. 1 Introduction The basic assumptions of connectionism on neural learning are reasonably nonrestrictive and rather appealing from an engineering point of view:

r

The locality principle, which states that the modifications in the connections depend on the activities of pre- and postsynaptic neurons and do not depend on the activity levels of the other neurons

Neural Computation 17, 779–838 (2005)

© 2005 Massachusetts Institute of Technology

780

S. Fiori

r r

The principle stating that the modification of synapses is slow compared with the characteristic time of neuron dynamics The forgetting principle, stating that if either the pre- or postsynaptic neurons or both are silent, then no synaptic changes take place except for exponential decay

Classical contributions to this analysis are given in Bechtel and Abrahamsen (1993) and Linsker (1992). Perhaps the most influential work in the history of connectionism is the contribution of the neuropsychologist Donald Hebb. In his famous book, Hebb (1949) presented a theory of behavior based on the physiology of the nervous system. He reduced the types of physiological evidence to two main categories: the existence and properties of continuous cerebral activity and the nature of synaptic transmission in the central nervous system. Hebb combined these principles in order to develop a theory of how learning occurs within an organism. As a matter of fact, Hebbian learning and cell/synapses creation/death are kinds of learning for which there is neural evidence. In Hebbian learning, the weights between neurons are adjusted so that each weight better captures the relationship between the units. The Hebbian learning rule specifies that the weight of a connection between two units should be varied in proportion to the product of their activation. It states that the connections between two neurons might be strengthened if the neurons fire simultaneously. As a result, if a unit fires when presented with a pattern, the weights from the active inputs are strengthened, so that the unit will respond to the same pattern even better in the future. In a network of many units, however, a mechanism is necessary to prevent different neurons from encoding the same features. In particular, a way to encourage sparse representations between units is to promote unit decorrelation through lateral inhibition (Barlow, 1989, 1998, 2001; Foldi´ ¨ ak, 1990). Lateral inhibition is a distinguishing feature of some artificial unsupervised neural networks. It provides a mechanism that makes a network of neurons hierarchical and with which the neurons compete for the right to generate a response to the incoming stimuli. This kind of competition enables the abstract neuronal units to form receptive fields that are sensitive to stimuli coming from different regions of the input space (Spratling & Johnson, 2002). Mutual inhibition between units may be achieved by training the lateral connections via an anti-Hebbian rule: whenever two units in a layer are active simultaneously, the connection between them becomes more inhibitory, so that their correlation is decreased, and their joint activity will be discouraged in the future (Foldi´ ¨ ak, 1990). For the standard weight-and-sum neuron model, Hebbian learning makes the synapses be correlators. In some circumstances, the plain Hebbian learning continually strengthens a unit’s weights without bound (Baldi & Hornik, 1995; Bechtel & Abrahamsen, 1993; Miller & MacKay, 1994). Within networks endowed with the appropriate structure,

Nonlinear Complex-Valued Extensions of Hebbian Learning

781

however, the unboundness problem is not present (consider, for instance, the recurrent error correction model discussed in Harpur, 1997). A number of years ago, Amari (1977) investigated a neural theory of association and concept formation, and Oja (1982) studied a stabilized version of the classical Hebb rule. They observed that under fair conditions, the stabilized learning rule makes the neuron capture the most powerful eigenvector of the input covariance matrix or, in other terms, the rule makes the neuron able to extract the principal component from the input multivariate random signal. Since Amari and Oja’s pioneering work, several new learning algorithms have been proposed for extending the one-unit principal component neural system to complete neural networks. The classical contribution in the field of principal component networks may be traced back to Sanger (1989), who used an online version of the well-known Gram-Schmidt orthogonalization algorithm. Also, Rubner and Tavan (1989) and Diamantaras and Kung (1996) introduced a linear neural network endowed with lateral inhibitory connections for achieving output decorrelation. Their algorithm is often referred to as adaptive principal component extractor (APEX). In more recent years, several authors have introduced different principal component analysis (PCA) rules for generalizing classical ones (Abbas & Fahmy, 1994; Bannour & Azimi-Sadjadi, 1995; Costa & Fiori, 2001; Fiori & Piazza, 2000; Fiori, 2001, 2002). A PCA-related argument is principal subspace analysis (PSA), which concerns the computation of the subspace of the input space spanned by the principal eigenvectors (Oja, 1989). Extended discussions on Hebbian, anti-Hebbian, and modified Hebbian learning paradigms and their properties in the context of component and subspace analysis may be found in Atick and Redlich (1993), Fyfe and MacDonald (2002), Harpur (1997), and Weingessel and Hornik (2000). It is likely that one of the reasons for the success of stable Hebbian learning theory is its usefulness for solving many signal processing problems, as illustrated, for instance, in Costa and Fiori (2001), Kung, Diamantaras, and Taur (1994), Luo and Unbehauen (1997), and Palmieri and Zhu (1995), where extracting the first principal components is shown to be of prime importance. Nevertheless, it has been clearly demonstrated that computing the last principal components of a data sequence—those principal components endowed with the smallest (nonzero) power—may be very useful as well (Gao, Ahmed, & Swamy, 1994; Klemm, 1987; Mathew & Reddy, 1994; Schmidt, 1986; Xu, Oja, & Suen, 1992). The extraction of the last principal components is commonly referred to as minor component analysis (MCA). Nonlinear versions of standard Hebbian or anti-Hebbian learning rules, namely, those rules based on the optimization of nonquadratic cost functions, may make the involved linear neural networks capable of performing the independent component analysis (ICA) of incoming signals (for a review of ICA in the neural network field, see, e.g., Bell & Sejnowski, 1995). An early

782

S. Fiori

and pervasive physiological example of such behavior was presented by Barlow and Foldi´ ¨ ak (1989). They recalled that the information about the basic tastes (sweet, salty, bitter, and sour) is not carried on by separate fibers; instead, each fiber carries a mixed signal with different relative sensitivities to the four basic tastes. The operation of separating the four mixed quantities, performed by the nervous system, could not be explained by simple decorrelation rule. Thus, Barlow and Foldi´ ¨ ak proposed a modified (anti-Hebbian) rule for synaptic strength tuning based on the assumption of nonnegativity of the taste variables.1 Classical contributions on nonlinear extensions to PCA may be found in Karhunen and Joutsensalo (1994, 1995) and Oja (1994). The aim of this review is to describe extensions of previous work to complex-weighted linear neural networks in a formal way. The guideline for the presented research is complex-valued gradient-based optimization of nonquadratic learning criteria. The derivations that are presented subsume some contributions found in the scientific literature: We recall interesting contributions from advanced statistical theory of learning and try to relate sparsely published valuable works on these topics in a unifying view. These works span the following topics:

r

r

r

r

Linear signal decomposition by artificial neural networks. This widely known theory allows expressing a multivariate signal as a linear combination of features or components that enjoy some special statistical property (like incorrelatedness or independency). The decomposition consists in jointly finding a basis for the signal and the corresponding components. The decomposition is then achieved by components optimization or by reconstruction error minimization. Nonquadratic component optimization and reconstruction error definition. Both rely on the proper selection of involved nonquadratic learning criteria via available suitable interpretations of Hebbian learning, such as the ones suggested by Song, Yilong, and Feng (1998) and by Sudjianto and Hassoun (1995). Neural parameters adaptation by constrained optimization of learning criteria of complex-valued arguments. This is the basis for neural learning with physical or structural constraints, or both, when complexvalued input signals or input-output signal pairs are to be dealt with. The constraints considered here are related to the orthonormality of neural connection matrices. Orthonormality expression via the insertion of topological elements in the networks or by modifying the network learning criterion.

1 This is what is currently referred to as nonnegative independent component analysis (Plumbley, 2003).

Nonlinear Complex-Valued Extensions of Hebbian Learning

783

Topological constraints are given, for example, by lateral connections that force the neurons in a network to encode uncorrelated features. Modified learning criteria are created, for example, by the Lagrange multiplier method for equality constraints. The development of the theory of linear complex-weighted neural networks is supported by pervasive engineering and physiological motivations. In the supervised-learning field, for instance, it was reported (see Michaels & Upadhyaya, 1999) that complex-valued backpropagation models train faster than conventional backpropagation networks, are more resistant to local minima, and exhibit better generalization ability. By extending previous work proposed independently by some researchers (Benvenuto & Piazza, 1992; Georgiou & Koutsougeras, 1992), Nitta (1997, 2000, 2004) reported that complex-valued backpropagation architectures exhibit reduced probability of learning standstill and are able to perform transformations of geometric figures in the complex plane, which their realvalued counterparts are unable to effect. Also, following the pioneering work by Widrow (see, e.g., Widrow & Winter, 1988) on merging neural networks and adaptive filters, recently Hanna and Mandic (2003) presented a complex-valued nonlinear gradient-descent learning algorithm for a nonlinear neural adaptive filter with adaptive complex-valued activation function; such structure is mentioned to be beneficial when dealing with signals that have rich dynamical behavior. Like other applications, Miyauchi, Seki, Watanabe, and Miyauchi (1993) proposed an interpretation of optical flow based on complex-weighted neural networks, while Muezzino ¨ glu, ˇ Guzeli¸ ¨ s, and Zurada (2003) recently proposed a design method for complexweighted multistate Hopfield memory, which appears as a generalization of the conventional Hopfield model that can be an efficient tool to process static integral information. The new method was shown to outperform the generalized Hebbian rule, which has yet constituted the only learning rule for this model, in associating phase-modulated integral information. Also, from a biological point of view, the current interest in pulse-coded neural networks (Levine, Brown, & Shirey, 1999) has created a need for a mathematically compact way of dealing with phase as well as magnitude of neural signals, which may be easily found in the complex-valued representations. In the unsupervised field, a first extension to classical Sanger’s PCA learning theory to the complex plane has been presented by De Castro, De Castro, Amaral, and Franco (1998) with application to optimal image compression in the spectral domain. An application of nonlinear MCA extended to the complex domain has been presented recently in Fiori (2003b), with application to robust beam forming. Robust beam forming is a well-known signal processing technique allowing for performing spatial filtering of a signal source in the presence of spatial noise and other disturbing sources by means of an array of electromagnetic antennas or electromechanical

784

S. Fiori

microphones, provided that the direction of arrival of the primary source is known. A beam former may be realized by a complex-weighted neural unit fed with the Fourier transform of the measured signals, hence its complex-valued nature. A bioengineering application of adaptive beam forming is in a hearing aid (Kates, 1993). A way to train the beam-forming neuron is to force it to solve a correlation-matrix minimal-eigenvalue extraction problem, which may be formulated as an MCA problem (Fiori, 2003b). Further applications of complex-valued principal or minor component analysis have been reviewed and discussed in Luo and Unbehauen (1997). The theory of complex-weighted network learning has led to effective independent component analysis algorithms for complex-valued statistically independent signals (see, e.g., Fiori, 2000, 2003a). Typical applications of complex-valued ICA algorithms are to electromagnetic phenomena analysis and modeling (Desodt & Muller, 1990; Fiori, Faba, Albini, Cardelli & Burrascano, 2003). Further examples of applications of complex-valued ICA algorithms are frequency-domain algorithms for separation of sound signals and for biomedical data analysis (Cichocki & Amari, 2002). Also, Michaels and Upadhyaya (1999) provided an architecture for signal and image processing (focused on capacitance and inductance sensors) and for modeling the timing and pattern of neural signals in biological settings. In particular, they introduced a complex-valued associative memory theory and showed that in its iterative form, its learning procedure is a complexvalued Hebbian adapting algorithm that uses association rather than error correction. Section 2 of this review is devoted to the complex-valued Hebbian learning theories stemming from the optimization of generalized network output power, while section 3 presents those extended Hebbian learning theories stemming from generalized reconstruction error minimization. In particular, section 2.1 presents a learning theory for a linear feedforward architecture, where the constraints on symmetry and hierarchy are embodied in the learning criterion, while section 2.2 introduces a theory for Rubner-Tavan’s laterally connected network structure (Rubner & Tavan, 1989). Both sections present a comparison with the real-valued networks and rules counterparts. Section 2.3 presents a possible way of choosing the nonlinearity involved in the nonquadratic learning criteria based on the Sudjianto-Hassoun fruitful maximum-mismatch principle (Sudjianto & Hassoun, 1995). Section 2.4 presents some formal results pertaining to the convergence of learning rules in the one-unit case, while section 2.5 deals with the multiunit case. Section 3.1 illustrates a learning theory stemming from generalized reconstruction error minimization for a symmetrical network, and section 3.2 discusses some possible ways to select the involved nonquadratic error measures, while section 3.3 considers the hierarchical case and the corresponding choices of involved nonlinearities. Section 4 concludes the review.

Nonlinear Complex-Valued Extensions of Hebbian Learning

785

2 Complex-Valued Hebbian Learning by Nonquadratic Output Optimization This section presents a unified view and presents and discusses generalizations of networks’ architectures and learning rules for principal component and subspace analysis of complex-valued signals. Generalization is achieved by extending the classical architectures to complex-weighted neural networks and by extending the classical learning criteria to nonquadratic optimization objectives by the introduction of nonlinear functions. A possible choice of the involved nonquadratic function is also discussed. In this section, the Lagrange multiplier method is used extensively in order to construct suitable criteria for learning under orthonormality constraints. It is worth recalling how the Lagrange multiplier method for equality constraints works in practice (see Bertsekas, 1996, for details of the theory and specific computations). Suppose that J (w) and L i (w), i = 1, . . . , m and w ∈ R p, are differentiable functions that map R p → R, and the goal is to solve opt J (w) such that L j (w) = 0 , j = 1, . . . , m, namely, to find the optimum (maximum or minimum) of the function J (w) that satisfies the equality constraints L j (w) = 0. This is equivalent to solving the following problem, opt

J (w) −

m

λ j L j (w) ,

j=1

for w and λ j , without restrictions. The variables λ j ∈ R are termed Lagrange multipliers. It is not difficult to recast the above optimization problem in terms of matrix-type variables if necessary. opt In practice, the optimal multipliers λ j may be found analytically through a multiplier elimination method (Bertsekas, 1996). Then the optimal parameter vector w may be found by a gradient-based algorithm, dw ∂ J (w) opt =± , dt ∂w m ∂ J (w) opt def ∂ opt ∂ = λj J (w) − L j (w) , ∂w ∂w ∂w j=1

(2.1) (2.2)

where the + sign denotes maximization and the − sign denotes minimization.

786

S. Fiori

The contents of this section may be anticipated:

r r r r r

Learning principal/minor component/subspace by complex-valued gradient-based optimization of nonquadratic criteria Learning in an orthogonally constrained network by complex-valued gradient-based optimization of nonquadratic criteria and output decorrelation by lateral inhibition Choice of the nonlinear criteria via Sudjianto-Hassoun interpretation of Hebbian learning, illustrated by numerical examples Analysis of static and dynamical properties of one-unit complexvalued PCA and MCA neural systems, illustrated by numerical examples Multiunit complex-valued component and subspace networks—notes on dynamical properties and on the design of learning algorithms that agree with the orthonormality constraints

2.1 Complex-Valued Gradient-Based Optimization of Nonquadratic Criteria. The origin of modern Hebbian learning may be traced back to Oja’s work on PCA learning theory. PCA is appropriate, for example, when measures on a number of observed variables have been obtained, and the good is to develop a smaller number of artificial variables (termed principal components) that will account for most of the variance in the observed variables. The principal components may then be used as predictor or criterion variables in subsequent analyses. In particular, PCA deals with the extraction of the principal component that accounts for the largest fraction of total variance. Oja’s first principal component learning rule was based on a single neuron, described by y(t) = wT (t)x(t), where x(t) ∈ R p represents a stationary multivariate zero-mean random process endowed with finite covariance matrix , w(t) ∈ R p is the neuron’s weight vector, and y(t) ∈ R is the neuron’s output signal. The learning task of Oja’s neuron is to maximize the variance of its response signal, namely, of E[y2 ] = wT w, under the constraint wT w = 1. This learning rule reads dw(t) = E[x(t)y(t) − w(t)y2 (t)] . dt

(2.3)

In the above relations, t denotes continuous time and E[·] denotes statistical expectation (ensemble average).2 This expression clearly reveals the 2 More formally, the ensemble average should be denoted by the symbol E [ f |w], x which denotes conditional expectation of function f (x) with respect to the statistics of x subject to the hypothesis w. For conciseness, it will hereafter be simply written in short notation as E[ f ].

Nonlinear Complex-Valued Extensions of Hebbian Learning

787

presence of the Hebbian term x(t)y(t) and of a stabilizing term; thus, it is also referred to as a stabilized Hebbian learning equation. The Oja rule was initially formulated as an approximated normalized Hebbian learning algorithm by using a normalization step that binds the connection vector w to belong to a unit-radius sphere (Oja, 1982): In the discrete-time counterpart of such Hebbian learning, the learning step-size is usually supposed to be very small; therefore, it is possible to expand by Taylor series the normalized Hebbian rule with respect to the learning step size and to truncate the expansion to the first-order term. This procedure led to the discrete-time version of learning rule (2.3). In first PCA, it is known that the extracted weight vector coincides with the eigenvector of the covariance matrix corresponding to its largest eigenvalue. In general PCA, more than one unit-norm weight vector is sought. As it is known that the eigenvectors of a covariance (hence symmetric) matrix are orthogonal to each other, it is necessary to enforce the orthogonality of network’s weight vectors. Related concepts to PCA are minor component analysis, which consists of extracting the weakest principal components, and principal or minor subspace analysis, which consists in finding (whatever basis of) the linear subspaces spanned by the principal or minor vectors. In order to generalize Oja’s work to the complex domain, a linear neural network may be considered, which is formed by linear complex-weighted units. Let the following learning objective function for the neural net be defined, def

J (wk ) = U(wk ) + L(wk ) , k = 1, ..., m,

(2.4)

where wk ∈ C p represents the weight vector of the k th neuron. The criterion def U(·) contains a nonlinear function of the k th neuron’s output yk = wkH x and is defined as follows, def

U(wk ) = E[ f (wkH x)] ,

(2.5)

where x ∈ C p is the input vector of the network whose values are drawn from a p-dimensional complex-valued input space and the superscript H denotes Hermitian transpose (i.e., transpose plus complex conjugation). The number of neurons of the network is denoted here with m, where m ≤ p. The function f (·) is a real-valued, positive function of complex-valued argument and is of the form + f (ζ ) = g(|ζ |) , ζ ∈ C , g : R+ 0 → R0 . def

(2.6)

The function g(u) is normally required to be differentiable and convex with a minimum in u = 0 in a convenient right-sided neighborhood of the origin.

788

S. Fiori

The choice of the function f (ζ ) as a real function of |ζ | seems reasonable for the following reasons:

r r

r

r

In the classic (Oja’s) case, the objective function contains the neuron’s output variance, which depends on the modulus (i.e., the unsigned value) of the output only. The leading principle in minor, principal component, or subspace analysis is projection power minimization or maximization. In the signal processing literature, the power of a complex-valued signal is defined as the average of its square modulus. This quadratic optimization principle may be extended to robust (nonquadratic) optimization in a natural way, as shown in this and following sections. Ample literature is available on signal processing techniques and algorithms, which are based on transformed-output-modulus optimization principles. A recent review of such techniques in blind signal processing may be found, for instance, in Amari and Cichocki (1998), Chow and Fang (2001), Pandey (2001), and Thirion Moreau and Moreau (2002). Some algorithms described in this article have readily found application to engineering-type signal-processing tasks (e.g., adaptive beam forming, blind separation of digital modulation type signals, and blind localization of electromagnetic sources), which are based on transformed output modulus optimization.

The function L(·) is needed for embodying the necessary constraints of orthonormality of the weight vectors in criterion 2.4, namely: w Hj wk = 0 if j = k , wkH wk = 1 .

(2.7)

Note that each orthogonality condition may be rewritten more conveniently by observing that w Hj wk = 0 if and only if Re{w Hj wk } = 0 and Im{w Hj wk } = 0. Thus, the function L(·) may be expressed as

def

L(wk ) = λkk (wkH wk − 1) +

K (k) j=1, j=k

Re{λk j wkH w j } .

(2.8)

A set of Lagrange multipliers {λk j } has been introduced in order to take into account the condition ensuring the normality of the weight vectors and the pairs of conditions needed for ensuring the orthogonality principle to be met after learning. In fact, the following identity holds: Re{λk j (wkH w j )} = Re{λk j }Re{wkH w j } − Im{λk j }Im{wkH w j }.

Nonlinear Complex-Valued Extensions of Hebbian Learning

789

Therefore, the expression Re{λk j (wkH w j )} is a compact way to express the equality constraints Re{w Hj wk } = 0 and Im{w Hj wk } = 0 by the Lagrange multiplier method (the minus sign between the above terms does not matter because the multipliers are variables to be determined). Note that the multipliers are complex valued, except for the λkk , which are real valued; therefore, the cost function L(·) is by construction real valued. The indexing function K (·) has been introduced in order to take into account two distinct cases:

r r

The symmetric case, where K (k) = m for each neuron, which leads to a generalized principal subspace analyzer The hierarchical case, where K (k) = k, actually leading to a generalized principal component analyzer opt

In order to look for optimal weights wk maximizing criterion 2.4, a gradient steepest-ascent learning algorithm is employed by following the general scheme shown in the equations 2.1 and 2.2. By definition, the gradient of a real-valued function F (w) with respect to a complex-valued vector w is intended as ∂ F (w) def ∂ F (u, v) ∂ F (u, v) = +i , ∂w ∂u ∂v

(2.9)

def √ where i = −1 and u + iv = w; also, F (u, v) : R p × R p → R denotes the real-valued function of two real-valued arguments derived from F (u + iv). First, it is necessary to evaluate the gradient:

∂U(wk ) dg(|yk |) ∂|yk | g (|yk |) =E =E yk x . ∂wk d|yk | ∂wk |yk |

(2.10)

k| = yk x, where the superIn fact, from definition 2.9, it follows that |yk | ∂|y ∂wk script stands for complex conjugation. In the same way, the gradient of the function L(·) is to be evaluated. The expression of the gradient of L(wk ) with respect to wk is found to be

K (k) ∂ L(wk ) = 2λkk wk + λk j w j . ∂wk j=1, j=k

(2.11)

By gathering equations 2.10 and 2.11, we thus obtain K (k) g (|yk |) ∂ J (wk ) =E λk j w j . yk x + 2λkk wk + ∂wk |yk | j=1, j=k

(2.12)

790

S. Fiori

The standard elimination process may be performed in order to get rid of the dependence on the Lagrange multipliers. It consists of solving the following set of equations for λk j , after having fixed the variable k, wrH

∂ J (wk ) ∂wk

= 0 , r = 1, . . . , m ,

(2.13)

under the constraint of orthonormality of the vectors wk expressed by equations 2.7. Tedious but straightforward calculations give, after footer renaming, the result

opt

λk j = −E

g (|yk |) yk yj |yk |

1 1 − δk j 2

,

(2.14)

where δkh denotes Kronecker’s delta. It would be interesting to observe that the multipliers do not enjoy any symmetry properties unless g (u) = u. In opt opt this special case, in fact, (λk j ) = λ jk would hold. By plugging expressions 2.14 into equation 2.12, the formula for the optimal gradient of J is readily found:

∂J ∂wk

opt =E

K (k) g (|yk |) y x− yj w j . |yk | k j=1

(2.15)

Now the optimal gradient may be used in a steepest-ascent algorithm to design a neural optimizing system. In short, by defining the quantities K (k)

dg(u) 1 du u

(2.16)

dwk = Pk E[G(|yk |)yk x] , k = 1, 2, . . . , m . dt

(2.17)

def

Pk = I p −

def

w j w Hj and G(u) =

j=1

with u ∈ R+ , the new learning rule is

The factor E[G(|yk |)yk x] may be interpreted as a complex-valued extended Hebbian term common to each neuron, while the projector Pk is a deflating term that drives each weight vector wk into a different subspace.3 It is worth noting that by choosing as g(·) the function g(u) = 12 u2 , that means having g (u) = u, and thus G(u) = 1, the gradient steepest-ascent 3

The term projector is a bit extended here: unless the vectors wk become orthonormal, the operator Pk does not represent a true projection because it is not idempotent, that is, P2k = Pk .

Nonlinear Complex-Valued Extensions of Hebbian Learning

791

learning equations for a continuous-time PSA/PCA neural network with m outputs become

K (k) dwk yj w j = E yk x − , k = 1, 2, . . . , m . dt j=1

(2.18)

When K (k) = k, it coincides with the learning rule proposed by DeCastro et al. (1998), which rewrites compactly as

d W = E xx H − UT(W H xx H W) W , dt where UT(·) returns the upper-triangular part of the matrix contained within. If, furthermore, x ∈ R p , then the DeCastro-DeCastro-Amaral-Franco rule coincides with Sanger’s learning paradigm (Sanger, 1989). In the generalized case, the function g(·) is assumed different from quadratic, usually by the help of robust statistics theory (Karhunen & Joutsensalo, 1995). An alternative choice of this function will be discussed in section 2.3. 2.2 Output Decorrelation by Lateral Inhibition. Kung and Diamantaras realized a Hebbian learning network with Rubner-Tavan’s laterally connected linear neural network (Rubner & Tavan, 1989), endowed with the unsupervised APEX learning rule (Diamantaras & Kung, 1996). The input-output network equations are y = z + HH y , z = WH x ,

(2.19)

where x ∈ C p , y, z ∈ Cm , W ∈ C p×m , and H ∈ Cm×m . Note that H is strictly upper triangular; thus, the network is hierarchical (but not recurrent). It is in fact worth noting that the network output signal yk may be expressed as yk = wkH x + hkH y ,

(2.20)

for k = 1, . . . , m, with wk ∈ C p and hk ∈ Cm . An exemplary laterally connected network of this kind is depicted in Figure 1. As the network is hierarchical, the vectors hk enjoy the property h k j = 0 for j ≥ k; thus, for instance, h1 = [0 0 · · · 0 0]T , h2 = [h 21 0 · · · 0 0]T , h3 = [h 31 h 32 0 · · · 0 0]T , and so forth. This implies that the quantity hkH y does not depend on wk , but only on w j with j < k. Kung and Diamantaras proved that under reasonable conditions, in the real-valued case, the APEX rule makes the column vectors of the direct connection weight matrix W asymptotically converge to the first principal eigenvectors of the covariance matrix of the input signal, and the lateral connection weight matrix H asymptotically vanishes to zero.

792

S. Fiori

Direct connections weights

Lateral connections weights

Figure 1: An exemplary network exhibiting laterally connections.

Later, Chen and Hou (1992) extended the APEX algorithm to perform PCA of complex-valued random signals. They proved experimentally that the complex-valued APEX algorithm actually allows extracting a number of principal components from a complex-valued signal. The aim of this section is to give an alternative explanation of the complex-valued APEX learning rules, based on gradient optimization and to extend this result to Hebbian learning. In order to design a PCA-type learning procedure for the laterally connected network illustrated in Figure 1, let us define the following criterion function: def

Nk (W, H) = E[ f (wkH x)] + hkH E[yy H ]hk .

(2.21)

The first term on the right-hand side contains the generalized power of the transformed signal zk = wkH x, where again f (ζ ) = g(|ζ |), ζ ∈ C, while the second term contains a linear combination of the cross-correlation values of the outputs. (Generalized power refers to the expectation of a nonquadratic function of network’s output.) By definition of PCA, the first term has to be maximized under the constraint wkH wk = 1, while the second term must be zeroed. Generalized power maximization and decorrelation of output signals may be thought of as separate objectives. These targets may be attained through the following objective function: def

P(W, H) =

m k=1

Nk (W, H) + λk (wkH wk − 1) + E[ψk ](hkH hk ) ,

(2.22)

Nonlinear Complex-Valued Extensions of Hebbian Learning

793

where the functions λk and E[ψk ] are Lagrange multipliers. The choice of denoting with E[ψk ] the Lagrange multipliers corresponding to the constraints on the lateral connection strengths is motivated by the observation that in the considered learning equations, the optimal values of multipliers appear as the statistical expectation of some functions; thus, the adopted notation inherently embodies such features (Fiori & Piazza, 2000). Also, this choice makes the notation easier. H It deserves to be noted that the terms m k=1 E[ψk ](hk hk ) add degrees of freedom to the learning system. Such terms ensure zero lateral forcing at convergence (i.e., hkH hk = 0), and the presence of the free functions ψk affects the learning dynamics and can speed up the convergence of the network (Costa & Fiori, 2001; Fiori & Piazza, 2000) without changing the set of stationary points of the algorithm. The direct connection W adaptation aims at maximizing the power of the transformed signal by maximizing the objective function, equation 2.22, with respect to W only. In order to adapt each wk , the gradient steepestascent rule shown in equations 2.1 and 2.2 may be used. By reasoning as in the previous section for computation of multipliers, the optimal gradient of P with respect to wk is found to be

∂P ∂wk

opt

= E[G(|yk |)(xyk − zk yk wk )] , k = 1, 2, . . . , m ,

(2.23)

with G(·) being defined as in section 2.1. Then the lateral connection weight matrix H is used only in order to minimize the cost function defined in equation 2.22 in order to decorrelate the network’s outputs. By defining the real part and the imaginary part of def (R) def (R) (I ) (I ) yk and hk as yk = yk + i yk and h = hk + ihk , with some mathematical work we find: (R)

∂ yk

(R) ∂hk (R)

∂ yk

(I ) ∂hk

(I )

(R)

= y[k] , (R)

(R) def

(R) (I )

(R) ∂hk

(I )

= y[k] ,

(I )

= y[k] ,

where y[k] = [y1

∂ yk ∂ yk

(I ) ∂hk (R)

y2

(I )

= −y[k] , (R)

(I ) def

(I )

(I )

· · · yk−1 0 · · · 0]T , y[k] = [y1 y2

(I )

· · · yk−1 0 · · · 0]T ,

(R) def

with k > 1 and y[1] = y[1] = [0 · · · 0]T . Hence, we find

∂ Nk = 2E y[k] yk , ∂hk def

(2.24) def

where y[k] = [y1 y2 · · · yk−1 0 · · · 0]T with k > 1 and y[1] = [0 0 · · · 0 0]T . The function P may be iteratively minimized, with respect to the variable

794

S. Fiori

matrix H, by means of a gradient steepest-descent rule, where the gradient of P with respect to hk assumes the expression

∂P = 2E y[k] yk + ψk hk , k = 1, 2, . . . , m . ∂hk It is interesting to note that in view of optimization, there are no theoretical reasons to force functions ψk to assume any particular value (Costa & Fiori, 2001; Fiori & Piazza, 2000). In other terms, unless further constraints are established in the optimization of function 2.22, it is impossible to find optimal values for the multipliers E[ψk ]. On the basis of the previous calculations, the obtained nonlinear complex-valued learning rules for output decorrelation by lateral inhibition read:  dwk   = E[G(|yk |)(xyk − zk yk wk )] , k = 1, 2, . . . , m ,  dt

 dh   k = −E y[k] yk − ψk hk , k = 1, 2, . . . , m . dt

(2.25)

It could be interesting to discuss two possible choices of the functions ψ(·)’s and to relate the obtained learning rules to learning paradigms known from the scientific literature. The first one consists in assuming null the free functions ψk . In order to find a correspondence in the scientific literature with a related learning rule, we consider the classic (quadratic optimization) case, in which the function g(u) is assumed equal to 12 u2 and real-valued signals are dealt with. Then, in the hypothesis that for t large enough, the lateral forcing has vanished (hk ≈ 0), because the neurons have learned to encode uncorrelated features, from the first of equations 2.19, we see that zk ≈ yk . Consequently, the learning rules 2.25 can be approximated by  dwk   = E[yk (x − yk wk )] , k = 1, 2, . . . , m ,  dt  dh   k = −E[y[k] yk ] , k = 1, 2, . . . , m , dt

(2.26)

known as Rubner-Tavan’s learning equations (Rubner & Tavan, 1989). It is easy to recognize a principal component rule for the direct connections and an anti-Hebbian rule for the lateral connections, in the spirit of Hebbian and anti-Hebbian learning recalled in section 1. The second choice consists in assuming ψk = |yk |2 . Again in the quadratic optimization hypothesis and under the asymptotic assumption zk ≈ yk , the

Nonlinear Complex-Valued Extensions of Hebbian Learning

795

model 2.25 closely resembles the learning algorithm by Chen and Hou (1992) (which coincides with the Kung-Diamantaras learning rule in presence of real-valued signals):  dwk   = E[yk (x − yk wk )] , k = 1, 2, . . . , m ,  dt   dhk = −E[y (y + y h )] , k = 1, 2, . . . , m .  k k k [k] dt

(2.27)

It deserves to be noted that the learning equations for wk in equation 2.25 differ from the corresponding Chen-Hou equations because of the presence of the term zk yk instead of yk yk = |yk |2 . This difference may be nonnegligible. The quantity zk yk is a complex number; thus, it embodies a phase factor that |yk |2 does not possess. This difference disappears when m = 1. In this case, the equations coincide with Oja’s first principal component analyzer in the complex domain. 2.3 The Sudjianto-Hassoun Principle and Its Extension to the Complex-Valued Case. In sections 2.1 and 2.2, nonquadratic criteria optimization was dealt with in order to define extended Hebbian learning rules. Here we aim at briefly recalling the Sudjianto-Hassoun interpretation of Hebbian learning and extending this theory to the complex-valued case, because it naturally leads to a possible choice of nonquadratic criterion for learning. Also, an exemplary application to complex-valued independent component analysis will help in clarifying the usefulness of the SudjiantoHassoun principle in extended Hebbian learning. Sudjianto and Hassoun (1995) considered the problem of maximizing def a criterion J (w) = E[S2 (wT x)] subject to the restriction wT w = 1, where T y = w x is the output of a single-unit neural network and S(·) is a generic saturating sigmoidal function, for instance, such that S(y) ∈ [−1, +1]. They noted that maximizing the variance of a saturating function of y leads the model neuron to prefer configurations w that correspond to having the values of S(y) concentrated around the extremes −1 and +1. If the quantity def v = S(y) is perceived as a new random variable with probability density function q V (v|w), q V (v) in short notation, this corresponds to having this distribution U shaped (Sudjianto & Hassoun, 1995). The gradient steepestascent learning rule for this abstract neuronal system is dw = (I − wwT )E[(y)x] , dt def

(2.28)

¯ the probability denwhere (u) = 2S (u)S(u). Let us denote now by q Y (y|w) ¯ and with sity function of the random variable y due to a configuration w

796

S. Fiori

¯ its cumulative distribution function, namely: QY (y|w) def

¯ = QY (y|w)

y −∞

¯ q Y (u|w)du .

¯ ¯ − 1. In this case, it is well known Let us also assume S(y) = 2QY (y|w) (Sudjianto & Hassoun, 1995) that the variable v will be uniformly distributed within [−1, +1]. The central idea developed by Sudjianto and Hassoun is that the learning rule, equation 2.28, will converge to a weight vector surely ¯ since the rule seeks a U-shaped distribution of v, that is, a different from w, distribution that deviates from a uniform one. In order to extend the Sudjianto-Hassoun principle to the complexvalued case, let us consider the cost function def

U(w) = E[S2 (|y|)] ,

(2.29)

for a complex-weighted neuron with output y = w H x, with x and w belonging to C p . Its gradient steepest-ascent maximization under the constraint w H w = 1 yields the learning rule dw y H = (I − ww )E (|y|) x , dt |y|

(2.30)

which closely recalls equation 2.17 for m = 1. The learning rule 2.30 drives function S(|y|) to take on values around its extremes, for instance, +1; therefore, it seeks the configurations w, making the probability density function q V (v) different from a uniform one. Now we may consider for S(|y|) a function like def

Q|Y| (|y|) =

0

|y|

q |Y| (u)du .

By equating equation 2.30 to 2.17, with m = 1, the relationship between q |Y| and g is found to be g (|y|) = 2q |Y| (|y|)

|y| 0

q |Y| (u)du .

(2.31)

The concept of Sudjianto-Hassoun interpretation of Hebbian learning has been just touched on here. An extensive analysis of this interesting and fruitful theory has been recently proposed in Fiori (2003c) in the context of complex-valued independent component analysis. In summary, independent component analysis allows extracting statistically independent source signals from their linear combinations where the

Nonlinear Complex-Valued Extensions of Hebbian Learning

797

mixing operator is unknown and the temporal dynamic of the source signals is also unknown. Another useful interpretation of ICA techniques concerns their feature extraction ability, which offers a mathematically sound way of decomposing signals that exhibit involved dynamics into independent basis signals (Cichocki & Amari, 2002; Hyv¨arinen, Karhunen, & Oja, 2001). Formally, the simpler source observation model is written as x = As, where s ∈ C p denotes the vector of source signals, x ∈ C p denotes the vector of observed signals, and A ∈ C p× p denotes the matrix of mixing coefficients. In the above (noiseless, instantaneous) case, the matrix A may be assumed orthonormal without loss of generality. In fact, it is known that if A is not orthonormal, it is possible to preprocess the observed data via a so-called whitening algorithm that makes it possible to separate out the independent components via a rotation only (Cichocki & Amari, 2002; Hyv¨arinen et al., 2001). In this case, a linear artificial neural network described by y = W H x with W ∈ C p× p orthonormal may be effective in separating the independent source signals out from the observed mixtures. In the case that the sources are typical base-band digital signals used in telecommunications, such as QAM or PSK (QAM stands for quadrature amplitude modulation and PSK for phase shift keying; (Haykin, 1989), we may give a quite appealing interpretation of the way the SudjiantoHassoun principle works. In fact, in the above-mentioned cases, the probability density functions of the generic source signal sk modulus are written as q |sk | (|sk |) =

q |sk |,r δ(|sk | − s¯k,r ) ,

(2.32)

r

where the constant s¯k,r denotes the r :th possible discrete value assumed by the quantity |sk |, the symbol δ(·) denotes Dirac’s delta, and the quantities q |sk |,r denote the probability that |sk | = s¯k,r . Now, the probability density function of each observation is different from peaked (i.e., more similar to a uniform one), so the aim of the separating network is to make the output distributions of the artificial neural network as peaked as possible. This behavior is readily recognized as extending the “make the output distribution be as U shaped as possible” to “make the same distribution as peaked as possible.” These informal considerations offer further justification for the choice of learning rules based on the optimization of criterion functions that depend on the network output moduli only. The above informal considerations have been formall investigated in Bingham and Hyv¨arinen (2000) and Fiori (2000, 2003c). Some contributions on the use of nonclassical principal component techniques to blind source separation have been recently summarized in Fiori (2003c) and the choice of the involved nonlinear functions has been

798

S. Fiori

discussed, for instance, in Hyv¨arinen (1999) and Mathis and Douglas (2002). The main weaknesses of these approaches seem to be:

r r r

The nonlinear functions are generally chosen on the basis of existing contributions from robust statistics and on heuristic observations and findings. Their choice generally relies on the fact that using nonlinear functions adds high-order statistical features to the system but gives little insight into how these features are related to the separation problem. Because of the strong nonlinearity of the learning equations, it is difficult to give any analytical proof of convergence or detailed studies about the features of the employed learning criteria.

In order to illustrate the usefulness of Sudjianto-Hassoun principle in extended Hebbian learning for independent component analysis and to clarify the above calculi with an example, let us suppose that input x contains a complex-valued orthogonal mixture of statistically independent signals (Desodt & Muller, 1990; Laheld & Cardoso, 1994) and that one of these signals is a gaussian noise of the form ν = ν (R) + iν (I ) , where ν (R) and ν (I ) are zero-mean gaussian random variables with variance σ 2 . Then it is known that the modulus |ν| follows the Rayleigh distribution, u u2 q R (u) = 2 exp − 2 (u) , σ 2σ where (u) is the unit step function. Then by formula (2.31), we find (u) = gR

u2 u2 2u exp − − exp − , u≥0. σ2 2σ 2 σ2

Figure 2 depicts the Rayleigh warping function g u(u) for a unit power noise. In this case, it is possible to express the cumulative distribution function as

u2 QR (u) = 1 − exp − 2 2σ

,

whose shape is illustrated in Figure 2. It is interesting to note that the obtained shape for the neuron model’s nonlinear function QR (u) formally differs from the classical hyperbolic-tangent one, even if its shape resembles the usual saturating sigmoid. Also, in this case, the sigmoid’s shape parameter σ has a clear physical meaning. In order to illustrate how the above theory works in practice, we consider a simple 4 × 4 complex-valued independent component analysis case. Figure 3 shows the source signals, s1 , s2 , s3 , and s4 —namely, three QAM16

Nonlinear Complex-Valued Extensions of Hebbian Learning

799

1

0.9

0.8

0.6

R

g’(u)/u , Q (u)

0.7

0.5

0.4

0.3

0.2

0.1

0

0

0.5

1

1.5

2

2.5 u

3

3.5

4

4.5

5

Figure 2: Rayleigh warping function g u(u) (solid line) and the corresponding cumulative function QR (u) (dashed line) for σ = 1.

signals and a complex gaussian noise of standard deviation σ = 0.5—and their four linear superpositions x1 , x2 , x3 , and x4 . The same figure also shows the histograms of the moduli |sk | and |xk |. It is clear that as the QAM16 symbols are discrete by definition, their histograms are peaked. This simple rule does not apply, of course, to the gaussian noise. Conversely, the histograms of the observations xk are distributed. Figure 4 shows the whitened signals, denoted here by v1 , v2 , v3 , and v4 , and the estimated source signals obtained through the neural learning algorithm obtained by the extended Hebbian learning rule described in this section, denoted by y1 , y2 , y3 , and y4 . The histograms of the moduli of the complex-valued signals y1 , y2 , y3 look close to peaks, as predicted by the theory. The main contribution of the Sudjianto-Hassoun principle-based learning theory to ICA is to allow designing the proper structure of the nonlinear part of neurons provided that some information is known about the statistical structure of signals that the source signals differ from. This principle may prove very useful in blind signal processing, where the statistical structure of the involved signals is not known in advance, but the statistical structure

800

S. Fiori

Modulus histogram

Signal s

Signal s

40

Imaginary

20

100 0

0

1

0

Signal s

2

3

1

1

1

1

0

0

0

0

−1

−1 −1

Imaginary

4

200

1

0

−1 −1

1

0

0

10

10

0

0

0

0

10

−20 −20

0

20

−10 −10

0

10

−10 −10

200

200

200

200

100

100

100

100

0

0

10 Signal x1

20

0

0

10 Signal x2

20

0

1

−1

1

20

0

0

2

−1 −1

1

10

−10 −10 Modulus histogram

Signal s

2

300

0

10 Signal x3

20

0

0

0

1

0

10

10 Signal x3

20

Figure 3: (Second row) Source signals s1 , s2 , s3 , and s4 (three QAM16 signals and a complex gaussian noise). (Third row) Their four linear superpositions x1 , x2 , x3 , and x4 . (First row) Histograms of the moduli |sk |. (Fourth row) Histograms of the moduli |xk |.

of signals that differ from the useful ones, such as noises, might be easily known. 2.4 Analysis of the One-Unit Complex-Valued PCA/MCA. The learning rule in equation 2.17 for a single neuron in the quadratic complex-valued case particularizes to the adapting law: dw = E[y x − |y|2 w] , dt

(2.33)

with x and w belonging to C p . By defining the Hermitian covariance matrix def = E[xx H ], this differential equation rewrites as dw = w − (w H w)w . dt

(2.34)

Imaginary

Nonlinear Complex-Valued Extensions of Hebbian Learning 2

2

2

2

1

1

1

1

0

0

0

0

−1

−1

−1

−1

Imaginary

−2 −2

0 Re

2

−2 −2

0 Re

2

−2 −2

0 Re

2

−2 −2

2

2

2

2

1

1

1

1

0

0

0

0

−1

−1

−1

−1

−2 −2

Histogram

801

0 Re

2

−2 −2

600

600

400

400

0 Re

2

−2 −2

0 Re

2

800

−2 −2

0 Re

2

0 Re

2

2 |y4|

4

300

600

200

400 200 0

200

0

1 |y1|

2

0

100

200 0

1 |y2|

2

0

0

1 |y3|

2

0

0

Figure 4: (Top row) Whitened observations v1 , v2 , v3 , and v4 . (Middle row) Estimated source signals y1 , y2 , y3 , and y4 obtained through the extended Hebbian learning rule enriched by the Sudjianto-Hassoun principle. (Bottom row) Histograms of the moduli |yk |.

This system has a set of stationary points defined by def

E∗ = {w ∈ C p : w − σ w = 0} .

(2.35)

With the exception of the trivial solution w = 0, the elements w ∈ E∗ coincide to the eigenvectors of . The following proposition states the convergence of the system 2.34 to the eigenvector in E∗ corresponding to the largest eigenvalue σ . The proof of this theorem follows from the arguments successfully used in the realvalued case by other authors (Oja, 1992; Sanger, 1989). Theorem 1. Let us suppose Hermitian in equation 2.34 with eigenpairs (σ 1 , q 1 ), (σ 2 , q 2 ), . . . , (σ p , q p ). Suppose then that eigenvalues are distinct and

802

S. Fiori

arranged in descending order, and eigenvectors are normalized so that qkH qk = 1 and w(0) H q 1 = 0. Then there holds lim w(t) = q 1 e iα ,

t→+∞

where α is arbitrary in [0, 2π]. Proof. Let us expand vector w(t) by means of the system’s eigenbasis (Haykin, 1998; Sanger, 1989), which means writing: w(t) = θ1 (t)q1 + θ2 (t)q2 + · · · + θ p (t)q p ,

(2.36)

where scalar functions θk (t) ∈ C are termed principal modes. Plugging equation 2.36 in equation 2.34 yields p dθh (t) h=1

dt

p p p p H θh (t)qh − [θk (t)qk ] [θk (t)qk ] θh (t)qh . qh = h=1

k=1

k=1

h=1

By recalling property qk = σk qk , we have p p p p m dθh (t) H θh (t)qh σh − θk (t) θr (t)σr qk qr θh (t)qh . qh = dt h=1 h=1 k=1 r =1 h=1 Now from the fact that qkH q = δk , it follows that the differential equations for the h th principal modes, with h ≥ 2, read θ˙ h (t) = θh (t)σh −

p

|θk (t)|2 σk θh (t) , h = 2, . . . , p ,

(2.37)

k=1

while it is particularly useful to write in a separate equation the differential law pertaining to the first principal mode θ1 (t), that is, θ˙ 1 (t) = θ1 (t)σ1 − |θ1 (t)|2 θ1 (t)σ1 −

p

|θk (t)|2 σk θ1 (t) .

(2.38)

k=2

Principal modes θh (t) are complex-valued quantities, and their real and (R) (I ) imaginary parts are denoted here as θh (t) and θh (t), respectively. Our aim is now to solve differential system 2.37 + 2.38, which is a coupled nonlinear system of differential equations. In order to decouple the nonlinear differential subsystem 2.37, let us define the new functions: (R)

def

φh (t) =

(R)

θh (t) (R)

θ1 (t)

(I )

def

, φh (t) =

(I )

θh (t) (I )

θ1 (t)

, h = 2, . . . , p .

Nonlinear Complex-Valued Extensions of Hebbian Learning

803

For the real part of the h:th principal mode, there holds (R) (R) (R) (R) θ˙ (t)θ1 (t) − θh (t)θ˙ 1 (t) d (R) , φh (t) = h (R) dt θ (t)2

(2.39)

1

and the system 2.37+2.38, particularized for the real parts, reads: (R) (R) θ˙ h (t) = θh (t)σh −

p

(R)

(2.40)

(R)

(2.41)

|θk (t)|2 σk θh (t) , h = 2, . . . , p .

k=1 (R) (R) θ˙ 1 (t) = θ1 (t)σ1 −

p

|θk (t)|2 σk θ1 (t) .

k=1

By using equations 2.40 and 2.41 within equation 2.39, direct calculations (R) show that the dynamics of the state variables φh (t) looks like (R) (R) φ˙ h (t) = (σh − σ1 )φh (t) , h = 2, . . . , p . (I )

A similar formula holds for φh (t). Thus, we have proven that θh (t) = φh (0)e (σh −σ1 )t θ1 (t) ,

(2.42)

θh (t) = φh (0)e (σh −σ1 )t θ1 (t) .

(2.43)

(R)

(I )

(R)

(R)

(I )

(I )

The above formulas show that subsystem 2.37 has been successfully decoupled, since the dynamics of each principal mode no longer depends on the other modes, apart from the first one. It is useful to note that from equations 2.42 and 2.43, the expression below follows: (R)

(I )

|θk (t)|2 = θk (t)2 + θk (t)2 = e 2(σk −σ1 )t [φk (0)2 θ1 (t)2 + φk (0)2 θ1 (t)2 ] . (R)

(R)

(I )

(I )

Now the aim is to solve the differential equation 2.38 for θ1 (t). It is worth noting that it slightly simplifies if the following variable change is performed: θ1 (t) = c(t)e σ1 t , c(t) ∈ C . After this change, we have |θ1 (t)|2 = |c(t)|2 e 2σ1 t , (R)

(I )

|θh (t)|2 = e 2σk t [φh (0)2 c (R) (t)2 + φh (0)2 c (I ) (t)2 ] ,

804

S. Fiori

where c (R) (t) and c (I ) (t) stand, respectively, for the real part and the imaginary part of the function c(t). Then differential equation 2.38 becomes c(t) ˙ = − |c(t)|2 e 2σ1 t σ1 c(t) −

p

(R)

(I )

e 2σk t [φk (0)2 c (R) (t)2 + φk (0)2 c (I ) (t)2 ]σk c(t) .

(2.44)

k=2

By defining the auxiliary quantities, def

η(R) (t) = −

p

(R)

(2.45)

(I )

(2.46)

e 2σk t σk φk (0)2 − σ1 e 2σ1 t ,

k=2 def

η(I ) (t) = −

p

e 2σk t σk φk (0)2 − σ1 e 2σ1 t ,

k=2

the differential equations for the real and the imaginary parts of c(t) rewrite compactly:

c˙ (R) (t) = η(R) (t)c (R) (t)3 + η(I ) (t)c (I ) (t)2 c (R) (t) , c˙ (I ) (t) = η(I ) (t)c (I ) (t)3 + η(R) (t)c (I ) (t)c (R) (t)2 .

(2.47)

Since η(R) (t) < 0, η(I ) (t) < 0 and at least c(0) = w(0) H q1 = 0, two cases should be considered. CASE 1: Only one of the terms c (R) (0) and c (I ) (0) differs from zero. In this case, system 2.47 simplifies into c(t) ˙ = η(t)c(t)3 , where η(t) stands for η(R) (t) or η(I ) (t). The above differential equation may be solved by

c(t)

c(0)

dc = c3

t

1 1 = −2 2 c(t) c(0)2

η(τ )dτ ⇒

0

t

η(τ )dτ .

0

This readily leads to 1 e −2σ1 t = − 2e −2σ1 t 2 θ1 (t) θ1 (0)2

t

η(τ )dτ .

0

/ From definitions 2.45 and 2.46, it can be seen that under condition σ1 ∈ {σ2 , . . . , σm }, there holds lim 2e −2σ1 t

t→+∞

0

t

η(τ )dτ = −1 .

(2.48)

Nonlinear Complex-Valued Extensions of Hebbian Learning

805

This implies the conclusion lim

t→+∞

1 =1. θ1 (t)2

(2.49)

CASE 2: c (R) (0) = 0 and c (I ) (0) = 0. The hypotheses imply c (R) (t) = 0 and c (I ) (t) = 0. Thus, it is possible to multiply both sides of equations 2.47 by a factor 2/c (R) (t) and 2/c (I ) (t), respectively. This way we have: d log c (R) (t)2 d log c (I ) (t)2 = = 2[η(R) (t)c (R) (t)2 + η(I ) (t)c (I ) (t)2 ] . (2.50) dt dt The first equality in the chain tells us that log c (R) (t)2 = log c (I ) (t)2 − log κ, where κ is a positive constant determined by the initial conditions. Equivalently, we have c (I ) (t)2 = κc (R) (t)2 . This means that equation 2.50 can be easily solved, in that there holds d log c (R) (t)2 = 2[η(R) (t) + κη(I ) (t)]c (R) (t)2 . dt The latter equation is equivalent to c˙ (R) (t) = [η(R) (t) + κη(I ) (t)]c(t)(R) (t)3 . Again we have: 1 (R)

θ1 (t)2

=

e −2σ1 t (R)

θ1 (0)2

− 2e −2σ1 t

t

[η(R) (τ ) + κη(I ) (τ )]dτ ,

(2.51)

0 (I )

θ1 (0)2 1 1 1 = , κ = . (R) |θ1 (t)|2 1 + κ θ (R) (t)2 θ1 (0)2 1

(2.52)

By using twice the result 2.48, it may be shown that lim 2e −2σ1 t

t→+∞

t

[η(R) (τ ) + κη(I ) (τ )]dτ = −(1 + κ).

0

Thus, we arrive at the conclusion lim

t→+∞

1 =1. |θ1 (t)|2

The results 2.49 and 2.53 also imply lim |θk (t)| = 0 , k = 2, 3, . . . , p .

t→+∞

(2.53)

806

S. Fiori

It is worth mentioning that the assumption that all eigenvalues are distinct, while reasonable, for example, in signal processing tasks, is not crucial. The crucial assumption is that the first eigenvalue differs from the others, as suggested by the fact that the result, equation 2.49, holds under condition σ1 ∈ / {σ2 , . . . , σm }. In this case, a proof of a similar consistency result may be based on the convergence proof for the real-valued case provided in theorem 3.3 of Helmke and Moore (1993). When the function g(·) defined in equaiton 2.6 is not quadratic, it is likely that the neural unit just studied behaves in a different way from Oja’s neuron. A study on what happens in the general nonquadratic case for a real-weighted neuron has been carried out by Oja (1994). The analysis is extended here to the complex case. To begin, let us define the following quantities,

g (|y|) L(w) = E y x = E G(|w H x|)(xx H ) w , |y| def

def

G(w) = E[g (|y|)|y|] = E[G(|w H x|)|w H x|2 ] ,

(2.54) (2.55)

with G(u) defined again as in section 2.1. Note that L(w) ∈ C p , while G(w) ∈ C. Then learning rule 2.17 rewrites as dw = L(w) − wG(w) , dt

(2.56)

as in Oja (1994). The stationary points of system 2.56 are among the elements of the set: def

W∗ = {w ∈ C p : L(w) − wG(w) = 0} .

(2.57)

Let us suppose g (u) has null even derivatives at the origin, as in Oja (1994). Thus, the MacLaurin expansion of G(u) looks like 1 G(u) = g (2) (0) + u2 g (4) (0) + higher-order terms (h.o.t.). 6 Replacing the expanded function G(u) into definitions 2.54 and 2.55 gives 1 L(w) = wg (2) (0) + w H E[(xx H )w(xx H )]wg (4) (0) + h.o.t. , 6 1 G(w) = w H wg (2) (0) + E[|w H x|4 ]g (4) (0) + h.o.t. , 6 where again denotes the network input covariance matrix.

Nonlinear Complex-Valued Extensions of Hebbian Learning

807

Therefore, a neuron model with a sigmoidal activation function G(·) instead of a linear one introduces higher-order statistics in learning equations. This can be seen by looking at the set of stationary points W∗ that only in a first-order approximation coincides with E∗ , provided that the |g (n) (0)| are small enough for n > 2. Therefore, in general, the vectors in W∗ deviate from the principal directions in E∗ and thus also depend on higher-order statistical structures than covariance. A widely known problem in the scientific literature is that extending PCA theory to minor component analysis theory is not a straightforward task. This is true even in the complex domain case. In particular, it is not possible to replace maximization of criterion 2.4 J (w) = U(w) + L(w) with respect to w, with its minimization, in that the resulting learning rule, dw ∂J =− = E[−y x + |y|2 w] , dt ∂w

(2.58)

is unstable. The proof of the following proposition follows from the proof of the corresponding real-valued case presented in Fiori (2003b) and Oja (1992). Theorem 2. Let ∈ C p× p be a Hermitian matrix with eigenpairs (σ 1 , q 1 ), (σ 2 , q 2 ), . . . , (σ p , q p ). Let us suppose eigenvalues are distinct and arranged in descending order, and eigenvectors are normalized so that qkH qk = 1 . Then ˙ = −w + (w H w)w with w(0) H q1 = 0 becomes unstable in finite system w time. Proof. The proof essentially follows that of theorem 1. Let us consider the principal modes dynamics: θ˙ h (t) = −θh (t)σh +

p

|θk (t)|2 σk θh (t) , h = 1, . . . , p − 1 ,

k=1

θ˙ p (t) = −θ p (t)σ p +

p−1

|θk (t)|2 σk θ p (t) + |θ p (t)|2 θ p (t)σ p .

k=1

Let us now define the pseudo-modes, (R)

def

φh (t) =

(R)

θh (t) (R)

θ p (t)

(I )

def

, φh (t) =

(I )

θh (t) (I )

θ p (t)

, h = 1, . . . , p − 1 ,

where superscripts (R) and (I ) denote, as usual, the real part and the imaginary part of a complex-valued variable. By means of the pseudo-modes, it

808

S. Fiori

is possible to show that θh (t) = φh (0)e −(σh −σ p )t θ p(R) (t) , (R)

(R)

θh (t) = φh (0)e −(σh −σ p )t θ p(I ) (t) . (I )

(I )

Introducing the complex-valued auxiliary function c(t) so that θ p (t) = c(t)e−σ p t , it is possible to find the resolving differential equation: c(t) ˙ = |c(t)|2 e −2σ p t σ p c(t) −

p−1

e −2σk t [φk (0)2 c (R) (t)2 + φk (0)2 c (I ) (t)2 ]σk c(t) . (R)

(I )

k=1

Again, it is useful to define functions: def

η(R) (t) =

p−1

e −2σk t σk φk (0)2 + σ p e −2σ p t , (R)

k=1 def

η(I ) (t) =

p−1

e −2σk t σk φk (0)2 + σ p e −2σ p t . (I )

k=1

By means of these functions, it is straightforward to show that the real part and the imaginary part of c(t) satisfy system 2.47. Thus, the solution has the form 2.52. Now the time integrals of η(R) and η(I ) are explicitly required. The first one is 2e

2σ p t

t

η(R) (τ )dτ = e 2σ p t

0

p−1

φk (0)2 (e −2σk t − 1) + 1 − e 2σ p t . (R)

k=1

The second integral has a similar form; thus, we may ultimately write 1 = B − A(t)e 2σ p t , |θ p (t)|2 where B is a positive constant and A(t) is an increasing function of the time. Therefore, depending on initial conditions, a finite time t¯ exists such that: 1 = 0. |θ p (t¯)|2 Oja (1994) proposed an algorithm allowing for the extraction of generalized minor component analysis from real-valued signals that overcomes the stability problem. The aim of this part is to formalize the stabilization method within the optimization framework and to extend the theory to the complex case.

Nonlinear Complex-Valued Extensions of Hebbian Learning

809

Let us consider, once again, the problem of minimizing the criterion def

C(w) = E[g(|w H x|)] + λ(w H w − 1) ,

(2.59)

with respect to the weight vector w. The expression for its gradient is ∂C = E[G(|y|)y x] + λw. ∂w

(2.60)

∂C to zero and Thus, the optimal multiplier may be found by setting w H ∂w by solving the resulting equation under the constraint w H w = 1, that is, by solving

wH

∂C ∂w

opt = E[G(|y|)|y|] + 2λopt (w H w) = 0 .

The central point is now that as optimality requires w H w − 1 = 0, the latter condition is equivalent to E[G(|y|)|y|] + 2λopt − σ¯ (w H w − 1)w = 0 , with σ¯ being an arbitrary constant. By solving the previous equation for λopt ∂C opt and replacing the found function in the expression of ∂w , we obtain the stabilizable learning rule: dw = −E[G(|y|)(y x − |y|2 w)] − σ¯ (w H w − 1) . dt

(2.61)

When G(u) = 1, it is possible to prove that the corresponding minor component analysis learning algorithm converges to the expected solution. def

Theorem 3. Let = E[xx H ] be the covariance matrix of the random process x(t) with eigenpairs (σ 1 , q 1 ), . . . , (σ p , q p ) and G(u) = 1 in equation 2.61. Suppose eigenvalues are distinct and arranged in descending order, and eigenvectors are normalized so that qkH qk = 1. If σ¯ > σ 1 then the state vector w of system 2.61 with w(0 ) H q p = 0 asymptotically converges toward q p up to a phase factor. ˙ = −( − σ¯ I)w + Proof. System 2.61 with G(u) = 1 can be rewritten as w ¯ def ¯ are σ¯ − σ p > w H ( − σ¯ I)ww. Define = −( − σ¯ I). The eigenvalues of σ¯ − σ p−1 > · · · > σ¯ − σ1 > 0, while its eigenvectors coincide to eigenvectors ¯ + (w H w)w, ¯ ˙ = w of . Thus, theorem 2.4 applies to system w allowing the conclusion that w asymptotically converges to the last eigenvector q p up to a phase factor.

810

S. Fiori 4

1.5

4

3

3 1

2

2 0.5

0

3

Im{x }

1 Im{x2}

Im{x1}

1

0

−1

0

−1 −0.5

−2

−2 −1

−3

−4 −2

−3

0

2 Re{x1}

4

−1.5 −5

0

5 Re{x2}

10

−4 −5

0 Re{x3}

5

Figure 5: Input data for the first principal and minor component analysis experiment.

It is important to note that this result gives only a sufficient condition for the stability of rule 2.61. This rule embodies a modification that is known as origin shift in linear algebra and that in this context has been shown to arise from the Lagrange multiplier method in a natural way. As a numerical example, let us consider an input random process x ∈ C3 whose covariance matrix has eigenvalues σ1 = 3, σ2 = 2, and σ3 = 1. Such a signal is illustrated in Figure 5. Running the learning rule 2.33, which allows extracting the first principal component from the data, one expects that the first principal mode (θ1 ) tends to one (up to a phase factor), while the second and third principal modes tend to zero, where the principal modes have been defined in theorem 2. These results are confirmed by the left-hand panel of Figure 6. Furthermore, we tried to run the learning rule 2.61 on the same data set in order to extract the first minor component. We tried first with σ¯ = 0, which means using the nonstabilized last minor component analyzer: Simulations show that the rule quickly becomes unstable. Then we tried with σ¯ = 1, as in Oja (1994). Even in this case, the rules look unstable. Finally, we tried to use the sufficient condition provided by theorem 3. Since the largest eigenvalue is 3, we chose σ¯ = 3.5. Simulation results are shown in

Nonlinear Complex-Valued Extensions of Hebbian Learning First principal component analysis

First minor component analysis 1

1

|θ |

|θ1|

1.2

1

0.8

0

2000

4000

0.5

0

6000

0.6

0.6

0

2000

4000

6000

0

2000

4000

6000

0

2000 4000 Iterations

6000

2

|θ |

0.8

|θ2|

0.8

0.4 0.2 0

811

0.4 0.2

0

2000

4000

0

6000

1.5

1.3 1.2 3

|θ |

|θ3|

1 0.5 0

1.1 1

0

2000 4000 Iterations

6000

0.9

Figure 6: Principal modes modules evolution. (Left) First principal component analysis case. (Right) First minor component analysis case.

the right-hand panel of Figure 6. As expected, the first two principal modes converge to zero, while the third mode approaches the value 1. 2.5 Notes on Multiunit Complex-Valued Component and Subspace Networks. By using the fundamental result found by Oja (1982) about the convergence of the neural algorithm, allowing the extraction of the first principal component from a real-valued random process x(t), Sanger (1989) was able to prove the convergence of the generalized Hebbian algorithm (GHA). In particular, Sanger proved that the differential system d W = W − UT(WT W)W , dt def

where = E[xxT ] and W ∈ R p×m , extracts the first m eigenvectors of the covariance matrix up to a sign factor. In the same way, on the basis of the theorem 1, it would not be difficult to show that the multiunit learning rule 2.18, in the hierarchical version allows extracting the first m complex eigenvectors up to a phase factor.

812

S. Fiori

Rule 2.17 was derived via the standard Lagrange multipliers method that is used to enforce the constraints of orthogonality on the weight vectors. However, the multipliers guarantee the constraints to be satisfied only at equilibrium, but in general not during system evolution. (A detailed analysis of the drawbacks inherent in the use of Lagrange multiplier methods for orthogonality constraints has been proposed recently in Douglas, Amari, & Kung, 2000.) In order to illustrate this effect more clearly in the case of nonlinear complex-valued, component or subspace learning, let us reformulate the orthogonality constraints by borrowing some concepts from differential geometry. (A good reference book is, Helmke & Moore, 1993.) Let us define the complex-valued compact Stiefel manifold as def

St( p, m, C) = {A ∈ C p×m |A H A = Im } . Now it is clear that the requirement that the connection pattern W at learning equilibrium is an orthonormal matrix may be equivalently formulated as W ∈ St( p, m, C). As the learning rules of interest here are expressed in k terms of differential equations of the kind dw = Fk , with Fk being a proper dt learning vector field for every abstract neuron, it is convenient to express the orthogonality constraints in terms of differential properties. This may be done by invoking the concept of tangent space of the Stiefel manifold at a point, namely: TW St( p, m, C) = {V ∈ C p×m |V H W + WV H = 0m } . In terms of single weight vectors, in order for the connection pattern to belong to the Stiefel manifold, they should satisfy

dwk dt

H wh + wkH

dwh dt

=0

(2.62)

for every pair (k, h) ∈ [1, ..., m]2 . In order to verify if this property is satisfied, it is sufficient to replace the expressions 2.17 in equations 2.62. This would lead to the following equations: E[G(|yk |)yk x H ]PkH wh + wkH Ph E[G(|yh |)yh x] = 0 . It is now straightforward to find that in the symmetric network case, the projector operator is such that PkH wh = 0 for every pair of indices (k, h), while in the hierarchical case, it holds that PkH wh = 0 for k ≥ h, while PkH wh = wh for k < h. As a consequence, when k < h, the left-hand side of equation 2.62 becomes E[G(|yk |)yk yh + G(|yh |)yh yk ] = 0 .

(2.63)

Nonlinear Complex-Valued Extensions of Hebbian Learning

813

From this result, it is necessary to conclude that in the symmetrical (generalized principal or minor subspace analysis) case, it is true that W(t) ∈ St( p, m, C) at any time, while in the hierarchical (generalized principal or minor component analysis) case, in general it happens that W(t) ∈ / St( p, m, C) at any time only if the algorithm is out of equilibrium. It is worth noting, however, that in any case, it is guaranteed that the optimization of the weight vectors wk happens on a unitary radius hyper-sphere. In fact, in either the symmetrical or the hierarchical case, it is obtained from k conditions 2.62 that wkH dw = 0, which implies that wk 2 is constant (at one, dt if the weight vectors are initialized to have unitary norm). Ultimately, this means that vector orthogonality is to be progressively achieved by learning and is not guaranteed at any time. This is a drawback of the Lagrange multiplier method. On the basis of general knowledge and the results granted from the preceding sections (especially section 2.4), we may summarize these drawbacks:

r

r

r

r

Generally, the method of multipliers consists in embedding the constraints under which the optimization of a criterion function should be performed in the criterion function itself by adding a properly constructed function to the original criterion. From a geometrical point of view, it seems unnatural to modify the objective function instead of restricting it to the set of feasible configurations or variables values. As a consequence of this construction, the constraints should be thought of as a sort of subtarget to be achieved in parallel to the optimization of the criterion function. Therefore, provided the employed unconstrained optimization algorithm is able to perform the optimization of the augmented criterion, the constraints are fulfilled only when the optimization is completed and not—in general—at any time. The modification of the criterion function may cause the appearance of undesired effects, such as the appearance of local extremes that might make the optimization algorithm unable to solve the learning task or might make the learning progress slower than expected. The general idea behind constrained optimization by the Lagrange multiplier method is to replace the constrained optimization flow with an unconstrained optimization flow (that takes places, for example, in the Euclidean space R p ) by properly embedding the constraints into an augmented learning criterion. It is worth noting that the space that the optimum search takes place in plays an important role in determining the behavior of the gradient-based learning system. As evidenced in section 2.4, for instance, a learning differential equation on R p may be unstable even if it was derived from strong constraints such as the normality of the weight vectors. Defining learning differential equations

814

S. Fiori

on compact manifolds (such as the Stiefel one) instead may help to overcome this serious difficulty. A detailed treatment of the Lagrange multiplier method may be found in Bertsekas (1996). In the nonquadratic case, moreover, it is difficult to give general results on learning capabilities. Some recent results for the real-valued case based on the design of appropriate Lyapunov functions for the system equivalent to equation 2.56 have been introduced in Vegas and Zufiria (2004). These results are valid in the case of homogeneous nonlinearities L(w). It is worth mentioning that for the case that nonlinear complex-valued PCA is used for blind source separation of telecommunication-type source signals with a Sudjianto-Hassoun nonlinearity, a convergence theorem was given in Fiori (2003c). Following the line of the preceding text on multiunit artificial networks, it also appears appropriate to discuss briefly the important topic of the stability of learning equations in view of computer-based implementation. It is worth recalling the distinction between the related concepts of learning differential equation and learning algorithm. A learning differential equation arises when a learning trajectory is considered in the weight space and the learnable variables of an artificial neural networks are parameterized through a free variable (normally the time) that locates a network configuration over a learning trajectory. In the general gradient-based learning setting, for instance, the learning trajectory cannot be written in explicit form, and it is specified through a differential equation (often of the first order) whose intuitive meaning is to specify the velocity field associated with learning a specific task. The solution (integration) of such a differential equation (provided an initial configuration is specified) gives rise to the network’s learning trajectory, formally termed gradient flow. A learning differential equation may be solved in closed form very rarely. Normally it is necessary to approximate the gradient flow numerically through an iterative algorithm. This corresponds to sampling the time variable and trying to obtain a close approximation of the exact learning trajectory. Then the obtained discrete-time learning equation is what it is normally referred to as a learning algorithm. The passage from a generic learning differential equation to an associated learning algorithm is not unique, as it may be performed in several different ways, and it is not consequence free, because, depending on the chosen time discretization method, the salient features of the differential equations may be retained or lost. In particular, it is worth discussing here the effect of time discretization on the two salient features of orthogonality: preservation and learning stability:

r

Euler discretization. The simplest discretization method consists in replacing the derivatives with respect to the time with incremental

Nonlinear Complex-Valued Extensions of Hebbian Learning

815

ratios. This method is widely know as the Euler integration technique. For instance, the derivative in equation 2.3 may be approximated as dw(t) w(t + T) − w(t) ≈ , dt T

r

where T denotes the width of the time interval that the derivative is approximated within. It is clear that the Euler algorithm may be well suited on a linear space, as the Euclidean manifolds C p or C p×m , while the Stiefel manifold St( p, m, C) is a curved space, and it is impossible to move from a network configuration to another configuration through vector or matrix addition. This means that the Euler method does not preserve the orthonormality of a network’s connection matrix, and it is not guaranteed that the learning system eventually converges to a configuration belonging to St( p, m, C) or that the network connection stays in the vicinity of such a manifold for a sufficiently long time, which may ultimately affect the stability of the learning algorithm. These problems have been widely investigated by Yan, Helmke, and Moore (1994) and Zufiria (2002). Fiori and Piazza (2000) showed that by properly selecting the learning step size, it could be possible to obtain a convergent learning system. Chen, Amari, and Lin (1998) and Chen and Amari (2001) also investigated the problem of the existence of a Stiefel-manifold attractor for minor component or subspace analysis algorithms. It is worth recalling that, as already noted in the section 2.4, the minor component and subspace analysis algorithms appear to be the most problematic ones with regard to dynamical stability. Geometric integration. A good discretization method in the time domain should allow converting the learning differential equations into discrete-time algorithms so as to retain (up to reasonable precision) the geometrical properties that characterize the developed learning rules. From a numerical point of view, the solution of matrix-type differential equations on curved manifolds has been widely investigated in the context of geometrical integration (see, e.g., Hairer, Lubich & Wanner, 2002, and Iserles, Munthe-Kaas, Nørsett, & Zanna, 2000), which is a recent branch of numerical analysis. The traditional effort of numerical analysis and computational mathematics has been to render physical phenomena into algorithms that can produce sufficiently affordable, precise, and robust numerical approximations. Geometrical integration is also concerned with producing numerical approximations that preserve the qualitative attributes of the solution to the extent possible. In the context of neural learning under orthogonality constraints, some efforts have been devoted to develop learning algorithms over the group of orthogonal matrices and over the Stiefel manifold. These consist of Riemannian gradient-based and non-gradient-based learning algorithms that strongly bind the connection patterns to the Stiefel

816

S. Fiori

r

manifold. (For recent reviews, refer to Celledoni & Fiori, 2004, and Fiori, 2001, 2002.) One of the advantages offered by geometrical integration over compact manifolds (such as the orthogonal group and the Stiefel manifold) is the intrinsic stability: whether the algorithm converges to the expected solution or not, the network connection matrix always keeps a bounded norm. Fixed-point algorithms. Recently, preliminary results on fast batchtype fixed-point iteration for nonlinear Hebbian learning with orthonormality constraints have been illustrated in Fiori (2003d, 2004, in press). The extension of this class of algorithms to the complex domain as well as some fundamental issues such as convergence and computational-complexity reduction are still open problems.

3 Complex-Valued Hebbian Learning by Nonquadratic Reconstruction Error Minimization A second philosophy known in the literature allowing formulating Hebbian learning involves the concept of reconstruction error minimization. Let a stationary multivariate random signal x(t) of size p, whose covariance matrix has distinct eigenvalues, be expanded by means of a basis of size m < p given by {w1 , w2 , . . . , wm }. This means that a new random signal xˆ (t) may be defined so that xˆ = y1 w1 + y2 w2 + · · · + ym wm . This is termed the econstruction formula, where the coefficients yk are defined as the projections yk = wkH x in the presence of complex-valued signals; projection expressions are termed analysis formulas. It is clear that as m < p, in general xˆ (t) = x(t), and the difference between the original and the reconstructed version of x may be conceived as the reconstruction error (Karhunen & Joutsensalo, 1995; Xu, 1992). In the following sections, we consider nonquadratic complex-valued reconstruction error minimization for a symmetrical network, which leads to a generalized principal subspace analysis theory, and the component-wise nonquadratic reconstruction error minimization principle for a hierarchical network leads to a generalized principal component analysis learning theory. We also discuss the problem of nonlinearity selection by recalling interesting statistical interpretations of nonquadratic learning criteria. The rationale for considering nonquadratic reconstruction error optimization was given by O’Leary (1990) and Liano (1996) in the artificial neural field. Liano’s contribution was limited to supervised neural learning for modeling purposes, but it remains valid for the present unsupervised (linear) data modeling treatment. In particular, Liano observed that most neural networks are trained by minimizing the mean squared error pertaining

Nonlinear Complex-Valued Extensions of Hebbian Learning

817

to the given training set. In the presence of outliers, the resulting neural model can differ significantly from the system that generated the data or from the intended abstract model. The purpose of Liano’s research was to introduce a robust approach that can minimize the influence of gross errors on the accuracy of the neural model. Two different approaches were used in Liano (1996) to study the mechanism by which outliers affect the resulting models: the influence function approach and the maximum likelihood approach. The same approaches were followed by Song et al. (1998) and are also partially followed here (in fact, the maximum likelihood approach is replaced by the maximum entropy method here). The contents of this section may be anticipated as:

r r r r

Learning principal and minor subspace by minimization of nonquadratic complex-valued reconstruction error Choosing the nonlinear criteria via the Song-Yilong-Feng interpretation of Hebbian learning, which is largely based on Liano’s research in the context of supervised multilayer perceptron learning Learning principal and minor component by component-wise nonquadratic reconstruction error minimization Illustrative numerical simulation results in the robustness of the considered learning theories

3.1 Symmetrical Nonquadratic Complex-Valued Reconstruction Error Minimization. In the symmetrical network case, the reconstruction error may be formally defined as def

e = x − xˆ = (I − WW H )x .

(3.1)

It may be shown that in the real-valued case as well as for complex-valued signals, minimizing the mean square reconstruction error leads to principal subspace analysis. The real-valued case has been investigated by Xu (1992). In the complex-valued case of interest here, this property has been analytically shown by Yang (1995), who proved the following two results: def

Theorem 4. W is a stationary point of Y(W) = E[ x − WW H x 2 ] if and only def if W = Um Q where Um ∈ C p×m contains any m distinct eigenvectors of = H m×m E[xx ] and Q ∈ C is a unitary matrix. Theorem 5. All stationary points of Y(W) are saddle points except when Um contains the m dominant eigenvectors of . In this case, Y(W) attains the global minimum. Note the presence of the arbitrary unitary matrix Q, meaning that the network’s connection pattern W that minimizes the reconstruction error

818

S. Fiori

corresponding to coefficients y = Q H UmH x but in general E[yk y j ] = 0, which means the network-transformed signals are still correlated. One of the most interesting features of this approach is that the optimization problem of searching for a weight matrix W that minimizes the mean square reconstruction error is unconstrained, in that no constraints have to be added to the error cost function in order for the minimization problem to be consistent. This appears as a noticeable difference with respect to the material presented in section 2. In order to reduce the effects of noise, disturbances, and outliers affecting the data sets, generalized versions of the mean square reconstruction error minimization-based PSA algorithms have been developed for analyzing real-valued signals. In what follows, we extend this enhanced theory to the complex domain. Instead of minimizing the quantity E[ e 2 ], here we derive a generalized PSA algorithm by minimizing, through the use of a gradient-based search procedure, a more general cost defined as def

Z(wk ) = E[φ( e 2 )] ,

(3.2)

thought of as a function of any single vector wk for easy notation, where k = 1, 2, . . . , m. The function φ(u) with u ≥ 0 should be differentiable and convex and should possess a unique minimum in u = 0. The problem about its choice is crucial for the behavior of the PSA algorithm and will be discussed. In order to minimize function Z(·) with respect to each wk by means of a gradient steepest-descent recursive algorithm, its gradients are needed. Therefore, we want to evaluate ∂ Z(wk ) dφ( e 2 ) ∂ e 2 . =E ∂wk d( e 2 ) ∂wk

(3.3)

Note that from definition 3.1, there holds e 2 = x H x − 2x H WW H x + x H WW H WW H x . The expression of the gradient of e 2 is ∂ e 2 = 2(Wy − 2x)yk + 2y H W H wk x . ∂wk By recalling the identity Wy = −e + x, the above expression simplifies into 1 ∂ e 2 = −eyk − xe H wk . 2 ∂wk

(3.4)

Nonlinear Complex-Valued Extensions of Hebbian Learning

819

def

Let us define (u) = dφ(u) . This is what Liano (1996) refers to as the indu fluence function. Then the gradient steepest-descent learning equations for generalized principal subspace extraction for complex-valued signals read ∂ Z(wk ) dwk = E[( e 2 )(eyk + xe H wk )] , k = 1, 2, . . . , m , (3.5) =− dt ∂wk with e being defined by equation 3.1. Choosing the nonlinear function (·) is not generally an easy task. Song et al. (1998) proposed a theory allowing linking the choice of (·) to the statistical features of the reconstruction error for real-valued signals. The aim of the following section is to show how these results may be extended to the complex-valued case. 3.2 The Song-Yilong-Feng Principle and Its Extension to the ComplexValued Case. The Song-Yilong-Feng theory for real-valued ergodic inputsignal x is based on the definition of the random variable: def

ε(x, W) = x − WWT x 2 .

(3.6)

Let us suppose that several random input vectors {xn }n=1,N have been collected. For each input sample, we may compute a reconstruction error value εn = ε(xn , W). The likelihood of W is then defined as the probability def

L(E|W) = q (ε1 , ε2 , . . . , ε N |W) ,

(3.7)

def

where E = [ε1 ε2 · · · ε N ] and q (·, . . . , ·|·) represents the joint probability density function of the εn subjected to the hypothesis W. Under the hypothesis that the xn are identically distributed and statistically independent (i.i.d.), the likelihood writes: L(E|W) =

N n=1

q ε (εn |W) =

N

q (εn ) ,

(3.8)

n=1

in short notation.4 Having defined the likelihood of a network connection matrix under the observed reconstruction error values, we may invoke the maximum likelihood estimation theory as a learning tool for the neural network. In order to make this learning principle be profitable, we should hypothesize a distribution for the reconstruction error values.

4 It might be worth recalling that successive x , but not their components, are supposed n to be i.i.d.

820

S. Fiori

In the following, we formulate the maximum-likelihood theory in a version suitable for treating an infinite number of available samples and discuss some possible choices of the error distributions. We also relate the Song-Yilong-Feng theory to the conceptualization of robust classification proposed by Xu and Yuille (1995). Let us suppose that the probability density function of the square reconstruction error ε has the negative exponential (one-side Laplacean) form def

q E (u) = κe −κu (u) ,

(3.9)

where κ > 0 denotes the Laplacean dispersion parameter and (u) again denotes the unit-step function. Then the log likelihood takes on the expression log L(E|W) = N log κ − κ

N

xn − WWT xn 2 .

n=1

The meaning of this expression is that the maximum likelihood principle and the minimum average square error principle coincide if the conditioned probability density function q ε (ε|W) is assumed as in equation 3.9. Note that the above statement is based on the hypothesis that N is finite; otherwise, the log likelihood could be unbounded. On the basis of this observation, Song et al. (1998) proposed extending the idea to a generic probability density function, replacing the averagesquare-error principle with the maximum likelihood estimation of the weight matrix W. It consists in looking for a weight matrix maximizing the quantity log L(E|W) =

N

log q (εn ) .

(3.10)

n=1

We aim at extending this optimization criterion in two ways. First, the maximum likelihood theory may be made rigorous for an infinite-size data set (i.e., for density-function-based descriptions instead of sample based). The key step for this extension relies on the use of differential entropy associated with a probability distribution. Second, we want to extend the Song-Yilong-Feng principle to the complex domain. As it shall shortly be clear, this extension leads us to face again the problem of nonlinear function selection. This problem may be solved directly through the Song-YilongFeng principle (an especially interesting formulation comes by choosing the Cauchy-Lorenz distribution for the reconstruction error model) and also from a classification theory by Xu and Yuille (1995) based on statistical physics.

Nonlinear Complex-Valued Extensions of Hebbian Learning

821

In the presence of infinitely many observations, instead of using the likelihood, the differential conditional entropy may be adopted. To this aim, let us define the function def

H(W) = −E x [log q ε (ε|W)|W] = −E[log q (ε)]

(3.11)

in short. The quantity H(W) plays the role of a “minus-mean-loglikelihood”. If we assume again the distribution 3.9 for the square error ε, we find H(W) = − log κ + κ E[ e 2 ] . If the reconstruction error is defined for complex-valued signals as in equation 2.64, minimizing the entropy H with respect to W also means minimizing the mean square error—in symbols: min {H(W)} ⇐⇒

W∈C p×m

min {E[ e 2 ]} .

W∈C p×m

Following the idea in Song et al. (1998), the minimum entropy principle may be extended by assuming different probability density functions q (·). In order to find the matrix W minimizing function 3.11, the gradient steepestdescent approach may be considered. Its use requires evaluating the gradient: ∂ H(wk ) q ( e 2 ) ∂ e 2 . = −E ∂wk q ( e 2 ) ∂wk def

(u) Defining the statistical score function (u) = − qq (u) , the gradient steepestdescent algorithm pertaining to the entropy 3.11 minimization reads

dwk = E[( e 2 )(eyk + xe H wk )] . dt

(3.12)

Note that equations 2.68 and 3.12 are identical except for the form of functions (·) and (·). The choice of (·) is driven by the hypotheses about the statistics of the scalar quadratic error e 2 . In order to illustrate the usefulness of the differential-entropy-based approach, let us consider in detail the probability density functions q E defined by equation 3.9 and the modified (one-sided) Cauchy-Lorentz distribution recommended in Liano (1996) and Song et al. (1998): def

q C (u) =

2 θ (u) , θ > 0 . 2 π θ + u2

822

S. Fiori

The most interesting difference between the Laplacean and Cauchy probability density functions is that the first distribution is characterized by a mean value and a variance, while the second distribution has neither a mean value nor a variance.5 This special feature of Cauchy distribution deserves some comment:

r

r

r

If we assume a Cauchy distribution for the quadratic error e 2 instead of the Laplacean one, we do not need to embody any a priori hypotheses about the variability of the error in the algorithm. In other words, the risk of biasing the error distribution with inaccurate information is less serious than with the Laplacean distribution. Another interesting implication of the inexistence of moments of the Cauchy distribution is that the effects of Cauchy-based modeling are less tied to the number of available samples. In fact, the practical meaning of the inexistence of moments is that collecting several data points gives no more accurate an estimate of the moments than a single data point does. However, the Cauchy density does impose scale limitation; the choice of the width parameter θ does affect the width of the receptive fields of the neurons in the component and subspace analysis network.

The score function corresponding to the Cauchy distribution is C (u) =

θ2

2u (u) . + u2

Several expressions for the function φ(·) in equation 3.2 have been proposed in the literature (Karhunen & Joutsensalo, 1994, 1995; Xu & Yuille, 1995), usually justified by invoking the theory of robust statistics. These different choices may be discussed using the Song-Yilong-Feng interpretation. In order to accomplish this, it is useful to equate functions (u) and (u). It is worth recalling that (u) =

dφ(u) 1 dq (u) , (u) = − , du q (u) du

where function φ(u) was introduced in formula 3.2 and the probability density function q (u) takes part in formula 3.8. From these relationships, the following formula is readily found: φ(u) = − log q (u) + c ,

(3.13)

5 In fact, the Cauchy distribution does not admit any statistical moment because its characteristic function is not differentiable at the origin.

Nonlinear Complex-Valued Extensions of Hebbian Learning

823

where c ∈ R is a constant. The above formula may be used in two directions. First, if the probability distribution model for the reconstruction noise q (·) has been selected, then the corresponding warping function φ(·) may be computed by formula 3.13. Care should be taken that the resulting function φ(·) fulfills regularity as well as convexity requirements, by definition; otherwise, the pair (q , φ) must be rejected as it is not valid in the present framework. If the pair (q , φ) is acceptable, the constant c, whose value does not influence the shape of the warping function φ(·) but only its vertical placement, may be chosen arbitrarily (e.g., in order to simplify the expression of function φ(·)). Alternatively, formula 3.13 may help in discussing some choices of φ(·) in terms of probability distributions for the square reconstruction error. In this case, care should be taken that the resulting function q (·) be a probability density function; that is, it is a positive integrable function with unitary area. In this case, the constant c may be determined +∞ by the formula c = − log 0 e −φ(τ ) dτ . As two examples of the first case, we may find the functions φ(·) corresponding to the Laplacean and Cauchy-Lorentz distributions. Straightforward computations show that they write φE (u) = − log κ + c + κu , 2θ + c + log(θ 2 + u2 ) , φC (u) = − log π respectively. In these cases, the constant c may be chosen to be zero or, for example, for the Cauchy distribution case, as c = log 2θ , which simplifies π the expression of φC (u). The obtained functions may be easily verified to fulfill the required restrictions such as regularity and convexity, at least in a properly chosen right-sided neighborhood of the origin. As an example of def the second case, let us consider the warping function φK (u) = log cosh(βu) found in Karhunen and Joutsensalo (1994, 1995). The corresponding q (·) function is computed to be q K (u) =

2 2β (u) , π e −βu + e βu

which may easily be proven to be a density distribution and shall hereafter be referred to as K density model. The shape of the above distributions q E , q C , and q K , along with the corresponding warping functions φE , φC , and φK , are depicted in the Figure 7 for arbitrary parameters values. In both the negative exponential distribution model and the K model, as the square reconstruction error ε increases, the density functions value decreases rapidly. This means that in those distributions, a large reconstruction error (usually associated to disturbances) has a small probability of occurring; then the model does not take into account properly the presence of outliers. Conversely,

824

S. Fiori

0.7

10 Laplacean Cauchy-Lorentz K-model

Laplacean Cauchy-Lorentz K-model

9

0.6 8 0.5

7

6 0.4 5 0.3 4

3

0.2

2 0.1 1

0

0

2

4

6

8

0

10

0

2

4

6

8

10

Figure 7: (Left) Reconstruction error distribution functions q E (κ = 0.5; Song et al., 1998), q C (θ = 5/π ; (Song et al., 1998) and q K (β = 1; Karhunen & Joutsensalo, 1995). (Right) Warping functions φE , φC , and φK .

the use of the modified Cauchy distribution improves the robustness of the PSA algorithm since it decreases more slowly than the former two. To end, it is interesting to discuss the relationships among the SongYilong-Feng theory and the conceptualization proposed by Xu and Yuille (1995) based on statistical physics. Let us consider the reconstruction error ε(x, w) due to a one-unit artificial neural network: def

ε(x, w) = x − (w H x)w 2 , x ∈ {xn }n=1,N . Xu and Yuille (1995) considered a neuron’s energy function to be minimized for statistical learning, def

K (f, w) =

N n=1

f n ε(xn , w) + η

N (1 − f n ) ,

(3.14)

n=1

where η is a constant and the f k ’s are random binary variables acting as decision flags indicating whether xk is an outlier ( f k = 0) or not ( f k = 1) def and f = [ f 1 f 2 · · · f N ]T . The aim is to minimize K (f, w) with respect to f

Nonlinear Complex-Valued Extensions of Hebbian Learning

825

and w simultaneously. To solve this difficult problem, Xu and Yuille defined a Gibbs distribution, def

ρ(f, w) =

1 −β K (f,w) , e

where is a normalization constant and 1/β stands for a “temperature” (for a discussion on the meaning of parameters η and β, see Xu & Yuille, 1995). Then they proposed to approximate ρ(f, w) by computing the marginal probability distribution ρm (w) obtained averaging ρ(f, w) over all possible configurations f (which may be thought of as a kind of mean-field approximation). This gives (Xu & Yuille, 1995) ρm (w) = 1m e −β K eff (w) , with def m = e Nβη and def

−K eff (w) =

N 1 log{1 + e −β[ε(xn ,w)−η] } . β n=1

(3.15)

The optimal connection configuration wopt is found by maximizing−K eff (w). It is now interesting to note that −K eff (w) is very similar to log L(E|w) in equation 3.10; thus, we may define a probability density function, 1

def

q X (u) = A[1 + e −β(u−η) ] β (1 − (u − )) ,

(3.16)

where Ais a normalization constant and the factor 1 − (u − ) makes q X (u) vanish outside a compact Then the normalization constant 1 domain [0, ]. 1 def def can be found by A1 = β1 D u−1 (1 + Cu) β du, where D = exp(−β) and C = exp(βη). Note that may be arbitrarily large, albeit bounded. A distribution q X is shown in Figure 8. The distribution q X (u) represents an equivalent reconstruction error distribution that would make it possible to select a proper nonlinear warping function in reconstruction-error-based learning via the Xu-Yuille statistical physics–based theory. 3.3 Component-Wise Nonquadratic Reconstruction Error Minimization. The learning theory discussed in section 3.1 allows extracting a principal subspace only from a set of complex-valued random signals. A neural network trained by means of that algorithm is not able to effect PCA substantially because of its inherent symmetry. With the aim of breaking the symmetry and making the network hierarchical, it is possible to define a reconstruction error vector for each neuron (Karhunen & Joutsensalo, 1995) and to design a learning rule for each neuron so that it tries to minimize a local error. Again, only linear reconstruction is dealt with. A possible definition is ek = x −

K (k) j=1

yj w j ,

(3.17)

826

S. Fiori 0.9

Recon. error probability density distribution

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.5

1

1.5 Reconstruction error

2

2.5

3

Figure 8: Reconstruction error distribution functions q X (β = 0.5, η = 4/π as in Xu & Yuille, 1995, and = 3).

where k ranges from 1 to m, the number of the neurons. The indexing function K (k) determines whether the network is symmetrical. If K (k) = m, we come back to the symmetrical case; all the neurons learn to correct the total error e = x − mj=1 y j w j , while K (k) = k makes the network hierarchical.6 Following Karhunen and Joutsensalo (1995), here we discuss the optimization of the cost function def

R(ek ) =

p

E[ f (e kr )] ,

(3.18)

r =1

where, contrary to the case discussed in section 3.1, the r th entry of the k th error vector ek , namely, e kr , warps by the function f (·). Again, f (ζ ) = g(|ζ |) for ζ ∈ C, with g(u) being differentiable, convex, with a unique minimum in u = 0 in a convenient right-sided neighborhood of the origin. Evaluating the gradient of R(ek ) with respect to the weight vector wk involves rather 6 Note that in the hierarchical case, the reconstruction error may be defined recursively as e1 = x − y1 w1 , while for k ≥ 2: ek = ek−1 − yk wk .

Nonlinear Complex-Valued Extensions of Hebbian Learning

827

difficult operations that may be considered interesting from a methodological point of view; hence, we provide some details about its computation. Particularly, we need the Hadamard product operator, defined as 

     a1 c1 a 1 c1 a 2   c2        def  a 2 c 2  a◦c= .  ◦  .  =  .  ,  ..   ..   ..  ap cp a pc p with a, c being p-dimensional real- or complex-valued vectors. Note that the Hadamard product is commutative and associative. Furthermore, we need a set of vectors br : each br ∈ R p is defined so that its entries are 0 except for the r th one equal to 1 (i.e., the {br } is the canonical basis of R p ). R(ek ) , let us consider the quantity |e kr |2 , To begin with the calculation of ∂ dw k which has the structure |e kr |2 = xr xr − 2P(e kr ) + Q(e kr ) ,

(3.19)

where the following quantities have been defined: def

P(e kr ) =

K (k)

Re{y j xr w jr } ,

j=1 def

Q(e kr ) =

K (k) K (k)

Re{yj ys w jr wsr } .

j=1 s=1

The gradient of P(e kr ) computed with respect to the complex-valued weight vector wk is found to be ∂ P(e kr ) = yk x ◦ br + (xr wkr )x . ∂wk

(3.20)

About the gradient of Q(e kr ), after some mathematical work, we find 1 ∂ Q(e kr ) = wkr x . yk yh w j ◦ br + yj whr 2 ∂wk j=1 K (k)

(3.21)

Now the effects of the nonlinearity f (·) have to be taken into account. For notational compactness, let us define functions: def

b(u) =

dg(u) def b(|ζ |) , u ≥ 0 and B(ζ ) = , ζ ∈C. d(u2 ) |ζ |

(3.22)

828

S. Fiori

Let us evaluate the gradient of R(ek ) with respect to the weight vector wk : p 1 ∂ R(ek ) b(|e kr |) ∂|e kr |2 . = E ∂wk 2 r =1 |e kr | ∂wk

(3.23)

From equations 3.19 to 3.23, rearranging terms, recalling the use of the Hadamard operator ◦, and allowing function B(·) to operate componentwise, we have: K (k) 1 ∂ R(ek ) H =E (y j w j )[B(ek ) ◦ wk ]x 2 ∂wk j=1

+E

yk B(ek )

◦

K (k)

yj w j

j=1

− E yk x ◦ B(ek ) + x H [B(ek ) ◦ wk ]x . K (k) By replacing the term j=1 y j w j with x − ek , we obtain the gradient steepest-descent learning algorithm: dwk 1 ∂ R(ek ) = E ekH [B(ek ) ◦ wk ]x + yk [B(ek ) ◦ ek ] . =− dt 2 ∂wk

(3.24)

def

Let us consider the nonquadratic warping function g(u) = 2u3 for u ∈ This choice yields B(z) = 1—hence, the learning algorithm:

R+ 0.

dwk = E[(ekH wk )x + yk ek ] , k = 1, 2, . . . , m , dt

(3.25)

which coincides with the learning rule 3.5 when (·) = 1 and K (k) = m. Also, in the presence of real-valued input signals, the quantities involved in equation 3.24 are real valued; thus, it may be rewritten as: −

1 ∂ R(ek ) = E ekT [B(ek ) ◦ wk ]x + yk [B(ek ) ◦ ek ] . 2 ∂wk

(3.26)

Now, since B(u) = b(u)/u for u ∈ R, if we allow b(·) to operate componentwise, the following identities hold: B(ek ) ◦ ek = b(ek ) and ekT [B(ek ) ◦ wk ] = wkT b(ek ). Therefore, the learning rule corresponding to the gradient 3.26 for a realvalued PCA neural network reads dwk = E[wkT b(ek )x + wkT xb(ek )] , k = 1, 2, . . . , m . dt

(3.27)

Nonlinear Complex-Valued Extensions of Hebbian Learning

829

This learning rule coincides with the nonclassic PCA extraction rule discussed, for instance, in Karhunen and Joutsensalo (1995). With reference to the choice of the nonlinear function g(·), it seems reasonable to extend the entropy version of the Song-Yilong-Feng theory in order to model the probability density function of any single reconstruction error p ek . Formally, we can then particularize the cost function 3.18 as − j=1 E[log q (|e k j |2 )], which ultimately means identifying g(u) = − log q (u2 ). By choosing as the conditional probability density function the one-sided Cauchy distribution, plugging this g(u) into the definition of B(·) leads to BC (z) =

2|z| . θ 2 + |z|4

(3.28)

This shows that the approach in Song et al. (1998) may be successfully extended to the component-wise reconstruction error minimization method. The extended Hebbian learning theory described in this section relates to a series of contributions. One of them is Xu’s interpretation of extended Hebbian learning known as maximum uncertainty theory (a detailed analysis recently has been presented in Fiori, 2001). Other interesting contributions to this topic are the learning theories by Corchado and Fyfe (cited in Fiori, 2003b), Miao and Hua (1998), Plumbley (1992), Shirazi, Peper, and Sawai (1999), and Higuchi and Eguchi (2004). 3.4 Illustrative Numerical Experiments. In order to illustrate the concept of principal component and subspace analysis in the complex domain by nonquadratic reconstruction error minimization via examples, the results of some numerical simulations are reported below. The experiments aimed at illustrating the convergence of the learning algorithms 3.5 and 3.24, and comparing the performances of the robust and nonrobust versions in the presence of complex-valued data corrupted by gross outliers. Robustness should emerge by the proper choice of the nonlinear warping functions (·) and (·). The experiments performed retrace the ones suggested in Song et al. (1998) and Xu and Yuille (1995). A network with two inputs x1 and x2 and one output is trained with a set of uncorrupted samples and with the same data set corrupted by 3% of outliers. The two data sets are shown in Figure 9. Note that for a single-unit neural network, the distinction between PCA and PSA disappears. The network weight vector is denoted here by w. Furthermore, the quantity q1 denotes the first normalized eigenvector of the covariance matrix of the uncorrupted data (i.e., the true eigenvector). As the component and subspace analysis performance index, we consider the principal angle γ , here defined in the following way: def

γ = cos−1

|w H q1 | . wH w

830

S. Fiori 2

4 2

2

Im{x }

Im{x1}

1

0

−1 −2 −10

0

−2 −4

−5

0 Re{x1}

5

−6 −2

10

10

−1

0 Re{x2}

1

2

10

8 5 2

Im{x }

Im{x1}

6 4 2

0

0 −2 −10

−5

0 Re{x1}

5

10

−5 −5

0

5

10 Re{x2}

15

20

Figure 9: Experiments on reconstruction error minimization-based learning: Uncorrupted data set (top) and data set with outliers (bottom).

The outliers are such that they produce a deviation between the true first eigenvector and the noisy first eigenvector of about 30 degrees. First, algorithms 3.5 and 3.24 have been trained on the uncorrupted data, and both have been able to detect the principal eigenvector corresponding to a principal angle value close to zero. Second, algorithm 3.5 with (·) = 1 was trained on the noisy data; the results are shown in Figure 10. The same algorithm, trained on the same data set, is made robust by introducing the Cauchy-Lorentz nonlinearity (·) = C (·), and has given instead the result shown in Figure 11, which pertains to the value of the Cauchy parameter θ = 5. Figures 10 and 11 show the behavior of the principal angle γ as well as the values of the norm of the reconstruction error e 2 during learning. The net result of the introduction of the Cauchy nonlinearity is apparent, as it allows almost completely neglecting the effects of outliers on learning. The results in both robust and nonrobust cases pertaining to algorithm 3.24 are completely equivalent and are therefore not shown.

Nonlinear Complex-Valued Extensions of Hebbian Learning

831

Principal angle γ (Deg)

100 80 60 40 20 0

0

500

1000

1500

2000

2500 Samples

3000

3500

4000

4500

5000

0

500

1000

1500

2000

2500 Samples

3000

3500

4000

4500

5000

Reconstruction error ||e||

14 12 10 8 6 4 2 0

Figure 10: Numerical result obtained with algorithm 3.5, in the nonrobust version ((·) = (·) = 1).

4 Conclusion We initially conceived of this review around 1997 and have been researching it since then. It has led to a great deal of research, and new material has been added through the years as soon as it became available in the scientific literature. As the principal component analysis literature is very large and fragmented, this contribution aimed at bringing together a number of mathematical results for principal component analysis and its generalizations such as principal subspace and minor component analysis in the complex domain. The emphasis of the article was on extensions to complexvalued neural networks and on relating these to a number of previous results known from scientific literature. In particular, extensions of previous research work and relationships were sought on Hebbian learning for linear, complex-weighted, feedforward, and laterally connected neural networks. Learning principles have been presented and analyses carried out on complex-valued principal, minor component, subspace quadratic, and nonquadratic rules.

832

S. Fiori

Principal angle γ (Deg)

100 80 60 40 20 0

0

500

1000

1500

2000

2500 Samples

3000

3500

4000

4500

5000

0

500

1000

1500

2000

2500 Samples

3000

3500

4000

4500

5000

Reconstruction error ||e||

20

15

10

5

0

Figure 11: Numerical result obtained with algorithm 3.5, in the robust version ((·) = (·) = C (·)).

The starting point of the analysis was an optimization approach for complex-valued networks that aimed at producing a purposeful contribution. Many existing results could then be subsumed into this framework (although some may not). Neural networks architectures for principal subspace analysis were considered in section 2. After reviewing previous work that considered the complex-valued linear and real-valued nonlinear case, the equations for the complex-valued nonlinear case were derived using gradient optimization with Lagrange multipliers. Then neural principal component analyzers for the real-valued linear and complex-valued linear cases were reviewed and extended to the complex-valued nonlinear case, again using gradient-based optimization. The problem of the choice of the nonlinearity was mentioned, and an attempt was made to generalize observations of Sudjianto and Hassoun to the complex case. The analysis proceeded by extending some results from real-valued (linear) minor component analysis algorithms to the complex linear case. Minimization of reconstruction error was considered in section 3. Existing principal subspace and principal component approaches for the

Nonlinear Complex-Valued Extensions of Hebbian Learning

833

real-valued nonlinear and complex-valued linear cases were reviewed and gradient equations for the complex-valued nonlinear case were derived. The problem of choosing the nonlinearity was discussed based on a maximum likelihood approach by Song et al. (1998) for the real valued non-linear case. As a crucial part of the presented learning rules is the selection of a cost function on which the derivation of the involved nonlinearity is based, we made some choices concerning this selection. We made the choice to consider only cost functions that take into account the magnitude of their complex argument. While this choice could seem natural, it was discussed and motivated with respect to its implicit assumptions through citations of relevant literature and explanation of its usefulness in applied fields related to signal processing. The treatment of the shape of the nonlinearity seemed to provide insight that would go beyond the results known for nonlinearities in the real-valued case. The discussion of the choice of the nonlinearity and related developments in independent component analysis is an example of this finding. We believe this review will be useful to other researchers because it presents an overview of existing approaches to complex-valued linear and real-valued nonlinear versions of Hebbian learning algorithms and provides derivations of equations that may be useful for the complex-valued nonlinear case. It may be of interest to readers who are interested in the extension of Hebbian learning to the complex and nonlinear case. Acknowledgments I gratefully thank the anonymous reviewers, the action editor, Mark Plumbley, and the editor, Terrence Sejnowski, for the careful and accurate suggestions that helped to improve the clarity and the overall organization of this article, and Leslie-Anne Chaden for her constant constructive support in handling the manuscript. The final version of this work was completed when I was a short-term visitor of the Mathematical Neuroscience Laboratory of the Brain Science Institute at the Institute of Physical and Chemical Research RIKEN (Japan), from July to September 2004. I express my gratitude to the BSI director, Shunichi Amari, and to the laboratory members for the kindest and warmest hospitality. References Abbas, H. M., & Fahmy, M. M. (1994). Neural model for Karhunen-Lo´eve transform with application to adaptive image compression. IEE Proceedings I, Communications, Speech and Vision, 140(2), 135–143. Amari, S. (1977). Neural theory of association and concept formation. Biological Cybernetics, 26, 175–185.

834

S. Fiori

Amari, S. & Cichocki, A. (1998). Adaptive blind signal processing: Neural network approaches. Proceedings of the IEEE, 86(10), 2026–2048. Atick, J. J., & Redlich, A. N. (1993). Convergent algorithm for sensory receptive field development. Neural Computation, 5(1), 45–60. Baldi, P. F., & Hornik, K. (1995). Learning in linear neural networks: A survey. IEEE Transactions on Neural Networks, 6(4), 837–858. Bannour, S., & Azimi-Sadjadi, M. R. (1995). Principal component extraction using recursive least squares learning. IEEE Transactions on Neural Networks, 6(2), 457– 469. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H. B. (1998). Guest editorial: Cerebral predictions. Perception, 27(8), 885–888. Barlow, H. B. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12(3), 241–253. Barlow, H. B., & Foldi´ ¨ ak, P. (1989). Adaptation and decorrelation in the cortex. In C. Miall, R. M. Durbin, & G. J. Mitchison (Eds.), The computing neuron, (pp. 54–72). Reading, MA: Addison-Wesley. Bechtel, W., & Abrahamsen, A. (1993). Connectionism and the mind. Oxford: Blackwell. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Benvenuto, N., & Piazza, F. (1992). On the complex back-propagation algorithm. IEEE Transactions on Signal Processing, 40(4), 967–969. Bertsekas, D. P. (1996). Constrained optimization and Lagrange multiplier methods. Bedford, MA: Athena Scientific. Bingham, E., & Hyv¨arinen, A. (2000). A fast fixed-point algorithm for independent component analysis of complex valued signals. International Journal of Neural Systems, 10(1), 1–8. Celledoni, E., & Fiori, S. (2004). Neural learning by geometric integration of reduced “rigid-body” equations. Journal of Computational and Applied Mathematics, 172 (2), 247–269. Chen, T.-P., & Amari, S.. (2001). Unified stabilization approach to principal and minor components extraction algorithms. Neural Networks, 14(10), 1377–1387. Chen, T.-P., Amari, S., & Lin, Q. (1998). A unified algorithm for principal and minor components extraction. Neural Networks, 11(3), 385–390. Chen, Y., & Hou, C. (1992). High resolution adaptive bearing estimation using a complex-weighted neural network. Proc. of International Conference on Acoustics, Speech and Signal Processing, 2, 317–320. Chow, T. W., & Fang, Y. (2001). Neural blind deconvolution of MIMO noisy channels. IEEE Transactions on Circuits and Systems—Part I, 48(1), 116–120. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. New York: Wiley. Costa, S., & Fiori, S. (2001). Image compression using principal component neural networks. Image and Vision Computing Journal [Special issue], 19(9–10), 649–668. De Castro, M. C. F., De Castro, F. C. C., Amaral, J. N., & Franco, P. R. G. (1998). A complex valued Hebbian learning algorithm. Proc. of International Joint Conference on Neural Networks (pp. 1235–1238) Piscataway, NJ: IEEE. Desodt, G., & Muller, D. (1990). Complex ICA applied to the separation of radar signals. Proc. of Signal Processing V: Theories and Applications (EUSIPCO), 1 , 665–668.

Nonlinear Complex-Valued Extensions of Hebbian Learning

835

Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: Theory and applications. New York: Wiley. Douglas, S. C., Amari, S. & Kung, S. Y. (2000). On gradient adaptation with unit-norm constraints. IEEE Transactions on Signal Processing, 48(6), 1843–1847. Fiori S. (2000). Blind separation of circularly distributed source signals by the neural extended APEX algorithm. Neurocomputing, 34(1–4), 239–252. Fiori S. (2001). A theory for learning by weight flow on Stiefel-Grassman manifold. Neural Computation, 13(7), 1625–1647. Fiori S. (2002). A theory for learning based on rigid bodies dynamics. IEEE Transactions on Neural Networks, 13(3), 521–531. Fiori S. (2003a). Extended Hebbian learning for blind separation of complex-valued sources. IEEE Transactions on Circuits and Systems—Part II, 50(4), 195–202. Fiori S. (2003b). A neural minor component analysis approach to robust constrained beamforming. IEE Proceedings—Vision, Image and Signal Processing, 150(4), 205– 218. Fiori S. (2003c). Neural ICA by “maximum-mismatch” learning principle. Neural Networks, 16(8), 1201–1221. Fiori S. (2003d). Fully-multiplicative orthogonal-group ICA neural algorithm. Electronics Letters, 39(24), 1737–1738. Fiori S. (2004). A fast fixed-point neural blind deconvolution algorithm. IEEE Transactions on Neural Networks, 15(2), 455–459. Fiori S. (in press). Fixed-point neural independent component analysis algorithms on the orthogonal group. Journal of Future Generation Computer Systems. Fiori, S., Faba, A., Albini, L., Cardelli, E., & Burrascano, P. (2003). Numerical modeling for the localization and the assessment of electromagnetic field sources. IEEE Transactions on Magnetics, 39(3), 1638–1641. Fiori, S., & Piazza, F. (2000). A general class of ψ–APEX PCA neural algorithms. IEEE Transactions on Circuits and Systems—Part I, 47(9), 1394–1398. Foldi´ ¨ ak P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Fyfe, C., & MacDonald, D. (2002). ε-insensitive Hebbian learning. Neurocomputing, 47(1–4), 35–57. Gao, K., Ahmed, M. O., & Swamy, M. N. S. (1994). A constrained anti-Hebbian learning algorithm for total least-squares estimation with applications to adaptive FIR and IIR filtering. IEEE Transactions on Circuits and Systems–Part II, 41(11), 718– 729. Georgiou, G. M., & Koutsougeras, C. (1992). Complex-domain backpropagation. IEEE Transactions on Circuits and Systems—Part II, 39(5), 330–334. Hairer, E., Lubich, C., & Wanner (2002). Geometric numerical integration: Structure preserving algorithms for ordinary differential equations, Berlin: Springer-Verlag. Hanna, A. I., & Mandic, D. P. (2003). A complex-valued nonlinear neural adaptive filter with a gradient adaptive amplitude of the activation function. Neural Networks, 16(2), 155–159. Harpur, G. F. (1997). Low entropy coding with unsupervised neural networks, Unpublished doctoral disseratation, University of Cambridge. Haykin, S. (1989). An introduction to analog and digital communications. New York: Wiley.

836

S. Fiori

Haykin, S. (1998). Neural networks: A comprehensive foundation, (2nd eds). Upper Saddle River, NJ: Prentice Hall. Hebb, D. (1949). The organization of behaviour. New York: Wiley. Helmke, U., & Moore, J. B. (1993). Optimization and dynamical systems. Berlin: Springer-Verlag. Higuchi, I., & Eguchi, S. (2004). Robust principal component analysis with adaptive selection for tuning parameters. Journal of Machine Learning Research, 5, 453–471. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Iserles, A., Munthe-Kaas, H. Z., Nørsett, S. P., & Zanna, A. (2000). Lie-group methods. Acta Numerica, 9, 215–365. Karhunen, J., & Joutsensalo, J. (1994). Representation and separation of signals using nonlinear PCA type learning. Neural Networks, 7(1), 113–127. Karhunen, J., & Joutsensalo, J. (1995). Generalizations of PCA, optimization problems, and neural networks. Neural Networks, 8(4), 549–562. Kates, J. M. (1993). Superdirective arrays for hearing aids. Journal of Acoustics Society of America, 94(4), 1930–1933. Klemm, R. (1987). Adaptive airborne MTI: An auxiliary channel approach. IEE Proceedings F, 134, 269–276. Kung, S.-I., Diamantaras, K. I., & Taur J. S. (1994). Adaptive principal component extraction (APEX) and applications. IEEE Transactions on Signal Processing, 42(5), 1202–1217. Laheld, B., & Cardoso, J. F. (1994). Adaptive source separation with uniform performance. Proc. of Signal Processing VII: Theories and Applications (EUSIPCO), 1, 183–186. Levine, D. S., Brown, V. R., & Shirey, T. V. (1999). Oscillations in neural systems. Mahwah, NJ: Erlbaum. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7(1), 246–250. Linsker, R. (1992). Local synaptic rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702. Luo, F.-L., & Unbehauen R. (1997). Applied neural networks for signal processing. Cambridge: Cambridge University Press. Mathew, G., & Reddy, V. U. (1994). Development and analysis of a neural network approach to Pisarenko’s harmonic retrieval method. IEEE Transactions on Signal Processing, 42(3), 663–667. Mathis, H., & Douglas, S. C. (2002). On the existence of universal non-linearities for blind source separation. IEEE Transactions on Signal Processing, 50(5), 1007–1016. Miao, Y., & Hua, Y. (1998). Fast subspace tracking and neural network learning by a novel information criterion. IEEE Transactions on Signal Processing, 46(7), 1967– 1979. Michaels, R. B., & Upadhyaya, B. R. (1999). A complex valued neural network with local learning laws. In C. H. Dagli et al. (Eds.), Intelligent engineering systems through artificial neural networks (Vol. 9, pp. 101–109). New York: ASME Press.

Nonlinear Complex-Valued Extensions of Hebbian Learning

837

Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6(1), 100–126. Miyauchi, M., Seki, M., Watanabe, A., & Miyauchi, A. (1993). Interpretation of optical flow through complex neural network. Proc. of the International Conference on Artificial Neural Networks, (pp. 645–650). New York: Springer. Muezzino ¨ glu, ˇ M. K., Guzeli¸ ¨ s, C., & Zurada, J. M. (2003). A new design method for the complex-valued multistate Hopfield associative memory. IEEE Transactions on Neural Networks, 14(4), 891–899. Nitta, T. (1997). An extension of the back-propagation algorithm to complex numbers. Neural Networks, 10(8), 1391–1415. Nitta, T. (2000). An analysis of the fundamental structure of complex-valued neurons. Neural Processing Letters, 12(3), 239–246. Nitta, T. (2004). Orthogonality of decision boundaries in complex-valued neural networks. Neural Computation, 16, 73–97. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematics and Biology, 15, 267–273. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural System, 1, 61–68. Oja, E. (1992). Principal components, minor components, and linear neural networks. Neural Networks, 5(6), 927–935. Oja, E. (1994). Beyond PCA: Statistical expansions by nonlinear neural networks. Proc. of the International Conference on Artificial Neural Networks, 2, 1049–1054. O’Leary, D. P. (1990). Robust regression computation using iteratively reweighted least squares. SIAM Journal of Matrix Analysis Application, 11(3), 466–480. Palmieri, F., & Zhu, J. (1995). Self-association and Hebbian learning in linear neural networks. IEEE Transactions on Neural Networks, 6(5), 1165–1184. Pandey, R. (2001). Blind equalization and signal separation using neural networks. Unpublished doctoral disseratation, Indian Institute of Technology, Roorkee, India. Plumbley, M. D. (1992). Efficient information transfer and anti-Hebbian neural networks. Neural Networks, 6(6), 823–833. Plumbley, M. D. (2003). Algorithms for non-negative independent component analysis. IEEE Transactions on Neural Networks, 14(3), 534–543. Rubner, J., & Tavan, P. (1989). A self-organizing network for principal component analysis. Europhysics Letters, 10(7), 693–698. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2(6), 459–473. Schmidt, R. (1986). Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3), 276–280. Shirazi, M. N., Peper, F., & Sawai, H. (1999). Principal component analysis by entropylikelihood optimization. Transactions of the Information Processing Society of Japan, 40(10), 3638–3644. Song, W., Yilong, L., & Feng, M. (1998). An adaptive robust PCA neural network. In Proc. of International Joint Conference on Neural Networks (pp. 2288–2293). Piscataway, NJ: IEEE. Spratling, M. W., & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Computation, 14(9), 2157–2179.

838

S. Fiori

Sudjianto, A., & Hassoun, M. H. (1995). Statistical basis of non-linear Hebbian learning and application to clustering. Neural Networks, 8(5), 707–715. Thirion Moreau, N., & Moreau, E. (2002). Generalized criterion for blind multivariate signal equalization. IEEE Signal Processing Letters, 9(2), 72–74. Vegas, J. M., & Zufiria, P. J. (2004). Generalized neural networks for spectral analysis: Dynamics and Lyapunov functions. Neural Networks, 17(2), 233–245. Weingessel, A., & Hornik, K. (2000). Local PCA algorithms. IEEE Transactions on Neural Networks, 11(6), 1242–1250. Widrow, B., & Winter, R. (1988). Neural nets for adaptive filtering and adaptive pattern recognition. IEEE Computer Magazine, 21, 25–39. Xu, L. (1992). Least mean square error reconstruction principle for self-organizing neural nets. Neural Networks, 6(5), 627–648. Xu, L., Oja, E., & Suen, C. Y. (1992). Modified Hebbian learning for curve and surface fitting. Neural Networks, 5(3), 441–457. Xu, L., & Yuille, A. L. (1995). Robust PCA by self-organizing rules based on statistical physics approach. IEEE Transactions Neural Networks, 6(1), 131–143. Yan, W.-Y., Helmke, U., & Moore, J. B. (1994). Global analysis of Oja’s flow for neural networks. IEEE Transactions on Neural Networks, 5(5), 674–683. Yang, B. (1995). Projection approximation subspace tracking. IEEE Transactions on Signal Processing, 43(1), 1247–1252. Zufiria, P.J. (2002). On the discrete-time dynamics of the basic Hebbian neural network node. IEEE Transactions on Neural Networks, 13(6), 1342–1352.

Received November 26, 2003; accepted August 10, 2004.

LETTER

Communicated by Kechen Zhang

Difficulty of Singularity in Population Coding Shun-ichi Amari [email protected]

Hiroyuki Nakahara [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Saitama, 351-0198 Japan

Fisher information has been used to analyze the accuracy of neural population coding. This works well when the Fisher information does not degenerate, but when two stimuli are presented to a population of neurons, a singular structure emerges by their mutual interactions. In this case, the Fisher information matrix degenerates, and the regularity condition ensuring the Cram´er-Rao paradigm of statistics is violated. An animal shows pathological behavior in such a situation. We present a novel method of statistical analysis to understand information in population coding in which algebraic singularity plays a major role. The method elucidates the nature of the pathological case by calculating the Fisher information. We then suggest that synchronous firing can resolve singularity and show a method of analyzing the binding problem in terms of the Fisher information. Our method integrates a variety of disciplines in population coding, such as nonregular statistics, Bayesian statistics, singularity in algebraic geometry, and synchronous firing, under the theme of Fisher information.

1 Introduction A fundamental concern in systems neuroscience is how information is represented in the brain. Population coding is a widely accepted scheme of representation in which the accuracy of encoding is evaluated by Fisher information (Seung & Sompolinsky, 1993; Zemel, Dayan, & Pouget, 1998; Yoon & Sompolinsky, 1999; Zhang & Sejnowski, 1999; Pouget, Dayan, & Zemel, 2000; Wu, Amari & Nakahara, 2002a, 2002b; Wu, Nakahara, & Amari, 2001; Nakahara & Amari, 2002; Nakahara, Wu, & Amari, 2001; Wu, Chen, Niranjan, & Amari, 2003). The Fisher information is not only related to the mean square error of an estimator by the Cram´er-Rao theorem, but is an intrinstic quantity to measure how much information is included in the population coding. Neural Computation 17, 839–858 (2005)

© 2005 Massachusetts Institute of Technology

840

S. Amari and H. Nakahara

When two similar stimuli are presented at the same time, their responses interact mutually and cause strange phenomena in both the mathematical structure and the behavior of an animal. We explain this phenomenon by a simple example (Ingle, 1968; Amari & Arbib, 1977). When a frog observes a fly in its visual field, it is represented by an excitation pattern of neurons in the visual field (tectum). The pattern has a peak at the place corresponding to the location of the fly (see Figure 1A). Neural firing fluctuates randomly, so that the amount of information included in such population coding is analyzed by Fisher information. However, when two flies appear at the same time, the response pattern of neural excitation has two peaks at the corresponding locations of the flies (see Figure 1B). When the two flies are very close, the two peaks merge into one, and there remains only a noisy pattern of one peak (see Figure 1C). What will happen under such a condition? Neuroethological study has observed pathological behavior in that the frog cannot distinguish the two flies and sometimes jumps in between the flies, failing to catch either (Ingle, 1968). Mathematically, the Fisher information matrix degenerates in this case, so that the established Cram´er-Rao paradigm of statistics does not hold. Hence, a new framework of statistics is required to analyze such a nonregular or singular situation. Looking at the situation in more detail, we find that algebraic singularity exists in the underlying statistical manifold, which is the main cause of difficulty. There are many practical methods to resolve the problem. The frog has its own neural algorithm. Many theoreticians use the Bayesian strategy, and biologically plausible algorithms have been proposed (Pouget, Zhang, Denever, & Latham, 1998; Zemel, et al., 1998; Zemel & Pillow, 2002; Sahani & Dayan, 2003). However, the Bayesian method is not free of the difficulty caused by the singularity, because the performances of Bayesian algorithms are largely affected by the prior distributions in the singular case, which is completely different from the regular case. Computer simulations have been used so far to analyze the estimation errors of proposed algorithms. We need a new theretical framework to carry out the mathematical analysis, in particular the limit of accuracy attainable by the optimal method. The singularity also affects the Bayesian method strongly in a different way from the regular case. This study focuses on such a singular problem of population coding, providing a new framework of analysis. We do not propose a practical procedure to decide the number of stimuli or to decode information giving the estimated locations of stimuli. Instead, our framework is necessary to analyze the behaviors of algorithms mathematically. For this purpose, we must integrate various disciplines related to the scheme of population coding: neural fields (Amari, 1977), degenerate Fisher information in a nonregular model (Amari & Ozeki, 2001; Fukumizu, 2003), algebraic geometrical singularity (Fukumizu & Amari, 2000; Watanabe, 2001a) that causes such degeneration, and the information-geometrical method (Amari & Nagaoka, 2000; Amari, 1998) to elucidate the whole

Difficulty of Singularity in Population Coding

A

B

r(z)

x

z

r(z)

x1

C

841

x2

z

r(z)

x1 x2

z

Figure 1: Schematic presentation of firing patterns r (z) in a neural population, where r (z) denotes the firing rate of neurons at location z: given a single stimulus at x (A), given two stimuli at x1 and x2 that are far enough apart (B), and given two stimuli close to each other (C).

structure. (See also Dacunha-Castelle & Gassiat, 1997; Lin & Shao, 2003 and Burnashev & Amari, 2002, for recent statistical analysis of the log-likelihood ratio and the maximum likelihood estimator in singular statistical models.) We begin with the analysis of the Fisher information of population coding in a neural field. The Fourier domain is useful for analysis when the field is homogeneous. Various properties in population coding, such as the effects of the shape of a tuning curve and the correlation of noises, will be clearly understood in this framework (Wu et al., 2002b). We

842

S. Amari and H. Nakahara

summarize the method in section 2, when a single stimulus is given in a one-dimensional neural field. When two stimuli appear close together at the same time in a one-dimensional field, their mutual interaction becomes very strong. The Fisher information is a matrix in this case, but it degenerates when the two are close to each other. This violates the classic framework of statistics, and one cannot apply the Cram´er-Rao theorem because the estimation error grows without limit. It is indeed difficult to define the optimal strategy to judge if there are two targets or one. It is more difficult to identify the intensity and the location of each target and to analyze rigorously the performances of various algorithms. The difficulty is observed in animal behavior (Ingle, 1968; Treue, Hol, & Rauber, 2000; Reichardt & Poggio, 1975). When the Fisher information degenerates, its inverse diverges to infinity. Therefore, we need a careful mathematical analysis. We demonstrate that the space of parameters to be estimated from the neural excitation pattern has algebraic singularity and analyze its structure in detail. Such singularity causes the degeneration of the Riemannian structure (or the Fisher information) of the space as a black hole does in the real space-time universe. We analyze how the accuracy of estimation degenerates as the two targets draw near to each other. To this end, the Fisher information is decomposed into regular directions and singular directions, and the accuracy of estimation in the respective directions is analyzed. Such analysis gives new insight into modern statistics. The Bayes inference is another method to give a practical solution even in a singular case. However, we show that there exists a serious problem, because a smooth, positive prior over the parameters induces a singular prior on the singular model. A smooth, positive prior on the space of response functions is not absolutely continuous to a smooth prior on the parameter space. They are mutually singular. We study such a problem, showing that the same difficulty exists in Bayesian framework. From the viewpoint of model selection, the number of stimuli is decided from the data by some criteria such as the Akaike information criterion, the Bayesian information criterion, or the minimum descriptive length criterion. However, for all of these model selection procedures, their statistical validity loses its ground in such a singular case, as we explain below. We then ask if there is a way to get rid of this difficulty, finding that this is closely related to the binding problem. We demonstrate how synchronous firing resolves this difficulty by using the Fisher information. The difficulty is remarkably relaxed by synfiring, which resolves singularity. The synfiring scheme has been studied extensively, but this is the first time it is analyzed in terms of the Fisher information. We suggest a new method of analyzing a general binding problem in terms of the Fisher information in the framework of information geometry (Amari & Nagaoka, 2000).

Difficulty of Singularity in Population Coding

843

2 Fisher Information for a Single Stimulus We recapitulate the population coding in a neural field, summarizing the analysis of the Fisher information for a single stimulus. Details are found in the work of Wu et al. (2001). Consider a population of neurons located on a line, which we call a one-dimensional neural field, and the position of a neuron is denoted by a coordinate z. Let r (z) be the firing rate of the neurons at around position z. Neurons are excited in response to a stimulus applied from the outside. We consider a set of possible stimuli of which the features are specified by one-dimensional coordinates x. The feature may be the orientation of a bar, the position of an object, or something else. There is a natural one-to-one correspondence between x and z, and a stimulus with feature x is encoded by an activity pattern r (z) over the field such that the neurons at positions z ≈ x are excited strongly, with r (z) having a peak at z ≈ x (see Figure 1). More precisely, let f (z) be a unimodal function having a peak at z = 0, and when stimulus x is applied, the expected firing rate of the neurons at around z is given by f (z − x). The function f (z) is called the tuning curve, and we use the gaussian function ϕ for f , ϕ(z) = √

1 2πa

e −z /2a , 2

2

(2.1)

for the ease of analysis, where a represents the width of the tuning function. Neuronal firing is stochastic in its nature. Given stimulus x, the observed firing pattern r (z; x) is given by r (z; x) = f (z − x) + σ ε(z),

(2.2)

where ε(z) denotes random fluctuations with intensity σ . We assume that ε(z) is gaussian with mean 0. Firing of neurons at positions z and z is correlated in general, so that we put

ε(z) = 0, ε(z)ε z = h z − z ,

(2.3) (2.4)

where < > denotes expectation and h denotes the covariance function. Here, h (z − z ) decreases to 0 as z − z becomes larger. We assume the following form of the covariance function h(z), h(z) = (1 − β)δ(z) + βe −z /2b , 2

2

(2.5)

where δ(z) is the delta function, 1 − β represents the intensity of independent fluctuation proper to each neuron, and b represents the effective range

844

S. Amari and H. Nakahara

of correlated fluctuation. This form captures a variety of the correlation structure. Our previous articles (Wu et al., 2001, 2002a, 2002b) investigated the properties of the Fisher information for the single stimulus. Although the above h(z) does not include a multiplicative noise structure, it is possible to address such a case with a little modification (Yoon & Sompolinsky, 1999; Eurich & Wilke, 2000; Nakahara & Amari, 2002; Wu, Amari, & Nakahara, 2004), so that for simplicity we use only the above form in this study. The probability distribution of firing pattern r = {r (z)} given x is then written explicitly as Q(r|x) =

∞ ∞ 1 1 [r (z) − f (z − x)]h ∗ (z − z ) exp − 2 Z 2σ −∞ −∞ [r (z ) − f (z − x)]dzdz ,

(2.6)

where Z is the normalization factor. The function h ∗ (z) is the inverse kernel of the correlation function h, defined by

∞ −∞

h ∗ (z − z )h(z − z )dz = δ(z − z ),

where δ is the delta function. The Fisher information for the single stimulus x is given by

d 2 ln Q(r|x) d 2 ln Q(r|x) − Q(r|x) dr, I F (x) = E − d x2 d x2

(2.7)

where E denotes expectation. Simple calculation yields (Wu et al., 2001) 1 IF = 2 σ

f (z − x)h ∗ z − z f z − x dzdz ,

(2.8)

which does not depend on x because of the translation invariance of the neural field. By using the Fourier transform, I F is given by 1 IF = 2πσ 2

∞

−∞

ω2 exp −a 2 ω2 dω, H(ω)

(2.9)

where H(ω) is the Fourier transform of h(z), 2 2 √ b ω H(ω) = (1 − β) + β 2π b exp − . 2

(2.10)

Difficulty of Singularity in Population Coding

845

This form elucidates various characteristics of the Fisher information for a single stimulus (Wu et al., 2001). 3 Fisher Information for Two Stimuli Fisher information is useful for evaluating the accuracy of population coding. The Cram´er-Rao theorem shows that I F −1 provides a lower bound of ˆ The situation, the mean square error for the estimated feature or location x. changes drastically however, when two stimuli are presented at the same time. The response of neurons at z, given two stimuli at locations x1 and x2 , is modeled by r (z) = f (z; x1 , x2 ) + σ (z),

(3.1)

where f (z; x1 , x2 ) is the tuning curve when two stimuli with features x1 and x2 are presented. Let the intensities of the two stimuli be v and 1 − v, where we assume that the total intensity is regulated to be equal to 1. Then the tuning curve is defined by f (z; x1 , x2 , v) = (1 − v)ϕ(z − x1 ) + vϕ(z − x2 ),

(3.2)

where 0 ≤ v ≤ 1. Experimental results (Treue et al., 2000; Martinez-Trujillo & Treue, 2000) suggest this form of tuning curves. The probability distribution Q (r; x1 , x2 , v) of neural firing is given as equation 2.6, where f (z) is now set as equation 3.2. In this setting, we search for the Fisher information in population coding concerning two stimuli, (1 − v, x1 ) and (v, x2 ). Since the unknown parameters (x1 , x2 , v) are three in number, the Fisher information now becomes a 3 × 3 matrix. When the parameters are represented by a vector θ = (x1 , x2 , v),

∂ 2 ln Q(r|θ) I F (θ) = −E . ∂θ∂θ

(3.3)

We can calculate the Fisher information matrix I F in a similar way, and its inverse would give the accuracy of the estimated features of the stimuli. When x1 = x2 , the two targets merge into one, and f (z; x, x, v) = ϕ(z − x) does not depend on v, because of x1 = x2 , that is, a single stimulus. However, when the two targets are close, x1 ≈ x2 , a pathological situation occurs, and it causes an extreme difficulty in estimating v, causing nonidentifiability of v. Let us first discuss this difficulty in case of x1 = x2 for ease of understanding. The Fisher information matrix indeed degenerates, and its inverse diverges. Therefore, we cannot apply the Cram´er-Rao paradigm of statistics in this case. Such a pathological statistical model is said to be nonregular and has

846

S. Amari and H. Nakahara

been outside the scope of the standard theory of statistics, except for some pioneering work (Weyl, 1939; Hotelling, 1939) and recent developments (Dacunha-Castelle & Gassiat, 1997; Lia & Shao, 2003; Fukumizu, 2003). The analysis in the pathological case is the main topic of this letter. We also show how the singularity affects the Bayesian method and model selection. The pathological situation is elucidated by algebraic geometry, because algebraic singularities exist in the parameter space M = {(x1 , x2 , v)} of two stimuli (see Figure 2A). First, when x1 = x2 = x, the tuning curve is the same whatever v is, so that the points on a vertical line, l x : x1 = x2 = x, v = arbitrary,

(3.4)

have the same effect on the neural field (see Figure 2A, red bold vertical line). Second, when v = 1 (v = 0), there is no x1 (x2 ) stimulus, so that x1 (x2 ) cannot be decided. Hence, the points on the line l x,1 : v = 1, x1 are arbitrary, and the line l x,0 : v = 0, x2 is arbitrary (see Figure 2A, red bold horizontal lines). Thus, the points in the parameter space that have the same effect would ˜ consisting of the resultant be summarized as the same point in the set M probability distributions. Mathematically, we divide M by the equivalence relation of the same effect. The parameter space M then collapses along ˜ has algebraic singularity, those critical lines, and the resultant reduced M as shown in Figure 2B. The Fisher information matrix degenerates on these critical points. We now introduce a new method of analysis, focusing on the algebraic singularity. We begin with reparameterizing the features of stimuli as u = x2 − x1 w = (1 − v)x1 + vx2 . Here, u indicates the difference between the locations of the two stimuli, and w indicates their center of gravity. Using these, we can rewrite the tuning curve as f (z; u, w, v) = (1 − v)ϕ(z − w + vu) + vϕ(z − w − (1 − v)u).

(3.5)

With these parameters (w, u, v), we evaluate the accuracy of coding in the neighborhood of u ≈ 0. Without loss of generality, we rescale the z-axis to let a = 1, where x1 , x2 , and u should be rescaled as well as the width b of the correlation. Using the Chebyshev-Hermite expansion with respect to u of equation 3.5, we get, f (z; u, w, v) =

∞ c n (v) n=0

n!

u Hn (z − w) ϕ(z − w), n

Difficulty of Singularity in Population Coding

A

847

B

v

x2 x1

~ M

M

D

C 1

θ3

v

0.75

x 10-4 1.5 1

0.5 0.25 0 0

0.04

u

0.08 0.1

v

0

-1 -1.5 0

0.4

θ2

0.8

1.2 x 10-3

Figure 2: (A) Parameter space M = {(x1 , x2 , v)}. The critical set consists of three surfaces with red lines. The three red thicker lines correspond to a set of those parameters that are equivalent to one stimuli and give the same probability ˜ where the equivalent set of parameters is distribution. (B) The reduced space M, reduced to one point. (C) Regular (u, v)-plane, where the part u < 0, v = 0.5 + v0 , is equivalent to u > 0, v = 0.5 − v0 by interchanging x1 and x2 . (D) The (u, v)plane embedded in regular space (θ1 , θ2 ) by a cusp-type singular map. Note that the line u = 0 in (C) (the light blue line) is mapped to a point in D.

where c n (v) ≡ (1 − v)(−v)n + v(1 − v)n and Hn (z) is the Chebyshev-Hermite polynomials. We approximate f (z; u, w) by taking the summation up to 3, having 1 1 2 3 f (z; u, w, v) = 1+ c 2 (v)u H2 (z − w)+ c 3 (v)u H3 (z−w) ϕ(z − w), 2 6 c 0 = 1, c 1 = 0, c 2 = v(1 − v), c 3 = −v(1 − v)(2v − 1) H0 (z) = 1, H1 (z) = z, H2 (z) = z2 − 1, H3 (z) = z3 − 3z.

848

S. Amari and H. Nakahara

Let us now introduce the set of new parameters θ = (θ1 , θ2 , θ3 ) in which f is written as f (z; θ) = {1 + θ2 H2 (z − θ1 ) + θ3 H3 (z − θ1 )} ϕ(z − θ1 ),

(3.6)

and consider a new statistical model S parameterized by θ in which the probability distributions are given by equation 2.6 and f is given by equation 3.6. Our original statistical model M is parameterized by ξ = (w, u, v) and is included in S by the mapping c 2 (v)u2 c 3 (v)u3 θ = w, , . 2 6

(3.7)

The mapping (3.7) from ξ to θ is singular, because its Jacobian determinant, ∂θ |J | = , ∂ξ

(3.8)

vanishes at u = 0. We adopt the strategy of calculating the Fisher information I FS (θ) of S, which is regular in the coordinates θ. The singular I F (ξ) of M is given through the singular mapping equation 3.7. This method elucidates the singular structure of M. We first evaluate I FS (θ) at around θ2 = θ3 = 0, which corresponds to u = 0. The elements of I FS (θ) = (Ikl ) are calculated as Ikl =

1 σ2

∂k f (z; θ)h ∗ z − z ∂l f z ; θ dzdz ,

where ∂k f = (∂/∂θk ) f . In terms of the Fourier transform, they are 1 Ikl = 2πσ 2

Fk (ω)H ∗ (ω)Fl (ω)dω,

(3.9)

where Fk (ω) = ∂ F (ω)/∂θk and F (ω) is the Fourier transform of f (z). The Fisher information matrix is explicitly calculated as  g1 0 −g2 I FS (θ) =  0 g2 0  , −g2 0 g3 

where gk =

1 2πσ 2

ω2k exp(ω2 ) dω > 0. H(ω)

Difficulty of Singularity in Population Coding

849

For ease of further study, we modify the parameters from θ to θ ∗ , θ1∗ = θ1 +

g2 θ3 , g3

(3.10)

θ2∗ = θ2 ,

θ3∗ = θ3 .

(3.11)

Then the Fisher information matrix is diagonalized, 

 g1 0 0 I FS (θ ∗ ) =  0 g2 0  , 0 0 g¯ 3 where g¯ 3 = g3 − g22 /g1 > 0. It is now easy to obtain the Fisher information matrix I F (ξ) in terms of the parameters ξ = (w, u, v), which represent the features of the stimuli. Let J be the Jacobian of the transformation from ξ to θ ∗ ,  1 ∂θ ∗ J = = 0 ∂ξ 0

(1/2)g2 g3−1 c 3 u2 c2 u (1/2)c 3 u2

 (1/6)g2 g3−1 c 3 u3 (1/2)c 2 u2  , (1/6)c 3 u3

(3.12)

where c i = dc i (v)/dv. Then the Fisher information I F (ξ) is obtained from I F (ξ) = J T I FS (θ ∗ ) J .

(3.13)

4 Algebraic Singularity and Accuracy of Estimation Although the Fisher information I FS (θ ∗ ) is regular, I F (ξ) is singular at u = 0, because the Jacobian J of the transformation is singular at u = 0, |J | = 0. In other words, the present model M is embedded in a regular space as ˜ and the singularity cannot be resolved by coordinate a singular subset M, transformation. Let us highlight the role of the singular mapping. Since the structure of the field is homogeneous along the w-axis, we focus on the mapping between (u, v) and (θ2∗ , θ3∗ ) =

v(1 − v)u2 v(1 − v)(1 − 2v)u3 , 2 6

.

(4.1)

Figure 2C shows the u-v plane, where the v-axis (indicated by a thick blue vertical line in the figure) corresponding to u = 0 degenerates to a single point in the θ-plane, causing the singularity. Horizontal (black) and vertical (red) lines simply indicate the grids of v and u values. Figure 2D shows the regular θ2 -θ3 plane in which the grids of (u, v) in Figure 2C are embedded.

850

S. Amari and H. Nakahara

Note that Figure 2C corresponds to a slice vertically cut from Figure 2A. Now consider the line v = 1/2 + v0 (solid horizontal blue line) in Figure 2C. It corresponds to the stimuli of intensity 1/2 + v0 , of which distance u changes arbitrarily. Note that when u < 0, it corresponds to the stimuli of intensity 1/2 − v0 , where stimulus 1 and stimulus 2 are interchanged. Hence, we consider the two blue horizontal lines as one component. The line is mapped to a cusp line (black or blue pair) in Figure 2D. The line v0 = 0 corresponds to the horizontal line θ3∗ = 0 in the right figure, and as v0 increases, the angle of the corresponding cusp becomes larger. As v0 tends to 1/2, the cusp shrinks and reduces to the origin. This gives rise to the singularity corresponding to v = 0 or 1, another type of singularity, which we do not address in this article. Note that the space of θ ∗ itself corresponds to a regular statistical model with a nondegenerate Fisher information matrix. The space of ξ is embedded in the θ ∗ space by the mapping with the singularity of the cusp structure (see Figure 2D). ˆ u, ˆ vˆ be the maximum likelihood estimator. The errors wˆ = Let ξˆ = w, wˆ − w, uˆ = uˆ − u, vˆ = vˆ − v are given by ξˆ =

∂ξ ˆ ∗ −1 ˆ ∗ ∗ θ = J θ . ∂θ

(4.2)

The covariance matrix of the error is given by I F−1 (ξ), but the inversion of I F is complicated, so we evaluate the error directly by using equation 3.13. ∗ Since three components of θˆ are independently distributed, we obtain, −1 by evaluating J carefully, the asymptotic evaluation of the variances of the ˆ errors ξ, g3 , g1 g¯3

(4.3)

ˆ ≈ k1 u−4 + k2 u−2 , V(u)

(4.4)

ˆ ≈ k3 u−6 , V(v)

(4.5)

ˆ ≈ V(w)

where k1 = 36(1 − 2v)2 v−2 (1 − v)−2 g¯3 −1 , 2 k2 = 4 6v2 − 6v + 1 v−2 (1 − v)−2 g2−1 ,

(4.6)

k3 = 144g¯3−1 .

(4.8)

(4.7)

When v = 1/2, k1 , the coefficient of u−4 vanishes. The above results highlight the pathological nature of estimation:

r

For w, the Fisher information is of order 1 even when u → 0. Hence, one can estimate the center of gravity of two stimuli accurately.

Difficulty of Singularity in Population Coding

r

r

851

When v = 1/2, the variance of uˆ diverges to infinity in the order of 1/u4 as u → 0. The estimation of v is worse, because its variance diverges in the order of 1/u6 . These show the difficulty in estimating them accurately. When v = 1/2, that is, the intensities of two stimuli are the same, the dominant term in equation 4.6 vanishes, so that the variance of uˆ is of order 1/u2 . Hence, estimation is carried out more accurately under such a situation.

Turning to the estimation of the original locations x1 and x2 , we have ˆ xˆ 1 = wˆ − vˆu, ˆ u, ˆ xˆ 2 = wˆ + (1 − v)

(4.9) (4.10)

so that terms of O 1/u6 dominate in their error variances. In conclusion, one can estimate the center of gravity accurately, but it is very difficult to estimate their intensities and difference of locations, or the respective locations x1 and x2 , as u tends to 0. We have obtained the order of divergence, u−4 and u−6 . More rigorous mathematical analysis when v = 1/2 is given in Burnashev and Amari (2002) by using the technique of large deviation.

5 Model Selection and Singularity We have discussed the behavior of the maximum likelihood estimator (MLE) when two stimuli are presented. When the two are located closely, the estimation of the two locations is difficult. One may decide the number of stimuli (one or two in the present case) after the estimation of the parameters, but one may decide the number before estimation based on the encoded pattern r and then perform estimation. We apply different statistical models: one including single stimulus and the other including two stimuli. By comparing the results, one can decide the number. This is the problem of model selection. There are two typical methods frequently used in model selection. One is the Akaike information criterion (AIC), which evaluates the KullbackLeibler divergence between the true distribution and the estimated distribution based on the two models. The other is the Bayesian information criterion (BIC), which evaluates the posterior probabilities of the two models. The minimum descriptive length (MDL) criterion gives practically the same criterion as the BIC, although the underlying philosophy is different. One may apply these criteria in the singular case, obtaining a practical answer to decide the number of stimuli. However, all the criteria (AIC, BIC, MDL) are based on the statistical analysis of the behavior of the MLE under each model. In these analyses, the estimator is assumed to be asymptotically

852

S. Amari and H. Nakahara

subject to the gaussian distribution with the covariance matrix equal to the inverse of the Fisher information matrix. Since the Fisher information is degenerate in the present singular case, the MLE is not subject to the gaussian distribution even asymptotically. So one should be very careful in applying these criteria in hierarchical models including singularity, because no mathematical justification exists.

6 Bayesian Inference and Singularity The Bayesian inference presumes an adequate prior distribution over the parameters of the model. One may use a parametric family of priors, including hyperparameters, thus introducing a hierarchical structure of Bayesian inference. Given observed data r, one can calculate the posterior probability of the parameters under the model p(ξ|r) =

π(ξ) p(r|ξ) , p(r)

(6.1)

where π(ξ) is the prior distribution and p(r) is the marginal distribution of r. When one wants to combine the problem of model selection, where we presume a number of models M1 , M2 , · · · , and the posterior is p (ξi , Mi |r) =

π (Mi ) π (ξi |Mi ) p (r|ξi ) , p(r)

(6.2)

where π (Mi ) is the prior of the model Mi with parameters ξi . One may use the marginal distribution p(Mi |r) for model selection (BIC). The MAP estimator is the one that maximizes the posterior distribution. This is a point estimator, and it is known that the MAP estimator includes the same amount of Fisher information as the MLE. The Bayesian predictive distribution is obtained by using the posterior distributions of parameters to give a distribution of new data. The predictive distribution does not belong to the model, but is optimal when the prior is true. In general, it gives a good estimator, provided the prior is not extreme. The amount of Fisher information included in it is larger than that included in the MAP or MLE estimator, but the difference is small or only higher order, which can be revealed by calculating the curvature direction of the ancillary statistics (Komaki, 1996). When a smooth, positive prior is used, all the Bayesian estimators are asymptotically the same in a regular statistical model. One choice is the Jeffreys noninformative prior, which uses the square root of the determinant of the Fisher information matrix, and its optimality is guaranteed by information theory (Clarke & Barron, 1990). Hence, the performance of the

Difficulty of Singularity in Population Coding

853

Bayesian inference is asymptotically guaranteed to be good, and if one can choose an adequate prior based on prior knowledge, its performance is better. However, when one chooses an extreme prior, say diverging to infinity at some points or zero on some region, the result of the inference is biased and strongly affected by the chosen prior. When the statistical model is singular, as is our model, a serious problem arises. The Jeffreys prior converges to 0 in the critical region of parameters. Hence, this is an extreme case. One usually assumes a smooth prior, in particular, the uniform prior, on the model M. Instead of the parameter space ˜ of behaviors, which are the space of all M, we may consider the space M ˜ looks very natural to be the tuning functions f (z, ξ ). A smooth prior on M assumed, but it becomes singular on M because the critical set on M cor˜ responds to an infinite number of the same probability distributions in M. Therefore, the uniform (or smooth) prior counts a probability distribution ˜ infinitely many times in M. If we assume the uniform (or belonging to M ˜ of behaviors, then such multiplicity smooth) prior on the reduced model M disappears, as the Jeffreys prior does. However, such a prior becomes 0 on the critical region of the original parameter space. Therefore, the prior becomes singular in a singular statistical model. There has been little study on the Bayesian inference in a singular statistical model except for the algebraic geometry theory of Watanabe (2001a, 2001b, 2001c). We can observe that the uniform prior favors a smaller model (one stimulus in our case), so that it suppresses automatically larger models, avoiding overfitting. When the true distribution is in a smaller model, it gives a better performance. Watanabe and his colleagues have studied the Bayesian inference in singular models by using algebraic geometry, which we do not recapitulate here. 7 Synfiring Solution How can the brain solve the difficulty in the singular situation? We propose a possible solution in the scheme of synchronous firing. Given two stimuli representing different objects, the synfire hypothesis (Abeles, 1991; Diesmann, Gewaltig, & Aertsen, 1999; von der Malsburg, 1999; Singer, 1999) poses that the responses due to one object are synchronized, while those by the other fire synchronously in another phase. For simplicity, we assume there are only two phases, phase 1 and phase 2, and object 1 excites neurons in phase 1 with degree α and in phase 2 with the complementary degree α¯ = 1 − α. As for object 2, α and α¯ are interchanged. Hence, the responses of the neural field in phases 1 and 2 are, respectively, r1 (z) = α v¯ ϕ (z − x1 ) + αvϕ ¯ (z − x2 ) + σ ε1 (z)

(7.1)

r2 (z) = α¯ v¯ ϕ (z − θ1 ) + αvϕ (z − θ2 ) + σ ε2 (z),

(7.2)

854

S. Amari and H. Nakahara

where v¯ = 1 − v and εi (z) are independent noises. The degree of synchrony is represented by α, and there is no synchrony when α = 1/2. For r˜ = (r1 (z), r2 (z)) ,

(7.3)

the probability of r˜ is the product of Q (r1 (z)) and Q (r2 (z)) of equation 2.6, Q (˜r) = Q (r1 ) Q (r2 ) ,

(7.4)

because r1 and r2 are independent since ε1 and ε2 are independent. Hence, the Fisher information is the sum of those in phases 1 and 2. We calculate the Fisher information I F1 (ξ) included in r1 directly at u = 0. It is given by  −(α− α) ¯ (α v¯ + αv) ¯ vv¯ g1 0 ¯ 2 g1 (α v¯ + αv) . I F1 (ξ)=−(α− α) ¯ (α v¯ + αv) ¯ vv¯ g1 0 ¯ 2 (vv¯ )2 g1 (α− α) 2 0 0 ¯ g0 (α− α) 

(7.5) Its determinant vanishes so that r1 has singular information. The I F2 (ξ) in phase 2 is given by interchanging α and α¯ in I F1 . The total Fisher information is I F (ξ) = I F1 (ξ) + I F2 (ξ)  cg1 = (α − α) ¯ 2 (v − v¯ ) g1 0

¯ 2 (v − v¯ ) g1 (α − α) 2 (α − α) ¯ 2 (vv¯ )2 g1 0

 0 , 0 2 (α − α) ¯ 2 g0 (7.6)

¯ v¯ , which is nonsingular even at u = 0, except c = α 2 + α¯ 2 v2 + v¯ 2 + 4α αv for the case with α = α¯ = 1/2. The effect of synchrony is dramatic even when the degree of synchrony is weak. The population coding is no longer singular, keeping sufficient information to distinguish the two objects. We next discuss a possible mechanism of inducing synchrony.

8 Remarks on Binding Different Features A more general aspect of the binding problem is discussed in terms of correlated or synchronous firing. Let an object O have two different

Difficulty of Singularity in Population Coding

855

features, say, location and orientation, represented by analog features x and x , respectively. In a typical example in the binding problem, x represents the shape parameter and x the color of an object. Assume that two objects O1 and O2 are presented, with respective features x1 , x1 and x2 , x2 . If there exists a neural field representing the features (x, x ) jointly, when one object (x, x ) is applied, the tuning function f (z, z ) in the two-dimensional field with coordinates (z, z ) has a peak at (x, x ) of the presented object, and the neural response is its noisy version r (z, z ) = f (z − x, z − x ) + σ ε(z, z ). When two objects are presented simultaneously, the response is r (z, z ) = (1 − v)ϕ z − x1 , z − x1 + vϕ z − x2 , z − x2 + σ ε(z, z ). (8.1) Hence, r has two peaks at x1 , x1 and x2 , x2 , and there is no difficulty for binding xi with xi , provided the two peaks are separated. When they are close, the above type of singularity occurs. It is not plausible to assume the existence of a field combining any two features. There will be, in general, two different fields, Z and Z —one representing x and the other x . Given two objects, two peaks are aroused in field Z at x1 and x2 , and two peaks in field Z at x1 and x2 , in the respective response patterns: r (z) = f (z; x1 , x2 ) + σ ε(z), r z = f z ; x1 , x2 + σ ε z .

(8.2) (8.3)

The problem is to find which peak, x1 or x2 , is bound to which peak, x1 or x2 . Synchronization is a solution, as we have analyzed. Synchrony is generated by the correlated firing of neurons. A simple mechanism causing pairwise and higher-order correlations has already been shown (Amari, Nakahara, Wu, & Saka, 2003). We extend that model to be applicable in our case. Here, we assume spiking neurons in a neural field Z. Let p(z) be the potential of the neuron at z, and the neuron fires when p(z) exceeds a threshold. The potential is given by the tuning function disturbed by fluctuating noises, p(z) =

1 1 ϕ (z − x1 ) + ϕ (z − x2 ) + σ ε(z), 2 2

(8.4)

where the noise term is decomposed as ε(z) = ϕ (z − x1 ) ε1 (z) + ϕ (z − x2 ) ε2 (z) + ε0 (z).

(8.5)

856

S. Amari and H. Nakahara

Here, ε1 , ε2 , and ε0 (z) are independent and gaussian random variables. The noise terms ε1 and ε2 are multiplicative, representing the Poisson character of spiking neurons. When ε1 , ε2 , and ε0 are temporally independent (or temporal correlations decay to 0 quickly), we have a model of spiking neurons. The potential in field Z has a similar nature, p (z ) = (1 − v)ϕ z − x1 + vϕ z − x2 + σ ε (z ),

(8.6)

where the noise in field Z is also decomposed as ε (z ) = ϕ z − x1 ε1 (z ) + ϕ z − x2 ε2 (z ) + ε0 (z ).

(8.7)

It is plausible that noises ε1 and ε2 originate from the objects O1 and O2 , respectively, and are common to fields Z and Z . It is analyzed in Amari et al. (2003) that synchrony emerges in such a model. This can solve the binding problem. More detailed analysis using the Fisher information is in our future program. 9 Discussion This article investigated the Fisher information when two stimuli are presented together. Since ordinary statistical analysis fails in this case due to the degeneracy of the Fisher information, we introduced a novel method using algebraic singularity, and this method allowed us to examine not only the Fisher information but also the algebraic structure of the degeneracy. We analyzed the asymptotic property of how the information decays as the locations of the two stimuli near each other, although we do not propose any decoding algorithms. We then searched for a possible neural mechanism to resolve singularity. This is given by the synchronization of the neural firing in population coding. We analyzed how the Fisher information behaves under synchronization. We finally discussed a possible mechanism of generating synchronization in relation to the binding problem. Our method synthesizes various disciplines from many fields. Acknowledgments H.N. is supported by Grants-in-Aid 14780614 and 14017106 from the MEXT, Japan. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87.

Difficulty of Singularity in Population Coding

857

Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S., & Arbib, M. A. (1977). Competition and cooperation in neural nets. In J. Metzler (Ed.), Systems neuroscience (pp. 119–165). San Diego: Academic Press. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: AMS and Oxford University Press. Amari, S., Nakahara, H., Wu, S., & Sakai, Y. (2003). Synchronous firing and higherorder interactions in neuron pool. Neural Computation, 15, 127–142. Amari, S., & Ozeki, T. (2001). Differential and algebraic geometry of multilayer perceptrons. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, E84-A(1), 31–38. Burnashev, M. V., & Amari, S. (2002). On some singularities in parameter estimation problems. Problems of Information Transmission, 38, 328–346 Clarke, B. S., & Barron, A. R. (1990). Information theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36, 453–471 Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1, 285–317. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Eurich, C. W., & Wilke, S. D. (2000). Multidimensional encoding strategy of spiking neurons. Neural Computation, 12(7), 1519–1529. Fukumizu, K. (2003). Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31, 833–851. Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3), 317–327. Hotelling, H. (1939). Tubes and spheres in n-spaces, and a class of statistical problems. Amer. J. Math., 61, 440–460. Ingle, D. (1968). Visual releasers of pre-catching behavior in frogs and toads. Brain Behavior and Evolution, 1, 500–518. Komaki, F. (1996). On asymptotic properties of predictive distributions. Biometrika, 83, 299–313. Liu, X., & Shao, Y. (2003). Asymptotics for likelihood ratio tests under loss of identifiability. Annals of Statistics, 31(3), 807–832. Mart´ınez-Trujillo, J. C. M., & Treue, S. (2000). Attention modulates tuning curves from two stimuli inside the receptive field of MT/MST cells in the macaque monkey. Society for Neuroscience Abstracts, 26, 251.16. Nakahara, H., & Amari, S. (2002). Attention modulation of neural tuning through peak and base rate in correlated firing. Neural Networks, 15, 41– 55. Nakahara, H., Wu, S., & Amari, S. (2001). Attention modulation of neural tuning through peak and base rate. Neural Computation, 13, 2031–2048. Pouget, A., Dayan, P., & Zemel, R. (2000). Information processing with population codes. Nature Neuroscience Reviews, 1, 125–132. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 373–401. Reichardt, W., & Poggio, T. (1975). A theory of the pattern induced flight orientation of the fly, Muska domestica, II. Biological Cybernetics, 18, 69–80.

858

S. Amari and H. Nakahara

Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Computation, 15, 2255–2279. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Nat. Acad. Sci. USA, 90, 10749–10753. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65. Treue, S., Hol, K., & Rauber, H. J. (2000). Seeing multiple directions of motionphysiology and psychophysics. Nature Neuroscience, 3, 270–276. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104. Watanabe, S. (2001a). Algebraic analysis for non-identifiable learning machines. Neural Computation, 13, 899–933. Watanabe, S. (2001b). Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8), 1409–1060. Watanabe, S. (2001c). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dieterrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 329–336) Cambridge, MA: MIT Press. Weyl, H. (1939). On the volume of tubes. Amer. J. Math., 61, 461–472. Wu, S., Amari, S., & Nakahara, H. (2002a). Asymptotic behaviors of population codes. Neurocomputing, 44–46, 697–702. Wu, S., Amari, S., & Nakahara, H. (2002b). Population coding and decoding in a neural field: A computational study. Neural Computation, 14, 999–1026. Wu, S., Amari, S., & Nakahara, H. (2004). Information processing in a neuron ensemble with the multiplicative correlation structure. Neural Networks, 17(2), 205–214. Wu, S., Chen, D., Niranjan, M., & Amari, S. (2003). Sequential Bayesian decoding with a population of neurons. Neural Computation, 15(5), 993–1012. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing, II (pp. 167–173). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10, 403–430. Zemel, R. S., & Pillow, J. (2002). A probabilistic network model of population responses. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 223–242). Cambridge, MA: MIT Press. Zhang, K., & Sejnowski, T. J. (1999). Neural tuning: To sharpen or broaden. Neural Computation, 11, 75–84.

Received November 24, 2003; accepted August 31, 2004.

LETTER

Communicated by Laurence Abbott

Neurons Tune to the Earliest Spikes Through STDP Rudy Guyonneau [email protected] Centre de Recherche “Cerveau et Cognition,” Toulouse 31000, France, and Spikenet Technology, Revel, France

Rufin VanRullen [email protected] Centre de Recherche “Cerveau et Cognition,” Toulouse 31000, France

Simon J. Thorpe [email protected] Centre de Recherche “Cerveau et Cognition, Toulouse 31000, France, and Spikenet Technology, Revel 31250, France

Spike timing-dependent plasticity (STDP) is a learning rule that modifies the strength of a neuron’s synapses as a function of the precise temporal relations between input and output spikes. In many brains areas, temporal aspects of spike trains have been found to be highly reproducible. How will STDP affect a neuron’s behavior when it is repeatedly presented with the same input spike pattern? We show in this theoretical study that repeated inputs systematically lead to a shaping of the neuron’s selectivity, emphasizing its very first input spikes, while steadily decreasing the postsynaptic response latency. This was obtained under various conditions of background noise, and even under conditions where spiking latencies and firing rates, or synchrony, provided conflicting informations. The key role of first spikes demonstrated here provides further support for models using a single wave of spikes to implement rapid neural processing.

1 Introduction Activity-dependent learning at the systems level relies on dynamical modifications of synaptic strength at the cellular level. Experimental data have shown that these modifications depend on temporal pairing between a preand a postsynaptic spike: an excitatory synapse receiving a spike before a postsynaptic one is emitted potentiates while it weakens the other way around (Markram, Lubke, Frotscher, & Sakmann, 1997). The amount of modification depends on the delay between these two events: maximal when preand postsynaptic spikes are close together, the effects gradually decrease Neural Computation 17, 859–879 (2005)

© 2005 Massachusetts Institute of Technology

860

R. Guyonneau, R. Van Rullen, and S. Thorpe

and disappear with intervals in excess of a few tens of milliseconds (Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Feldman, 2000). Learning at the neuronal level thus heavily depends on the temporal structure of neuronal responses. Interestingly, in many brain areas, the temporal precision of spikes during stimulus-locked responses can be in the millisecond range. The evidence is clear in both the auditory cortex (Heil, 1997) and the somatosensory system (Petersen, Panzeri, & Diamond, 2001). Reproducible temporal structure can also be found in the visual system, such as in MT (Bair & Koch, 1996), and from retinal ganglion cells to the inferotemporal cortex (Sestokas, Lehmkuhle, & Kratz, 1991; Meister & Berry, 1999; Liu, Tzonev, Rebrik, & Miller, 2001; Richmond & Optican, 1990; Victor & Purpura, 1996; Nakamura, 1998). This raises the question of how a neuron with STDP will respond when repeatedly stimulated with the same pattern of input spikes. Let us consider a simple case where a neuron is exposed to an input spike pattern

Figure 1: Facing page. Single wave experiment. For a repeated spike wave under realistic conditions, a neuron learns to react faster to its target. Synaptic weights converge onto the earliest firing afferents, even with 25 ms jitter or 50 Hz background activity. (A) Typical incoming activity. Bottom: raster plot of a jittered spike wave amid spontaneous activity. At each presentation, 5 ms jitter and 5 Hz background activity are regenerated; the resulting pattern is presented to the postsynaptic neuron (with initial potential set to 0); when it spikes, the STDP learning rule is applied and its potential reset to 0 before going to the next presentation. Prior to presentation, the presynaptic neurons do not fire any spike. Top: the corresponding poststimulus time histogram (PSTH): the gaussian form in the left-most part corresponds to the reproduced spike wave. (B) Dynamics of repeated STDP. Top: sum of all synaptic weights at each presentation (the dashed line represents the output neuron threshold). The sum of the synaptic weights stored in the afferents stabilizes at threshold value. Bottom: the horizontal axis corresponds to the number of presentations (i.e., the learning step). The black line refers to the left axis and shows the reduction of postsynaptic latency during the course of learning. The background image refers to the right axis where each synapse weight is mapped by a gray-level index (see the corresponding bar on the right). Synapses are ordered by spiking latency of the corresponding neuron within the original reproducible input pattern (here, a wave), more precisely before superimposing spontaneous activity or adding some jitter. This order is decided a priori and stays fixed for the whole simulation. During learning, earliest synapses become fully potentiated and later ones are weakened. (C) Effect of jitter. Jitter is generated by a gaussian distribution. Increasing its standard deviation does not affect convergence until about 10 ms. From there, it slows the system roughly quadratically. (D) Effect of spontaneous firing rate. Increasing background activity slows convergence approximatively linearly.

Neurons Tune to the Earliest Spikes Through STDP

861

consisting of one input spike on each afferent synapse, with arbitrary but fixed (e.g., gaussian distributed, as in latencies. For one given input pattern presentation, the input spikes elicit a postsynaptic response, triggering the STDP rule. Synapses carrying input spikes just preceding the postsynaptic one are potentiated, while later ones are weakened. The next time this input pattern is re-presented, firing threshold will be reached sooner, which implies a slight decrease of the postsynaptic spike latency. Consequently, the learning process, while depressing some synapses it had previously potentiated, will now reinforce different synapses carrying even earlier spikes than the preceding time. By iteration, it follows that when the presentation of the same input spike pattern is repeated, the postsynaptic spike latency will tend to stabilize at a minimal value while the first synapses become fully potentiated and later ones fully depressed (see Figure 4 in Song, et al., 2000, as well as Gerstner and Kistler, 2002a, for related demonstrations; see also Figures 1A and 1B). Under these simplistic conditions, neurons can learn to react faster to a given stimulus pattern by emphasizing the role of the earliest firing inputs. However, input spike patterns in the brain do not consist of onespike-per-neuron waves of infinite precision. Not only do stimulus-locked responses occur with a jitter on the order of 1 or more milliseconds (Mainen

862

R. Guyonneau, R. Van Rullen, and S. Thorpe

& Sejnowski, 1995), neocortical cells also spontaneously fire at rates up to 20 Hz in awake animals (Evarts, 1964; Hubel, 1959; Steriade, Oakson, & Kitsikis, 1978). Finally, temporal structure in spike trains is often distributed over numerous consecutive spikes in long time windows. With realistic conditions, would the effect of STDP still emphasize the very first afferents? Would this be the case even when other types of neural codes (firing rate, synchrony) are present within the spike trains? This would have critical implications for our understanding of neural information coding and processing. In this theoretical study, we seek to answer these questions through a range of simulations using biologically plausible neuron models. 2 Spike Waves A target neuron is repeatedly presented with an input spike pattern. In a first example, it consists of a single asynchronous spike wave: one spike for each synapse with gaussian-distributed latencies (50 ms mean, 20 ms width). At each presentation, spike times are jittered, and Poisson spontaneous activity is added to the spike trains. We varied the amount of jitter and spontaneous firing and investigated their effect on the target neuron learning behavior. 2.1 Methods. 2.1.1 Integrate and Fire. The postsynaptic neuron is connected to a thousand afferent neurons via as many synapses and integrates spikes across time (in all simulations except one, no leakage was involved). Systematically starting at a resting level of 0 before presentation, it sums the weight of the synapses that carry each incoming spike. The presynaptic neurons are assumed to be quiet prior to presentation; they do not fire any spike. When reaching its threshold (100 for all simulations), the postsynaptic neuron fires, triggers an STDP-inspired learning rule for its first action potential only, and then sets its potential back to a resting level of 0. 2.1.2 STDP Model. The learning function F has the prototypical form of STDP according to electrophysiological results in cultured hippocampal neurons (Bi & Poo, 1998). The amount of synaptic modification arising from a single pair of pre- and postsynaptic spikes separated by time t is expressed as follows: if τ+ ≤ t ≤ 0, if 0 < t ≤ τ− , otherwise

F (t) = A+ . (1 − (t/τ +)) F (t) = −A− . (1 − (t/τ −)) F (t) = 0,

where A+ and A− determine the maximum amounts of synaptic modification that occur when t is close to zero (A+ = A− = 1 in all simulations). τ+ and τ− are the temporal windows for, respectively, potentiation and

Neurons Tune to the Earliest Spikes Through STDP

863

depression, expressed in milliseconds (here, τ+ = −20 ms and τ− = 22 ms). Synaptic growth from learning step n to n + 1 is computed as follows: wi (n + 1) = wi (n) + F (tpre − tpost ), where wi (n) is the “free” weight of synapse i at nth presentation step, tpost the postsynaptic spike timing, and tpre the presynaptic spike timing. Under these conditions, synaptic weights may grow to infinitely large values. To address this obstacle, we implemented a sigmoidal saturation function g: Wi = g(wi ), where g(x) = ((π/2) + atan(x))/π. The “free” weight, labeled wi , corresponds to the unconstrained weight of the synapse, which connects input neuroni to the postsynaptic one. It ranges from −∞ to +∞ and is the target of the strengthening/weakening process. The “effective” weights, Wi , lie in the ]wmin , wmax [ interval (here wmin = 0, wmax = 1). These “effective” weights are the ones that are considered for calculating the excitatory postsynaptic potential.1 2.1.3 Jitter. When needed, each reproducible spike was displaced in time by a value chosen randomly at each presentation from a gaussian distribution of mean 0 and standard deviation set to the expected jitter. 2.1.4 Spontaneous Activity. Spontaneous activity is generated using a Poisson process as described in section 3.1. It is redrawn for each afferent neuron, at each presentation, before being superimposed on the jittered reproducible structure to constitute the incoming activity. 2.1.5 Convergence Criterion. Convergence is met when the local average of postsynaptic latencies, in a ±5-step window, remains within 1 ms for 100 consecutive steps (reported convergence is the first of these 100 steps). We also checked that the synaptic weights were indeed selected on the basis of their earliest inputs.

1 Note that here, contrary to other update rules (Rubin, Lee, & Sompolinsky, 2001), when the effective weight approaches the upper and lower bounds, both the expected amounts of reward and punishment (in the “effective” weight domain) tend toward 0. In the multiplicative update case (Song et al., 2000), these amounts stay balanced one compared to the other: when the weight reaches a bound, for example, wmax , then reward goes to 0, whereas punition is maximal and inversely. For the additive update rule (Cateau & Fukai, 2003), both potentiation and depression are independent of the weight. If the update results in a synaptic weight outside the bounds, the weight is clipped to the boundary values.

864

R. Guyonneau, R. Van Rullen, and S. Thorpe

2.2 Resistance to Jitter and Spontaneous Activity. A typical example with 5 ms jitter and 5 Hz spontaneous activity (see Figure 1A) shows that the target neuron learns to react faster to the input pattern—in the case here, a spike wave—by selectively reinforcing the earliest firing afferents (see Figure 1B). Initially, synaptic weights are set so as to evoke the first postsynaptic response when the entire reproducible pattern, the spike wave, has been integrated (in the present case, around 120 ms after stimulus onset). Synapses are then progressively reinforced and then weakened, shortening the postsynaptic neuron latency from one step to the following. This is especially visible between steps 20 and 100, where this codependent dynamic appears as a “bright crest” going from the latest to the earliest afferences. When the postsynaptic neuron latency stabilizes, the STDP rule is systematically applied at approximately the same time: its rewarding part constantly affects the same synapses, those receiving the earliest of the reproducible spikes, thus driving them to their maximal strengths. Symmetrically, later spikes invariably arrive in the weakening part of the STDP window: the corresponding synapses are continuously depressed. This trend can also be observed in the evolution of the total amount of synaptic weight of the neuron at a given step (see Figure 1B, top). It corresponds to the potential the model neuron would reach if each of its synapses was hit by one spike only. Initially, it starts from a value well below the threshold needed for the postsynaptic neuron to fire a spike. Here, the presence of spontaneous activity allows the neuron to reach its threshold and trigger the STDP rule for the first time. As the learning process goes on, the sum of synaptic weights increases due to the increasing number of spikes, thus of synapses, falling in the rewarded part of the STDP window; tends toward the neuron’s threshold value; and stabilizes around this level. Since the maximum synaptic weight is fixed at 1.0, the neuron can reach a state where the first N spikes suffice to make it fire (where N is the output neuron’s threshold). The neuron will fire early only if it receives more or less the same pattern as the one learned during the first moments of incoming activity. In fact, it responds increasingly faster to a precise sequence of spikes than to any other (see Figure 5). Simulations showed that the convergent behavior of this trend is very robust. When jitter is increased (with spontaneous activity set to 0), the number of presentations needed for convergence stays the same until the amount of jitter reaches 10 ms; then it increases roughly quadratically (see Figure 1C). Thus, jitter has little or no effect on the consequences of repeated STDP for plausible values. Moreover, convergence of the synaptic weights onto the earliest afferents could even be obtained with jitter in the range of 20 to 25 ms, that is, as wide as the input spike times distribution itself. When increasing spontaneous activity (in the 5–50 Hz range without any jitter), the convergence is also slowed but is nonetheless obtained even at

Neurons Tune to the Earliest Spikes Through STDP

865

the highest rate (see Figure 1D). Note that in this case, the input spike wave represents less than 15% of the total spikes. As only the 10% of synapses carrying the earliest spikes are selected, this result implies that the STDP rule is able to focus exclusively on less than 2% of the spikes, those whose timing is reproducible and early, while discarding the rest. 3 Spike Trains When a single wave of spikes is the reproducible structure, it is clear that convergence to a state in which weights are heavily concentrated on the earliest firing inputs is very robust. However, it could be argued that the STDP rule focuses on the left-most part of the train only because it is the only part that contains information from one pattern presentation to the next (indeed, the right-most part contains only randomly generated spikes). 3.1 Uniform Spike Distribution. The main simulation uses an input spike pattern in the form of 500 ms–long spike trains where the temporal structure and the amount of information is statistically homogeneous across time and across afferent neurons. More specifically, spike trains were generated according to a Poisson process. Each of the 1000 excitatory afferents emits a given spike train where each interspike interval, isi, was determined according to a Poisson rule process depending on the expected rate of the train, u: isi (r ) = ln(r )/−u,

where r is a random value chosen from a uniform distribution on the interval ]0.0, 1.0[. The reproducible structure for the simulation illustrated in Figure 2 is defined once and for all using this method (u = 20 Hz for all afferent neurons). Note that we are not suggesting that neuronal responses are completely stochastic, since reproducibility implies the contrary (see, e.g., Meister & Berry, 1999). We need this only to ensure that no a priori assumptions are made as to how neural information is encoded, thanks to the homogeneous nature of Poisson processes. This basic pattern is reproduced on each presentation, undergoing 5 ms jitter and mixed with 5 Hz spontaneous activity (see Figure 2A). The initial synaptic weights are set so as to elicit the first postsynaptic response after approximately 400 ms. As in the previous simulation, latency decreases steadily from this initial value to stabilize after 1500 presentations. Conjointly, synapses carrying the first spikes become fully potentiated and later ones are fully depressed (see Figure 2B, bottom). The sum of synaptic weights displays the same dynamic as in the spike wave experiment (see section 2.2): going from a very low level, it rises until it reaches the minimal value needed to make the postsynaptic neuron fire on a single volley of

866

R. Guyonneau, R. Van Rullen, and S. Thorpe

Figure 2: Spike trains experiment. (A) Typical incoming activity. The input spike pattern consists of arbitrary 500 ms–long spike trains, where the amount of information is equally distributed over time. At each presentation, this reproducible structure is modified using 5 ms jitter and 5 Hz spontaneous activity, then presented to the postsynaptic neuron, whose potential is set to 0. Prior to presentation, the presynaptic neurons do not fire any spike. Due to the random aspect of the input design (Poisson-inspired process; see section 3.1), latencies and firing rates in the reproducible input structure were quite variable, ranging, respectively, from 0 to ≈350 ms (mean of ≈52 ± 51 ms, standard deviation) and from 4 to 44 Hz (mean 19.6 ± 6.3 Hz, standard deviation). (B) Dynamics of repeated STDP. Here again, latency decreases and stabilizes. Synapses are selected on the basis of their first-spike timings: the earliest ones are fully potentiated and the latest are weakened. The sum of synaptic weights follows the same behavior as in the spike wave experiment: it stabilizes around the output neuron’s threshold value (see Figure 1B).

spikes. At this point, the system has converged: synaptic potential flattens around a value corresponding to the postsynaptic neuron’s threshold (see Figure 2B, top). Thus, the selection of the earliest spikes through STDP is obtained even when late parts of the spike trains carry the same amount of information. 3.2 Latency versus Firing Rates and Synchrony. Note that no assumptions were made in the previous simulation as to how neurons encode information within the spike trains. Firing rates (Gerstner, Kreiter, Markram, & Herz, 1997; Shadlen & Newsome, 1998) or synchrony (Abeles, 1991) are widely believed to support the neural code. In this regard, one may think that these principles should drive the effects of STDP. Indeed, it has been argued that competition for control of the postsynaptic response would thus be won by the most correlated inputs (Song, Miller, & Abbott, 2000), or by

Neurons Tune to the Earliest Spikes Through STDP A

867

B 200

200

100

100

0

0 400

W 1 .5

300 .1 200 .01 100

0

100

200

300

Time (ms)

400

500

0

0

500

1000

1500

2000

.001

# presentations

Figure 3: Latency versus rate. (A) Typical incoming activity. The shorter the latency, the fewer spikes the input neuron emits. According to the spike train design, the tail of the PSTH contains more spikes than its head. A jitter of 5 ms is applied to each spike timing, and no spontaneous activity is superimposed on the reproducible input structure so as to rigorously control the rates of afferents. Activity is then presented to the postsynaptic neuron, whose potential is set to 0. Prior to presentation, the presynaptic neurons do not fire any spike. (B) Dynamics of repeated STDP. The same trend emerges: although receiving fewer spikes, synapses hit by the earliest trains are finally fully potentiated. Inversely, strongly firing inputs will be neglected because they fire late.

the most vigorously firing ones (Gerstner & Kistler, 2002b). However, this study points toward first-spike timing as a determining factor, emphasizing the role of temporal asynchrony in neural information coding (Gautrais & Thorpe, 1998). In the brain, short latencies are generally associated with highest firing rates, which also often result in high temporal correlations. This would make it difficult to disentangle the respective influences of these aspects of the spike train on neuronal learning. In our simulations, however, we can ask how these different aspects fare, one compared to the other, by artificially defining input spike patterns where first-spike timing is pitted against average firing rate or amount of synchrony. In the next simulation, latencies and rates have been artificially opposed: the afferent neuron with the shortest latency fires at the slowest rate over the entire window, the latest one at the highest rate, and neurons inbetween display a gradual latency-to-rate trade-off. As in the main experiment, spike times are jittered at each presentation (but spontaneous activity has been removed so as to rigorously control the firing rate of each input neuron, ranging from 4 to 44 Hz; (see Figure 3A). The result is clear as the same trend emerges (see Figure 3B). This means that a synapse receiving a very high firing rate will not be retained by STDP if it does not also correspond to one of the shortest latencies.

868

R. Guyonneau, R. Van Rullen, and S. Thorpe

A

B 400

200

200

100

0

0 400

W 1 .5

300 .1 200 .01 100

0

100

200

300

Time (ms)

400

500

0

0

500

1000 1500

2000

2500 3000 3500

.001

# presentations

Figure 4: Latency versus synchrony. (A) Typical incoming activity. The spike pattern was designed so as to oppose latency to synchrony: the shorter the latency, the shorter the synfire chain. Spontaneous activity of 5 Hz has been superimposed on the reproducible input structure and no jitter is simulated so as to rigorously keep the synchrony within the spike trains. Activity is then presented to the postsynaptic neuron whose potential is set to 0. Prior to presentation, the presynaptic neurons do not fire any spike. (B) Dynamics of repeated STDP. Once again, postsynaptic latency decreases before stabilizing, and the earliest synapses become fully potentiated while later ones are weakened. Thus, being highly correlated is not sufficient for inputs to be selected by STDP; they also have to fire with one of the shortest latencies.

In the final simulation, latency and amount of synchrony are artificially opposed: the longer the latency, the more neurons are made to share the same spike train (hence, defining longer synfire chains; Abeles, 1991). The size of these correlated groups gradually extends from 1 to 54 synchronously firing units. Here, 5 Hz spontaneous activity is added as usual, while jitter is removed so as to obtain truly synchronous waves (see Figure 4A). Once again, the results are clear: synapses carrying the very first input spikes are selected by STDP, whereas highly correlated inputs having late latencies are depressed (see Figure 4B). Note the small and eventually vanishing streaks of synaptic potential, a necessary consequence of the fact that synchronous inputs are always reinforced or depressed together by the STDP rule. Conclusively, when latency and synchrony are inversely correlated, the selection of the potentiated synapses still depends on the timing of the first reproducible spikes alone. 3.3 Selectivity Measures. Convergence of synaptic weights onto earliest afferents through STDP is now established, underlining the importance of first-spikes timing. But why do synaptic weights distribute in such a

Neurons Tune to the Earliest Spikes Through STDP

869

remarkable way? The first obvious answer is the postsynaptic spike latency reduction, as displayed in all the simulations. But having a neuron responding fast to a given event is not an interesting feature if it is not also selective to it. STDP has been shown to explain the development of direction selectivity in recurrent cortical networks, where excitatory and inhibitory synapses were modified according to their spiking activity (Rao & Sejnowski, 2000). Selectivity measures were based on whether the postsynaptic neuron fires. How should one define selectivity within the present, inhibition-less framework? Any conventional measure based on the postsynaptic firing rate is ruled out since the postsynaptic response is limited to its first spike (see sections 2.1 and 4). Under these conditions, we propose to use firing latency as a viable measure of selectivity: if a neuron spikes faster to the input pattern it learned than to any other one, it would de facto behave if not selectively, at least differentially to it. First, we generated 50,000 distractor spike trains using the same method (Poisson-inspired spike trains at 20 Hz; see section 3.1) as for the target pattern (see Figure 2A); a priori, these can be considered as equivalent to the target in terms of average latency of the spike trains, spike count, correlations, and so forth. At each learning step of the main simulation (see Figure 2B), these distractor patterns were tested for postsynaptic latency on the learning neuron and compared to 1000 responses to the target pattern, using the same conditions of background noise (5 ms jitter and 5 Hz spontaneous activity). These distractors and target presentations did not trigger the learning rule: the postsynaptic latency was measured but did not give rise to any synaptic modifications. A threshold was set around the mean response time to targets. Target spike trains yielding responses before the threshold were considered as hits and distractor ones as false alarms. Selectivity (d ) was computed as follows: d = z (hit rate) − z (false alarm rate), where z( p) stands for the inverse of the normal cumulative distribution function of p (Green & Swets, 1966). The expected maximum was computed using an expected hit rate of 50% and a false alarm rate of 1/2n, where n is the number of distractor patterns (here, n = 50, 000). The results (see Figure 5) show that while the neuron was initially less likely to respond to the target pattern than to arbitrary distractors (due to its initially random weight distribution), it became highly selective to its target as learning developed: after about 1500 presentations, when the postsynaptic neuron latency and weights have indeed stabilized (see Figure 2B), not one of the distractor patterns could make this neuron fire sooner than with the target one.

870

R. Guyonneau, R. Van Rullen, and S. Thorpe

Figure 5: Selectivity of the efferent neuron. The dashed line is the expected maximum value for d (under the present conditions), gray dots stand for d values at each learning step, and the black curve is a local average of these. The neuron becomes steadily more selective to the target pattern used for training. Initially, it tends to fire slightly more slowly to it than to 50,000 spike trains used as distractors. After 1500 presentations, when its response latency stabilizes (see Figure 2B), the postsynaptic neuron reacts to its target before any of the distractors. Note that some observations landed above the expected maximum value for d : in certain conditions (100% hit rate and/or 0% false alarms), the observed d can rise above the expected max value computed for an expected 50% hit rate and 1 for 100,000 false alarms.

Conclusively, a neuron exhibiting STDP will characteristically respond faster and faster to a precise repeated pattern than to any comparable one, thus becoming more selective to it, at least in the terms proposed here. 4 Discussion The temporally asymmetric STDP rule reinforces the last synapses involved in making a neuron fire. At first sight, one might think that its repeated application invariably reinforces afferents at the tail of the spike sequence firing a neuron. Instead, we have demonstrated that it focuses, in a remarkably robust manner, on the head of the spike sequence: dynamical STDP tunes a neuron to synapses transmitting reproducible spikes with the shortest latencies, thus enabling the cell to respond faster and selectively to the input

Neurons Tune to the Earliest Spikes Through STDP

871

it has learned. This result is obtained for arbitrary structures of the input spike pattern. This STDP tuning is achieved by selecting a relatively small subset of afferents, here fully determined by the model’s output threshold. In that sense, it is worth noting that only 10 to 40 fully potentiated excitatory inputs, among a possible 10,000 or so, would be enough to evoke a response (Shadlen & Newsome, 1994). The main assumption for the demonstration is that spike times within the input pattern must show a certain degree of reproducibility (which can be modulated by fairly high amounts of jitter and spontaneous activity). This is compatible with experimental observations in numerous brain structures (Heil, 1997; Petersen et al., 2001; Bair & Koch, 1996; Sestokas et al., 1991; Liu et al., 2001; Richmond & Optican, 1990; Victor & Purpura, 1996; Nakamura, 1998). This repetition of similar input patterns could quite simply be the result of multiple exposures with the same stimulus at different times in life. Alternatively, we propose that a single stimulus exposure could result in a sequence of similar processing waves through the rhythmic activity of cortical oscillations (Hopfield, 1995). Note that in both situations, one would not expect the state of the system to be the same from one step to another. In particular, the efferent neuron potential would be unlikely to have the same resting level every time the input pattern is presented. We have performed simulations where baseline fluctuations were simulated. As usual, the typical incoming activity was of the same kind as the one used in the main simulation: the reproducible input structure consisted of 1000 spike trains generated using a Poisson-inspired process (see section 3.1). At every step, a 5 ms jitter was applied to each precise spike time; then 5 Hz spontaneous activity was superimposed before being presented to the neuron. But instead of being systematically reset to 0, the postsynaptic neuron potential was set according to a gaussian process of mean 0 and width 50. Prior to presentation, the presynaptic neurons do not fire any spike. An identical convergence was reached, though less rapidly (see Figure 6). For simplicity, all of the simulations so far have been conducted with a single postsynaptic spike, no matter what the activity is afterward: a single STDP learning rule was applied at each input pattern presentation. While it may seem a controversial simplification, experimental evidence suggests that the effect of the late spikes could be neglected (Froemke & Dan, 2002). As Tsodyks (2002) commented, “The main effect of STDP is well explained by the first pair of spikes, with the additional spike having only a marginal contribution.” The first presynaptic spike is also the most relevant in evoking an excitatory postsynaptic potential (EPSP), because of synaptic depression (Thomson & Deuchars, 1994), a trend that is reinforced by synaptic redistribution, where a synapse depresses even more when potentiated (Markram & Tsodyks, 1996; see below for more details). Nonetheless, for completeness, we investigated the effects of having multiple postsynaptic spikes that repeatedly triggered the STDP mechanism. Here we implemented a potential

872

R. Guyonneau, R. Van Rullen, and S. Thorpe 200 100 0 500

W 1 0.5

400

0.1

300 0.01 200 0.001 100 0.0001 0

# presentations

Figure 6: Fluctuating potential. In this simulation, the initial efferent neuron potential was set—just before the cell is presented with the incoming activity— according to a gaussian distribution of mean 0 and standard deviation 50. Typical incoming activity is of the same kind as in the main experiment: a reproducible structure of spike trains amid 5 ms jitter and 5 Hz spontaneous activity (see Figure 2A). Prior to presentation, the presynaptic neurons do not fire any spike. Due to the fluctuation, postsynaptic neuron latencies were highly variable; the black line thus depicts a smoothed version (using a moving average of width 101 steps; ± standard deviation, dotted lines). Here, the fluctuation results in a delay of the reinforcement process applied to the reproducible structure: from step 0 to 5000, the crest-ascending motion is not obvious. But once some synapses are strong enough (i.e., displayed in white on the figure), the modification sequence starts until only the earliest firing inputs remain with large weights and the postsynaptic neuron latency has decreased to reach a stable, steady state. The sum of synaptic weights follows the same behavior as in the spike wave experiment: it stabilizes around the output neuron’s threshold value (see Figure 1B). While convergence needs more time to be reached (about 10 times more than for the original simulation; see Figure 2B), the repeated application of STDP leads the neuron with a highly fluctuating potential to tune itself on its earliest afferents and respond faster.

leak current (τ = 20 ms) in the efferent neuron. Simulation parameters and typical incoming activity were of the same kind as in the main experiment (see Figure 2A). Again, the same trend was observed with a slowing of convergence and the appearance of irregular second postsynaptic spikes that did not prevent the neuron from becoming tuned to its earliest regular afferents (see Figure 7).

Neurons Tune to the Earliest Spikes Through STDP

873

Figure 7: Leaky integrator and multiple postsynaptic responses. Here, a leak current (τ = 20 ms) is added to the neuron potential and more than one postsynaptic spike can be elicited in the course of the presentation, triggering STDP repeatedly. Typical incoming activity is of the same kind as in the main experiment: a reproducible structure of spike trains amid 5 ms jitter and 5 Hz spontaneous activity (see Figure 2A). It is presented to the output neuron after setting its potential to 0. Prior to presentation, the presynaptic neurons do not fire any spike. The black line links the first postsynaptic response latency at each presentation. Times at which the efferent neuron has subsequent discharges are displayed by empty circles. Except for some additional postsynaptic responses that did not affect the dynamics of learning, convergence on the earliest afferents was reached, slightly more slowly than in the comparative case without the use of a leaky integrator (see Figure 2B, bottom). Due to the leakage current, synapses had to be initialized at a slightly higher value than in previous simulations. The sum of synaptic weights thus started from a higher value than before, above the threshold of the postsynaptic neuron’s threshold (see Figure 2B, top). It nonetheless acted the same way to stabilize at threshold value.

One could also address the fact that theoretical studies use the idealized “smooth” STDP curve, while experimental data always display some noise: the curve is also likely to be noisy in biological settings (Bi & Poo, 1998). We thus tested this feature in a specific simulation under the same conditions as in section 3.1, except that each single synaptic modification was affected by a random offset taken from a gaussian distribution of mean 0 and with a standard deviation set to the noiseless amount of modification (see Figure 8A). Even when noise affects the synaptic modifications in a biologically realistic way, the trend emerges in a very similar manner (see Figure 8B).

874

R. Guyonneau, R. Van Rullen, and S. Thorpe

Figure 8: Noisy synaptic modifications. This simulation is identical to the main one (see Figure 2) except that in this case, each synaptic modification is affected by an offset taken from a gaussian distribution of mean 0 and standard deviation set to the noiseless amount of modification. Incoming activity is of the same kind as in the main experiment: a reproducible structure of spike trains amid 5 ms jitter and 5 Hz spontaneous activity (see Figure 2A). It is presented to the output neuron after setting its potential to 0. Prior to presentation, the presynaptic neurons do not fire any spike. (A) Typical sample of synaptic modifications. Taken from the modifications occurring at step 1 (only one out of two modifications are represented for clarity), this scatter plot shows the amount of each synaptic modification depending on the relative timing of the corresponding spike compared to the postsynaptic spike timing (dashed line). Notice how some expected rewards are in fact depressions and vice versa, as in the experimentally obtained STDP graphs (Bi & Poo, 1998). (B) Dynamics of repeated STDP. The trend emerges quite the same way as in Figure 2, illustrating the fact that STDP is able to reach for the first spikes of a reproducible input pattern, even when synaptic modifications are randomized.

Evidently, the demonstration here depends on the learning rule used in these simulations, and in particular on its temporally asymmetrical shape. This form was observed in several empirical studies (Markram et al., 1997; Bi & Poo, 1998, 2001; Zhang et al., 1998; Feldman, 2000; and Sjostr ¨ om ¨ & Nelson, 2002, for reviews) and has triggered numerous theoretical studies of STDP (Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Gerstner & Kistler, 2002b; see Abbott & Nelson, 2000, and Kepecs, Van Rossum, Song & Tegner, 2002, for reviews). Other forms exist, for example, symmetric (Egger, Feldmeyer, & Sakmann, 1999) or opposed (Bell, Han, Sugawara, & Grant, 1997), that could give rise to radically different trends. In the latter case, for example, where a pre-before-post pairing would induce depression and post-before-pre

Neurons Tune to the Earliest Spikes Through STDP

875

reinforcement, inhibiting the last spikes to make a cell fire might in fact increase the postsynaptic latency from one presentation to the other. A critical question concerns the handling of spike timing-dependent modifications when the same synapse receives more than one input spike that falls in the STDP window. Here we simply assumed that the effects of spike pairs would sum linearly, as in many theoretical studies (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Song, Miller, & Abbott, 2000; Senn, Markram, & Tsodyks, 2001), but the question extends to the problem of synaptic modifications in natural spike trains. A recent experimental study showed that the efficacy of each spike in modifiying synaptic strength was suppressed by the preceding spike in the same neuron: STDP favors the earliest of two spike pairs (Froemke & Dan, 2002), that is, in our case, reinforcement over depression. Besides, in a spike sequence following a period of inactivity, each spike decreases the probability of vesicle release for the following presynaptic action potentials, reducing the amplitude of their respective EPSPs in the process (Thomson & Deuchars, 1994). This form of short-term plasticity, synaptic depression, acts at the presynaptic level through the depletion of the pool of vesicles at the synaptic release site, where recycling is a relatively slow process compared to sustained activity. Not only does it by itself suggest a relative importance of first spikes as opposed to later ones, but it also interacts with long-term potentiation in a supportive way called synaptic redistribution (Markram & Tsodyks, 1996). While increasing, or decreasing, the probability of transmitter release, STDP would at the same time decrease (resp. increase) the availability of readily releasable vesicles for later spikes: when they get stronger, synapses depress more and vice versa (Markram & Tsodyks, 1996; Volgushev, Voronin, Chistiakova, & Singer, 1997). As such, “synaptic redistribution can significantly enhance the amplitude of synaptic transmission for the first spikes in a sequence,” thus giving a predominant role to the first spikes in terms of synaptic plasticity (Abbott & Nelson, 2000). Applied to the present framework, these more realistic simulations of synaptic modifications would no doubt reinforce the “back-in-time” dynamics of STDP, as illustrated by latency reduction, a necessary consequence of its very form. 5 Conclusion The most important implications of our work concern the way the brain stores and uses information. Much work in neural coding assumes that information is encoded using either firing rate or synchrony. This study raises doubts about this assumption by demonstrating the unambigous prevalence of the earliest firing inputs. This sensitivity to spatiotemporal spike patterns, obviously inherent in STDP rules, emphasizes the importance of temporal structure in neural information, as proposed and argued by several authors (Perkel & Bullock, 1968; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Engel, Konig, Kreiter, Schillen, & Singer, 1992; de Ruyter

876

R. Guyonneau, R. Van Rullen, and S. Thorpe

van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Bair, 1999; Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001; Van Rullen & Thorpe, 2001). As stated by Froemke and Dan (2002), “Timing of the first spike in each burst is dominant in synaptic modifications.” Latencies do indeed seem to play a distinct role in terms of neural processing, as shown throughout this study. We showed how spatially distributed synapses can be made to integrate a spike wave coming from an afferent population and evoke a fast and selective response in the postsynaptic neuron, even in the presence of background noise. As a consequence, the fact that STDP naturally leads a neuron to respond rapidly and selectively on the basis of the first few spikes in its afferents lends support for the idea that even complex visual recognition tasks can be performed on the basis of a single wave of spikes (Van Rullen & Thorpe, 2002; Thorpe & Imbert, 1989). Acknowledgments This research was supported by the CNRS, the ACI Neurosciences Computationelles et Int´egratives, and SpikeNet Technology SARL. We also thank our reviewers for their critical comments. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci. 3, (Suppl.), 1178–1183. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Bair, W. (1999). Spike timing in the mammalian visual system. Curr. Opin. Neurobiol., 9, 447–453. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comput., 8, 1185–1202. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G. Q., & Poo, M. M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Cateau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural. Comput, 15, 597–620. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805– 1808.

Neurons Tune to the Earliest Spikes Through STDP

877

Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nat. Neurosci., 2, 1098–1105. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. TINS, 15, 218-226. Evarts, E. V. (1964). Temporal Patterns of Discharge of Pyramidal Tract Neurons During Sleep and Waking in the Monkey. J. Neurophysiol., 27, 152–171. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gautrais, J., & Thorpe, S. (1998). Rate coding versus temporal order coding: A theoretical approach. Biosystems, 48, 57–65. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–81. Gerstner, W., & Kistler, W. M. (2002a). Spiking neuron models. Cambridge: Cambridge University Press. Gerstner, W., & Kistler, W. M. (2002b). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Gerstner, W., Kreiter, A. K., Markram, H., & Herz, A. V. (1997). Neural codes: Firing rates and beyond. Proc. Natl. Acad. Sci. USA, 94, 12740–12741. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New-York: Wiley. Heil, P. (1997). Auditory cortical onset responses revisited. I. First-spike timing. J. Neurophysiol., 77, 2616–2641. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hubel, D. H. (1959). Single unit activity in striate cortex of unrestrained cats. J. Physiol., 147, 226–238. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kepecs, A., Van Rossum, M. C., Song, S., & Tegner, J. (2002). Spike-timing-dependent plasticity: Common themes and divergent vistas. Biol. Cybern., 87, 446–458. Liu, R. C., Tzonev, S., Rebrik, S., & Miller, K. D. (2001). Variability and information in a neural code of the cat lateral geniculate nucleus. J. Neurophysiol., 86, 2789–2806. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213– 215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical neurons. Nature, 382, 807–810. Meister, M., & Berry, M. J., II. (1999). The neural code of the retina. Neuron, 22 (3), 435–450. Nakamura, K. (1998). Neural processing in the subsecond time range in the temporal cortex. Neural Comput., 10, 567–595.

878

R. Guyonneau, R. Van Rullen, and S. Thorpe

Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Perkel, D. H., & Bullock, P. H. (1968). Neural coding. Neurosci. Res. Prog. Sum., 3, 405–527. Petersen, R. S., Panzeri, S., & Diamond, M. E. (2001). Population coding of stimulus location in rat somatosensory cortex. Neuron, 32, 503–514. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), In Advances in neural information processing systems, 12. (pp 164–170).Cambridge, MA: MIT Press. Richmond, B. J., & Optican, L. M. (1990). Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. II. Information transmission. J. Neurophysiol., 64, 370–380. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Sestokas, A. K., Lehmkuhle, S., & Kratz, K. E. (1991). Relationship between response latency and amplitude for ganglion and geniculate X- and Y-cells in the cat. Int. J. Neurosci., 60, 59–64. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Sjostrom, P. J., & Nelson, S. B. (2002). Spike timing, calcium signals and synaptic plasticity. Curr. Opin. Neurobiol., 12, 305–314. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926. Steriade, M., Oakson, G., & Kitsikis, A. (1978). Firing rates and patterns of output and nonoutput cells in cortical areas 5 and 7 of cat during the sleep-waking cycle. Exp. Neurol., 60, 443–468. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends Neurosci., 17, 119-126. Thorpe, S. J., & Imbert, M. (1989). Biological constraints on connectionist models. In R. Pfeifer, Z. Schreter, & F. Fogelman-Soukli´e (Eds.), Connectionism in perspective (pp. 63–92). Amsterdam: Elsevier. Tsodyks, M. (2002). Spike-timing-dependent synaptic plasticity—the long road towards understanding neuronal mechanisms of learning and memory. Trends Neurosci., 25, 599–600. van Rossum, M. C., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. VanRullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Comput., 13, 1255–1283. VanRullen, R., & Thorpe, S. J. (2002). Surfing a spike wave down the ventral stream. Vision Res., 42, 2593–2615.

Neurons Tune to the Earliest Spikes Through STDP

879

Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol., 76, 1310–1326. Volgushev, M., Voronin, L. L., Chistiakova, M., & Singer, W. (1997). Relations between long-term synaptic modifications and paired-pulse interactions in the rat neocortex. Eur. J. Neurosci., 9, 1656–1665. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., and Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395 37–44.

Received January 30, 2004; accepted August 25, 2004.

LETTER

Communicated by Alex Reyes

Rate and Synchrony in Feedforward Networks of Coincidence Detectors: Analytical Solution Shawn Mikula [email protected]

Ernst Niebur [email protected] Zanvyl Krieger Mind/Brain Institute and Department of Neuroscience, Johns Hopkins University, Baltimore, MD 21218, U.S.A.

We provide an analytical recurrent solution for the firing rates and crosscorrelations of feedforward networks with arbitrary connectivity, excitatory or inhibitory, in response to steady-state spiking input to all neurons in the first network layer. Connections can go between any two layers as long as no loops are produced. Mean firing rates and pairwise crosscorrelations of all input neurons can be chosen individually. We apply this method to study the propagation of rate and synchrony information through sample networks to address the current debate regarding the efficacy of rate codes versus temporal codes. Our results from applying the network solution to several examples support the following conclusions: (1) differential propagation efficacy of rate and synchrony to higher layers of a feedforward network is dependent on both network and input parameters, and (2) previous modeling and simulation studies exclusively supporting either rate or temporal coding must be reconsidered within the limited range of network and input parameters used. Our exact, analytical solution for feedforward networks of coincidence detectors should prove useful for further elucidating the efficacy and differential roles of rate and temporal codes in terms of different network and input parameter ranges. 1 Introduction Much of the theoretical basis for understanding information processing in complex biological systems is based on computational modeling—numerical solutions of the underlying model equations. While this approach has proven extremely useful and is the only practical one in many cases, analytical solutions would nearly always be preferable if they were available. Naturally, this is the case only for carefully selected systems of sufficient simplicity. In this report, we present one such system: an interconnected network of ideal coincidence detectors of arbitrary depth. Our solution is limited to feedforward networks, but otherwise the connectivity is arbitrary: Neural Computation 17, 881–902 (2005)

© 2005 Massachusetts Institute of Technology

882

S. Mikula and E. Niebur

neuronal thresholds can be arbitrary, synapses can be of arbitrary strengths, they can be excitatory and inhibitory, and they can be between any two neurons in any two layers provided that no loops are formed (this is the feedforward condition). The input to the network is characterized in terms of the average firing rate and the pairwise correlation, and we present iterative, closed-form solutions for all neurons (and pairs) in the network in the same form. The model is based on our previous exact, analytical solutions for the output firing rate of an individual coincidence detector receiving excitatory and inhibitory inputs, in both the presence and absence of synaptic depression (Mikula & Niebur, 2003a, 2003b, 2004) After defining methods and notations in section 2, we present our main result, the closed-form expressions for mean rate and cross-correlation, in section 3. Example networks are studied in sections 4 and 5. Implications for neural coding are discussed in section 6. 2 Methods 2.1 Notational Conventions. Matrix notation is used throughout our derivation. Matrices are denoted by boldfaced, uppercase letters; row and column vectors by boldface, lowercase letters; and scalars by regular lowercase letters. Matrices and vectors that are functions of one or more variables have their arguments denoted by subscripts, whereas scalars have their ar j denotes a matrix that is guments within parentheses. Thus, for instance, a function of one argument, j; the vector ci, j is a function of two arguments, i and j; and q (i, j; s, t) denotes a scalar that is a function of four arguments, i, j, s, and t. 2.2 Model Neurons: Coincidence Detectors. The model neurons used in this study are coincidence detectors. A coincidence detector is a computational unit that fires at time t if the number of unit excitatory postsynaptic potentials (EPSPs) received within the window (t − T, t) equals or exceeds the threshold θ. In many cases, it makes sense to think of T as a period on the order of 10 ms. This is the timescale of fast ionic synaptic conductances, and it is at this timescale that synaptic events superpose and interact. We do not, however, make use of this specific setting in our analysis, other than requiring that it be sufficiently small so that a maximum of one spike can be generated in a period of this length, and our results do not depend on this setting. 2.3 Binomial Spike Trains with Specific Cross-Correlation. In a previous report Mikula & Niebur, 2003a), we introduced a systematic method for the generation of an arbitrary number of spike trains with specified pair-wise mean cross-correlations and firing rates. Action potentials are distributed according to binomial counting statistics in each spike train.

Rate and Synchrony in Feedforward Networks

883

Mean firing rates and cross-correlations are the same for all spike trains (or all pairs of spike trains, respectively), but they can be chosen independently of each other. We describe the procedure here for the convenience of the reader. Let m be the number of input spike trains, each having n time bins. All bins are of equal length δt, chosen sufficiently small so that each contains a maximum of one spike; that is, each bin is guaranteed to contain either one or zero spikes. The makes the decision whether to fire within one time bin; therefore, T = δt. Assuming a firing rate of f =

p , δt

(2.1)

the probabilitythat a spike is found in any given time bin is p; no spike is found with probability(1 − p). Bins in any given spike train are independent, which implies that the following analysis can be limited to a single time bin. A physiologically important special case is obtained if the rate of incoming spikes is low and convergence is high; the binomial statistics that governs the spikes generated by a coincidence detector is then approximated by Poisson statistics. We further note that throughout this letter, we often refer to the spike probability, p, simply as an input or output firing rate, with the understanding that the actual firing rate is obtained by dividing p by the time bin size, δt, as in equation 2.1. 2.4 Network Architecture. An example feedforward network is shown in Figure 1. To characterize the connectivity between layers in the network, j , which quantifies the let us introduce the connectivity matrix, denoted C connectivity from the j layer to the j + 1th layer, and in which the (k, l)th entry contains the numerical value of the connection to the kth neuron in j is m j+1 x m j , where layer j + 1 from the lth neuron in layer j. The size of C m j denotes the number of neurons in layer j. The values of the connectivity matrix are real numbers—positive for excitatory connections and negative for inhibitoryconnections. We define the connectivity vector, a row vector, denoted ci, j , to be the j that the length of ci, j j . It follows from the definition of C ith row from C is m j and that the kth element of ci, j contains a real number describing the connection from the kth neuron in layer j to the ith neuron in layer j + 1. Although the example network in Figure 1 is limited to connections between subsequent layers, that is, between layers j and ( j + 1) for all j, we note that our formalism is sufficiently general to allow connections between any two layers j and j + J with J ≥ 1, provided layer j + J exists in the network. This can be seen by first considering a network with connections only between subsequent layers (as in Figure 1), that is, with J = 1. The connections with J > 1 are then added by formally adding virtual neurons with one incoming synapse whose strength exceeds its threshold.

884

S. Mikula and E. Niebur

Figure 1: A simple n+1-layer feedforward network with cyclical boundary conditions, period 4. Coincidence detectors are represented by a circle with a () symbol in them, and connections are shown as directed arrows between coincidence detectors. Input is shown as stylized spike trains at the very bottom. Comma-separated pairs of numbers indicate layer (second number) and neuron in this layer (first number).

Finally, we denote the output firing probability(which, by equation 2.1, is directly proportional to output firing rate) for the ith neuron located in the jth layer by the scalar firing rate, p(i, j). 3 Results 3.1 Computation of Firing Rates for the Network. As discussed in section 2.3, the m0 initial inputs to our feedforward network are binomial and originate at the zeroth layer (see Figure 1). Because each of the m0 inputs is the outcome of a Bernoulli trial, the total number of possible input states is 2m0 .1 We will find it useful to enumerate these 2m0 input states and to assign corresponding probabilities for their occurrence. 1

A Bernoulli trial is defined as a single random event for which there are two and only two possible outcomes that are mutually exclusive and have a priori fixed probabilities that sum to unity. In our case, the outcome of a trial will be either zero or exactly one

Rate and Synchrony in Feedforward Networks

885

0 . Each row of this Toward this end, we define the input state matrix, matrix represents one input state in the form of a binary row vector whose kth entry consists of a zero or a one, denoting the absence or presence of an input spike in the kth input, respectively. From the above considerations, 0 must be of size 2m0 × m0 . we know that We now determine how the different input states are transformed at layer 1 of our network. That is, we compute how each input state affects suprathreshold events for all of the m1 neurons in layer 1. The solution is based on the known connectivities between the m0 inputs and the m1 neu 0 (see section 2.4), and on the thresholds of the neurons in rons in layer 1, C layer 1, listed in the matrix θ1 . This matrix has 2m0 identical rows, each consisting of the vector of all thresholds in layer 1; thus, the element in column j (and all rows) of this matrix is the threshold of neuron j in layer 1 (the usefulness of this construct will become obvious in equation 3.1). These net 0 , into work dynamics transform the input states, contained in the matrix the resulting states in layer 1 of the network. We collect these states in the 1 , which describes the states of layer 1 and is computed as follows: matrix 0 − θ1 ). 0C 1 = (

(3.1)

By the definition of matrix products, the element in row i and column j 0 is the input to neuron j in layer 1 when state i is present in layer 0. 0C of Furthermore, we define the operator () in this equation as the Heaviside function acting on the individual entries of its matrix argument. It takes on the value of unity if an entry within its argument is nonnegative and zero 1 are 2m0 × m1 and that the otherwise.2 Note that the dimensions of 1 is the layer 1 state corresponding to the kth row of 0 . That is, kth row of the transformation of the kth input state at layer 1 is given by the kth row of 1 . To be sure, while 0 is an exhaustive list of all states in the input layer, 1 is, in general, not a matrix containing all states in layer 1. Depending on m0 , m1 and the connectivity and thresholds, this matrix in general contains only a subset of all possible states in layer 1, and each of these states may appear more than once. We can recursively apply the same method used in equation 3.1 to generate the state matrices for other layers of our network to arrive at the general j , the layer j state matrix, as the following: expression for j−1 − θ j ), j−1 C j = (

(3.2)

spike of an input neuron in a given time bin; the former happens with probability p where 0 ≤ p ≤ 1 and the latter with probability (1 − p). 2 We note that this component-wise procedure is not an operation defined in a vector space. We nevertheless use the terminology of vector spaces (rather than introducing terms like arrays or n-tuples) for convenience.

886

S. Mikula and E. Niebur

j gives us the layer j state corresponding to the kth where the kth row of j−1 and terminates 0 . The iteration defines j in terms of j−1 and C row of 0 , as in equation 3.1. 0 and C at j , which is of length 2m0 , gives us the neuronal reThe ith column of sponses of the ith neuron in layer j to the 2m0 input states. Let this column i, j , the state vector for the ith neuron in layer j; then vector be denoted ψ the following holds: i, j = ( j−1 ci, j−1 − θ(i, j)), ψ

(3.3)

where the scalar θ (i, j) is the threshold of neuron i in layer j. Note that the only difference between the right-hand sides of equations 3.3 and 3.2 j−1 and the threshold is that the latter substitutes the connectivity matrix C matrix θ j for the connectivity vector ci, j−1 and the threshold scalar θ (i, j), respectively. Having determined how input states are transformed across the network by equations 3.1 and 3.2, we now proceed to compute mean output spiking probabilities. We define the input state probability vector pin , a column vector whose kth element is the probability for the occurrence of the kth input state. The output spiking probability for any neuron in the network is obtained by summing over all input states, leading to suprathreshold events times the corresponding input state probability (given by pin ). Thus, the output spiking probabilities for all neurons located in layer j can then be written as a column vector p j , computed as Tj pin , p j =

(3.4)

where the ith entry of p j is the output spiking probability for the ith neuron j. Tj is the transpose of located in layer j and where Alternatively, by using equation 3.3, we can obtain an explicit expression for the output spiking probability of a single neuron located anywhere in the network. The output spiking probability for the ith neuron located in layer j, the scalar p(i, j), is given by i,T j pin . p(i, j) = ψ

(3.5)

Note that p(i, j) is the same as the ith entry of p j from equation 3.4. With equations 3.4 and 3.5, we have thus obtained an analytical recurrent solution for the output spiking probabilities (or firing ratesas per equation 2.1) of the coincidence detectors comprising a feedforward network that receives an arbitrary number of binomial inputs modulated with respect to both mean rate and cross-correlation.

Rate and Synchrony in Feedforward Networks

887

3.2 Computation of Cross-Correlations. The standard Pearson correlation coefficient, denoted q (x, y), is defined by the following for random variables x and y, xy − x y q (x, y) = , var(x) var(y)

(3.6)

where x denotes the mean value of x, and var(x) denotes the variance of x. If x is a Bernoulli random variable (i.e., takes on values 0 or 1), then its variance is given by var(x) = x − x2 ,

(3.7)

and we may rewrite equation 3.6 as the following (Mikula & Niebur, 2003b): q (x, y) =

prob(xy = 1) − prob(x = 1) prob(y = 1) 2 2 prob(x = 1)− prob(x = 1) prob(y = 1)− prob(y = 1) (3.8)

where prob(x=1) denotes the probability that x takes the value of unity, and prob(xy=1) denotes the probability that both x and y take the value of unity. In the notation established earlier,the spiking probability scalar p(i, j) of neuron j in layer i takes the place of prob(x=1) in equation 3.8, taking into account that it is based on a Bernoulli process. In this equation, to compute the correlation with neuron t in layer s, the probability p(s, t) takes the place of prob(y=1). And finally, since the joint probability for the ith neuron in layer j and the sth neuron in layer t to fire action potentials is given +ψ s,t − 2)T pin , as can be seen by direct substitution, it follows by (ψ i, j s,t − 2)T pin . We can therefore i, j + ψ that prob(xy=1) is equivalent to (ψ rewrite equation 3.8 as the following expression for the cross-correlation, denoted q (i, j; s, t), between the ith neuron in layer j and the sth neuron in layer t, i, j + ψ s,t − 2)T pin − p(i, j) p(s, t) (ψ q (i, j; s, t) = , p(i, j) − p(i, j)2 p(s, t) − p(s, t)2

(3.9)

i, j is obtained using equation 3.3. where ψ We have thus obtained an exact recurrent solution for the cross-correlation between any two coincidence detectors, or between any coincidence detector and an input, located within a feedforward network receiving an arbitrary number of binomial inputs modulated with respect to both mean rate and cross-correlation.

888

S. Mikula and E. Niebur

3.3 Large Networks. On a practical note, we find that it is useful to 0 for the purpose of making consider methods for reducing the size of our solutions for rate and cross-correlation applicable to larger feedforward 0 is 2m0 × m0 , networks. As we saw in section 3.1, for m0 inputs, the size of which increases quickly for increasing m0 . We assume that the connectivity of the network is given. Under this condition, we have discerned three 0 . One of them is to include practical methods for reducing the size of only suprathreshold events, which means excluding those input states in which a subthreshold number of spikes occurs; all of these are mapped on a single state with a zero everywhere. Assuming uniform thresholds for 0 to all neurons,which we will denote by θ, then this method reduces −1 m0 ) × m , which can be a significant reduction as θ/m size (2m0 − θj=0 0 0 j approaches unity. The mirror image of this simplification is obtained in the case of small θ/m0 ; in this case, a large number of input states is mapped onto one state, namely, “one” in all positions. The second method for reducing 0 is valid for low input rates (i.e., the Poisson regime), in which the size of case the input states characterized by many spikes can be ignored due to the unlikely probability of such a state. The third method is to reduce the number of inputs while maintaining a relatively large number of neurons in each layer. In many cases, this is justified since for large convergence (the physiologically relevant case), only a small number of neurons can be simultaneously active (within one time bin of length T) to avoid the need for unrealistically high thresholds or saturation of the next layer. 4 Simple Example Let us consider the simple n-layer excitatory feedforward network shown in Figure1 with four neurons in each layer, cyclical (periodic) boundary connectivity conditions, and θ(i, j) = θ = 2 ∀i, j. We further assume that each connection is excitatory and has a weight of +1. The connectivity patterns between layers are the same for all layers; that is, each neuron in layer j connects to the three closest neurons in layer j + 1. Recalling from section 2.4 that the (k, l)th entry of the connectivity matrix is defined as the weight of the connection to the kth neuron in layer j + 1 from the lth neuron in layer j, we obtain a connectivity matrix that is the same for each layer and given by the following:  1 1  C( j) =  0 1

1 1 1 0

0 1 1 1

 1 0 . 1 1

(4.1)

For computing the total number of input states, we note that there are four binomial inputs, and thus 24 input states. The resulting input matrix 0 is

Rate and Synchrony in Feedforward Networks



0 0  0  0  0  0  0  0 0 =  1  1  1  1  1  1  1 1

0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

889

 0 1  0  1  0  1  0  1 . 0  1  0  1  0  1  0 1

(4.2)

0 is the input state for having zero input spikes Thus, the first row of for all four inputs, the second row is the input state for having zero input spikes for inputs 1 through 3, whereas input 4 has a spike, and so on for all 0. of the 16 rows of Using the network connectivity and the input state matrices, we can now j and ψ i, j . The result is given in apply either equation 3.2 or 3.3 to derive Table 1, which shows all attainable states for all neurons in layers 1 through 3 0, 1, 2 , and of our network (plus the input layer, 0). We have placed 3 alongside each other in Table 1 to emphasize that this is a logical truth Table 1: Truth Table for the First Three Layers of the N-Layer Feedforward Network Shown in Figure 1. 3

2

1

0

13 ψ 23 ψ 33 ψ 43 ψ 12 ψ 22 ψ 32 ψ 42 ψ 11 ψ 21 ψ 31 ψ 41 ψ 10 ψ 20 ψ 30 ψ 40 ψ 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1

0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1

0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1

0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1

0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1

0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1

0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1

0 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1

0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1

0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1

0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1

0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

890

S. Mikula and E. Niebur

table. This should also make it easier to see how different input states, which 0 , are transformed across different layers, which correspond to the rows of 1, 2 , and 3. correspond to the rows of It turns out that Table 1 characterizes the behavior not only of the first three but of all layers of the (infinite) network. The reason is that all higher 2 , and similarly, all higher odd layers are even layers are identical to 1 . This periodic behavior is not a peculiarity of this specific identical to network, is a property of all feedforward networks of coincidence detectors with identical layer properties (connectivity and thresholds) and finite numbers of neurons per layer; since they can have only a finite number of states, networks of infinite depth need to show periodic behavior. This can be seen easily by considering that there are 2m0 possible states in the input layer, layer 0 (see section 3.1), and in all higher layers. Since all layers are assumed to have the same number of neurons, at least two layers between the input layer and layer (2m0 + 1) must have identical activity patterns. Let us assume the first one of these is layer l and the second one is layer l . Then, since the connectivities between all layers are identical, the pattern in layer (l + 1) is the same as in (l + 1), the pattern in (l + 2) is as in (l + 2), and so on; activity is thus periodic with a periodicity (l − l) starting at layer l. This is what was to be shown. We note in passing that periodicity includes the case of period 1, for example, when the activity “dies out” in layer l because all activity is below threshold and therefore all higher layers have zero activity. The next step is the computation of column vector pin , whose ith entry 0 ). We contains the probability for the ith input state (i.e., the ith row of will first assume uncorrelated inputs (correlated input is discussed later; see equation 4.6 for an example). In this case, the components of pin are simple binomial probabilities. We have the following for pin :  (1 − P10 )(1 − P20 )(1 − P30 )(1 − P40 )  (1 − P10 )(1 − P20 )(1 − P30 )P40     (1 − P10 )(1 − P20 )P30 (1 − P40 )      (1 − P10 )(1 − P20 )P30 P40    (1 − P10 )P20 (1 − P30 )(1 − P40 )      (1 − P10 )P20 (1 − P30 )P40     (1 − P10 )P20 P30 (1 − P40 )     (1 − P )P P P 10 20 30 40 .  pin =   P (1 − P )(1 − P )(1 − P ) 10 20 30 40     P (1 − P )(1 − P )P 10 20 30 40     P (1 − P )P (1 − P ) 10 20 30 40     P10 (1 − P20 )P30 P40     P10 P20 (1 − P30 )(1 − P40 )     P10 P20 (1 − P30 )P40     P10 P20 P30 (1 − P40 ) P10 P20 P30 P40 

(4.3)

Rate and Synchrony in Feedforward Networks

891

Figure 2: Effect of uniform, uncorrelated input firing rate. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.

To alleviate the notation, we replaced p(i, 0) by Pi0 for i ∈ {1, . . , 4}, that is, the mean firing probabilities for the ith input from the zeroth layer. The values of Pi0 can be freely chosen depending on the question under study (subject to the conditions 0 ≤ Pi0 ≤ 1 since they are probabilities). For example, we may choose uniform inputs, in which case all Pi0 would be equal. Alternatively, we may study how spatially localized high rate inputs are dissipated in higher levels of the network. In this case, we set some of the Pi0 to high values relative to the remaining Pi0 . Both cases are illustrated in the examples. The results for the case of uncorrelated inputs of uniform rate are shown in Figure 2. Figure 2A shows the mean rates for the input, even, and odd layers. Note that rates get propagated to arbitrarily high levels of the feedforward network. Besides the initial attenuation of rates at layer 1, there is no further attenuation. Figure 2B shows nearest-neighbor cross-correlations for the network, and Figure 2C shows cross-correlation matrices for each layer. As expected, because we have uncorrelated inputs of uniform rate, nothing much interesting happens with the cross-correlation at higher layers, except for a uniform increase in all cross-correlations, which is expected due to shared connections. What is more interesting is that crosscorrelation is not monotonically increasing as a function of network layer; rather, it remains constant from layer 1 to all higher layers. Figure 3 shows how rates and cross-correlations in higher layers vary as a function of input rates. In Figure 3A, we see that rates in higher layers are approximately linearly related to input rates, whereas in Figure 3B, we see that cross-correlations in higher layers depend on input rates in an inverted-U shaped manner, with maximum cross-correlations occurring for input rates of 0.5.

892

S. Mikula and E. Niebur

Figure 3: (A) Higher-layer output rate versus input rate curve. (B) higher-layer cross-correlation versus input rate curve for the case of uncorrelated inputs of uniform rate.

We now briefly consider the case of spatially inhomogeneous inputs. As an example, let the mean rates of the second and third inputs into our feedforward network, P20 and P30 or, in the notation of equation 3.5, p(2, 0) and p(3, 0), be equal, that is, p23 = P20 = P30 , where this equation defines p23 . Let us further assume that these inputs are correlated with a correlation coefficient q 23 , with 0 < q 23 ≤ 1 (the case q = 0 is the case of spatially inhomogeneous, uncorrelated input, which is of less interest here). To simplify upcoming expressions, we make the following definitions: √ q 23

(4.4)

q = (1 − q ).

(4.5)

q=

Rate and Synchrony in Feedforward Networks

893

Figure 4: Effect of spatially localized high input rates. (A) Mean firing rates for the network. (B) nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.

The equivalent of equation 4.3 for pin is now (see the appendix to Mikula and Niebur, 2003b, for detailed derivations) 



(1− P10 )((1− p23 )((1− p23 )+ p23 q )2 + p23 ((1− p23 )(1 − q ))2 )(1− P40 )  (1− P10 )((1− p23 )((1− p23 )+ p23 q )2 + p23 ((1− p23 )(1 − q ))2 )P40 (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 )   (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))P40    ) (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40  (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))P40     (1− P10 )((1− p23 )( p23 q q )2 + p23 ( p23 +(1 − p23 )q )2 )(1− P40 )    (1− P10 )((1− p23 )( p23 q q )2 + p23 ( p23 +(1 − p23 )q )2 )P40 pin = P ((1− p )((1− p )+ p q )2 + p ((1− p )(1 − q ))2 )(1− P ) . 23 23 23 23 23 40   10   P10 ((1− p23 )((1− p23 )+ p23 q )2 + p23 ((1− p23 )(1 − q ))2 )P40    P10 2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 )    P 2((1− p ) p q ((1− p )+ p q )+ p ( p +(1− p )q )((1− p )q ))P   10 23 23 23 23 23 23 23 23 40  P10 2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 )      P10 2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))P40   P ((1− p )( p q q )2 + p ( p +(1 − p )q )2 )(1− P ) 10 23 23 23 23 23 40 P10 ((1− p23 )( p23 q q )2 + p23 ( p23 +(1 − p23 )q )2 )P40

(4.6) The results for the case of uncorrelated, spatially localized high input rates are shown in Figure 4. Figure 4A shows the mean rates for the input, even, and odd layers. Notably, spatially localized high input rates get propagated to arbitrarily high levels of the feedforward network. Besides the initial attenuation of firing rates at layer 1, there is no further attenuation. Figure 4B shows nearest-neighbor cross-correlations for the network, and

894

S. Mikula and E. Niebur

Figure 5: Effect of halving input rates of Figure 4. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Crosscorrelation matrices for each layer.

Figure 4C shows cross-correlation matrices for each layer. Due to the convergence of inputs to higher layers, uncorrelated, spatially localized high input rates appear as cross-correlations in higher layers (see Figure 4B). That is, not only does rate information get propagated across the network, it also gets propagated as cross-correlation information. The case shown in Figure 5 is identical to that in Figure 4 except that now the input rates are halved. Note that qualitatively, the results shown are identical to those of Figure 4, which means that for even relatively low spatially localized input rates, with p(2, 0) = p(3, 0) = 0.2, rate information still gets propagated to arbitrarily high levels of the feedforward network. In addition, this rate information also appears as cross-correlation information in higher layers. Figure 6 shows the case for spatially localized cross-correlated inputs with uniform input rates. For this case, inputs 2 and 3 are cross-correlated to a value of q 2 = 0.4, and as can be seen in Figure 6B, this spatially localized cross-correlation gets propagated to arbitrarily high levels of the feedforward network. In Figure 6A, we see that the cross-correlation between inputs 2 and 3 is also represented in higher layers as higher rates. Note the appearance of increased cross-correlation between units 1 and 4 in layers 1 and higher in this and subsequent figures. It is due to the periodic boundary conditions and the specific connection scheme used: since input unit 2 projects to layer 1 unit 1, and input unit 3 projects to layer 1 unit 4, and units 2 and 3 are partially correlated, units 1 and 4 are also partially correlated but with a lower correlation coefficient than units 2 and 3, which both receive input from both correlated input units rather than from only one.

Rate and Synchrony in Feedforward Networks

895

Figure 6: Effect of spatially localized high cross-correlation (q = .4) between inputs 2 and 3. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.

Figure 7: Effect of spatially localized high cross-correlation (q = .8) between inputs 2 and 3. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.

The case shown in Figure 7 is identical to that in Figure 6 except that now the cross-correlation between inputs 2 and 3 is doubled to 0.8. This does not lead to much of an increase in higher-layer cross-correlations beyond what was seen in Figure 6B. Interestingly, in Figure 6A, we see that higher-layer rates are almost double what they were in Figure 6A. What this means is that doubling the input cross-correlation is reflected in an almost doubling

896

S. Mikula and E. Niebur

Figure 8: Effect of spatially localized high cross-correlation (q = .6) and high firing ratebetween inputs 2 and 3. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.

of the higher-layer rates, as opposed to the relatively smaller increase in higher-layer cross-correlations. Finally, in Figure 8, we see the case for spatially localized high crosscorrelation (q = 0.6) and high firing rate between inputs 2 and 3. Figure 8A shows that rates are propagated with little attenuation, which is in contrast to the cases in Figures 4 and 5, where the high input rates were not crosscorrelated. We see in Figure 8B that cross-correlations also get propagated across our network, though we also see relatively high cross-correlations between inputs 1 and 4 at higher layers, an effect likely attributed to the cyclic connectivity boundary conditions of our feedforward network.

5 More Complex Networks The simple example of section 4 is useful for making explicit the details involved with the application of equations 3.4, 3.5, and 3.9. In this section, we look at a larger feedforward network, which has the advantage of being more realistic than the previous one, but with the trade-off that many of the details cannot be made explicit due to space limitations. Our feedforward network is a slightly scaled-down version of the one introduced by Litvak, Sompolinsky, Segev, and Abeles (2003). We consider a 100-layer feedforward network containing exactly 1000 coincidence detectors per layer—500 excitatory and 500 inhibitory. The connectivity matrix is

Rate and Synchrony in Feedforward Networks

897

different between each layer3 and is randomly constructed with precisely balanced excitation and inhibition such that each neuron receives exactly 50 excitatory inputs and 50 inhibitory inputs. Further, pairs of neurons in each layer share 5% of their inputs from the preceding layer. Each excitatory connection carries a weight of +1, and each inhibitory connection has a weight of −1. We now define the inputs and the input states. For simplicity, we assume just two different input states. One state is defined with zero activity in all neurons of the input layer.4 In the other state, 10% of the inputs provide perfectly correlated input spike trains (i.e., q = 1), with a spiking probability within each spike train of pin . We designate these inputs as active inputs; since the total number of inputs is 1000, the number of active inputs is 100. All other input neurons (i.e., the other 90%) have zero spiking probability. Our requirement that the active inputs are perfectly correlated greatly simplifies what would otherwise be a difficult calculation. For example, if the active inputs were completely uncorrelated, then we would have 2100 different input states to deal with, a clear impossibility. Assuming perfectly correlated active inputs is a simple and effective way to dramatically reduce the number of input states. The results of this model using 100 different randomly generated connectivity matrices are shown in Figure 9 for average input spiking probabilities of .02 and .04 and using a threshold of 3. We show the results for only the first 20 layers of the network because higher layers are essentially the same as would be expected from Figure 9 (we studied up to layer 100). Spiking probabilities increase rapidly and immediately level off at about layer 2, remaining nearly constant for all higher layers. Because of the perfect correlation between all input neurons (q = 1), the mean firing rates at the higher layers of our network depend linearly on the initial input rates pin . For example, in Figure 9a, pin = .02 results in a mean output spiking probability of .055 at higher layers, whereas in Figure 9b, pin = .04 results in a mean output spiking probability of .11. Note the large standard deviations, indicating the richness and complexity of behavior for different connectivity matrices. The mean cross-correlations are propagated in a similar manner as the mean rates shown in Figure 9: since all active neurons in our network are perfectly correlated with other, the mean cross-correlation must be a function of the number of active neurons per layer, as is true for the mean rate per layer. Thus, if the fraction of active neurons in a given layer is r , the cross-correlation in that layer is r 2 . The propagating rates and 3 The connectivity used by Litvak et al. (2003) is apparently repeating (i.e., identical between each pair of subsequent layers). We studied this case as well and found results that were very similar to those with connectivity matrices that vary between each pair of layers and that are shown here. 4 For all nonzero thresholds, the same results would be achieved with all neurons active in the input layer.

898

S. Mikula and E. Niebur

Figure 9: Mean layer output rates as a function of network layer. (A) Mean input rate of 0.2. (B) mean input rate of 0.4.

Rate and Synchrony in Feedforward Networks

899

cross-correlations show no sign of decrement at higher layers (layer 100 looks the same as layer 3 in terms of rates and cross-correlations). 6 Discussion This article extends our previous analytical results (Mikula & Niebur, 2003a, 2003b, 2004) for an individual coincidence detector to a feedforward network of coincidence detectors receiving steady-state cross-correlated binomial inputs at the zeroth layer. Thus, our derivation is valid only for steady-state neuronal responses and does not tell us anything about transient responses, which also appear to be important. For example, Diesmann, Gewaltig, and Aertsen (1999) studied temporal structures of spike trains in feedforward networks of spiking neurons. They showed that in different parameter regimes, synchronous packets of spikes can either travel from layer to layer without loss of coherency or, alternatively, disperse within a few layers. The objectives of their study are complementary to ours as it focuses on transient activity while ours is concerned with constant firing rates and correlations. Likewise, the approach taken by Diesmann et al. is complementary to what we have taken insofar as they use a more complex single-neuron model, a leaky integrate-and-fire neuron rather than the simpler coincidence detector, but it requires a numerical solution while ours is analytical and recurrent and, furthermore, does not require introducing any approximations. In spite of the obvious limitations of feedforward networks, they are of considerable interest for theoretical considerations. Likewise, coincidence detectors are an old and tested but simple model for the units of neural networks. However, although computational approaches allow the use of much more realistic neuron models (an extreme case perhaps being a recent study by Reyes, 2003, in which replicas of an actual biological neuron were used as the underlying units), the use of the much simpler coincidence detectors as building blocks allowed us to develop a closed form for the mean rate and pairwise cross-correlation of each neuron and neuron pair, respectively. The details of our results apply only to the specific networks that we have studied here as examples to illustrate the general method we introduced. This method should, however, allow the analysis of other feedforward networks of coincidence detectors that are relevant for a specific problem. We emphasize that the methods described in section 3.3 permit one to find solutions for networks with much more realistic connectivity, including much higher numbers of synaptic connections. One way in which our approach may prove useful is in the elucidation of the neural code, which involves addressing the question of how biological neural networks represent and transform information in their patterns of activity (Perkel & Bullock, 1968). Most research in the primate cortex has focused on two coding schemes, rate and temporal coding. In rate

900

S. Mikula and E. Niebur

coding, information is coded purely in terms of average spiking activity, and variability of neural discharges is regarded as a form of noise. In temporal coding, neurons make use of the temporal structure of spike trains. Much experimental, computational, and theoretical work has been devoted to discussing this question, with evidence existing in favor of both rate codes and temporal codes. One of the tools for studying these competing models is the numerical simulation of feedforward networks. Unfortunately, two such recent simulation results involving biologically inspired feedforward networks yielded conflicting results, in which the group conducting one study (Litvak et al., 2003) could not reproduce previously published numerical results of another group (Shadlen & Newsome, 1998). As we have shown in section 5, our approach can make a contribution to the resolution of this dispute, bearing in mind that the basic units employed in those studies are different from the coincidence detectors we employed. The availability of an analytical form for the solution provided by our approach clearly precludes similar disputes. In a network of coincidence detectors, we find that aspects of both rate coding and temporal coding may be found. At least in the simple network studied in section 4, rate information is propagated up to arbitrarily high layers, even when spatially localized rate information is absent from the inputs and only spatially localized cross-correlation information is present. Also, in the case of uncorrelated inputs of uniform firing ratesshown in Figure 3A, we find that the output firing ratesof higher layers are almost linearly related to the input firing rates. Both would support that rate coding is at least a viable possibility, as proposed by Shadlen and Newsome (1998). The latter study may overemphasize the importance of rate coding, however: Salinas and Sejnowski (2000) pointed out that the influence of cross-correlations is relatively low in the Shadlen and Newsome study. The reason is that the latter study employed exactly equal correlation coefficients between inhibitory and excitatory populations (ρ E E = ρ I I = ρ E I in equation 20 of Salinas and Sejnowski, 2000), and the variance in the output layer becomes small since the contributions to the variance resulting from correlations between these populations cancel out (ρ E E + ρ I I − 2ρ E I = 0; Salinas & Sejnowski, 2000). Of course, the model described in section 5 works in the same parameter regime. On the other hand, we find that rate coding is not the only possibility of propagating information across the layers of the network. Our results show that cross-correlation information is propagated across the feedforward network to arbitrarily high layers, even when spatially localized cross-correlation information is absent from the inputs and only spatially localized rate information is present. Our results thus support the viability of temporal codes, as proposed by Litvak et al. (2003). For the types of networks we studied in sections 4 and 5, we conclude that both rate and cross-correlation information is propagated across

Rate and Synchrony in Feedforward Networks

901

feedforward networks and, furthermore, that they will interact and tend to occur together, such that when only rate or only cross-correlation information is present in the inputs, this information will appear in higher layers in the form of both rate and cross-correlation information. As such, rate and cross-correlation codes may well be intertwined and interdependent. An example of such a mixed code in complex nervous systems might be the representation of selective attention in the primate cortex (Niebur, 2002). Selective attention has been shown in electrophysiological studies to be correlated with both rate changes and changes in the fine temporal structure (on the order of milliseconds or tens of milliseconds) of neural activity (Moran & Desimone, 1985; Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Salinas & Sejnowski, 2001; Niebur, Hsiao, & Johnson, 2002). It will take more experimental as well as theoretical work to come to a conclusive answer as to which of the proposed neural coding schemes are used by the different nervous systems. Acknowledgments This work was supported by NIH/NINDS R01-NS43188-01A1 and NIH/ NEI R01-EY16281. References Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–11563. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feedforward networks with excitatory-inhibitory balance. J. Neurosci., 23, 3006–3015. Mikula, S., & Niebur, E. (2003a). The effects of input rate and synchrony on a coincidence detector: Analytical solution. Neural Computation, 15, 539–547. Mikula, S., & Niebur, E. (2003b). Synaptic depression leads to nonmonotonic frequency dependence in the coincidence detector. Neural Computation, 15(10), 2339– 2358. Mikula, S., & Niebur, E. (2004). Correlated inhibitory and excitatory inputs to the coincidence detector: Analytical solution. IEEE Transactions on Neural Networks, 15(5), 957–962. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Niebur, E. (2002). Electrophysiological correlates of synchronous neural activity and attention: A short review. Biosystems, 67(1–3), 157–166. Niebur, E., Hsiao, S. S., & Johnson, K. O. (2002). Synchrony: A neuronal mechanism for attentional selection? Current Opinion in Neurobiology, 12(2), 190–194. Perkel, D., & Bullock, T. H. (1968). Neural coding. Neurosci. Res. Programm. Bull., 6, 221.

902

S. Mikula and E. Niebur

Reyes, A. D. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nat. Neurosci., 6, 593–599. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193– 6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nature Reviews Neuroscience, 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Steinmetz, P. N., Roy, A., Fitzgerald, P., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190.

Received December 18, 2003; accepted August 10, 2004.

LETTER

Communicated by Alain Destexhe

Independent Variable Time-Step Integration of Individual Neurons for Network Simulations William W. Lytton [email protected] Department of Physiology, Pharmacology, and Neurology, State University of New York, Downstate, Brooklyn, NY 11203-2098, U.S.A.

Michael L. Hines [email protected] Department of Computer Science, Yale University, New Haven, CT 06520-8001, U.S.A.

Realistic neural networks involve the coexistence of stiff, coupled, continuous differential equations arising from the integrations of individual neurons, with the discrete events with delays used for modeling synaptic connections. We present here an integration method, the local variable time-step method (lvardt), that uses separate variable-step integrators for individual neurons in the network. Cells that are undergoing excitation tend to have small time steps, and cells that are at rest with little synaptic input tend to have large time steps. A synaptic input to a cell causes reinitialization of only that cell’s integrator without affecting the integration of other cells. We illustrated the use of lvardt on three models: a worst-case synchronizing mutual-inhibition model, a best-case synfire chain model, and a more realistic thalamocortical network model. 1 Introduction We have previously demonstrated some advantages of the global variable time-step integrators CVODE and CVODES (Cohen & Hindmarsh, 1994) over traditional fixed-step methods such as Euler, Runge-Kutta, or CrankNicholson for simulating single cells (Hines & Carnevale, 2001). The major advantage was due to the fact that neuron activity features spikes, requiring short time steps, followed by interspike intervals, which allow long time steps. The associated speed-up in the single-cell integration realm does not extend to simulation of networks, however. A major reason is that the global time step is governed by the fastest changing state variable. In an active network, some cell is usually firing, requiring a small time step for the entire network. A related reason is that synaptic events generally cause a discontinuity in a parameter or state variable. This requires a reinitialization, as the integrator must start again with a new initial condition problem. In a Neural Computation 17, 903–921 (2005)

© 2005 Massachusetts Institute of Technology

904

W. Lytton and M. Hines

network simulation, this means reinitialization of the entire network due to a single state variable change in one cell. With reinitialization, the integrator is working without any past history. Hence, the first step can be only firstorder accurate and must be very short. We demonstrate here that the poor network performance of the global variable time-step method can be overcome by giving each neuron in the system an independent variable time-step integrator. Thus, a single cell’s individual integrator uses a large dt when a neuron is quiescent or changing slowly, even though activity in other neurons in the network may cause those other integrators to proceed forward with many steps (small dt). When a cell receives a synaptic event, only that cell’s integrator has to be reinitialized. The critical problem in the implementation of the local time-step method (lvardt) is to ensure that when an event arrives at a cell at time te , all the state variables for the receiving cell are also at time te . This requires coordinating individual integrators so that one cell does not get so far ahead that it cannot receive a synaptic signal from another cell. To maximally challenge lvardt, we used the mutual inhibition model, a model that fully synchronizes. In this case, the expected superiority of multiple integrators is expected to be negated by the fact that all integrators are doing the same thing at the same time. At the other extreme, we show that synfire chains enjoy a dramatic performance improvement when using lvardt. We generalize the synfire chain in a simulation of multiple delay line rings to help understand the computational complexity of lvardt. Additionally, we demonstrate the performance improvement obtained using lvardt on a more biologically detailed thalamocortical network. 2 Methods The techniques and simulations described here are implemented in the NEURON simulator (http://www.neuron.yale.edu). The simulator provides several global time-step integration schemes. For global fixedstep methods (Hines, 1984; Hines & Carnevale, 1997), one can select either a first-order backward Euler integration scheme that is numerically stable for all reasonable neural models or the second-order Crank-Nicholson method. There are two global variable-step methods, both part of the Livermore SUNDIALS package , CVODES and IDA (http://www.llnl.gov/ CASC/sundials/; Hindmarsh & Serban, 2002). The current implementation of lvardt uses CVODES, which solves ordinary differential equations (ODEs) of the form d y = f (y, t). dt

(2.1)

Local Variable Time-Step Method

905

CVODES, like many other variable-step integrators, has an important property for the present usage: it allows rapid interpolation within the interval of the just-executed time step. Using a fully implicit fixed-step method, accuracy is proportional to dt. With the variable-step method, an absolute error tolerance (atol) is used to bound the error. In NEURON, absolute error tolerance is used in preference to proportionate error in order to avoid infinitesimal error tolerance near zero for state variables, notably voltage, that approach or pass zero. We replicated a fully connected homogeneous inhibitory interneuron network that shows rapid synchronization through mutual inhibition (Wang & Buzsaki, 1996). As a shorthand, we will call this the mutual-inhibition model. Variations on this model have been widely studied (Traub, Jefferys, & Whittington, 1999). The basic simulation used parameters identical to those in the original article (Wang & Buzsaki, 1996). Porting this simulation to lvardt required minor alterations detailed in section 3. We replicated the single-cell model from the network, demonstrating a currentfrequency response curve identical to that reported in the article (Wang & Buzsaki, 1996). We then used both NEURON’s fixed-step method and global variable time-step methods to demonstrate comparable activity in the network simulation (precise activity is dependent on randomized initial conditions). Both the original simulation and the lvardt version are available in runnable open-source form at the ModelDB web site (http://senselab.med.yale. edu/senselab/ModelDB; Hines, Morse, Migliore, Carnevale, & Shepherd, 2004). We also present a synfire simulation available as an example in the NEURON simulation package and a thalamocortical simulation based on published sources (Bazhenov, Timofeev, Steriade, & Sejnowkski, 1998). Simulations were run on 2.40 and 2.80 GHz Intel CPUs under Linux and Solaris operating systems.

3 Results The global variable time-step method has advantages in any simulation where periods of intense simulation activity alternate with periods where state variables remain relatively constant for a period of time. In general, this situation is more likely to occur during simulation of a single cell rather than a network. The larger the network simulation, the greater is the likelihood that a neuron somewhere in the network is showing spike activity. Using a global variable time-step method, this activity slows the entire simulation to tend to that one neuron’s integration needs. Neurons that are not active will be integrated with an unnecessarily short time step. These considerations suggested the development of the local variable time-step method (lvardt) to integrate a network piecemeal, providing short time-step integration for active neurons and long time-step integration

906

W. Lytton and M. Hines

A 40

mV

Global 5 ms

−80

B 40 Local 5

−80 Figure 1: Voltage trajectories for two cells are shown for the global variable-step method (A) and lvardt(B). dt (interval between marks) varies together at top and separately at bottom. The network is shown in schematic: the stimulator cell (filled circle) fires at time 0 and drives the bottom cell weakly with a delay of 0.1 ms (+ on the graphs), and the right cell strongly with a delay of 1 ms (vertical lines on the graphs). The right cell drives the bottom cell moderately with a delay of 0.1 ms. The two cells have Hodgkin-Huxley dynamics. The right cell spike threshold is −10 mV, and the second-order correct threshold event is marked with a—on the bottom panel. The threshold event does not necessarily lie on an integration time-step boundary.

for inactive neurons, while maintaining consistent integration accuracy throughout. Neurons that fire at different times get their state variables calculated at different times and, more important, different intervals (see Figure 1). Using the global method (top graph), the two cells have their trajectories calculated at the same times. With lvardt (bottom graph), integration points are independent. This is most obvious at the beginning of the simulation, when the cell that fires first (at right on the schematic;

Local Variable Time-Step Method

907

vertical lines on graph) has only 2 integration points, while the other cell (bottom; crosses) has 12 integration points. Where the trajectories cross, the first-spike cell integrator is called frequently, while the second-spike cell integrator is using longer dt. At the peak of the first spike, both cells are being updated frequently since the second-spike cell is coincidentally approaching threshold. At the peak of the second spike, however, the firstspike cell is at the falling phase of its spike and has far fewer integration points. State variables change quickly near threshold and at the peak of the action potential. At these times, the integrators use short time steps to accurately follow the trajectory through state space. At other times, much larger dt’s can be used to achieve the same accuracy for the more slowly varying state variables. Using lvardt, a neuron that is inactive does not waste CPU time. Overall performance evaluation for this simple simulation demonstrates that the global method integrates its 8 state variables (the 4 Hodgkin-Huxley variables m, h, n, V for each cell) 177 times, for a total of 1416 state-variable integrations. The integrators in the lvardt example integrate 4 state variables 138 times (first-spike cell) and 115 times (second-spike cell) for a total of 4 · (138 + 115) = 1012 state-variable integrations. This suggests the possibility of a 40% speed-up. The lvardt method creates a separate CVODES integrator for each of Nc cells in the network. Although there are many more integrators, each integrator is more compact since it only has to handle the state variables belonging to its particular neuron. Whether using one or Nc integrators, the total number of state variables remains the same. Although the expected relative performance gain with lvardt by function call statistics in Figure 1 is 40%, there is constant overhead for each step associated with each integrator (total overhead proportional to Nc ) and overhead required to determine which cell is to be integrated next, proportional to log(Nc ) for each step (total overhead proportional to Nc · log(Nc )). We discuss queue overhead in the section, “Estimating simulation complexity,” but except for very large numbers of cells, it reduces performance only slightly. Because the system now is being calculated forward in time by multiple, independent integrators, an integration coordinator is used to maintain the overall coherence of the integration. If the various neurons in the network are not connected, as in the case of testing parameter variation over a set of neurons, such coordination is not needed. However, in a network, the integration coordinator is vital to permit synaptic signals to be communicated at appropriate times. 3.1 Handling Events. Handling events with lvardt requires that when an event arrives at a cell, all of the state variables for that cell are at their appropriate values for that time. This is accomplished with three standard variable-step-integrator operations: single-step integration, interpolation, and reinitialization. Using these operations, we ensure that incoming events

908

W. Lytton and M. Hines

are always within reach of the receiving cell’s integrator and individual integrators do not move too far beyond the network as a whole. The individual integrators maintain state and derivative information on the interval of the most recent time step. That is, each neuron’s integrator can access states over an interval between the beginning ta and the end tb of a time step: tbi − tai = dt i for the ith neuron. This gives each individual integrator the ability to provide fast, high-order interpolation to a state at any time within the interpolation range defined by the two bounding times. The integration coordinator ensures that there is always overlap in j these interpolation ranges: tai ≤ tb ∀ i, j. We define tb/e min as the time of the earliest event te or least advanced integration bound tb . To guarantee interpolation range overlap, the integration coordinator either handles the least time event or single-steps the least advanced integrator, whichever is earlier. In this way, no integrator’s tb , and no event, ever falls behind any ta . Figure 2 illustrates most of these operations in the context of a sample integration for six cells. Note that an input event normally requires a threestep sequence (1) interpolation to the event time, (2) handling the event (or all the outstanding events to the cell with that delivery time), and (3) reinitializing the integrator. This full three-step sequence is required only for events that alter the course of the integration. By contrast, an event that only records the value of a state requires only the interpolation step, since the recorded cell’s integration can then continue from tb rather than from the interpolation point. Similarly, threshold detection does not affect the integrator. However, it must be noted that threshold detection remains tentative until the threshold time is reached by tb/e min, because an event received by the cell in the interim may reinitialize the cell to a time prior to the tentatively calculated threshold time. 3.2 Porting the Mutual-Inhibition Model to lvardt. The mutualinhibition model is an all-inhibitory network with full connectivity. Each neuron is a single compartment (point neuron) with spike-generating sodium and potassium voltage-dependent currents of the Hodgkin¬Huxley type. The dynamics of the mutual inhibition model permits initially asynchronous firing to coalesce into synchronous firing within a few spikes. In the original versions of the mutual-inhibition model (Wang & Buzsaki, 1996), the entire network is implemented as one large, continuous set of linked ODEs. This is done by making the opening rate (kC→O in /ms) of the postsynaptic conductance a continuous and continuously differentiable Boltzmann function of presynaptic voltage: kC→O = 12.0/(1 + e xp(−(Vpr e /2))).

(3.2)

This continuous-activation synapse model has the advantage of making the entire simulation somewhat more tractable analytically. However,

Local Variable Time-Step Method

cell # 0 1 2 3 4 5

909

A

t min

time

b

trigger

B

0 1 2 3 4 5 t

min

b/e

0

C 12 3 4 5 t

min

b/e

0

D 12 3 4 5 t

min

b/e

Figure 2: Typical sequence of local integration steps in a six-cell example. The ta to tb intervals of possible interpolation are shown as black rectangles. The tb/e min is shown as a vertical dashed line. (A) Integration coordinator requests integration for lagging cell (# 0 with minimum tb ). Integrator advances by dt (length of hashed rectangle). In that step, we suppose cell 0 crosses the threshold, and a threshold event is generated, labeled “trigger” in panel B. This is an event whose time is tentative since unprocessed synaptic events could still influence this cell. (B) Cell 5 has minimum tb and integrates forward. (C) The trigger event is at tb/e min. The handling of this trigger creates three events to be delivered to cells 3,4,5 at varying delays (short vertical lines). We suppose that its delay places the cell 3 input event earlier than any tb . (D) The event in cell 3 is now tb/e min. Cell 3 back-interpolates, the event is handled, and cell 3 reinitializes, giving ta3 = tb3 = tb/e min. Cell 3 will be the next cell to integrate forward.

910

W. Lytton and M. Hines

1 ms Figure 3: Comparison between event-triggered (solid) and continuously activated (dashed) synaptic conductance elicited by a presynaptic action potential. Curves superimpose except for slight deviations at initiation and peak, demonstrated by 50-fold blow-ups at these locations. Threshold = −5.7 mV, Cdur = 0.41 ms.

the continuous-activation model can be criticized as being nonbiophysical, since it represents postsynaptic conductance as being activated by somatic presynaptic voltage at all voltage levels. (Destexhe, Malnen, & Sejnowski, 1994a; 1994b). At rest, this activation is generally infinitesimal and will have no effect on the simulation. A more important disadvantage of the continuous-activation synapse model is that it does not allow explicit definition of axonal and synaptic delays. In order to implement the mutual-inhibition model using lvardt, we needed to translate the continuous-activation synapses to event-driven synapses. Equation 3.2 for the continuous-activation synapse gives a steeply rising sigmoid. Thus, the transmitter release is significant only in the period in which the presynaptic action potential is above some threshold around 0 mV. Furthermore, since the action potential trajectory in this region is relatively insensitive to changing synaptic inputs, the transmitter release is well approximated by a threshold-triggered stereotypical pulse of transmitter of duration, Cdur . This latter synapse model has been extensively used (Destexhe et al., 1994a, 1994b; Lytton, 1996). Adjusting the event threshold and pulse duration parameters to least squares best fit the continuousactivation synapse conductance (see Figure 3), gives synaptic conductance trajectories so similar that simulations with the two kinds of synapses produce graphically identical results. Spike time deviations between the event-driven simulation and the original continuous simulation were minimal: 108 ± 64 µs (mean ± standard deviation), well below the duration of an action potential. Improving the fit by, for example, adjusting the maximum kC→O was unnecessary. For

Local Variable Time-Step Method

911

the mutual-inhibition model, event-driven simulations run a bit faster than the equivalent continuous simulations. This is due to the use of a single generalized synapse for each cell that accepts all of the connecting input event streams and discontinuously changes just two state variables when an event arrives. This is orders of magnitude more efficient than the continuous model, where there are about Nc synapses per cell, each with an ODE that requires an evaluation of equation 3.2 every time step. Given that fixed time-step methods remain the simulation standard, we compared the fixed time-step performance with lvardt for the mutual-inhibition model (see Figure 4). We found that a fixed time step of 0.0025 ms gave results closely comparable to those of lvardt with absolute error tolerance (atol) of 1 · 10−3 or 1 · 10−5 . In this simulation, firing of all cells is powerfully drawn into the synchronizing attractor, making the result qualitatively similar for any integration method or tolerance that produced reasonable spike trajectories for the individual cells. Figure 4A demonstrates that the lvardt simulations are relatively slow in the early phase of the simulation, where irregular firing generates high-frequency input events in all cells but then becomes far more efficient once synchrony sets in. Figure 4B explains this by showing the time steps used during the simulations. The fixed dt methods are represented here by horizontal lines. The lvardt methods produces time steps that jump around during the initial presynchrony phase of the simulation and then settle down to an alternation between large dt (>1 ms with atol = 1 · 10−3 ) in the long intervals separating the population spikes and extremely short dt during the population spike itself. This is readily understood by noting the need to calculate not only individual cell spiking during the period of the population spike but also to handle the instantaneous (zerodelay) exchange of spikes and synaptic responses to these spikes during this same brief period. Profiling of these simulations demonstrates that lvardt consumes about 12% of its total CPU time performing these event deliveries and the associated interpolations. This relatively high figure is due to the fact that spikes that are tentatively triggered in a particular cell may then need to be taken back as other inputs into that cell arrive and alter the spike time. When the global variable time-step method is used, this species of thrashing behavior sometimes resulted in severe inefficiencies. Due to the all-toall connectivity, the near-simultaneous spiking of 100 neurons places 9900 (n2 − n since no self-connectivity) near-simultaneous events on the event queue after the occurrence of spikes in all of the cells. Using lvardt, these events are generally handled in sequential order with only occasional need to recalculate threshold time (see Figure 2C). However, the global variable time-step method attempts to reconcile the mutual influence of all of these competing events using the single integrator. This led to large computer time increases (up to 60-fold) in some simulations. It is possible to provide efficient handling of such massive event influx under the global method by artificially defining a resolution interval within which events would be

912

W. Lytton and M. Hines

CPU time (min)

A 20

dt=0.0025

atol=1e−5

10 atol=1e−3 dt=0.025

100

B

200

300

400 500 model time (ms)

log(dt) (ms)

0

−2

−4

−6

Figure 4: Comparison of fixed and variable time-step methods for the mutualinhibition model. (A) CPU time increases linearly with simulation time for fixedstep method (dashed lines for dt = 2.5µs – upper curve; dt = 25µs – lower). There is a reduction in CPU load at the onset of synchrony using the variablestep method (solid lines with absolute error tolerance 1 · 10−5 – upper; 1 · 10−3 – lower). (B) Time-step size as a function of time. Log(dt) is shown for fixed dt (horizontal dashed lines) and variable dt in one neuron (solid lines).

considered to be simultaneous. However, such an ad hoc event-handling approach would not be desirable for other types of simulations. Instead, we regard the lvardt as the natural implementation for the event-driven form of the mutual-inhibition model. We further note that the all-to-all connectivity and near-perfect synchrony of the model represents an extreme simulation situation. 3.3 Generalization of Mutual-Inhibition Model Using lvardt. The use of the event-driven simulation for the mutual-inhibition model provides

Local Variable Time-Step Method

913

the desirable side effect of allowing arbitrary delays to be introduced into the simulation. We have begun exploring the effects of delays primarily in order to ensure that these would be handled readily without introducing unexpected errors or inefficiencies. We found that CPU times using lvardt decreased slightly with the introduction of a 2 ms delay (from 1.43 min to 1.07 min). CPU times were also similar with introduction of inhomogeneity in the cells’ intrinsic frequencies (1.33 min), introduction of variability in the delays (1.33 min), or variability in both intrinsic frequency and delay times (1.57 min). Introduction of brief delay shifted but did not otherwise interfere with synchronization in the homogeneous case. Introduction of a range of delays (1.9–2.1 ms; uniform distribution) also did not interfere with synchronization. Further increasing the delay range to 2 to 5 ms produced slightly broader population spikes. However, activity still synchronized within four to five cycles as before. These manipulations increased lvardt integration efficiency by 10% to 20%. With an inhomogeneous population of neurons having different natural firing frequencies, the population spike broadened considerably, again without interfering significantly with the number of cycles required to achieve synchrony. Here again, there was only a mild reduction of integration speed, comparable to that seen with randomization of delays. 3.4 Use of lvardt with a Synfire Chain. The synfire chain, introduced by Abeles (1991; Aviel, Mehring, Abeles, & Horn, 2003) is optimal for application of the lvardt method. In this classical simulation, sets of cells are fired sequentially due to synaptic connectivity density that is greatest from each set to a follower set. At any one time, activity is restricted to a small set of cells that are carrying the signal forward, while other cells in the simulation are quiescent or are firing at lower background rates. Our evaluation of simulations consisting of 100 single-compartment Hodgkin-Huxley cells showed a 20-fold speed-up using lvardt as compared to fixed dt method with similar accuracies. In general, synfire simulations can be expected to show speed-ups of one to two orders of magnitude depending on the size of the chain, background firing rates, and forward versus lateral connectivity densities. Profiling of these simulations demonstrated that event-handling overhead was insignificant. 3.5 Use of lvardt in Thalamocortical Simulation. The highly structured, stereotyped simulations described above were meant to highlight situations in which lvardt would be particularly useful (synfire chain) or would be likely to encounter problems (mutual-inhibition model). We also benchmarked lvardt with a more complex thalamocortical simulation more closely related to activity in the nervous system. This simulation features four cell types: cortical pyramidal neurons and interneurons, thalamocortical cells, and thalamic reticular neurons (Bazhenov et al., 1998). The two

914

W. Lytton and M. Hines

thalamic cell types produced bursts with prolonged interburst intervals, a situation particularly advantageous for the use of the lvardt algorithm. To preserve accurate spike times out to 150 ms of simulation time, we had to use high-accuracy simulations: an error tolerance (atol) of 1 · 10−6 for the lvardt method and a dt of 1 · 10−4 for the fixed time-step method. (The concept of accuracy is somewhat problematic when considering these complex network simulations, as will be discussed further below.) The results were striking: 10 hour 20 minute simulation time for fixed dt and 6 minutes 13 seconds for lvardt, a 100-fold speed-up. Less dramatic results were obtained when comparing to a more typical fixed dt of 1 · 10−2 , comparable in this simulation to the lvardt method with error tolerance of 1 · 10−3 . In this case, the simulations took 2 minutes for lvardt and 5 minutes 53 seconds for fixed dt, a 3-fold speed-up. As in the original Bazhenov et al., (1998) model, several cell parameters were randomized to introduce variability into the model. In general, added variability would be expected to be advantageous for lvardt, increasing the likelihood that the integrators would require different time steps at a given point in the simulation. In practice, repetitive drive in this simulation dominated dynamics so that changing the degree of cellular variability made little difference. Similarly, addition of noisy inputs had little effect in this simulation. In general, addition of strong, high-frequency noisy inputs would be expected to remove lvardt advantages, requiring interrupts and reinitialization at the input frequencies. Such an extreme case would also adversely affect performance of the global variable time-step method. Although the fixed dt method would be unaffected, it would be inaccurate: all events would be rounded off to the nearest dt, an aliasing producing false input synchrony. Addressing this by reduction to an appropriately small fixed dt would again leave the variable time steps with an advantage. 3.6 Estimating Simulation Complexity. In order to perform detailed benchmarking and complexity evaluation, we needed a hybrid simulation that would allow us to scale simulation size without altering the simulation pattern. This was not readily done with the thalamocortical simulation, where scaling to greater numbers of neurons produced activity spread that depended critically on boundary conditions and parameter-scaling choices, despite preservation of qualitative activity pattern. We therefore went back to the synfire chain, reconfiguring it as a set of rings (see Figure 5A) to make it scale well and produce continuous activity. Each neuron in this rings simulation has 10 active compartments with Hodgkin-Huxley sodium, potassium, and leak conductances, that is, 40 states. Increasing the number of neurons in a ring increases the size of the simulation without increasing the amount of parallel activity: each ring is a delay line in which only one neuron is active at a given time. An increase in the number of rings increases the amount of activity occurring simultaneously.

Local Variable Time-Step Method

915

A

128 64 32

ba

l

2

16

fix e

d

dt

glo

CPU time: log (sec)

B

8 4

1 2 1 1

2 3 4 sim size: log (# cells)

Figure 5: (A) Schematic of a rings simulation using 8 rings of 10 neurons. The 10 tightly coupled compartments of each neuron have standard Hodgkin-Huxley channels. Activity is passed around each ring independently. (B) Log-log plot of simulation time versus simulation size. Simulations range in size from 10 to 20,480 neurons (total of 400 to 819,200 state variables) in 1 to 128 rings. With global and fixed dt, results for different number of rings overlap. With lvardt (dashed lines), simulation time increases with increased number of rings (number above each line). The asterisk shows the simulation represented in A with the dashed vertical line indicating run time with fixed and global methods. Times are on a Pentium CPU running at 3 GHz.

As expected, simulation time for the global and fixed methods depends on only the number of cells, not on how they are divided into rings. Therefore, the “global” and “fixed dt” curves for different number of rings all overlap in Figure 5B. The relative location of these curves reflects the usual choice of time step = 0.025 ms for the fixed method and atol = 1 · 10−3 for the global variable-step method. The lvardt curves indicate an enormous speed advantage when run with a small number of rings (lower set of dashed curves labeled 1, 2, 4) where

916

W. Lytton and M. Hines

only 1, 2 respectively 4 cells will be active at any given time. As we increase the number of rings, hence increasing the number of cells simultaneously active, lvardt takes more computation time to simulate a given number of cells. As the number of simultaneously active cells approaches the total number of cells, lvardt will be placed at a disadvantage compared to the other integration methods, as it wastes time maintaining the integrator queue and because of the Nc –fold increase in constant integrator overhead. At the left side of each lvardt curve (small number of cells per ring), doubling the number of rings doubles the number of active cells and approximately doubles simulation time. As the number of cells per ring rises, simulation time increases only very slightly at first and then rises more steeply, causing the curves to converge somewhat as the number of cells per ring gets very large. With these very large rings, the time spent integrating the cell that is firing is swamped by the time doing maximum time steps in the large number of quiescent cells. Evaluation of the lvardt integration in the ring simulation allowed us to develop an empirical weighting of simulation time that indicates the complexity of the lvardt method:

tbig · θ + (1 − θ) · tsmall · Ts · s + To + Tq · log(Nc ) .

Tr un = Nc ·

tstop tsmall

(3.3)

Total simulation time (Trun ) for the lvardt method will be proportional to number of cells (Nc ) and total model time (tstop ). The characteristic inverse dependence on time step (tsmall ) must here be weighted by a θ factor, indicating the proportion of neuron time spent crawling through spikes at small dt. tbig is also included since inactive cells will be expected The proportion of tsmall to have a characteristic time step related to subthreshold synaptic input interval in the interspike interval. For the fixed-step method, tbig = tsmall . For the global variable-step method, θ is dependent on the proportion of active to inactive in the network as a whole. For all methods, simulation time will be dependent on the number of state variables s per neuron and the time Ts required to integrate a single state variable. lvardt has two more dependencies: general overhead time To for handling each integrator and queue handling overhead for determining which integrator goes next. This latter term scales as Nc · log(Nc ) rather than Nc , reflecting the scaling for a queue sorting algorithm. Profiling demonstrates that integrator queue effect on simulation time is minimal (coefficient Tq is relatively small). For 128 rings of size 160 each, management of the integrator queue is under 3% of the simulation time, while integration is 94%. When using 1-compartment instead of 10-compartment cells (4 instead of 40 state variables) with this

Local Variable Time-Step Method

917

size network, state integration still dominates the calculation, with queue time increasing to only 6% of the simulation time. In Figure 2, we depicted times for events and integration boundaries as they would appear on a single queue. In the NEURON implementation, events are maintained on an event queue while integrators are maintained on a separate integrator queue. Only the latter is considered in the Tq term of equation 3.3. We did not consider the time for the event queue in this equation since it is the same regardless of which of the three integration methods is being used. 4 Discussion The opposing demands of the continuous and event-triggered aspects of neural simulation suggest that the problem can be split up. Individual neurons are computationally demanding. By contrast, connectivity does not require much CPU time, although its representation may be demanding of memory (up to n2 of neuron number). Thus, it is natural to separate the simulation problem into calculation of the continuous neural potentials and calculation of events and their effects. When the problem is split in this way, it is immediately recognized that the computational demands of each neuron will differ among themselves at any given time. This suggests the use of the local time-step method. 4.1 Event-Driven Simulation. Prior work has demonstrated a variety of methods for efficient handling of event-driven neural network simulations (Makino, 2003; Mattia & Del Giudice, 2000; Watts, 1994). However, these networks have been restricted to use with artificial cells, which permit analytical solution or approximation of cell states based on values at an arbitrary prior time. In such a network, cell states are calculated at the time of event receipt based on values determined at the prior event. In addition to external events (te ), the event queue for an artificial network may also contain self-events. For example, an artificial cell may use an event to alert itself to the end of its refractory period, permitting resetting of an internal state flag. Such self-events are typically of fixed period and can be added effortlessly to the queue. In an artificial cell network, the simulator maintains a queue of scheduled events that are then evaluated in order. To handle an event, the simulator updates the states of follower cells and places on the queue any new events generated by these followers. Since many practical simulations involve only a small set of event delay times, the need for O(logNc ) queue sorting is avoided, and the event queue can make use of efficient algorithms such as the O(1) algorithm presented by Mattia and Del Giudice (2000). Similar to the above, a network in NEURON can be constructed entirely of event-driven artificial cells. In an artificial cell network without realistic cells, there is no need for integration. In this setting, lvardt creates no

918

W. Lytton and M. Hines

integrators and does not need an integrator queue, making use of only the event queue. States for artificial cells are computed analytically on the arrival of events, and output events are then added to the event queue. By contrast with the event queue, the integrator queue always contains Nc -ordered tb events, where Nc is the number of cells. The tb times are uncorrelated with each other and with any te times. This variability in event order means that the integrator queue requires a fully general algorithm with characteristic log(Nc ) complexity per time step. In NEURON, both the event and integrator queues are based on Jones’s (1986) implementation of the splay tree algorithm of Sleator and Tarjan (1983). As we have shown, integrator queue execution time is negligible even for simulations on the order of 10,000 cells. 4.2 Contrasting Example Simulations. In order to demonstrate the general usefulness of the method, we explored its performance in several examples. The mutual-inhibition model is relatively unsuited to the use of any variable time-step method, proceeding to full synchronization within the round-off error for the double precision representation available. With regard to lvardt, such synchronization is a worst-case scenario from a performance perspective, since perfect synchrony of identical neurons means that all integrators are redundantly performing the same calculations at the same time. In addition, the mutual-inhibition model places increasing demands on event accounting, as the event-delivery algorithm attempts to reconcile the mutual effects of the n2 − n events arriving nearly simultaneously. Performance on the mutual-inhibition model simulation can be greatly improved by providing a window wherein arriving events are considered simultaneous and can be handled in the order received. We have explored this implementation but did not pursue it since it is an ad hoc solution to a peculiar simulation with pathologically synchronized behavior and since the lvardt method performed adequately despite this handicap. Moving beyond this artificially handicapped situation, it can readily be seen that the best use of lvardt will be in larger simulations involving heterogeneous neural populations where activity bounces from one area to another or spreads across the network. The synfire chain simulation represents the best-case extreme with only a small subset of neurons being active at a given time. lvardt can integrate these few, devoting no resources to the many that are done or await their moment. The ring simulation generalizes the synfire chain to allow the simulation size and the extent of simultaneous activity to be independently scaled. With a large enough number of rings, lvardt will be expected to lose its advantage, as integrator switching between simultaneously active cells will dominate. The more complex thalamocortical simulation was also assessed to demonstrate that the new method has real virtual-world application. While performance of lvardt on this simulation produced excellent results, comparison of simulations run using different integration methods with different

Local Variable Time-Step Method

919

degrees of precision raised questions of the appropriate standard of accuracy for simulations. In general, network simulations will not converge to a single, correct result since long runs will invariably yield a near-threshold event, which will produce spikes at slightly different times or not at all, ultimately dependent on round-offs, an example of sensitivity to initial conditions. From this point onward, the divergence of activity in the single neuron may spread to alter firing patterns throughout the network. In the case of a synchronizing network such as the mutual-inhibition model, the strength of the synchronizing attractor will resolve such deviations. In the general case, however, these deviations result in entirely different spike trains after a certain point, regardless of the degree of accuracy requested. This invariable variability requires some metric other than strict spike occurrence times to identify firing pattern similarities and the adequacy of an integration method (Victor & Purpura, 1996).

4.3 Parallel Computation. The use of parallel integrators for different neurons naturally raises the issue of porting this simulation method to a parallel computer. The NEOSIM project (http://www.neosim.org) has developed a parallel discrete event sample implementation for the delayed delivery of spike events from a source cell on one CPU to a target cell on another. This implementation coexists with NEURON’s lvardt method. ij High performance requires a significant minimum spike delay time td , between source cell i and target cell j. In this case, our strict integration j j ij assertion property tai ≤ tb ∀ i, j is relaxed to tai ≤ tb + td . The NEURON portion of the NEOSIM + NEURON implementation merely accepts a request from NEOSIM to integrate a specific cell or group of cells to a specified stop time consistent with the relaxed assertion. When a NEURON cell fires ij at time t, it notifies NEOSIM. At this time, no cell has a ta > t + td . Note that in our zero-delay mutual-inhibition model, this parallel method would be useless. However, most network models have a significant minimum delay between different cells and can realize significant performance gains with the NEOSIM parallel discrete event delivery technique.

5 Conclusion In summary, lvardt offers substantial speed-ups for simulations of networks of realistic ion-channel-based neurons. The advantages will be greatest in situations where some neurons are quiescent during periods when other portions of the network are active. This would be the case for simulations involving serial activation of areas, as, for example, in hippocampus, or involving cell types with very different firing properties, as, for example, in a thalamocortical or basal ganglia simulation. lvardt’s mixture of event-driven and differential equation simulation also makes it ideal for implementation

920

W. Lytton and M. Hines

of hybrid networks, where artificial cells with analytically soluble states are combined with realistic neurons requiring full ODE integration. As with any other computational method, the suitability of lvardt is dependent on the exact problem to be solved. The individual user will want to benchmark a particular simulation across methods before deciding which one to use. A general assessment of suitability can be performed by considering the factors laid out in equation 3.3. Our heuristic conclusion is that for medium-size networks in which average synaptic input intervals to a single cell are much greater than a fixed step dt, the lvardt method will have better performance than the fixed-step method. Acknowledgments This work was supported by NINDS grants NS11613 and NS32187. We thank the reviewers for many helpful suggestions. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aviel, Y., Mehring, C., Abeles, M., & Horn, D. (2003). On embedding synfire chains in a balanced network. Neural Computation, 15, 1321–1340. Bazhenov, M., Timofeev, I., Steriade, M., & Sejnowski, T. (1998). Computational models of thalamocortical augmenting responses. Journal of Neuroscience, 18, 6444–6465. Cohen, S., & Hindmarsh, A. (1994). Cvode user guide (Tech. Rep.). Livermore, CA: Lawrence Livermore National Laboratory. Destexhe, A., Mainen, Z., & Sejnowski, T. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18. Destexhe, A., Mainen, Z., & Sejnowski, T. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci, 1, 195–230. Hindmarsh, A., & Serban, R. (2002). User documentation for cvodes, an ode solver with sensitivity analysis capabilities (Tech. Rep.). Livermore, CA: Lawrence Livermore National Laboratory. Hines, M. (1984). Efficient computations of branched nerve equations. Int. J. BioMedical Computing, 15, 69–76. Hines, M., & Carnevale, N. (1997). The neuron simulation environment. Neural Computation, 9, 1179–1209. Hines, M., & Carnevale, N. (2001). Neuron: A tool for neuroscientists. Neuroscientist, 7, 123–135. Hines, M., Morse, T., Migliore, M., Carnevale, N., & Shepherd, G. (2004). Modeldb: A database to support computational neuroscience. J. Comput. Neurosci, 17, 73–77. Jones, D. (1986). An empirical comparison of priority-queue and event-set implementations. Comm. ACM, 4, 300–311.

Local Variable Time-Step Method

921

Lytton, W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Computation, 8, 501–510. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Computing and Applications, 11, 210–223. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305– 2329. Sleator, D., & Tarjan, R. (1983). Self adjusting binary trees. In Proc. ACM SIGACT Symposium on Theory of Computing (pp. 235–245). New York: ACM Press. Traub, R., Jefferys, J., & Whittington, M. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol, 76, 1310–1326. Wang, X., & Buzsaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci, 16, 6402–6413. Watts, L. (1994). Event-driven simulation of networks of spiking neurons. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6. (pp. 927–934). San Mateo, CA: Morgan Kaufmann.

Received December 18, 2003; accepted August 25, 2004.

LETTER

Communicated by Anthony Burkitt

Synaptic Shot Noise and Conductance Fluctuations Affect the Membrane Voltage with Equal Significance Magnus J. E. Richardson [email protected]

Wulfram Gerstner [email protected] Laboratory of Computational Neuroscience, I&C and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne EPFL, Switzerland

The subthreshold membrane voltage of a neuron in active cortical tissue is a fluctuating quantity with a distribution that reflects the firing statistics of the presynaptic population. It was recently found that conductancebased synaptic drive can lead to distributions with a significant skew. Here it is demonstrated that the underlying shot noise caused by Poissonian spike arrival also skews the membrane distribution, but in the opposite sense. Using a perturbative method, we analyze the effects of shot noise on the distribution of synaptic conductances and calculate the consequent voltage distribution. To first order in the perturbation theory, the voltage distribution is a gaussian modulated by a prefactor that captures the skew. The gaussian component is identical to distributions derived using current-based models with an effective membrane time constant. The well-known effective-time-constant approximation can therefore be identified as the leading-order solution to the full conductance-based model. The higher-order modulatory prefactor containing the skew comprises terms due to both shot noise and conductance fluctuations. The diffusion approximation misses these shot-noise effects implying that analytical approaches such as the Fokker-Planck equation or simulation with filtered white noise cannot be used to improve on the gaussian approximation. It is further demonstrated that quantities used for fitting theory to experiment, such as the voltage mean and variance, are robust against these non-Gaussian effects. The effective-time-constant approximation is therefore relevant to experiment and provides a simple analytic base on which other pertinent biological details may be added. 1 Introduction Given a perfect model of the membrane response to synaptic input, it would be possible to infer from the distribution of the subthreshold, membranevoltage fluctuations many quantities of interest, such as the levels of activity and correlations in the excitatory and inhibitory presynaptic populations. Neural Computation 17, 923–947 (2005)

© 2005 Massachusetts Institute of Technology

924

M. Richardson and W. Gerstner

Early models of synaptic input (Stein, 1965) comprised a leaky integrator driven by a stochastic current, which generated postsynaptic potentials of fixed amplitude. Since then, great effort has been made to incorporate further biological details. Soon after the publication of Stein’s model, synaptic conductance effects began to be addressed (Stein, 1967; Johannesma, 1968; Tuckwell, 1979; Wilbur & Rinzel, 1983; Lansky & Lanska, 1987). These early models featured unfiltered, delta-pulse synapses and were primarily concerned with the statistics of the interspike interval distribution. Although the majority of studies used the diffusion approximation (i.e., the limit of high synaptic rates and low postsynaptic potential amplitudes), the effects of shot noise due to Poisson distributed pulse arrival at low rates have also been considered (see, e.g., Tuckwell, 1989) in the context of stochastic resonance (Hohn & Burkitt, 2001) and the neural response to correlations in the presynaptic population (Kuhn, Aertsen, & Rotter, 2003). Other studies have examined the filtering of the incoming pulses at the synapses and have shown it can lead to unexpected dynamical response properties: synaptic filtering can, paradoxically, allow neurons to follow high-frequency signals better (Brunel, Chance, Fourcaud, & Abbott, 2001; Fourcaud & Brunel, 2002). More recently, a number of experimental studies have directly measured the effect of synaptic drive on the membrane voltage (Kamondi, Acsady, Wang, & Buzsaki, 1998; Destexhe & Par´e, 1999; Sanchez-Vives & McCormick, 2000; Monier, Chavane, Baudot, Graham, & Fr´egnac, 2003; Holmgren, Harkany, Svennenfors, & Zilberter, 2003). The availability of such measurements has led to a renewed interest in the quantitative modeling of synaptic drive, with a view to infer presynaptic network states from voltage fluctuations (Stroeve & Gielen, 2001; Rudolph, Piwkowska, Badoual, Bal, & Destexhe, 2004), compare current and conductance-based models of synaptic drive (Tiesinga, Jos´e, & Sejnowski, 2000; Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2003; Rudolph & Destexhe, 2003; Jolivet, Lewis, & Gerstner, 2004; Richardson, 2004; La Camera, Senn, & Fusi, 2004; Meffin, Burkitt, & Grayden, 2004), and explore mechanisms for the gain control of the neuronal response (Chance, Abbott, & Reyes, 2002; Burkitt, Meffin, & Grayden, 2003; Destexhe, Rudolph, & Par´e, 2003; Fellous, Rudolph, Destexhe, & Sejnowski, 2003; Prescott & De Koninck, 2003; Grande, Kinney, Miracle, & Spain, 2004; Kuhn, Aertsen, & Rotter, 2004). In this letter, the combined effects on the membrane voltage of synaptic shot noise, filtering, and conductance increase will be examined. The central result is that the effects of synaptic shot noise on the membrane voltage statistics are as significant as those of synaptic conductance fluctuations and therefore either both (or neither) of these features of the synaptic drive should be taken into account for a consistent approach. This means that diffusion-level descriptions, such as numerical simulations or the FokkerPlanck approach, in which the drive is modeled as gaussian noise, cannot correctly describe detailed aspects of the membrane-voltage distribution, such as its skew.

Shot Noise and Conductance Fluctuations

925

2 Membrane Response to Synaptic Drive In this section, the full model of the membrane response to synaptic drive is introduced and two common approximations to this model outlined. An analysis of the aspects of the drive missed by these approximation schemes will motivate the development of a perturbative approach. 2.1 The Full Model. Following Stein (1967), the membrane voltage V(t) responds passively to synaptic drive: voltage gated channels, including spike-generating currents, are not included. The membrane is modeled by a capacitance C in parallel with a leak conductance g L and two fluctuating excitatory ge (t) and inhibitory gi (t) conductances with equilibrium potentials at E L , E e , and E i , respectively. This system therefore comprises three independent variables: C

dV = −g L (V − E L ) − ge (V − E e ) − gi (V − E i ) + Ia pp dt

(2.1)

τe

dge δ t − tke = −ge + c e τe dt {tk }

(2.2)

dgi = −gi + c i τi δ t − tki . dt {tk }

(2.3)

e

τi

i

The excitatory conductance is driven by pulses that arrive at the Poissondistributed times {tke } at a total rate Re summed over all input fibers. Each pulse provokes a quantal conductance increase c e , which then decays exponentially with a time constant τe . The inhibitory conductance is defined analogously. Any experimentally applied current is accounted for by Ia pp . In this letter, only the steady-state statistical properties will be considered. Thus, all expectations of a quantity x(t), written as x(t), denote either an average over an ensemble of statistically independent systems, in which any transients due to initial conditions are no longer present, or the temporal average of x(t) in a single system. 2.2 The Diffusion Approximation. For the case in which the rates Re , Ri are relatively high, the number of pulses that arrive within the timescales τe , τi will be approximately gaussian distributed. The replacement of the synaptic shot noise in equations 2.2 and 2.3 by a constant term and gaussian white noise constitutes the diffusion approximation. Thus, using excitation as an example, τe

√ dge ge0 − ge + 2σe ξe (t), dt

(2.4)

926

M. Richardson and W. Gerstner

where the gaussian white noise ξe (t) has a mean and autocorrelation function defined by ξe (t)ξe (t ) = τe δ(t − t ).

ξe (t) = 0

(2.5)

The Ornstein-Uhlenbeck process (see equation 2.4) has been shown to capture the statistics of conductance fluctuations at the soma of compartmentalized model neurons (Destexhe, Rudolph, Fellous, & Sejnowski, 2001). The average conductance ge0 and the standard deviation σe are related to the variables c e , τe , and Re through ge0 = c e τe Re ,

σe = c e

τe Re . 2

(2.6)

By construction, the first two moments of the diffusion approximation are identical to those of the shot noise process. Higher moments, however, are not correctly reproduced in the diffusion approximation. The conductance equation 2.4 is linear and can be integrated.1 The fluctuating component ge F of the conductance is ge F (t) ≡ ge (t) − ge0

√ 2σe 0

∞

ds −s/τe e ξe (t − s), τe

(2.7)

which yields (with equation 2.5) the gaussian distribution p D (ge ) =

(ge − ge0 )2 . exp − 2σe2 2πσe2 1

(2.8)

The subscript signifies that the calculation was made in the diffusion approximation. There are clearly some problems with distribution 2.8 if the conductance mean ge0 is of a similar magnitude to the standard deviation σe . In this regime, the diffusion approximation predicts negative conductances (Lansky & Lanska, 1987; Rudolph & Destexhe, 2003). In fact, the criterion for validity of the diffusion approximation is σe /ge0 1,

1

(2.9)

The Stratonovich formulation of stochastic calculus is used throughout this letter. However, for additive white noise or multiplicative colored noise, there is no difference between the Stratonovich or Ito forms. See, for example, Risken (1996).

Shot Noise and Conductance Fluctuations

927

suggesting that this approximation misses higher-order terms scaling with powers of σe /ge0 . Thus, the shot noise conductance fluctuations should read ge F (t) =

√ 2σe

∞

0

ds −s/τe σe ξe (t − s) + corrections ∝ , e τe ge0

(2.10)

where ξe (t) is the gaussian white noise defined in equation 2.5. 2.3 The Diffusion Approximation Is Inconsistent. The combination of the diffusion approximation of the synaptic drive (see equation 2.4 and its equivalent for inhibition) and the full voltage equation, 2.1, will now be examined. By separating the synaptic conductances into tonic components ge0 , gi0 and fluctuating components ge F , gi F , the voltage equation can be written as C

dV = −g0 (V − E 0 ) − ge F (V − E e ) − gi F (V − E i ), dt

(2.11)

where the total conductance g0 and drive-dependent equilibrium potential E 0 are defined by g0 = g L + ge0 + gi0

and

E0 =

1 (g LE L + ge0 E e + gi0 E i + Ia pp ). g0 (2.12)

The subscripts 0 anticipate that these quantities are correct at the zero order of a perturbation expansion that will be developed in a later section. The total conductance g0 suggests the introduction of an effective membrane time constant, τ0 = C/g0 .

(2.13)

This feature of the synaptic drive was identified in the early analytic treatment of Johannesma (1968). The fluctuation terms driving the voltage in equation 2.11 will now be examined. Taking excitation as an example, the voltage-dependent component of the drive can be expanded around the equilibrium potential E 0 , ge F (V − E e ) = ge F (E 0 − E e ) + ge F (V − E 0 ).

(2.14)

The two terms on the right-hand side have simple interpretations. The first is an additive noise term and therefore just a fluctuating current. The second is a multiplicative noise term and, in the context of equation 2.11, it can be

928

M. Richardson and W. Gerstner

seen that this term represents fluctuations in g0 , or equivalently in τ0 , the effective membrane time constant. These two noise terms are, however, not equally significant. The quantity V− E 0 grows (linearly) with the fluctuations ge F , gi F . So whereas the additive noise terms are of the order ge F , gi F , the multiplicative noise terms are of the order ge2F , gi2F , and ge F gi F . This suggests that (1) the multiplicative noise terms could be neglected if the noise strength was in some way small, and (2) if these terms were retained, the effects of the synaptic drive on the membrane voltage would be modeled in greater detail. Point 1 is valid, as will be seen in section 2.4. Point 2, however, is false due to an unexpected weakness of the diffusion approach with multiplicative noise. This will now be outlined. On reexamining equations 2.7 to 2.9, it is seen that relative to the tonic conductance, the fluctuations in the diffusion approximation scale with σe /ge0 . But equation 2.10 states that the terms missed by this approximation scale with the square of this quantity. Hence,

σe ge F /ge0 = A ge0

+B

σe ge0

2 + ···

(2.15)

where A is the diffusion-level term and B is the first-order correction due to shot noise. Given that σe /ge0 is the small quantity parameterizing the diffusion approximation, it is clearly inconsistent to neglect the secondorder term B in the additive noise ge F (E e − E 0 ) of equation 2.14 but keep the implicit A2 term in the multiplicative noise ge F (V− E 0 ) ∝ ge2F . This is, however, what occurs in the diffusion approximation. This result is surprising because it implies that although diffusion-based approaches (such as the Fokker-Planck equation or any simulation with filtered gaussian noise) purport to capture the effects of synapticconductance fluctuations, they miss equally important terms due to the shot noise. However, it should be stressed that almost all previous studies of conductance-based synaptic noise that used the diffusion approximation implicitly concentrated their analyses on the dominant effects coming from the tonic conductance increase and additive noise term; the conclusions of such studies remain valid. 2.4 The Effective-Time Constant Approximation. This is also known as the gaussian approximation of the voltage distribution. The treatment of the membrane voltage can easily be made consistent with the diffusion approximation of the synaptic conductance equations. This is achieved by dropping the multiplicative noise term, that is, by neglecting conductance fluctuations, to yield C

dV −g0 (V − E 0 ) + ge F (E e − E 0 ) + gi F (E i − E 0 ). dt

(2.16)

Shot Noise and Conductance Fluctuations

929

This voltage equation is of the form of a current-based model, but the dominant effect of the synaptic conductance is accounted for through the use of an increased effective leak g0 . This approximation is in widespread use, having been applied to white noise synaptic drive (Wan & Tuckwell, 1979; Lansky & Lanska, 1987; Burkitt & Clark, 1999; Burkitt, 2001; Burkitt et al., 2003, La Camera et al., 2004), alpha-pulse synapses (Manwani & Koch, 1999), and, more recently (Richardson, 2004), to the case of exponentially filtered synapses studied here. The equation set comprising the voltage equation 2.16 and the diffusion approximations for the conductances are simple to analyze and can be integrated to give V(t) − E 0

(E e − E 0 ) ∞ −s/τe ds e − e −s/τ0 ξe (t − s) (τe − τ0 ) 0 √ σi (E i − E 0 ) ∞ −s/τi + 2 ds e − e −s/τ0 ξi (t − s). g0 (τi − τ0 ) 0

√

2

σe g0

(2.17) This equation has an obvious interpretation: the quantities multiplying the noise are just the excitatory and inhibitory postsynaptic potentials for a membrane with an effective time constant τ0 . The fact that it is linear in the noise means that many quantities of interest can be easily calculated, including temporal measures such as the autocorrelation function. The distribution predicted for the voltage is the gaussian p0 (V) =

(V − E 0 )2 , exp − 2σV2 2πσ 2 1

(2.18)

V

where, for the case where there are no correlations between excitation and inhibition, the variance is (Richardson, 2004) σV2 =

σe g0

2 (E e − E 0 )2

τe + (τe + τ0 )

σi g0

2 (E i − E 0 )2

τi . (τi + τ0 )

(2.19)

If the limit τe , τi → 0 is correctly taken (by keeping the quantities c e τe /C and c i τi /C fixed), it can be shown that this variance is compatible with previous results derived for the gaussian approximation of white noise conductance-based synaptic drive (Burkitt et al., 2003). However, for filtered noise, the variance in equation 2.19 differs significantly from that derived in Rudolph and Destexhe (2003). In that study, a one-dimensional Fokker-Planck equation was used that could not capture the effects of synaptic filtering. Through the introduction of effective synaptic time constants (Rudolph et al., 2004), the one-dimensional Fokker-Planck equation can be

930

M. Richardson and W. Gerstner

made to yield results that correspond, at the gaussian level and in the steady state, to the distribution parameterized by equations 2.12 and 2.19. 2.5 The Aim of This Letter. The gaussian approximation provides a mathematically convenient approach to the analysis of conductance-based synaptic drive and is accurate for parameter values relevant to experiment (Richardson, 2004). Given the analysis presented above, it is clear that to improve on the gaussian approximation, both shot noise and conductance fluctuations must be included. The goal of the next two sections will be to develop a perturbative method that allows for the consistent calculation of the conductance and voltage distributions at a higher order than the gaussian approximation. These higher-order calculations will yield the skew of the voltage distribution, a quantity that is measurable experimentally. More important, the approach will provide information on the validity of fitting gaussian-level analytical forms for the mean and variance to voltage traces of cortical neurons. To aid readability, only the results of the calculations are given in the main body of the article. However, the methods developed here are applicable to other areas of theoretical neuroscience, such as the distribution of amplitudes at depressing synapses (Hahnloser, 2003) or the shape of synaptic weight distributions (van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001), and are therefore presented in full in the appendixes. 3 Synaptic Shot Noise and Conductance Distributions In this section, the effects of shot noise on the synaptic conductance distributions will be analyzed. It should be noted that relaxation processes with shot noise, for which equations 2.2 and 2.3 are examples, have been well studied, and an exact solution (Gilbert & Pollak, 1960) for the distribution, in the form of a recursion relation, does exist. However, the aim of the approach (in section 4) is to incorporate the shot-noise conductance fluctuations into a model of the membrane voltage. A perturbative approach is better suited to this purpose. For this reason, the full solution for the shot-noise distribution p S (ge ) will not be presented here but, when needed, will be obtained by numerical simulation of equation 2.2. 3.1 The Diffusion Approximation Misses the Skew. In the limit where the standard deviation σe has a similar magnitude to the conductance mean ge0 , the diffusion approximation, unlike the full model, predicts negative conductances. A second source of difference between the statistics of shot noise and the diffusion approximation is also seen in the same limit; the distribution of the shot noise conductance becomes skewed, an effect that is obviously missed by the gaussian distribution given in equation 2.8. In order to get some intuition about the skew of the distribution, a

Shot Noise and Conductance Fluctuations

C

A

Distribution

931

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

shot noise pS diff. approx. pD perturb. theory pP

0.5 0.4 0.3 0.2 0.1 0

-1

0

1

2

3

4

-1 0 1 2 3 4 5 6 7

Conductance ge /ce

Conductance ge /ce B

D

0.12

0.06

| pD - pS | | pP - pS |

Difference

0.10

0.05

0.08

0.04

0.06

0.03

0.04

0.02

0.02

0.01

0.00

0.00 -1

0

1

2

3

Conductance ge /ce

4

-1 0 1 2 3 4 5 6 7

Conductance ge /ce

Figure 1: Distribution of shot noise conductance fluctuations; the perturbation theory improves on the diffusion approximation. (A) Comparison of the full distribution p S generated by the simulation of equation 2.2 to the diffusion approximation p D (see equation 2.8) and the perturbation theory p P (see equation 3.4) for the case σe /ge0 = 0.60 (Re τe = 1.5). (B) The corresponding absolute difference between the diffusion approximation and full solution | p D − p S | and also the perturbatively generated distribution and the full solution | p P − p S |. The perturbative distribution reduces the error caused by both the negative conductances and the skew. (C, D) Analogous measures for the case σe /ge0 = 0.41 (Re τe = 3.0) for which the theoretical approaches can be expected to be more accurate. Details of the simulations are given in appendix A.

comparison can be made between the full and approximate distributions shown in Figure 1A. In this case (for which Re τe = 1.5 implying σe /ge0 = 0.60), the peak of the shot noise distribution is to the left of that of the gaussian. Because both distributions have the same mean conductance ge /c e = 1.5, the shot noise distribution is skewed; it leans to the left with

932

M. Richardson and W. Gerstner

a longer tail to the right. Any improvement of the diffusion approximation should address both the negative conductivity and the skew of the conductance distribution. 3.2 Accounting for the Shot Noise. The corrections identified in equation 2.10 will now be accounted for. A stochastic variable ζe (t), analogous to gaussian white noise ξe (t), τe

√ dge ge0 − ge + 2 σe ζe (t), dt

(3.1)

can be constructed that has statistics that capture the shot noise fluctuations correctly up to the next order missed by the diffusion approximation. It can be shown that such a quantity must obey the same first- and second-order correlators as gaussian white noise, ζe (t) = 0,

ζe (t)ζe (t ) = τe δ(t − t ),

(3.2)

but also a new third-order correlator, √ ζe (t)ζe (t )ζe (t ) = 2

σe ge0

τe2 δ(t − t )δ(t − t ).

(3.3)

It is this third-order correlator, proportional to σe /ge0 , that provides the leading-order correction to the diffusion approximation. All higher-order correlators of products of ζe (t) factorize in terms of these first-, second-, and third-order correlators. Using the rules in equations 3.2 and 3.3, the conductance distribution can be shown (see appendix B) to be 2

1 he h 4 σe h 3e p P (h e ) = √ − exp − e , 1+ 3 ge0 3! 2! 2 2π

(3.4)

where h e = (ge −ge0 )/σe is the normalized conductance and the subscript P denotes that the result was derived as a perturbative expansion in the small variables σe /ge0 . The distribution takes the form of a gaussian modulated by a prefactor. To zero order in σe /ge0 , the prefactor is equal to one, and the gaussian distribution 2.8 is recovered. The prefactor terms proportional to σe /ge0 now allow for the moments of the distribution to be calculated at higher order. The mean and variance are unchanged, as would be expected given the previous comments about the exactness of these two moments. The first new result of the perturbation theory is the skew Sge of the distribution: Sge =

4 σe 1 (ge − ge0 )3 = h 3e = . σe3 3 ge0

(3.5)

Shot Noise and Conductance Fluctuations

933

A useful aspect of the perturbation theory is that this skew is exact. The distribution itself and its higher moments are, however, correct only at the given order of the series expansion in σe /ge0 . Two examples comparing the numerically generated conductance distribution p S , diffusion approximation p D and perturbation theory p P , are plotted in Figure 1. 4 The Subthreshold Voltage Distribution The model of synaptic conductance studied in the previous section can now be incorporated into the membrane voltage equation. This will allow the voltage distribution to be calculated at the next order beyond the gaussian approximation. The method involves a perturbative solution to the voltage equation 2.1, the excitatory synaptic conductance equation 3.1, and its inhibitory analog. For the perturbative calculation of the voltage distribution, it is convenient to use the following small parameters, xe = σe /g0

xi = σi /g0 ,

and

(4.1)

which are linearly related (in σe , σi ) to the small parameters of the conductance expansion σe /ge0 and σi /gi0 . The calculation for the voltage distribution is given in appendix C and, in terms of v = V − E 0 , can be written in the form p P (v) =

1 2πσV2

v 1+ σV

µV S − σV 2!

v3 S v2 + 3 exp − 2 , 2σV σV 3!

(4.2)

where the subscript P denotes the perturbatively generated result. The voltage appears only through the ratio v/σV , and the other terms µV /σV and S are parameters proportional to xe , xi : this distribution generates moments vm /σVm that are correct up to order xe , xi . The quantity µV is the leading-order correction to the voltage mean E 0 and stems from the conductance fluctuations only: the shot noise does not influence the mean voltage. The standard deviation, given by equation 2.19, is identical to the gaussian value σV and is therefore unaffected by shot noise or multiplicative conductance at this order in the perturbation expansion. Thus, V − E 0 = µV

and

(V − V)2 = σV2 .

(4.3)

The third-order moment of the distribution 4.2 gives the skew of the voltage distribution, 1 (V − V)3 = S = SSN + SC F . σV3

(4.4)

934

M. Richardson and W. Gerstner

From the expression given in appendix C, equation C.20, it can be seen that two distinct contributions to the skew naturally arise: one from the shot noise S SN and a second one from the conductances fluctuations SC F . These two contributions to the skew are equally significant because they are both proportional to xe , xi . This illustrates one of the central points of this study: the diffusion approximation of a conductance-based model with multiplicative noise is inconsistent because it misses the shot noise contribution SSN . The full set of equations for µV , σV , and S is given in appendix D. 4.1 An Example with Relevance to Experiment. To illustrate the effects of shot noise and conductance fluctuations, a scenario is considered in which the fluctuations due to the inhibitory component of the drive can be neglected. There are two different situations that allow this action to be taken. The first is when inhibition is absent. The second, and more interesting, case is relevant to experiments designed to isolate the effect of excitation on the membrane voltage (Silberberg, Wu, & Markram, 2004). In such experiments, the neuron is hyperpolarized through the injection of current so that the mean voltage E 0 is near the reversal of inhibition E i . In such cases, the factor E i − E 0 multiplying all inhibitory contributions to membrane fluctuations is relatively small, and such contributions can be dropped without significant loss of accuracy. Inhibition enters only through an increase of the tonic conductance g0 and the corresponding decrease of the effective time constant τ0 . For either of these scenarios, the moments that parameterize the distribution in equation 4.2 take the values τe (τe + τ0 ) τe σV2 = xe2 (E e − E 0 )2 (τe + τ0 )

µV = −xe2 (E e − E 0 )

8 g0 (τe + τ0 )2 τe 3 ge0 (τe + 2τ0 )(2τe + τ0 ) (τe + τ0 ) 2 3τe + 6τe τ0 + 2τ02 τe = −4xe . (τe + 2τ0 )(2τe + τ0 ) (τe + τ0 )

(4.5) (4.6)

S SN = xe

(4.7)

SC F

(4.8)

Equations 4.7 and 4.8 give the positive and negative contributions to the skew (see equation 4.4) that come from the shot noise and conductance fluctuations, respectively. For the case of purely excitatory drive, g0 = g L + ge0 , the relative importance of these contributions can be gauged by examining the ratio S SN τL (τe + τ0 )2 = 2 , S 3 (τ L − τ0 ) 3(τe + τ0 )2 − τ02 CF

(4.9)

Shot Noise and Conductance Fluctuations

935 2

Low-conductance state: g0=0.0667mS/cm , τ0=15ms 80 70 60 50 40 30 20 10 0 -0.02 0

B 0.12

diff. approx. perturb. theory shot-noise sim.

0.10

C 0.4

ETC approx. perturb. theory full model sim.

0.08 0.06 0.04

2

0.0

SSN+ SCF

-0.2

0.02 0.02 0.04 0.06

SSN

0.2

Skew

Distribution

A

0.00 -75 -70 -65 -60 -55 -50 -45

SCF

-0.4

0 0.05 0.1 0.15 0.2

Voltage V (mV)

Conductance ge (mS/cm )

xe=σe/g0 2

High-conductance state: g0=0.2mS/cm , τ0=5ms. 7 6 5 4 3 2 1 0 -0.1 0 0.1 0.2 0.3 0.4

E

F

0.04

0.4

0.03

0.0

Skew

Distribution

D

0.02

2

Conductance ge (mS/cm )

SSN+ SCF

-0.4

0.01

-0.8

0.00 -120 -100 -80 -60 -40 -20

-1.2

Voltage V (mV)

SSN

SCF 0

0.1 0.2 0.3 0.4

xe=σe/g0

Figure 2: Distribution of the membrane voltage; perturbation theory captures the skew. A neuron is subject to a purely excitatory synaptic drive with a current Iapp applied such that E 0 = −60 mV. (A, B) The conductance and voltage distributions for a low conductance state (ge0 = 0.0167, g L = 0.05 mS/cm2 ) with noise strength xe = σe /g0 = 0.2. The perturbative conductance distribution (see equation 3.4) is not accurate because σe /ge0 = 0.8. The weak skew of the corresponding voltage distribution (B) is, however, correctly predicted by the perturbation theory (see equation 4.2) because the underlying conductance skew is exact. (C) The voltage skew (see equations 4.7 and 4.8) is plotted as a function of xe for the same parameters, but with increasing noise σe . The shot noise S SN and conductance-fluctuation SC F contributions to the skew nearly cancel, explaining the almost gaussian voltage distribution in B. (D, E) A high conductance state (ge0 = 0.15 mS/cm2 ) with xe = σe /g0 = 0.4. (E) The large skew of the voltage distribution is captured by the perturbation theory. (F) The voltage skew is negative for the high-conductance case because SC F dominates. Details of the simulations are given in appendix A.

where τ L = C/g L is the leak time constant. The ratio is a monotonically increasing function of the effective time constant τ0 . In the limit of low conductance states, for which τ0 → τ L , the ratio diverges, and the contribution due to conductance fluctuations becomes negligible. For high-conductance states, for which τ0 → 0, the ratio converges to a constant value of 2/9. These

936

M. Richardson and W. Gerstner

results underline the fact that the effect of shot noise is nonnegligible: even in extremely high conductance states it still comprises just under a third, S SN /S = 2/7, of the net skew. These results are illustrated graphically in Figure 2. 5 Discussion The effect that shot noise synaptic drive has on the membrane voltage distribution was examined. A perturbative approach was developed that was first used to capture the statistics of filtered shot noise conductance fluctuations beyond both the gaussian effective-time-constant approximation and the diffusion approximation. These synaptic conductances were then incorporated into a model of the membrane voltage response. The approach allowed for the analysis of nongaussian features of the voltage distribution, such as its skew. In particular, it was shown that shot noise and synaptic conductance fluctuations affect the membrane at the same order: both effects need to be taken into account for a consistent approach. The regime in which the effects of shot noise on the voltage and firing rate might be clearly seen experimentally, is one of low presynaptic rate and large, sharp excitatory postsynaptic potentials (EPSPs). This is typical of the excitatory drive experienced by certain neocortical interneurons (Silberberg et al., 2004) for which isolated EPSPs can be many millivolts and there is little dendritic filtering. For a case in which the effects of shot noise are strong (outside the perturbative regime considered here), the voltage distribution can be considerably positively skewed with increased probability to be near threshold. It is expected that in such a case, the statistics of the generated action potentials would differ significantly from those predicted using a gaussian model of the membrane fluctuations with the same mean and variance. The gaussian, or effective-time constant approximation for the membrane distribution, is, however, mathematically simple: the mean (see equation 2.12) and variance (see equation 2.19) are transparent functions of the model parameters. Such gaussian distributions are therefore ideal to fit to experimental data (Rudolph et al., 2004) in cases where the shot noise effects are weak. The functional form of the distribution that takes into account the shot noise and conductance fluctuations is, however, somewhat less transparent as can be seen in equations D.5 and D.6 for the skew. So the question should be asked: To what extent would weak higher-order effects interfere with an attempt to fit the mean and variance to an experimental distribution? This question can be answered in the framework presented here. First, it is seen from equations 4.5 and D.3 that the correction to the mean voltage due to shot noise and conductance fluctuations is of order xe2 , xi2 , V = E 0 + µV + · · · = E 0 + O xe2 , xi2 .

(5.1)

Shot Noise and Conductance Fluctuations

937

Hence, the mean is not affected at first order. The same is true for the measured variance, (V − V)2 = (V − E 0 )2 − (V − E 0 )2 = σV2 1 + O xe2 , xi2 ,

(5.2)

which also increases only with xe2 , xi2 , despite the fact that the skew grows linearly with xe , xi . These results demonstrate that information extracted from the voltage mean using equation 2.12 and variance using equations 2.19 is not strongly affected by shot noise and conductance fluctuations missed in the gaussian approximation. Hence, fitting the gaussian-level moments to voltage traces is a robust method, given that equations 2.1 and 2.4 and their inhibitory counterpart provide a sufficiently realistic model of the effect of synaptic drive on the membrane voltage. In summary, the gaussian effective-time-constant approximation provides an accurate description of the voltage fluctuations and is a convenient tool for fitting theory to experiment. For most situations, its description of the stochastic voltage dynamics due to conductance-based synaptic drive is adequate, and it can be easily extended to include many biological details (such as voltage-dependent currents, dynamic synapses, heterogeneity, nontrivial temporal correlations in the drive, and others) missed in the simplified model considered here. Nevertheless, for the purposes of detailed modeling of conductance-based synaptic drive, it should not be overlooked that shot noise and conductance fluctuations are equally important. Our results demonstrate that diffusion-based approaches such as the FokkerPlanck equation or simulation using multiplicative filtered gaussian noise are inadequate for the description of the nongaussian statistics of the voltage. If the aim is to model or simulate the statistics of voltage fluctuations beyond the gaussian, effective-time-constant approximation, then synaptic shot noise must be included. Appendix A: Details of the Simulations The parameters used for the simulations were τe = 3 ms for the excitatory synaptic filtering, C = 1 µF/cm2 for the membrane capacitance, and g L = 0.05 mS/cm2 for the leak conductance. The reversal potentials used were E L = −80 mV for the leak and E e = 0 mV for synaptic excitation. Simulations were performed using the Euler method with the Poissonian synaptic shot noise implemented by integrating the conductance equation 2.2 to yield ge (t + dt) = ge (t) −

dt ge (t) + c e P(Re dt), τe

(A.1)

where c e is the postsynaptic conductance amplitude for a single pulse and Re is the total rate of incoming pulses. The quantity P(Re dt) is the random

938

M. Richardson and W. Gerstner

number of incoming pulses that arrive within the time step dt. The number is drawn from a Poisson distribution characterized by the mean Re dt. Appendix B: Filtered Poissonian Shot Noise The method for expanding higher-order gaussian correlators is first reviewed. The first- and second-order correlators are given in equation set 2.5. All higher odd-order correlators vanish, and higher even-order correlators (of order 2n) factorize into (2n)!/(2n n!)

(B.1)

permutations of products of n second-order correlators. As an example, and writing ξ (t1 ) = ξ1 for simplicity, the fourth-order correlator is ξ1 ξ2 ξ3 ξ4 = ξ1 ξ2 ξ3 ξ4 + ξ1 ξ3 ξ2 ξ4 + ξ1 ξ4 ξ2 ξ3 .

(B.2)

A fluctuating quantity ζe (t) is now introduced with statistics that are constructed so as to capture the effects of shot noise at a higher order than gaussian white noise ξ (t). The factorization properties of high-order correlators of ζe (t) can be derived from its first-, second-, and third-order correlators defined in equation set 3.2 and equation 3.3. These rules can be derived by expanding the Poissonian distribution of the shot-noise Ornstein-Uhlenbeck equation 2.2 and by keeping terms beyond the usual diffusion approximation (see, e.g., Rodriguez, Pesquera, San Miguel, & Sancho, 1985). To order σe /ge0 , higher even-order correlators obey the usual gaussian factorization rules and higher odd-order correlators can be decomposed into permutations of a product of a single third-order correlator and an appropriate number of second-order correlators. As an example, and using the shorthand ζe (t1 ) = ζ1 , the seventh-order correlator is factorized as follows: ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 ζ7 = ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 ζ7 + permutations,

(B.3)

where for this case, there are 7 · 6 · 5 permutations of the indices of the third-order correlator. Each fourth-order correlator can then be decomposed using the usual gaussian rules (see eq. B.2). It is important to note than no further third-order correlators are extracted out of the remaining even-order product. Otherwise, this would produce terms that go beyond the σe /ge0 correction. Hence, for a (2n + 3)-order correlator, there are (2n + 3)(2n + 2)(2n + 1) ·

(2n)! 2n n!

(B.4)

Shot Noise and Conductance Fluctuations

939

permutations. The first set of three terms comes from the different ways of arranging the single third-order correlator, and the final term comes from the gaussian statistics of the reduction of the remaining even-order correlator. B.1 The Conductance Distribution and Correlators. The normalized conductance variable h e = (ge −ge0 )/σe is introduced to simplify the following analysis. It obeys the equation τe

√ dh e = −h e + 2 ζe (t), dt

(B.5)

which can be integrated to yield √ h e (t) = 2

t −∞

ds −(t−s)/τe e ζe (s). τe

(B.6)

From this, the correlators of the conductance are found to be h e (t) = 0 h e (t)h e (t ) = exp(−|t − t |/τe ) h e (t)h e (t )h e (t ) =

4 σe exp(−|t − t |/τe ) exp(−|t − t |/τe ), 3 ge0

(B.7)

with higher-order correlators derivable from these using the underlying factorization rules for ζe (t). The steady-state distribution of the variable h e (t) can be obtained by calculating the probability density that h e (t) is found having a value near he : p(h e ) = δ(h e − h e (t)) =

∞

−∞

dq −iq h e iq h e (t) e . e 2π

(B.8)

The exponential is expanded to give ∞ ∞ iq h e (t) (iq )2m 2m (iq )3+2m 3+2m e = (t) . h e (t) + h 2m! (3 + 2m)! e m=0 m=0

(B.9)

The structure of the correlators allows this to be rewritten as

m ∞

1 −q 2 4σe e iq h e (t) = 1 + (iq )3 m! 2 3ge0 m=0 4σe 2 = 1 + (iq )3 e −q /2 , 3ge0

(B.10)

940

M. Richardson and W. Gerstner

which can be inserted into equation B.8, p(h e ) =

1−

4σe d 3 3ge0 dh 3e

∞

−∞

dq −iq h e −q 2 /2 , e 2π

(B.11)

to yield the distribution given in equation 3.4. Appendix C: The Membrane Distribution The statistics of the conductance fluctuations (given in equation 3.1) now are incorporated into a model of a passive membrane (see equation 2.1). For the following analysis, it is convenient to use the shifted voltage v = (V − E 0 ), with normalized conductances h e , h i defined in equations B.5 and B.6, τ0 v˙ + v(1 + xe h e + xi h i ) = xe Ee h e + xi Ei h i ,

(C.1)

where τ0 = C/g0 , E 0 are defined by equation 2.12, Ee = E e − E 0 , and xe = σe /g0 provides the small parameter used for the perturbative analysis of the voltage (with a similar definition of xi ). Because these small parameters are linearly related to those used for the conductance perturbation theory, corrections due to shot noise and conductance fluctuations will be simultaneously accounted for. Equation C.1 can be integrated to give v(t) =

t

−∞

t ds −(t−s)/τ0 − α(s) e s e τ0

dr τ0

β(r )

,

(C.2)

where the terms α(s) generate corrections to voltage-like quantities and β(r ) generates corrections to the effective time constant: α(s) = xe Ee h e (s) + xi Ei h i (s) β(r ) = xe h e (r ) + xi h i (r ).

(C.3)

The voltage distribution can now be obtained by evaluating the expectation p(v) = δ(v − v(t)) =

∞

−∞

dq −iq v iq v(t) e e , 2π

(C.4)

to the appropriate order in xe and xi . No correlations are assumed to exist between excitation and inhibition. This simplifying assumption can be relaxed, and the method used here easily extended to account for such correlations.

Shot Noise and Conductance Fluctuations

941

C.1 The Leading-Order Voltage Distribution. The derivation (Richardson, 2004) of the leading-order contribution to the voltage distribution of equation set 2.1 to 2.3 is first reviewed. The fluctuations of the voltage from its mean value are written as v(t) = σ (t) + O(xe2 , xi2 ) where σ (t) =

t −∞

ds −(t−s)/τ0 e (xe Ee h e (s) + xi Ei h i (s)) . τ0

(C.5)

In this approximation, the leading-order probability density is a gaussian, as can be seen by examining p0 (v) =

∞

−∞

dq −iq v iq σ (t) e e , 2π

(C.6)

where the expectation

q2 q4 1 2 2 e iq σ (t) = 1 − σ (t)2 + σ (t)4 · · · = e − 2 q σV 2 4!

(C.7)

is evaluated using the gaussian relation σ (t)2n = (2n)!σ (t)2 n /2n n! At this order, there are no contributions from the shot noise. From equation C.5, the expectation σ 2 (t) = σV2 is time independent and takes the value σV2 = xe2 Ee2

τe τi + xi2 Ei2 . (τe + τ0 ) (τi + τ0 )

(C.8)

Reinserting the result, equation C.7, into the probability density, ∞ σV2 v2 v 2 dq p0 (v) = exp − 2 , exp − q −i 2 2 2σV σV −∞ 2π

(C.9)

and evaluating the integral gives a gaussian voltage distribution: 1 (V − E 0 )2 . exp − p0 (V) = 2σV2 2πσV2

(C.10)

C.2 The Next-Order Correction to the Distribution. From the previous section, it is seen that the typical difference between the voltage and its mean scales with xe , xi . To develop a systematic expansion, the dimensionless variable y = v/σV is therefore introduced. At the next order,the expansion

942

M. Richardson and W. Gerstner

can be written y(t) = σ y (t) − φ y2 (t) + O xe2 , xi2 + · · · , where σ y (t) = σ (t)/σV and φ y2 takes the form φ y2 (t) =

1 σV

t

−∞

ds − (t−s) e τ0 τ0

s

t

dr α(s)β(r ). τ0

(C.11)

This gives the probability density correct to order xe , xi as p0 (y) + p1 (y) =

∞

−∞

dq −iq y iq (σ y (t)−φ y2 (t)) e e . 2π

Again the exponential within the expectation will be expanded and then evaluated to first order in φ y2 : ∞ ∞

(iq )2m 2m (iq )3+2m 3+2m 2 e iq (σ y (t)−φ y (t)) = σy + σ (2m)! (3 + 2m)! y m=0 m=0

−

∞ (iq )1+2m 2m 2 σ y φ y + O xe2 , xi2 . (2m)! m=0

(C.12)

The first term on the right-hand side of equation C.12 is the zero-order gaussian component treated above, the second term is the correction due to the Poissonian nature of the noise, and the third term is the correction due to the conductance-based drive. The second term is straightforward to analyze. Using the rules for the permutation of correlators, this term can be expanded out to give m ∞ ∞ (iq )3+2m 3+2m 1 −q 2 = (iq )3 σ y3 , σy (3 + 2m)! m! 2 m=0 m=0

(C.13)

which takes the form of a gaussian with a prefactor. To obtain the third term of equation C.12, expectations of the form σ y2m φ y2 need to be calculated. An examination of the structure of the integrals comprising this term shows that they can be written as

σ y2m φ y2 = σ y2m φ y2 + 2m·(2m − 1) ψ y4 σ y2m−2 .

(C.14)

Shot Noise and Conductance Fluctuations

943

The expectation φ y2 can be calculated from the form given above, and ψ y4 is defined by 4 1 ψy = 3 σ

t

−∞

ds1 ds2 ds3 τ0 τ0 τ0

t

s3

dr3 α(s1 )α(s3 )α(s2 )β(r3 ), τ0

(C.15)

where ψ y4 ∼ O(xe , xi ). An explicit form for this quantity will be given in appendix D. Substitution of the form C.14 into the third term of the expansion C.12 gives m ∞ ∞ (iq )1+2m 2m 2 2 1 −q 2 σ y φ y = iq φ y + (iq )3 ψ y4 . (2m)! m! 2 m=0 m=0

(C.16)

Inserting the results of equations C.13 and C.16 into the expansion C.12 gives

1 2 2 e iq (σ y (t)−φ y (t)) = 1 + (iq )3 σ y3 − iq φ y2 − (iq )3 ψ y4 e − 2 q ,

(C.17)

where the fact that σ y2 = 1 has been used. Inserting this into the leading order correction to the distribution, ∞ 1 2 dq −iq y (iq )3 σ y3 − iq φ y2 − (iq )3 ψ y4 e − 2 q p1 (y) = e −∞ 2π ∞ 2 d 4 3 d 3 dq −iq y − 1 q 2 = φy e 2 + ψy − σy e dy dy3 2π −∞ 1 2 = − φ y2 y + ψ y4 − σ y3 (y3 − 3y) √ e −y /2 , 2π

(C.18)

yields the perturbatively generated distribution, correct to order xe , xi , with the following functional form, 1 p(y) = √ 2π

S 1 + y µy − 2!

2 y S +y exp − , 3! 2 3

(C.19)

where µ y is the correction to the mean voltage and S is the skew, µ y = y = − φ y2 and

S = (y − y)3 = 6 σ y3 − ψ y4 , (C.20)

and the equalities hold to first order in the perturbation theory. The first correction to the mean of y is affected only by the synaptic conductance. However,there are two components of the skew S = SSN + SCF : a contribution

944

M. Richardson and W. Gerstner

SSN = 6σ y3 from the Poissonian nature of the drive and a contribution SCF = −6ψ y4 from synaptic conductance fluctuations. The functional forms of µ y and S will be evaluated by the quantities φ y2 , σ y3 and ψ y4 in the next section. Appendix D: The Voltage Mean and Skew At this order in perturbation theory, any of the higher-order cumulants of the voltage distribution can be expressed in terms of the mean µ y and the skew S, y2m =

(2m)! 2m m!

and

y2m+1 =

(2m + 2)! m + µ S , y 2m+1 (m + 1)! 3

(D.1)

where m = 0, 1, 2 · · · Only the odd correlators are different from the gaussian approximation at this order. D.1 Voltage Mean. The first quantity to be evaluated is the correction to the mean. Because of equation C.20, the integral

1 φ y2 = σ

t −∞

ds − (t−s) e τ0 τ0

s

t

dr α(s)β(r ) τ0

(D.2)

must be evaluated. These integrals can be performed using the equation set B.7 and yield for µV = v: µV = −

xe2 Ee

τe τi 2 + xi Ei . (τe + τ0 ) (τi + τ0 )

(D.3)

D.2 Voltage Skew: The Poissonian Contribution. Due to equation C.20, this requires the evaluation of

σ y3 =

3 t ds ds ds − (3t−s−sτ −s ) 1 0 e α(s)α(s )α(s ), σ −∞ τ0 τ0 τ0

(D.4)

which can be performed using the result for the third-order correlator given in equation set B.7. This yields for S SN = 6σ y3 , S SN =

1 σ3

8Ei3 τi3 xi4 (g0 /gi ) 8Ee3 τe3 xe4 (g0 /ge ) + . 3(τe + 2τ0 )(τ0 + 2τe ) 3(τi + 2τ0 )(τ0 + 2τi )

(D.5)

Shot Noise and Conductance Fluctuations

945

D.3 Voltage Skew: The Conductance Contribution. This is given by −6ψ y4 and therefore requires the evaluation of the integral given in equation C.15. After some algebraic effort, the result can be written in the form SC F

2 2 3τe + 6τe τ0 + 2τ02 (τe + 2τ0 )(2τe + τ0 ) 2 3τi2 + 6τi τ0 + 2τ02 4xi4 Ei3 τi − σ3 τi +τ0 (τi + 2τ0 )(2τi + τ0 ) 2x 2 x 2 E 3 Ei τe τi (2τe τi +τ0 (τe +τi ))(2τe (τi +τ0 ) − τi τ0 ) 2+ − 3 e i e σ (τe +τ0 )(τi +τ0 ) (2τe + τ0 )(2τi + τ0 )(τe τi + τe τ0 + τi τ0 ) 2 2 3 2xi xe Ei Ee τi τe (2τi τe +τ0 (τi +τe ))(2τi (τe +τ0 ) − τe τ0 ) 2+ . − 3 σ (τi +τ0 )(τe +τ0 ) (2τi + τ0 )(2τe + τ0 )(τi τe + τi τ0 + τe τ0 )

4x 4 E 3 = − e3 e σ

τe τe +τ0

(D.6) If only one synaptic input type is present or if the average voltage is near the reversal of inhibition such that Ei = E i − E 0 0, this result greatly simplifies. This case is given in equation 4.8 and compared to simulations of the full model in Figure 2. References Brunel, N., Chance, F. S., Fourcaud, N., & Abbott, L. F. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials Biol. Cybern., 85, 247–255. Burkitt, A. N., & Clark, G. M. (1999). New technique for analyzing integrate and fire neurons. Neurocomputing, 26–27, 93–99. Burkitt A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531–1547. Destexhe, A., Rudolph, M., Fellous, J.-M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107, 13–24. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Rev. Neurosci., 4, 739–751. Fellous, J.-M., Rudolph, M., Destexhe, A., & Sejnowski T. J. (2003). Synaptic background noise controls the input/output characteristics of single cells in an in vitro model of in vivo activity. Neuroscience, 122, 811–829.

946

M. Richardson and W. Gerstner

Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and- fire neurons. Neural Comput., 14, 2057–2110. Gilbert, E. N., & Pollak, H. O. (1960). Amplitude distributions of shot noise. Bell. Syst. Tech. J., 39, 333–350. Grande, L. A., Kinney, G. A., Miracle G. L., & Spain W. J. (2004). Dynamic influences on coincidence detection in neocortical pyramidal neurons. J. Neurosci., 24, 1839– 1851. Hahnloser, R. H. R. (2003). Stationary transmission distribution of random spike trains by dynamical synapses. Phys. Rev. E, 67, 022901. Hohn, N., & Burkitt, A. N. (2001). Shot noise in the leaky integrate-and-fire neuron. Phys. Rev. E, 63, 031902. Holmgren, C., Harkany, T., Svennenfors, B., & Zilberter, Y. (2003). Pyramidal cell communication within local networks in layer 2/3 of rat neocortex. J. Physiol. London., 551, 139–153. Johannesma, P. I. M. (1968). Diffusion models for the stochastic activity of neurons. In E. R. Caianello (Ed.), Neural networks (pp. 116–144). New York: Springer. Jolivet, R., Lewis, T. J., & Gerstner, W. (2004). Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. J. Neurophysiol., 92, 959–976. Kamondi A., Acsady, L., Wang, X.-J., & Buzsaki, G. (1998). Theta oscillations in somata and dendrites of hippocampal pyramidal cells in vivo: Activity-dependent phaseprecession of action potentials. Hippocampus, 8, 244–261. Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Comp., 15, 67–101. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. La Camera, G., Senn, W., & Fusi, S. (2004). Comparison between networks of conductance and current-driven neurons: Stationary spike rates and subthreshold depolarization. Neurocomputing, 58–60, 253–258. Lansky, P., & Lanska, V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26. Manwani, A., & Koch, C. (1999). Detecting and estimating signals in noisy cable structures, I: Neuronal noise sources. Neural Comp., 11, 1797–1829. Meffin, H., Burkitt, A. N., & Grayden, D. B. (2004). An analytical model for the “large, fluctuating synaptic conductance state” typical of neocortical neurons in vivo. J. Comput. Neurosci., 16, 159–175. Monier, C., Chavane, F., Baudot, P., Graham, L. J., & Fr´egnac, Y. (2003). Orientation and direction selectivity of synaptic inputs in visual cortical neurons: A diversity of combinations produces spike tuning. Neuron, 37, 663–680. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. P. Natl. Acad. Sci., 100, 2076–2081. Rauch, A., La Camera, G., Luscher, ¨ H.-R., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate-and-fire neurons to in vivo–like input currents. J. Neurophysiol., 90, 1598–1612. Richardson, M. J. E. (2004). Effects of synaptic conductance on the voltage distribution and firing rate of spiking neurons. Phys. Rev. E, 69, 051918.

Shot Noise and Conductance Fluctuations

947

Risken, H. (1996). The Fokker-Planck equation. New York: Springer-Verlag. Rodriguez, M. A., Pesquera, L., San Miguel, M., & Sancho, J. M. (1985). Master equation description of external Poisson white noise in finite systems. J. Stat. Phys., 40, 669–724. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Rudolph, M., & Destexhe, A. (2003). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Comput., 15, 2577–2618. Rudolph, M., Piwkowska Z., Badoual, M., Bal, T., & Destexhe, A. (2004). A method to estimate synaptic conductances from membrane potential fluctuations. J. Neurophysiol., 91, 2884–2896. Sanchez-Vives, M. V., & McCormick, D. A. (2000). Cellular and network mechanisms of rhythmic recurrent activity in neocortex. Nat. Neurosci., 3, 1027–1034. Silberberg, G., Wu, C. Z., & Markram, H. (2004). Synaptic dynamics control the timing of neuronal excitation in the activated neocortical microcircuit. J. Physiol-London, 556, 19–27. Stein, R. B. (1965). A theoretical analysis of neuronal activity. Biophys. J., 5, 173–193. Stein, R. B. (1967). Some models of neuronal variability. Biophys. J., 7, 37–68. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductance-based integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural. Comp., 13, 2005–2029. Tiesinga, P. H. E., Jos´e, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated currents. Phys. Rev. E, 62, 8413–8419. Tuckwell, H. C. (1979). Synaptic transmission in a model for stochastic neural activity. J. Theor. Biol., 77, 65–81. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. C. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Wan, F. Y. M., & Tuckwell, H. C. (1979). The response of a spatially distributed neuron to white noise current injection. Biol. Cybern., 33, 39–55. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike distribution. J. Theor. Biol., 105, 345–368.

Received August 4, 2004; accepted October 1, 2004.

LETTER

Communicated by Michael Cohen

Dynamical Behaviors of a Large Class of General Delayed Neural Networks Tianping Chen [email protected]

Wenlian Lu [email protected] Laboratory of Nonlinear Mathematics Science, Institute of Mathematics, Fudan University, Shanghai, 200433, China

Guanrong Chen [email protected] Electronic Engineering Department, City University of Hong Kong, Kowloon, Hong Kong, China

Research of delayed neural networks with varying self-inhibitions, interconnection weights, and inputs is an important issue. In the real world, self-inhibitions, interconnection weights, and inputs should vary as time varies. In this letter, we discuss a large class of delayed neural networks with periodic inhibitions, interconnection weights, and inputs. We prove that if the activation functions are of Lipschitz type and some set of inequalities, for example, the set of inequalities 3.1 in theorem 1, is satisfied, the delayed system has a unique periodic solution, and any solution will converge to this periodic solution. We also prove that if either set of inequalities 3.20 in theorem 2 or 3.23 in theorem 3 is satisfied, then the system is exponentially stable globally. This class of delayed dynamical systems provides a general framework for many delayed dynamical systems. As special cases, it includes delayed Hopfield neural networks and cellular neural networks as well as distributed delayed neural networks with periodic self-inhibitions, interconnection weights, and inputs. Moreover, the entire discussion applies to delayed systems with constant self-inhibitions, interconnection weights, and inputs. 1 Introduction Recurrently connected neural networks, sometimes called GrossbergHopfield neural networks, have been extensively studied, and many applications in different areas have been found. It is well known that many applications depend heavily on the dynamical behaviors of the networks. Therefore, analysis of these dynamic behaviors is a necessary step toward practical design of these neural networks. Neural Computation 17, 949–968 (2005)

© 2005 Massachusetts Institute of Technology

950

T. Chen, W. Lu, and G. Chen

A recurrently connected neural network is described by the following differential equations: n dui (t) a i j g j (u j (t)) + Ii = −di ui (t) + dt j=1

i = 1, . . . , n,

(1.1)

where g j (x) are activation functions, di , a i j are constants, and Ii are constant inputs. The dynamical behaviors of recurrently connected neural networks have been studied since the early period of research on neural networks. For example, multistable and oscillatory behaviors were studied by Amari (1971, 1972) and Wilson and Cowan (1972). Chaotic behaviors were studied by Sompolinsky and Crisanti (1988). Hopfield and Tank (1984, 1986) studied the dynamical stability of symmetrically connected networks and showed their practical applicability to optimization problems. Cohen and Grossberg (1983) gave more rigorous results on the global stability of the neural networks. A number of researchers have studied local and global stability of asymmetrically connected networks (e.g., Hirsch, 1989; Matsuoka, 1992; Michel & Gray, 1990; Chen & Amari, 2001; Chen, Lu, & Amari, 2002). Wang and Zou (2002) and Lu and Chen (2003) discussed the global stability of CohenGrossberg neural networks. In practice, the system is unsynchronized. Therefore, one often needs to investigate the following delayed dynamical systems with fixed delays or time-varying delays: n dui (t) a i j g j (u j (t)) = − di ui (t) + dt j=1

+

n

b i j f j (u j (t − τi j )) + Ii

i = 1, . . . , n

(1.2)

j=1

There are also many articles discussing dynamical behaviours of delayed neural networks. The delayed system (see equation 1.2) was investigated by Belair (1993), Belair, Campbell, and van den Driessche (1996), Cao and Wu (1996); Cao and Zhou (1998), Cao (2000), Chen and Amari (2001), Chen (2001), Gopalsamy and He (1991), Marcus and Westervelt (1989), Zhang and Jin (2000), and others. In the literature, various conditions ensuring global and exponential stabilities have been given. For more general delayed functional-differential equations, see early work by Hale (1965) and Miller (1965).

On a Large Class of Delayed Neural Networks

951

In this letter, we discuss a large class of delayed dynamical systems, which can be represented as n dui a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1

+

n j=1

∞

f j (u j (t − τi j − s))ds K i j (t, s) + Ii (t)

(1.3)

0

or n dui a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1

+

n j=1

∞

f j (u j (t − τi j (t) − s))ds K i j (t, s) + Ii (t),

(1.4)

0

where for any fixed t, ds K i j (t, s) are Lebesgue-Stieljies measures, which satisfy dsK i j (t + ω, s) = ds K i j (t, s), di (t), a i j (t), b i j (t), Ii (t), τi j (t) : R+ → R are continuously periodic functions with period ω > 0, that is, di (t + ω) = di (t), a i j (t) = a i j (t + ω), b i j (t) = b i j (t + ω), Ii (t) = Ii (t + ω), τi j (t + ω) = τi j (t) for all t > 0 and i, j = 1, 2, . . . , n. In section 3, we investigate the existence of a periodic solution for systems 1.3 and 1.4 and its global stability and exponential stability. 2 Some Definitions and Assumptions In this section, we give some definitions and assumptions that are used throughout the letter: Definition 1. Lip(G) denotes the class of functions with Lipshitz constant G > 0. Definition 2. The following two norms are defined for every t ∈ (−∞, ∞): 1. {ξ, 1}-norm. u(t){ξ,1} = i |ξi ui (t)|, where ξi > 0, i = 1, . . . , n. 2. {ξ, ∞}-norm. u(t){ξ,∞} = max i |ξi−1 ui (t)|, where ξi > 0, i = 1, . . . , n. Definition 3. For any real number a , denote a + = ma x{a , 0}. ¯ Throughout this article, we assume |ds K i j (t, s)| ≤ |dK¯ i j (s)|, ∞where dK i j (s) is a Lebesgue-Stiejies measure independent of t and satisfies 0 s|dK¯ i j (s)| < +∞, di (t) > di > 0 with di being constants, |a i∗j | = sup{0 0, τi∗j = sup{0
952

T. Chen, W. Lu, and G. Chen

3 Main Results In this section, we establish several theorems. Theorem 1. Suppose that G i , Fi > 0, gi (·) ∈ Li p(G i ) and is nondecreasing, f i (·) ∈ Li p(Fi ), i = 1, 2 · · · , n. If there are positive constants ξ1 , . . . , ξn such that + ξi |a i∗j | G j −ξ j d j + ξ j a j j (t) + i= j

+

n

ξi

∞

|dK¯ i j (s)|F j ≤ −η < 0,

j = 1, . . . , n,

(3.1)

0

i=1

then the dynamical system, equation 1.3, has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of equation 1.3, lim [ui (t) − vi (t)] = 0,

i = 1, . . . , n.

t→∞

(3.2)

Proof. Denote u¯ j (t) = u j (t + ω) − u j (t), g¯ j (u j (t)) = g j (u j (t + ω)) − g j (u j (t)),f¯j (u j (t)) = f j (u j (t + ω)) − f j (u j (t)). Then we have n d u¯ i a i j (t)g¯ j (u j (t)) = −di (t)u¯ i (t) + dt j=1

+

n j=1

∞

f¯j (u j (t − τi j − s))ds K i j (t, s).

(3.3)

0

Define the Lyapunov function: L(t) =

n

ξi |u¯ i (t)| +

i=1

n

∞

ξi

i, j=1

|d K¯ i j (s)|

0

t t−τi j −s

| f j (u¯ j (y))|dy.

(3.4)

Differentiating it, we have ˙ L(t) =

n

ξi sign{u¯ i (t)} −di (t)u¯ i (t)

(3.5)

i=1

+

n j=1

a i j (t)g¯ j (uj (t)) +

n j=1

0

∞

f¯j (uj (t − τi j − s))ds K i j (t, s)

On a Large Class of Delayed Neural Networks

+

n

n

∞

ξi

| f¯j (uj (t − τi j − s))||d K¯ i j (s)|

0

i, j=1

≤

| f¯j (uj (t))||d K¯ i j (s)|

0

i, j=1

−

∞

ξi

n

ξi −di |u¯ i (t)| + sign{u¯ i (t)}

i=1

+

n

∞

ξi

a i j (t)g¯ j (uj (t))

| f¯j (uj (t))||d K¯ i j (s)|

0

n

−ξ j d j |u¯ j (t)| + sign{u¯ j (t)}ξ j a j j (t)

j=1

+

n j=1

i, j=1

=

953

sign{u¯ i (t)}ξi a i j (t) g¯ j (uj (t))

i= j

+

n

| f¯j (uj (t))||d K¯ i j (s)|

0

i, j=1

≤

∞

ξi

 n 

−ξ j d j + ξ j a j j (t) +

j=1

+



n

i= j

ξi

i=1

≤ −η

n

∞

+ ξi |a i∗j |

Gj

|d K¯ i j (s)|F j |u¯ j (t)|

0

|u¯ j (t)|.

(3.6)

j=1

˙ Then L(t) ≤ 0. Therefore, L(t) is bounded, which implies that u¯ i (t), i = 1, . . . , n, are bounded. Thus, the right side of equation 3.3 is bounded, and every u¯ i (t) is uniformly continuous. Combining equation 3.6 with

∞ 0

n

|u¯ j (t)|dt ≤

j=1

1 L(0), η

(3.7)

we conclude that lim u¯ j (t) = 0.

t→∞

(3.8)

It is clear that with equation 3.7, for any positive integer k, when m → ∞,

954

T. Chen, W. Lu, and G. Chen

the following integral, n

(k+1)ω

|u j (t + (m + p)ω) − u j (t + mω)|dt

kω

j=1

=

n

(k+1)ω

kω

j=1

∞ p−1 n m+ u¯ j (t + lω)dt ≤ |u¯ j (t)|dt, l=m (k+m)ω j=1

(3.9)

converges to zero. By the Cauchy convergence principle (see Halmos, 1950), there are Lebesque integrable functions v j (t), j = 1, . . . , n, defined in 0 ≤ t < ∞ such that lim

n

m→∞

j=1

(k+1)ω

|u j (t + mω) − v j (t)|dt = 0

(3.10)

kω

for all positive integer k. Thus, we can select a subsequence mk such that lim u j (t + mk ω) = v j (t)

mk →∞

(3.11)

for almost all t ∈ [0, ∞) with respect to the Lebesgue measure (see Halmos, 1950). It is clear by equation 3.10 that vi (t + ω) = vi (t) when t > 0. By extending vi (t) periodically to −∞ < t < 0, we obtain a periodic function vi (t). Now we prove that vi (t) is a periodic solution of the dynamical system 1.3. By equation 3.10 and the Fatou lemma (see Halmos, 1950), for any t1 > 0, t2 > 0, we have

t2

lim

m→∞ t 1

|g(u j (t + mω)) − g(v j (t))|dt

≤ G j lim

t2

m→∞ t 1

lim

m→∞ t 1

t2

|u j (t + mω) − v j (t)|dt = 0

(3.12)

| f (u j (t + mω)) − f (v j (t))|dt

≤ F j lim

m→∞ t 1

t2

|u j (t + mω) − v j (t)|dt = 0

(3.13)

On a Large Class of Delayed Neural Networks

∞

|ds K¯ i j (s)|

lim

m→∞ 0

t1 ∞

≤

t2

| f (u j (t + mω − τi j − s)) − f (v j (t − τi j − s))|dt

|ds K¯ i j (s)| lim

t2

m→∞ t 1

0

955

| f (u j (t + mω − τi j − s))

− f (v j (t − τi j − s))|dt = 0.

(3.14)

Let t1 > 0, t2 > 0 be two points where equation 3.11 holds. Then, vi (t1 ) − vi (t2 ) = lim {ui (t1 + mk ω) − ui (t2 + mk ω)} mk →∞

= lim

t1

mk →∞ t 2

+

n j=1

t1

=

∞

−di (t)ui (t + mk ω) +

f j (u j (t + mk ω − τi j − s))ds K i j (t, s) + Ii (t) dt

0

−di (t)vi (t) +

n

a i j (t)g j (v j (t))

j=1 n j=1

a i j (t)g j (u j (t + mk ω))

j=1

t2

+

n

∞

f j (v j (t − τi j − s))ds K i j (t, s) + Ii (t) dt.

(3.15)

0

Therefore, vi (t) is absolutely continuous on the set where equation 3.11 is valid. Modifying vi (t) at those points t where equation 3.11 might be invalid (a set of Lebesgue measure 0) by continuity, we conclude that every vi (t) is absolutely continuous and n dvi (t) a i j (t)g j (v j (t)) = −di (t)vi (t) + dt j=1

+

n j=1

∞

f j (v j (t − τi j − s))ds K i j (t, s) + Ii (t).

(3.16)

0

In summary, v(t) = [vi (t), . . . , vn (t)] is a periodic solution of the dynamical system 1.3. Now we will prove that for every solution u(t) = [ui (t), . . . , un (t)], we have lim ui (t) = vi (t).

t→∞

(3.17)

956

T. Chen, W. Lu, and G. Chen

In fact, define ui∗∗ (t) = ui (t) − vi (t), g ∗∗ j (u j (t)) = g j (u j (t)) − g j (v j (t)), f j∗∗ (u j (t)) = f j (u j (t)) − f j (v j (t)), i = 1, . . . , n, and construct a Lyapunov function, L ∗1 (t) =

n

ξi |ui∗∗ (t)| +

i=1

N

ξi

i, j=1

∞

0

t

t−τi j −s

| f j∗∗ (u j (y))|dy|d K¯ i j (s)|. (3.18)

∗ Replacing ui (t + ω) − ui (t) by ui∗∗ (t), g ∗j (u j (t)) by g ∗∗ j (u j (t)), f j (u j (t)) by f j∗∗ (u j (t)), by using the same argument in proving equation we can prove

lim [ui (t) − vi (t)] = 0.

(3.19)

t→∞

In the following, we investigate the exponential stability of the delayed networks 1.3 and 1.4. We prove two theorems. Theorem 2. Suppose thatα > 0 , G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreas∞ ing, f i (·) ∈ Lip(Fi ), and 0 e αs |d K¯ i j (s)| < ∞, i, j = 1, 2, . . . , n. If there are positive constants ξ1 , . . . ,ξn such that ξ j (−d j + α) + ξ j a j j (t) +

i= j

+

n i=1

ξi

∞

+ ξi |a i∗j |

e α(s+τi j ) |d K¯ i j (s)|F j ≤ 0

Gj j = 1, . . . , n,

(3.20)

i0

then the dynamical system 1.3 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) of equation 1.3, there hold ||u(t + jω) − v(t)||{ξ,1} = O(e − jωα ).

(3.21)

Proof. Let ui∗ (t) = ui (t + ω) − ui (t), gi∗ (ui (t)) = gi (ui (t + ω)) − gi (ui (t)) f i∗ (ui (t)) = f i (ui (t + ω)) − f i (ui (t)). Direct calculations lead to the following equations: d{e αt u∗ (t)} αt a i j g ∗j (u j (t)) = e −di (t)ui∗ (t) + αui∗ (t) + dt j n ∞ ∗ f j (u j (t − τi j − s))ds K i j (t, s) i = 1, . . . , n. + j=1

0

On a Large Class of Delayed Neural Networks

957

Define a Lyapunov function by L 2 (t) =

n

ξi |e αt ui∗ (t)|

i=1

+

n

ξi

∞

t−τi j −s

0

i, j=1

t

e α(s+y+τi j ) | f j∗ (u j (y))|dy|d K¯ i j (s)|.

Differentiating L 2 (t), we have

L˙ 2 (t) =

n

ξi sign{ui∗ (t)}e αt

−di (t)ui∗ (t) + αui∗ (t)

i=1

+

n

a i j g ∗j (u j (t))

+

n

j=1

+

− τi j − s))ds K i j (t, s)

e αt | f j∗ (u∗j (t − τi j − s))||d K¯ i j (s)|

0

i, j=1

≤

∞

ξi

f j∗ (u j (t

e α(s+t+τi j ) | f j∗ (u j (t))||d K i j (s)|

0

n

n

∞

ξi

i, j=1

−

0

j=1

n

∞

ξi e

αt

(−di + α)|ui∗ (t)| + a ii (t)sign{ui∗ (t)}gi∗ (ui (t))

i=1

+ sign{ui∗ (t)}

+

n

|

n j=1, j=i

f j∗ (u j (t))|

j=1

=

n

|a i∗j |g ∗j (u j (t))

∞

e

α(s+τi j )

|d K¯ i j (s)|

0

e

αt

ξ j (−d j + α)|ui∗ (t)|

j=1

+ ξ j a j j (t) +

i= j

+

n i=1

ξi |a i∗j |

ξi | f j∗ (u j (t))|

∞

e 0

sign{u∗j (t)}g ∗j (u j (t)) α(s+τi j )

|d K¯ i j (s)|

958

T. Chen, W. Lu, and G. Chen n

≤

+ ξi |a i∗j | G j ξ j (−d j + α) + ξ j a j j (t) + i= j

j=1

+

n

∞

ξi F j 0

i=1

e α(s+τi j ) |d K¯ i j (s)| e αt |u∗j (t)| ≤ 0.

Therefore, L 2 (t) is bounded, and n

ξi |ui (t + ω) − ui (t)| = O(e −αt ).

(3.22)

i=1

Now define a function v(t) = [v1 (t), . . . , vn (t)]T by vi (t) = lim ui (t + jω). j→∞

Because of ui (t + jω) = ui (t) +

j

{ui (t + kω) − ui (t + (k − 1)ω)}

k=1

and equation 3.22, v(t) is well defined and is a periodic function with period ω. Moreover, ||u(t + jω) − v(t)||{ξ,1} = O(e − jωα )

when

j → ∞.

Theorem 3. Suppose that G i , Fi > 0, gi (·) ∈ Li p(G i ) and is nondecreasing, ∞ f i (·) ∈ Li p(Fi ), 0 e αs |d K¯ i j (s)| < ∞, i, j = 1, 2, . . . , n. If there are positive constants ξ1 , . . . , ξn and α > 0 such that for i = 1, 2, . . . , n, −(di − α)ξi + ξi {a ii (t)}+ +

j=i

+

n j=1

ξj Fj

∞

|a i∗j |G j ξ j

∗

e α(s+τi j ) |d K¯ i j (s)| ≤ 0.

(3.23)

0

Then the dynamical system 1.4 has a unique ω-periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of equation 1.4, there hold ||u(t + jω) − v(t)||{ξ,∞} = O(e −jωα )

j → ∞.

(3.24)

On a Large Class of Delayed Neural Networks

959

Proof. Let C = C([−∞, 0], Rn ) be the Banach space of continuous functions, which map [−∞, 0] into Rn with norm ϕ =

sup ϕ(θ){ξ,∞} .

(3.25)

−∞≤θ ≤0

Define zi (t) = [ui (t + ω) − ui (t)], gi∗ (zi (t)) = gi (ui (t + ω)) − gi (ui (t)), f i∗ (zi (t)) = f i (ui (t + ω)) − f i (ui (t)), and M1 (t) = sup e α(t−s) z(t − s){ξ,∞} . 0≤s≤∞

It is clear that e α(t) z(t){ξ,∞} ≤ M1 (t) and M1 (t) is nondecreasing. In the following, we will prove that M1 (t) = M1 (0) for all t > 0. In fact, for every t0 > 0, there are two possible cases: Case 1. e αt0 z(t0 ){ξ,∞} < M1 (t0 ). In this case, M1 (t) is not increasing at t = t0 . Case 2. e αt0 z(t0 ){ξ,∞} = M1 (t0 ). In this case, let i t0 . be an index such that αt e 0 zi (t0 ) = e αt0 z(t0 ){ξ,∞} = M1 (t0 ). ξi−1 t0 t 0

Then we have

d|e αt zit (t)| dt × +

t = t0

= sign(zit0(t0 )e αt0

−αzit0 (t0 ) dit0 uit0 (t0 + ω) − dit0 uit0 (t0 )

n

a it0 j (t0 )g ∗j (z j (t0 ))

j=1

+

n j=1

∞

f j∗

0

z j t0 − τit0 j (t0 ) − s d K it0 , j (t0 , s)

 n  a ∗ G j |z j (t0 )| ≤ e αt0 − dit0 − α + a i+t it (t0 ) zit0 (t0 ) + i t0 j 0 0  j=i t0

+

n j=1

Fj 0

∞

  z j t0 − τi j (t0 ) − s d K¯ i , j (s) t0 t0 

960

T. Chen, W. Lu, and G. Chen

  αt0 e zi (t0 ) ≤ − dit0 − α ξit0 + ξit0 a i+t it (t0 ) ξi−1 t0 t0 0 0  n

+

αt e 0 z j (t0 ) ξ j a i∗t j G j ξ −1 j 0

j=i t0

+

n

ξj Fj

α(t0 −τi j (t0 )−s) t0 e ξ −1 z j (t0 − τit0 j (t0 ) − s) j

∞

0

j=1

  × e α(s+τit0 j (t0 )) d K¯ it0 , j (s)  

n  ∗ + ≤ −(dit0 − α)ξit0 + ξi0 a it it (t0 ) + ξ j a it j G j e αt0 z(t0 ){ξ,∞} 0 0 0  j=i t0

+

n

ξj Fj

e α(s+τit0 j (t0 )) d K¯ i0 , j (s)

∞

0

j=1

  α(t0 −τi j (t0 )−s) t0 z(t0 − τit0 j (t0 ) − s){ξ,∞} × e 

n − dit0 − α ξit0 + ξit0 a i+t it (t0 ) + ξ j a i∗t j G j

≤

0 0

+

n

ξj Fje

j=1

ατi∗t

0

j

∞

e 0

αs

j=i t

0

d K¯ it0 , j (s) M1 (t0 ) ≤ 0.

(3.26)

Therefore, M1 (t) is not increasing at t = t0 . In summary, we have M1 (t) = M1 (0) for all t > 0. Therefore, u(t + ( j + 1)ω) − u(t + jω){ξ,∞} ≤ e − jα sup u(s + ω) − u(s). s∈(−∞,0]

(3.27) Similar to the proof of the previous theorem, define a periodic function v(t) = [v1 (t), . . . , vn (t)]T by vi (t) = lim ui (t + jω). j→∞

We know that v(t) is well defined and ||u(t + jω) − v(t)||{ξ,∞} = O(e − jωα ) when

j → ∞.

On a Large Class of Delayed Neural Networks

961

As a special case, let ds K i j (t, 0) = b i j (t), and ds K i j (t, s) = 0, for s = 0. Then dynamical systems 1.3 and 1.4 reduce to n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1

+

n

b i j (t) f j (u j (t − τi j )) + Ii (t) i = 1, . . . , n

(3.28)

j=1

and n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1

+

n

b i j (t) f j (u j (t − τi j (t))) + Ii (t) i = 1, . . . , n,

(3.29)

j=1

respectively. It should be noted that model 3.28 has been investigated in detail by Zhou, Liu, and Chen (2004) in terms of L 2 norm and Zheng and Chen (2004) in terms of L p norm. In this special case, d K¯ i j (0) = |b i∗j |, d K¯ i j (s) = 0 for s = 0 and ∞ ∗ ∗ e α(s+τi j ) |d K¯ i j (s)| = e ατi j |b i∗j |. 0

Therefore, as consequences of theorems 2 and 3, we obtain following two theorems: Theorem 4. Suppose that G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreasing, f i (·) ∈ Li p(Fi ), i = 1, 2, . . . , n. If there are positive constants ξ1 , . . . , ξn such that + −ξ j (d j − α) + G j ξ j a j j (t) + ξi |a i∗j | i= j

+ Fj

n i =1

ξi e ατi j |b i∗j | ≤ 0 ,

j = 1, . . . , n,

(3.30)

j = 1, . . . , n,

(3.31)

or in particular, −ξ j (d j − α) + G j

n i =1

+ Fj

n i =1

ξi |a i∗j |

ξi e ατi j |b i∗j | ≤ 0,

962

T. Chen, W. Lu, and G. Chen

then the dynamical system, 3.28, has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation: ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.

(3.32)

Theorem 5. Suppose that G i , Fi > 0, for all i = 1, 2, . . . , n. g(x) = (g1 (x), . . . , gn (x))T , where gi (·) ∈ Li p(G i ) is nondecreasing, f (x) = ( f 1 (x), . . . , f n (x))T , where f i (·) ∈ Li p(Fi ). If there are positive constants ξ1 , . . . , ξn and α > 0 such that −(di − α)ξi + ξi a ii + (t) + ξ j |a i∗j |G j j=i

+

n

∗

ξi e ατi j |b i∗j |F j ≤ 0

i = 1, 2, . . . , n,

(3.33)

j=1

in particular, −(di − α)ξi +

n

|a i∗j |G j ξ j +

j=1

n

∗

ξ j e ατi j |b i∗j |F j ≤ 0 i = 1, 2, . . . , n,

j=1

(3.34) then the dynamical system 3.29 has a unique ω-periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.

(3.35)

Suppose that ds K i j (t, s) = b i j (t)ki j (s)ds. Then dynamical systems 1.3 and 1.4 reduce to following dynamical systems with distributed delays: n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1

+

n

∞

b i j (t)

ki j (s) f j (u j (t − τi j − s)) ds + Ii (t)

(3.36)

0

j=1

and n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1

+

n j=1

respectively.

∞

b i j (t) 0

ki j (s) f j (u j (t − τi j (t) − s)) ds + Ii (t),

(3.37)

On a Large Class of Delayed Neural Networks

963

In this special case, d K¯ i j (s) = |b i∗j ||ki j (s)|, and

∞

0

∗ e α(s+τi j ) |d K¯ i j (s)| = |b i∗j |

∞

∗

e α(s+τi j ) |ki j (s)| ds.

0

Therefore, as consequence of theorem 1, we have: Theorem 6. Suppose that G i , Fi > 0, for all i = 1, 2, . . . , n, g(x) = (g1 (x), . . . , gn (x))T , where gi (·) ∈ Li p(G i ) is nondecreasing, f (x) = ∞ ( f 1 (x), . . . , f n (x))T , where f i (·) ∈ Li p(Fi ), and 0 s|ki j (s)|ds < ∞. If there are positive constants ξ1 , . . . , ξn such that −ξ j d j + ξ j a j j (t) +

i= j

+ Fj

n

ξi |b i∗j |

∞

+

ξi |a i∗j |

Gj

|ki j (s)| ds < 0,

j = 1, . . . , n,

(3.38)

0

i=1

then the dynamical system 3.36 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, lim ui (t) = vi (t) i = 1, . . . , n.

(3.39)

j→∞

Similarly, the following two theorems can be proven as consequences of theorems 2 and 3, respectively: Theorem 7. Suppose that G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreasing, ∞ f i (·) ∈ Li p(Fi ), i = 1, 2, . . . , n, and 0 s|ki j (s)| ds < ∞. If there are positive constants ξ1 , . . . , ξn such that −ξ j (d j − α) + ξ j a j j (t) +

i= j

+ Fj

n i=1

ξi |b i∗j |

∞

+ ξi |a i∗j |

Gj

e α(s+τi j ) |ki j (s)|ds < 0,

j = 1, . . . , n,

(3.40)

0

then the dynamical system 3.36 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)] and, for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, there hold ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.

(3.41)

964

T. Chen, W. Lu, and G. Chen

Theorem 8. Suppose that G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreasing, ∞ f i (·) ∈ Li p(Fi ), i = 1, 2, . . . , n, and 0 s|ki j (s)|ds < ∞. If there are positive constants ξ1 , . . . , ξn such that −(di − α)ξi + ξi a ii + (t) +

j=i

+

n

ξ j F j |b i∗j |

j=1

∞

ξ j |a i∗j |G j

∗

e α(s+τi j ) |ki j (s)|ds ≤ 0 i = 1, 2, . . . , n,

(3.42)

0

then the dynamical system 3.37 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, there hold ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.

(3.43)

Remark. Because constants can be regarded as ω-periodic functions with any period ω > 0, all the theorems apply to dynamical systems when di (t), a i j (t), b i j (t), and Ii (t) are constants. The details are omitted. It is natural to ask whether the conditions −ξ j d j + ξ j a j j +

+ ξi |a i j |

Gj

i= j

+

N

ξi

∞

|d K¯ i j (s)|F j < 0

j = 1, . . . , n

(3.44)

0

i=1

can guarantee exponential convergence of the dynamical system 1.3. The answer is no. Let us consider the following delayed differential equation, u(t) ˙ = −u(t) +

∞

u(t − s)dK (s),

(3.45)

0

where dK (τ ) = rewritten as

1 2

and dK (t) = 0 if t = τ . The differential equation can be

1 u(t) ˙ = −u(t) + u(t − τ ). 2

(3.46)

The solution of the differential equation is u(t) = e −αt ,

(3.47)

On a Large Class of Delayed Neural Networks

965

where α satisfies e ατ = 2(1 − α)

(3.48)

τ = log{2(1 − α)}/α.

(3.49)

or

If τ → ∞, then α → 0. It means that there exists no such fixed β that u(t) = O(e −βt ) holds for any delay τ . 4 A Numerical Experiment In this section, we present a numerical experiment to verify our main results. Consider the following delayed neural networks with two neurons: u˙ 1 = −2u1 (t) − 2 tanh(u1 (t)) + sin2 (t) tanh(u2 (t)) + | sin(2t)|arctg(u1 (t − 1)) +

1 cos2 (t)arctg(u2 (t − 2)) 2

+ sin(2t) + 3 u˙ 2 = −u2 (t) + 2 cos2 (t) tanh(u1 (t)) − tanh(u2 (t)) 1 2 sin2 (t)arctg(u1 (t − 1)) + | cos(t)|arctg(u2 (t − 2)) 3 3 + cos(t) + 4, +

(4.1)

where d1 = 2 ∗ b 11 =1

∗ ∗ d2 = 1 a 11 = −2 a 22 = −1 a 12 = 1 a 21 = 2 ∗ b 12 =

1 2

∗ = b 21

2 3

∗ = b 22

1 3

G 1 = G 2 = F1 = F2 = 1.

With ξ1 = ξ2 = 1, we have 2 1 = − <0 3 3 1 1 1 = −1 + [−1 + 1]+ + + = − < 0. 2 3 6

∗ + ∗ ∗ ] + b 11 + b 21 = −2 + [−2 + 2]+ + 1 + − d1 + [−a 11 + a 21 ∗ + ∗ ∗ − d2 + [−a 12 + a 22 ] + b 12 + b 22

By theorem 2 or theorem 4, we conclude that there exists a unique periodic solution that is globally exponentially stable. Figure 1 indicates that u1 (t) u2 (t) all converge to periodic trajectories, respectively.

966

T. Chen, W. Lu, and G. Chen

Figure 1: Dynamical behavior of two components.

Instead, if we hope to use the results given in Zheng and Chen (2004) or Zhou et al. (2004), we need to prove that the following matrix, P =

d1

∗ ∗ ∗ ∗ −[a 12 + a 22 + b 12 + b 22 ]

2 − 17 3 = , − 17 1 6

∗ ∗ ∗ ∗ −[a 11 + a 21 + b 11 + b 21 ]

d2

is an M matrix. It is easy to see that P is not an M matrix. Therefore, by the results given in Zheng and Chen (2004) and Zhou et al. (2004), we cannot conclude that this system has a unique periodic solution. 5 Conclusion In this article, we investigate dynamical behaviors of a large class of delayed neural networks with periodic self-inhibition, interconnection weights, and inputs. We prove that if the activation functions are of Lipschitz type and some set of inequalities, for example, equation 3.1, is satisfied, the delayed system has a unique periodic solution, which is stable globally. We also prove that if the set of inequalities 3.20 or 3.23 is satisfied, then the system is exponentially stable globally. This class of delayed dynamical

On a Large Class of Delayed Neural Networks

967

systems provides a general framework for many delayed dynamical systems. As a special case, it includes delayed Hopfield neural networks and cellular neural networks as well as distributed delayed neural networks with periodic self-inhibition, interconnection weights, and inputs as special cases. Therefore, it is easy to derive corresponding results for these neural networks from the results we have obtained. We also compare the results with those given in other articles by a numerical experiment. Moreover, the discussion also applies to delayed systems with constant interconnection weights and inputs. Acknowledgments We are grateful for the reviewers, comments, which were very helpful in the revision of the article. This project is supported by NSF of China under grant 60374018. G.C. is supported by the Hong Kong Research Grants Council under CERG grant City U 1018/0lE.

References Amari, S. (1971). Characteristics of randomly connected threshold element networks and neural systems. Proc. IEEE, 59, 35–47. Amari, S. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans., SMC-2, 643–657. Belair, J. (1993). Stability in a model of a delayed neural network. Journal of Dynamical Differential Equations, 5, 607–623. Belair, J., Campbell, S. A., & van den Driessche, P. (1996). Stability and delay-induced oscillations in a neural network model. SIAM Journal of Applied Mathematics, 56, 245–255. Cao, Y. J., & Wu, Q. H. (1996). A note on stability of analog neural networks with time delays. IEEE Transactions on Neural Networks, 7, 1533–1535. Cao, J. (2000). On exponential stability and periodic solution of CNNs with delay. Physics Letters A, 267, 312–318. Cao, J. D., & Zhou, D. (1998). Stability analysis of delayed neural networks. Neural Networks, 11, 1601–1605. Chen, T. P. (2001). Global exponential stability of delayed Hopfield neural networks. Neural Networks, 14(8), 977–980. Chen, T. P., & Amari, S. (2001). Stability of asymmetric Hopfield networks. IEEE Transactions on Neural Networks, 12(1), 159–163. Chen, T. P., Lu, W. L., & Amari, S. (2002). Global convergence rate of recurrently connected neural networks. Neural Computation, 14(12), 2947–2957. Cohen, M. A., & Grossberg, S. (1983). Absolute stability and global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, SMC-13, 815–821. Gopalsamy, K., & He, X. (1991). Stability in asymmetric Hopfield nets with transmission delays. Phys. D, 76, 344–358.

968

T. Chen, W. Lu, and G. Chen

Hale, K. (1965). Sufficient conditions for stability and instability of autonomous functional-differential equations. Journal of Differential Equations, 1, 452–482. Halmos P. R. (1950). Measure theory. New York: Van Nostrand. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopfield, J. J., & Tank, D. W. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nat. Acad. Sci., 79, 3088–3092. Hopfield, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Lu, W., & Chen, T. (2003). New conditions on global stability of Cohen-Grossberg neural networks. Neural Computation, 15(5), 1173–1189. Marcus, C. M., & Westervelt, R. M. (1989). Stability of analog neural networks with delay. Physics Review A, 39, 347–359. Matsuoka, K. (1992). Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, 5, 495–500. Michel, A. N., & Gray, D. L. (1990). Analysis and synthesis of neural networks with lower block triangular interconnecting structure. IEEE Trans. Circuits Syst., 37, 1267–1283. Miller, K. (1965). Asymptotic behavior of nonlinear delayed-differential equations. Journal of Differential Equations, 1, 293–305. Sompolinsky, H., & Crisanti, A. (1988). Chaos in random neural networks. Physical Review Letter, 61, 259–262. Wang, L., & Zou, X. (2002). Exponential stability of Cohen-Grossberg neural networks. Neural Networks, 15, 415–422. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Zhang, J., & Jin, X. (2000). Global stability analysis in delayed Hopfield neural models. Neural Networks, 13(7), 745–753. Zheng, Y., & Chen, T. (2004). Global exponential stability of delayed periodic dynamical systems. Physics Letters A, 322(5–6), 344–355. Zhou, J. Liu, Z., & Chen, G. (2004). Dynamics of delayed periodic neural networks. Neural Networks, 17(1), 87–101.

Received March 26, 2003; accepted August 11, 2004.

LETTER

Communicated by Kechen Zhang

A Subjective Distance Between Stimuli: Quantifying the Metric Structure of Representations Dami´an Oliva [email protected] Laboratorio de Neurobiolog´ıa de la Memoria, Departamento de Fisiolog´ıa, Biolog´ıa Molecular y Celular, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, 1428, Buenos Aires, Argentina

In´es Samengo [email protected] Centro At´omico Bariloche, San Carlos de Bariloche, R´ıo Negro, Argentina

Stefan Leutgeb [email protected] Centre for the Biology of Memory, Norwegian University of Science and Technology, NO-7489 Trondheim, Norway

Sheri Mizumori [email protected] Psychology Department, University of Washington, Seattle 98195-1525, U.S.A.

As subjects perceive the sensory world, different stimuli elicit a number of neural representations. Here, a subjective distance between stimuli is defined, measuring the degree of similarity between the underlying representations. As an example, the subjective distance between different locations in space is calculated from the activity of rodent’s hippocampal place cells and lateral septal cells. Such a distance is compared to the real distance between locations. As the number of sampled neurons increases, the subjective distance shows a tendency to resemble the metrics of real space.

1 How Different Are Two Stimuli Perceived? Conside a subject that is labeling the elements of a given set of stimuli S = {s 1 , s 2 , . . . , s N }. Every time a stimulus s j ∈ S is shown, he or she identifies it as s k ∈ S, where j may or may not be equal to k. Successful trials are those where the stimulus is correctly identified, that is, when j = k. By writing down the succession of presented stimuli and, in each case, the response of the subject, one can build up a list of pairs (s j , s k ), where the first element, Neural Computation 17, 969–990 (2005)

© 2005 Massachusetts Institute of Technology

970

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

s j , is the real stimulus, and the second one, s k , is the choice made by the subject. Notice that by looking into the list of pairs, one cannot determine precisely what the subject actually perceived, since only the final choice sk is accessible. The subject may in fact hesitate to identify the stimulus as a member of S, or even think that it does not truly match any of the s i . However, even if the mental representation elicited by stimulus s j is unknown, the researcher can assess that under the requirement to classify the stimulus as an element of S, the subject chooses s k . In a way, whatever the neural activity brought about by s j , out of all the elements in S, the one whose representation is most similar to the actual one is s k . The object of this work is to provide a quantitative measure of such a criterion of similarity. The approach does not rely on a model of mental representations; it only makes use of the statistics of actual and chosen stimuli. In what follows, we assume that a sufficiently large number of samples has been taken, so that the conditional probability Q(s k |s j ) of showing s j and perceiving s k may be evaluated for all j and all k. The matrix Q is henceforth called the confusion matrix. The elements of matrix Q are positive numbers, ranging from 0 to 1. In addition, normalization must hold,

Q(s k |s j ) = 1.

(1.1)

k

It should be noticed that Q need not be symmetrical. For any fixed j, one can define an N-dimensional vector q j such that its kth component is equal to Q(s k |s j ). The positivity of the elements of Q and the normalization condition, equation 1.1, determine a domain D, where q j can live. It is a finite portion of a hyperplane of dimension N − 1. Figure 1 depicts the domain D for N = 3. For some sets of stimuli, the confusion matrix may show clustering. That is, choosing a convenient ordering of the stimuli, Q may show a block structure, where cross-elements between stimuli belonging to different blocks are always zero. In this case, the stimuli belonging to different blocks are never confounded with one another. Moreover, they are never confounded with a third common stimulus. The phenomenon of clustering exposes a very particular structure perceived by the subject in the set of stimuli. In this work, we are interested in studying the statistics of mistakes. More specifically, if the subject happens to systematically confound, for example, two of the stimuli, it could be argued that from his or her subjective point of view, those two stimuli are particularly similar. We would like to quantify such an amount of similarity by introducing a distance between stimuli. This distance, being defined with the statistics of mistakes, is of course subjective. It may happen that in a particular experiment, the subjective distance between stimuli can actually be explained in terms of some physical parameter qualifying the stimuli (e.g., orientation angle, pitch, color). But it could

A Subjective Distance Between Stimuli

q3

971

j

1

1

q2

j

1 q1

j

Figure 1: Domain D where the vector q j can exist, when N = 3.

also happen that the subjective metric structure had no physical correlate in the stimuli themselves, but instead depended on the previous semantic knowledge of the subject, or on the presence or absence of distractors, or on the statistical distribution with which the stimuli are presented, or on the attention being paid by the subject. The aim, hence, of defining a subjective distance between stimuli is to provide a quantitative measure that may serve to determine the degree up to which these different processes— if present—contribute to the confusion of some stimuli with others or, in contrast, to their clear differentiation. In order to define a distance between stimuli, it is necessary to have a notion of equality. Here, two stimuli i and j are considered subjectively equal if qi = q j . That is, if for all k, Q(s k |s i ) = Q(s k |s j ). Hence, in the notion of equality, not only the way stimulus i is confounded with stimulus j is relevant. One must also compare the way each of those two stimuli are confounded with the rest of the elements in set S. If one of them is perceived as very similar to a third stimulus k but the other is not, then a noticeable

972

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

difference between i and j can be pointed out, and the stimuli cannot be considered subjectively equal. The equality of qi and q j , in addition, is not equivalent to a high confusion probability between the two of them. If stimulus i is always perceived as stimulus j and vice versa, then, taking s i and s j as representing the first and second components, respectively, in the q vectors, (qi )t = (0, 1, 0, . . . , 0), whereas (q j )t = (1, 0, 0, . . . , 0). In this case, the two stimuli are perfectly distinguishable from one another. The fact that the subject chooses to label stimulus i as j (and vice versa) does not mean he or she confuses them. It is only a question of names. Correspondingly, it may happen that two stimuli are never confounded with one another, and yet they are equal. This happens when Q(s i |s j ) = Q(s j |s i ) = Q(s i |s i ) = Q(s j |s j ) = 0, and in addition, Q(s k |s i ) = Q(s k |s j ), for all k different from i and j. Starting from the notion of subjective equality, in the next section a number of desirable properties of a subjective distance are discussed. Out of all the distances that fulfill these requirements, a single one is selected in section 3. In section 4, the relationship of our subjective distance to other measures of similarity is discussed. Section 5 extends the definition of subjective distance to the case where the response of the subject is given as a neural pattern of activity. In section 6, an example is presented using extracellular recordings from the rodent hippocampus and lateral septum. Finally, in section 7, a brief summary of the main ideas and results is given. 2 Properties of a Subjective Distance What are the desirable properties of a subjective distance? First, since the distance D between elements i and j is intended to reflect the statistics of confusion on presentation of these two stimuli, it is convenient to define it in terms of the vectors qi and q j . As a distance, it is required to fulfill the following conditions: 1. D(qi , q j ) ≥ 0, and D(qi , q j ) = 0 ⇔ qi = q j . 2. D is symmetrical: D(qi , q j ) = D(q j , qi ). 3. D obeys the triangle inequality: D(qi , q j ) + D(q j , qk ) ≥ D(qi , qk ). These are general requirements, defining a distance. The third condition implies that the set of q vectors that lie all at the same distance of one particular qi conform a convex figure, in the domain D. In addition, in the case presented here, the distance between two elements should not depend on the ordering of the stimuli. Hence, if the components k and are interchanged, in both qi and q j , the distance D(qi , q j ) should remain invariant. That is, if C k is a matrix that interchanges the kth and th component, then 4. D(qi , q j ) = D(C k qi , C k q j ).

A Subjective Distance Between Stimuli

973

This requirement, though plainly obvious from the intuitive point of view, imposes quite serious restrictions. Consider, for example, all the distances D(qi , q j ) that can be derived from a scalar product <, >, namely, D(qi , q j ) = j >. Once an orthonormal basis is given, this may be writ< qi − q j , qi − q i j ten as D(q , q ) = (qi − q j )t M(qi − q j ), where M is a hermitian, positive definite matrix representing the scalar product. Condition 4 imposes symmetry among the components of the vectors, which means that M must be proportional to the unit matrix. Therefore, out of all the distances that have a scalar product associated with them, the only one that fulfills condition 4 is the Euclidean distance—apart from a scale factor, fixing the units. What should be the meaning of the maximum subjective distance? The maximum distance should be reserved to those pairs of objects that the subject distinguishes unambiguously from one another: that is, to those s i and s j that are never confounded with a common stimulus. Mathematically, this means that for each k, either Q(s k |s i ) or Q(s k |s j ) (or both) must vanish, that is, whenever Q(s k |s i ) = 0, Q(s k |s j ) = 0 (and vice versa). In this case, whatever the response of the subject to stimulus s i , it never coincides with his or her response to stimulus s j . This situation corresponds to the intuitive notion of unambiguous segregation: the response of the subject to stimulus i is enough to ensure that the stimulus was not j. And vice versa, the response to stimulus j is enough to discard stimulus i. The fifth requirement, hence, reads: 5. If D(qi , q j ) is maximal if and only if s i and s j are unambiguously segregated. And conversely, if D(qi , q j ) is not maximal, then s i and s j are not unambiguously segregated. Imposing requirement 5 ensures that the stimuli that are unambiguously segregated are all at the same distance, no matter any other particular characteristic of the stimuli. And conversely, if two stimuli do not elicit segregated responses, they are not allowed to be at the maximum distance. Condition 5 establishes the cases that correspond to the maximum distance, in the same way that condition 1 does to the minimum distance. Adding the triangular inequality 3 ensures that those pairs of stimuli whose distance lies in between the minimal and the maximal one be consistently ordered. Condition 5 has two important consequences. In the first place, it ensures that the clustering structure present in Q is also reflected in the matrix of distances D. In D, cross-terms between different blocks are equal not to zero but to the maximum distance. Conversely, if D shows a block structure, it may be shown that Q has the same block structure. In the second place, if the maximum distance is finite, a similarity matrix S may be defined, S = d M I − D, where d M is the maximum distance and I the unit matrix. The matrix S also inherits, if present, the clustering structure of Q. This correspondence between the clustering structure of D (and of S) with the one of Q cannot be ensured if condition 5 is not fulfilled.

974

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

j 2 i The Euclidean distance DE (qi , q j ) = k (q k − q k ) does not fulfill condition 5. Taking into account that the q vectors are normalized (see equation 1.1), the √ maximum value of the Euclidean distance between two stimuli is 2. It can be attained, for example, for (q1 )t = (1, 0, 0, 0) and (q2 )t = (0, 1, 0, 0). In this example, in fact, stimulus 1 shares no common response with stimulus 2. However, not all stimuli with no common responses lie at the maximum Euclidean distance. Consider, for example, (q3 )t = (1/2, 1/2, 0, 0) and (q4 )t = (0, 0, 1/2, 1/2). Though showing disjoint response sets, their Euclidean distance is equal to 1, which is less than the maximum distance. This means that none of the distances that can be associated with a scalar product are useful as a measure of subjective dissimilarity. 3 Choosing a Subjective Distance There are still many distances fulfilling requirements 1 through 5. In what follows, a single one is selected on the basis of a maximum likelihood decoding. Imagine that someone observing the subject’s responses to either stimulus i or j has to guess which of the two has been presented. For the moment, we assume for simplicity that both stimuli appear with the same frequency; this requirement will be abandoned later. We assume the observer is familiar with the confusion matrix of the subject. There are several ways in which he or she can decide between stimuli i and j given the subject’s response. Here, a maximum likelihood strategy is considered, since this is the algorithm that maximizes the fraction of stimuli correctly identified. It consists of taking the choice of the subject—say, stimulus k—and deciding whether the actual stimulus was i or j on the basis of which of them has the larger qk component. If Q(s k |s i ) > Q(s k |s j ), the observer chooses stimulus i; if the opposite holds, the observer chooses stimulus j. If both conditional probabilities are equal, then the observer chooses any of the two stimuli, with equal probabilities. Under this scheme, the fraction of times the observer correctly identifies stimulus i is P(s i |s i ) =

q ki + j

k/q ki >q k

1 i q . 2 i j k

(3.1)

k/q k =q k

The fraction of times stimulus j is presented but the observer chooses stimulus i is P(s i |s j ) =

j

qk + j

k/q ki >q k

1 i q . 2 i j k k/q k =q k

(3.2)

A Subjective Distance Between Stimuli

975

Correspondingly, when the observer decides for stimulus j,

P(s j |s j ) =

j

qk +

j

k/q k >q ki

P(s j |s i ) =

j

k/q k >q ki

1 j q . 2 i j k

(3.3)

1 j q . 2 i j k

(3.4)

k/q k =q k

q ki +

k/q k =q k

The distance D(qi , q j ) is defined as the difference between the fraction of correct and incorrect maximum likelihood choices, namely, D(qi , q j ) = =

1 1 P(s i |s i ) − P(s i |s j ) + P(s j |s j ) − P(s j |s i ) 2 2 N 1 j |q i − q k |. 2 k=1 k

(3.5)

In other words, the distance between stimulus i and stimulus j is defined in terms of the performance of the maximum likelihood decoding, assuming that the response statistics of the subject are known. This definition is easily shown to fulfill all five conditions. A distance equal to zero means that the observer is deciding at chance between the two stimuli. Given that he or she uses a maximum likelihood strategy, that means that the two underlying vectors are equal. A distance equal to 1 implies that the observer always makes the right choice. In what follows, some mathematical properties of the distance D are analyzed. In order to get a geometrical flavor of D, Figure 2A shows a contour plot of the distance of all the vectors in D to the vectors (1/3, 1/3, 1/3) and Figure 2B to the vectors (0, 0, 1). The subjective distance D is translation invariant. That is, if the vectors qi and q j are displaced by a fixed vector q, the distance between them remains unchanged. Mathematically, D(qi , q j ) = D(qi + q, q j + q).

(3.6)

Equation 3.6 is valid for any displacement q. However, in the context here, the displacement should be such that both qi + q and q j + q fall inside the domain D. This means that the components of q must sum up to zero, and its magnitude must be bounded (the value of the bound depends on the location of qi and q j ). As a further characterization, the distance D between a stimulus that is perfectly identified by the subject—say, stimulus i—and another stimulus j, N j is given. We take (qi )t = (1, 0, 0, . . . , 0). In this case, D(qi , q j ) = k=2 qk =

976

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

Figure 2: Contour plot of the distance D to a fixed vector qi , in D. (A) qi = (1/3, 1/3, 1/3). (B), qi = (1, 0, 0).

j

j

1 − q 1 . The distance between qi and q j is fully determined by q 1 . It does not matter whether the probability of not selecting stimulus 1 is spread out over the last N − 1 components of q j or is entirely concentrated in a single one. This result generalizes to any vector qi having two or more null components: the distance depends on the sum of those same components of q j , not on their individual values. Finally, consider the case where the subject has a probability α of identifying any of the stimuli correctly (Q(s i |s i ) = α), and that whenever he or she makes a mistake, the error is equally distributed among all other stimuli (Q(s i |s j ) = β, for all i = j). The normalization condition equation 1.1 implies α + (N − 1)β = 1. In this case, D(s i , s j ) = |α − β|. 3.1 Extension to Continuous Stimuli. Consider the case where there is a continuous parameter x labeling the stimuli, such that stimulus i corresponds to an interval of x values ranging from i to (i + 1)/, with = 1/N. It is now convenient to vary i between 0 and N − 1. The scale of x is chosen in such a way that its maximum value is 1. For large N, the

A Subjective Distance Between Stimuli

977

confusion matrix can be written as Q(s j |s i ) = u( j|s i ), where u(x|s i ) is a piecewise continuous probability density. The distance between stimuli i and j reads D(qi , q j ) =

1 1 1 |u(x|s i )−u(x|s j )|d x, u(k|s i )−u(k|s j ) → 2 k 2 0 (3.7)

where the right-hand limit corresponds to making the number of stimuli N tend to infinity. In other words, the distance between stimuli i and j is equal to the area between the two corresponding densities. The normalization condition for u(x|s) ensures that D lies between 0 and 1. One can easily show that its value remains invariant when a different parameterization of the variable x is used—as long as the new variable is in a one-to-one relation to x. 3.2 Extension to Stimuli with Nonuniform Prior Probabilities. One could ask whether the measure D can be extended to stimuli that are not all presented with the same probability. Let Q(s i ) denote the probability of presenting stimulus s i . In this case, the maximum likelihood algorithm has to be substituted by a maximum a posteriori one. That is, the observer chooses stimulus s i upon response s k from the subject whenever pki = Q(s k |s i )Q(s i ) j j is larger than pk = Q(s k |s j )Q(s j ), and vice versa. Whenever the pki = pk , the choice between s i and s j is proportional to their corresponding priors. In this case, the difference between the correct and incorrect fractions of maximum likelihood estimations is D0 (qi , q j ) =

1 j i p − p k k. Q(s j ) + Q(s i ) k

(3.8)

This is, of course, also a valid distance between stimuli, since it fulfills requirements 1 through 5. Its maximum value is also 1, and it still carries the same block structure as Q. Notice that in this case, the distance not only depends on the subject’s perception, characterized by Q(s j |s i ), but also on the statistics of the stimuli (described by Q(s i )). 4 Comparison with Other Measures of Dissimilarity The distance D0 is no more than a geometrical view of the matrix Q(s i , s j ). It has the advantage of being true distance, that is, of obeying conditions 1 through 3. In addition, it fulfills the symmetry constraints imposed by condition 4 and preserves the block structure in Q, as ensured by condition 5. However, there are also other distances that still obey requirements 1 through 5; the angle between the vectors qi and q j can be taken

978

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

as an example. The advantage of D0 is that it has a simple interpretation in terms of maximum likelihood decoding performance. There have been previous attempts aiming at quantifying how different two stimuli are perceived. Perhaps the most similar to ours was proposed by Green and Swets (1966) in their definition of the discriminability d between two stimuli. Strictly speaking, their approach can be used only when the response to different stimuli is described by gaussian functions whose mean depends on the stimulus but whose variance remains fixed. They defined the discriminability d between two stimuli as the ratio between the difference of the two corresponding mean values to the standard deviation. Can the concept of d be extended to more general response distributions? In the gaussian case, one can make a correspondence between a given d value and the expected fraction of errors when estimating the stimulus from the response of the subject using a maximum likelihood decoding. In terms of this correspondence, d not only represents the distance at which the gaussian functions sit from one another, but more generally, the fraction of decoding mistakes, which is something that does not depend on the shape of the probability distribution of the responses of the subject. Large discriminability is associated with a small probability of making a mistake. The scale of the measure, however, since it is inherited from the gaussian case, has a non linear relationship with the fraction of mistakes. With this idea in mind, d can be extended to nongaussian stimuli (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Whatever the shape of P(s k |s i ) and P(s k |s j ), given the response s k , an observer can decide in favor of stimulus s i or s j depending on which of them has a higher probability of eliciting response s k . For each pair of stimuli s i and s j , there will be, on average, a certain fraction f e of errors. One could extend the definition of discriminability to be the d value that would give, in the gaussian case, the same fraction f e of errors. This extension, being intimately related to the performance of a maximum likelihood decoding, is grounded on the same rationale as our subjective distance D. The problem is that if one is constrained to choose the d scale to match the equivalent gaussian case, then the definition of discriminability does not obey the triangular inequality. It is easy to construct an example where P(s k |s i ) overlaps in a certain region with P(s k |s j ), which in turn overlaps in a different region with P(s k |s ), but such that P(s k |s i ) and P(s k |s ) are never simultaneously different from zero. In this case, d (s i , s j ) and d (s j , s ) are both finite, whereas d (s i , s j ) is infinite. Hence, though both D and d can be defined in terms of maximum likelihood performance, D is a proper distance, whereas d is not. Another well-known notion of distance can be defined (for continuous stimuli) in terms of the Fisher information metric tensor J . There is a natural scalar product associated with J and also a notion of distance in the space of stimuli (see Amari & Nagaoka, 2000). However, the entire Fisher geometry becomes meaningless for discrete stimuli. Our aim is to show that even in the discrete case, a definition of distance is possible.

A Subjective Distance Between Stimuli

979

The distance defined in terms of the Fisher metric tensor is a bilinear form. Its matrix elements are not constant, but depend on the point q that one is interested in. Hence, distances are defined in terms of a curvilinear integral, which may actually involve very difficult calculations. When two stimuli have disjoint response sets, the Fisher distance between them diverges. However, a Fisher distance between two stimuli equal to infinity does not necessarily imply that the responses to those two stimuli conform disjoint sets. The Fisher distance between two stimuli may diverge, for example, when the probability density u(x|s) is discontinuous. This implies a discrepancy with requirement 5. The Kullback-Leibler divergence (Cover & Thomas, 1991) between the vectors qi and q j can also be used to measure how different two stimuli are perceived. It has the appealing property of being intimately related to many concepts in information theory, and as such, it has an informationbased intuitive interpretation: it is a measure (in number of additional bits of the mean code length) of the inefficiency of assuming that the distribution of a given variable is qi when its true distribution is q j . It is not a distance, however, since it does not fulfill requirements 2 and 3. Its symmetrized version is sometimes called the Jensen-Shannon measure, which is still not a distance because it does not obey the triangular inequality 3. For this measure, condition 4 is always true. The maximum value of the JensenShannon divergence is infinite. This value, however, is not only reached for pairs of stimuli whose response sets are disjoint, but also whenever there j is a component k such that q ki = 0, and q k = 0. This means requirement 5 is not in general fulfilled. There has been another previous proposal of a pseudo-distance (Treves, 1997), which was also defined in terms of the confusion matrix. As opposed to D0 , and also to the Jensen-Shannon measure, the distance between stimuli s i and s j depends only on Q(s i |s j ), Q(s j |s i ), Q(s i |s i ) and Q(s j |s j ) (other stimuli do not appear). Just like the Jensen-Shannon divergence, however, it does not fulfill requirements 3 and 5. We claim that our measure allows a geometrical picture of a set of stimuli. Multidimensional scaling (Young & Hamer, 1994; Cox & Cox, 2000) also aims at this goal. The two approaches, however, bear certain differences. The use of D0 aims at a definition of a distance. It is particularly useful when one can vary a certain parameter in the experiment (in the example below, the number of cells being sampled) and wants to know the effect of that parameter in the structure of confusions. Nevertheless, D0 in itself does not provide a way of visualizing the stimuli in a particular space. Since the structure of confusions may depend on a very large number of factors (e.g., the semantic knowledge of the subject), there may be actually no small-dimensional space where the stimuli can be placed. Multidimensional scaling, in contrast, starts with a given matrix of distances, or dissimilarities. The algorithm is designed to place the stimuli in a finite dimensional space producing the minimum possible distortion of the

980

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

pairwise distances. Except for very particular sets of stimuli, it is an approximate method. What is gained is the optimal set of coordinates of the stimuli out of the matrix of distances. The subjective distance could be, in certain applications, a good starting point with which to feed the multidimensional scaling algorithm. Finally, we point out that our definition of subjective distance emphasizes the probability of confounding the stimuli, as opposed to other possible measures, that stress the differences in the elicited neural representations. In this sense, our approach is, broadly speaking, complementary to Victor and Purpura’s (1997) proposal of constructing a metric in the set of responses. There, different distances between spike trains were considered. Each proposed distance captured specific aspects of the neural response. For example, the distance between two spike trains was defined in terms of how different the timing of individual spikes was, or how different the interspike intervals were, and so forth. Their aim was to decide which of those distances could better cluster the neural responses corresponding to the different stimuli in their set and thus make inferences about the way stimuli were encoded into spike trains. In the back of this reasoning is the assumption that the stimuli themselves are all sufficiently different from one another. Here, in contrast, we are interested in the perceived distances and similarities of the stimuli, as can be deduced from the structure of confusions.

5 From Neural Representations to the Subjective Distance In order to use the distances defined in section 3, the conditional probabilities Q(s i |s j ) are needed. Such probabilities may be extracted, as described in section 1, from an experiment where the subject is asked to identify the stimuli. However, this is not the only way a matrix Q can be obtained. In the case of nonhuman subjects, many times the stimuli are presented while the activity of one or several neurons is being recorded by microelectrodes implanted in the animal’s brain. From these experiments, the conditional probability ρ(r k |s j ) of recording response r k during the presentation of stimulus s j may be calculated. One way of deriving Q from ρ is to use a decoding procedure, that is, to define a mapping going from the set of neural responses to the set of stimuli (Rolls & Treves, 1998; Dayan & Abbott, 2001; Rieke et al., 1997). Different rules defining such a transformation give rise to different decoding procedures. Among them, one can point out the maximum a posteriori k j k decoding, where r is associated with the stimulus that maximizes ρ(s |r ) = ρ(r k |s j )P(s j )/ ρ(r k |s )P(s ), that is, the conditional probability of having presented stimulus s j when response r k was measured. This rule is, among all possible decoding rules, the one that maximizes the fraction of correct decodings.

A Subjective Distance Between Stimuli

981

Another widely used option is the Bayesian approach, where Q(s j |s i ) =

ρ(s j |r k )ρ(r k |s i ).

(5.1)

k

Strictly speaking, in this case there is no decoding, since one does not choose a single stimulus for each response. One rather keeps all the probability distribution for each stimulus, given the neural response. The only assumption in the back of equation 5.1 is that the decoded stimulus is conditionally independent of the actual stimulus, that is, P(s j |r k , s i ) = P(s j |r k )P(r k |s i ). Passing from matrix ρ to matrix Q always means a loss of information. The amount of information that is lost can sometimes be estimated (Samengo, 2001). A maximum a posteriori decoding maximizes the fraction of correct identifications (that is, the trace f of the resulting Q) but probably loses some of the structure of ρ whenever the response r j is not the most probable response elicited by a given stimulus. A procedure like equation 5.1 is intended to preserve the statistical regularities of mistakes, but typically implies a lower fraction of correct choices f . Therefore, whenever neural responses are used to construct matrix Q, one must bear in mind that the results depend, at least partially, on how Q is calculated. The dependence of the Q matrix on the decoding may seem dangerous. Could one define a distance in terms of ρ without having to pass through matrix Q? Of course, this is possible in general terms. One option, for example, would be to calculate the mean response mi to stimulus i, and then define the distance between stimuli i and j as the distance Dn between the corresponding means.1 This is a distance between vectors living in the response set, not in D. The subindex n, hence, stands for neural. The distance Dn may be a sensible approach when one is interested in the neural representations of the stimuli. It is not very convenient, however, if the distance is meant to reflect the structure of confusions. As is shown below, Dn does not in general fulfill condition 5. Confusion is defined at the behavioral level; the subject is asked to identify the stimuli. At the neural level, one can talk about confusion between two stimuli only when a decoding procedure is introduced. In this sense, the process of decoding should not be viewed as an arbitrary step: one shifts the question of how distinguishable two stimuli are to the problem of inferring the stimulus from the neural response. To test whether a distance defined in the response set fulfills condition 5, we point out that for any reasonable decoding procedure, two stimuli that 1 Of course, even more sophisticated distances are possible, for example, one taking into account the covariance matrix, or even higher moments of the distributions ρ(r k |s i ). Here, we use only a very simple definition of a distance Dn in the response space, since we want to point out only that there is a qualitative difference between defining a distance in terms of the Q matrix and in terms of the ρ matrix.

982

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

elicit responses occupying convex, nonoverlapping portions of the response space are never confounded with one another. Is this enough to have them at the maximum Dn ? In general, no. Being Dn defined in terms of a distance in the response set, the more separate the responses to the two stimuli are, the larger Dn is. Hence, neural-based distances do not in general fulfill condition 5.

6 Application to Place Cells: Comparing the Actual and the Subjective Distance Between Two Locations in Space It is known that many of the pyramidal cells in the rodent hippocampus selectively fire when the animal is in a particular location of its environment (O’Keefe & Dostrovsky, 1971). Such a response profile allows one to define, for each cell, a place field, that is, the region in space for which the neuron is responsive. A classic experiment in the study of the rat’s hippocampal neurophysiology consists in letting the awake animal wander in a given environment while the activity of one or several of its hippocampal pyramidal cells is being recorded (for an overview see, e.g., O’Keefe & Dostrovsky, 1971, or Redish, 1999). In the experiment we analyze next, cells in the lateral septum have also been recorded. The lateral septum receives a massive projection from hippocampal pyramidal neurons. The activity of septal cells has also been shown to be informative about the location of the animal, though not as neat as in hippocampal neurons. The set of stimuli in this case consists of all the possible locations where the animal can be placed. This set is endowed with a natural metric: we know what the distance between two locations is. Other sets of stimuli, for example, a set of pictures of human faces, lack an obvious, natural distance. In order to test our definition of subjective distance, it is convenient to work with a metric set of stimuli, which allows the comparison of D0 with the natural physical distance in the set. A description of the experiment follows. Nine young adult Long-Evans rats were tested while moving in an eight-arm radial maze, as shown in Figure 3A. Each arm contained a small amount of chocolate milk in its distal part (for details, see Leutgeb & Mizumori, 2002). In a given trial, a rat initially placed in the center of the maze visited the eight arms in a random order (see Figure 3B), taking the food reward. Several trials were recorded per animal, sometimes in darkness and sometimes in light conditions. Septal and hippocampal cells were simultaneously registered while the animal moved in the maze. Single units were separated using an online and off-line separation software. Units were then classified according to their anatomical location and the characteristics of the spikes. Hippocampal pyramidal cells and lateral septal cells were identified. In what follows, each arm of the maze is taken as a different stimulus, and it is labeled by its angle, as in Figure 3A.The center of the maze was excluded so as to have eight similar

A Subjective Distance Between Stimuli

983

Figure 3: (A) Eight-arm maze where the animals move. (B) Path traveled by a rat in a given trial, as measured by the diode on its head. The rat goes straight to the end of the arm and drinks the chocolate milk. When turning around, it typically sweeps its head in a circular movement, peeping outside the maze.

and evenly visited stimuli. Thus, in the experiment, the physical distance between any two stimuli is the (actual) angle between the corresponding arms. The subjective distance is deduced from the responses. As an example, in Figure 4 the location of the animal is shown whenever a lateral septal (see Figure 4A) and a hippocampal place cell (see Figure 4B) fire a spike. It is clear that both cells fire selectively when the animal is in a particular location on the maze, the septal cell having a somewhat more distributed response.

984

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

Figure 4: Location of the animal when a (A) given lateral septal and (B) hippocampal pyramidal neuron fired a spike. The dots that fall outside the maze indicate that the nose of the animal is peeping outside the walls of the labyrinth.

The decoding procedure and the calculation of the Q matrix were carried out first for a single neuron and then for pairs, triplets, and sets of four cells measured simultaneously. The response of the animal was, correspondingly, a scalar, or a two-, three- and four-dimensional vector r, where the component rc stands for the firing rate of cell c. Hence, upon entrance to arm s j , rc was calculated as the ratio of the number of spikes fired by cell c to the time spent in the arm s j in that particular trial. To decode the arm corresponding to a given response r, the M responses most similar to r were taken into account. Those M firing rates included M0 , responses that corresponded to the animal in arm 0 degrees, M45 responsescorresponded to the animal in arm 45 degrees, and so forth. That is, M = i Mi , and the response r of the present trial is not included. The probability ρ(s j |r ) for the rat to be in arm s j when response r was observed was set as Mj /M. The

A Subjective Distance Between Stimuli

985

A 1

D0

From arm 0

o

180

o

0 1

D0

45

o

225

o

0 1

D0

90

o

270

o

0 1

D0

315 135

0 0

90

180

o

o

0

270

To arm

90

180

270

To arm

1

o

D0

From arm 180 From arm 0

0 1

D0

45

o

o

225

o

270

o

0 1

D0

90

o

0 1

D0

135

o

315

o

0 0

90

180

To arm

270

0

90

180

270

To arm

Figure 5: Matrix of distances D0 for the (A) lateral septal cell of Figure 4A, and (B) hippocampal pyramidal cell of Figure 4B.

decoded arm was the one maximizing ρ(s j |r ). If there was a draw between two arms, the decoded one was chosen at chance between those two, with probabilities that were proportional to the two corresponding priors. With this procedure, Q(s i |s j ) is defined as the fraction of times s i was decoded, whenever s j was the actual arm.

986

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

The procedure was carried out for several values of M ranging from 2 to 20. Each M gave rise to a different Q matrix. We observed that the fraction of correct decodings (the trace of Q) typically showed a maximum as a function of M. The M for which tr(Q) was maximal was taken to be the final one. With this procedure, we obtained a Q matrix for each neuron, each pair of neurons, each triplet, and each quadruplet. In what follows, the results for the subjective distance as derived from the chosen Q matrices are shown. As examples from the single neuron behavior we show, in Figure 5, the matrix D0 (qi , q j ), for the same cells of Figure 4. In each plot, the arm i is kept fixed, while the arm j is varied (it does not really matter which is i and which is j, since the distance is symmetrical). In all graphs, D0 = 0 corresponds to i = j. The plot in Figure 5A corresponds to a cell in the lateral septum. It may be seen that most arms are far away from the one at 135 degrees. Not surprisingly, this is precisely the arm where the cell fires most. If a strong, reliable response is obtained for a given stimulus and if this response differs from the response to all other stimuli, then there is little probability of misidentifying it. Among the remaining arms, the one closest to the one at 135 degrees is the one at 0 drgree, and it corresponds to the second-largest firing rate. In the case of the hippocampal cell of Figures 4B and 5B, the arm that is farthest away from all others is the one at 0 degree, the one where the place field is located. Surprisingly, however, its distinction from all other arms is not as clear as in the septal cell of Figures 4A and 5A even though Figure 4B indicates that this cell is much more selective to the location of the animal. The fact is that Figure 4 alone is not enough to depict the selectivity of the cell, because it does not show the statistics. The rat in Figures 4B and 5B entered 18 times into the arm at 0 degree, but in only half of those trials, the cell fired a burst of spikes. The other half had no response. Therefore, a burst of spikes most probably means the animal is the arm at 0 degree, but a silent response does not discard this arm. In all those trials when there was no activity upon entrance to the arm at 0 degree, this arm can well be confounded with any other arm. If all the silent entrances to the arm at 0 degree are discarded, then the matrix of distances changes drastically, as is shown in Figure 6. There, the distance from the arm at 0 degree to any other arm is shown to be equal to one, implying no mistakes at all. This shows that the trial-to-trial variability has an important influence on the distance between any two stimuli, even for those stimuli that may elicit very strong responses. Every cell has a different spatial distribution of responses and, hence, a different matrix of subjective distances. In what follows, therefore, instead of analyzing the specific characteristics of individual cells, the average behavior is studied. In Figure 7, the average distance of a given arm with all the others is shown. The numbers in the x-axis indicate the angle separating the two arms under consideration. Thus, the average D0 between all pairs of arms

A Subjective Distance Between Stimuli

987

D0

1

From arm 0

From arm 180

o

o

0

D0

1

45

o

90

o

225

o

0

D0

1

270

o

D0

0 1

135

o

315

o

0 0

90

180

To arm

270

0

90

180

270

To arm

Figure 6: Matrix of distances D0 for the cell of Figure 4B, when all the trials with silent entrances to the arm at 0 degree have been removed.

that lie at 45 degrees from one another is shown as a single point in the plot. Each data point represents an average among cells (62 units have been considered: 29 pyramidal cells in the hippocampus and 33 lateral septal cells). The error bars show the standard deviation of the cell average. In the lower curve, the arm is decoded from the activity of a single cell. As the curves rise, the decoding makes use of more cells (from 1 to 4) recorded simultaneously. The first thing that can be noticed is that as the number of neurons increases, all the subjective distances grow. This means that the representation of the different arms becomes more and more distinctive, and therefore the fraction of mistakes goes down. One can also observe that the single-neuron case looks pretty flat. That is, there is no evident structure in the set of stimuli, since each arm has roughly the same probability as being confounded with any other arm. An analysis of variance (ANOVA) of the distances obtained for pairs of arms at 45 and 180 degrees shows that they are not significantly different ( p > 0.3). In contrast, as the number of neurons increases, the curves begin to bend, showing larger distances for pairs of arms that lie farther apart. The ANOVA test shows that the distances obtained for pairs of arms at 45 degrees are significantly different from the ones at 180 degrees ( p < 0.0001). This means that if the response of four neurons is considered, then it is more probable

988

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

D0

0.5

1 cell 2 3 4

0.0 0

90 180 270 angle to neighbor

360

Figure 7: Average distance of a given arm with all the others, numbered counterclockwise, as calculated from 62 units (29 hippocampal pyramidal neurons and 33 lateral septal cells). The error bars are the standard deviations from the averages. Stars, triangles, squares, and circles correspond to 1, 2, 3, and 4 cells, respectively.

to confound a given location with a nearby location than with a distal one. Topography, hence, seems to emerge from population coding, not from the single-cell response. This example shows that the subjective distance is, when read from four simultaneously recorded cells, a monotonic function of the true distance. This indicates that near in the actual world corresponds to near in the subjective perception, and far in the actual world corresponds to far in the subjective perception. In this sense, one can assert that there is a continuous mapping between the actual and subjective spaces. If the subjective distance were strictly equal to the actual distance, however, the upper curve of Figure 7 would have a linear rise from 0 to 180 degrees, and then a similar linear decrease from 180 to 360 degrees. The curved shape of the upper trace of Figure 7 indicates that the scale in the subjective representations somehow shrinks as the actual stimulus moves farther away. This, in turn, shows that the system is better designed to make fine distinctions between nearby stimuli than between distal ones. The singlecell response instead makes no distinction at all between near or far. It

A Subjective Distance Between Stimuli

989

only recognizes whether the two arms are the same or not. All these nontrivial characteristics of the coding properties of hippocampal and septal cells have been visualized in terms of our definition of the subjective distance.

7 Summary The subjective distance as introduced here is a way of measuring how differently stimuli are perceived. The distance between any two elements may be interpreted in terms of the average performance when trying to infer the actual stimulus, if only the response of the subject is known. In such a performance, the trial-to-trial variability of the responses to each stimulus is as important as the mean responses. It should be noted that the probability of confusion depends not only on the characteristics of the two stimuli under consideration, but also on all the other stimuli in the set S. Hence, the distance between two given items may vary when the remaining stimuli in S are modified. So here, as well as in many other information or discrimination analyses the choice of the set of stimuli is a highly relevant (and sometimes difficult) issue in itself, which should not be neglected. The distances D and D0 may have several applications; for example, it may be of interest to compare the subjective distance with some other objective measure of distance. Here, as shown in section 6, D0 has proven useful to show that an increasingly topographic encoding of spatial location arises as the number of cells grows. In this case, as the population of neurons increases, the topography of real space seems to emerge. There are other cases, though, where the subjective perception of certain objects raises clear differentiating bounds between stimuli that are actually near, in the so-called physical space. For example, it is known that during the first year of life, the exposure of infants to their mother tongue builds up a very particular way of perceiving phonetic information. Two sounds that are physically very similar but correspond to two different phonemes in the child’s language are easily discriminated. Yet the researcher can design two sound waves differing even more in their physical characteristics, but that the infant cannot distinguish simply because they are not distinct building blocks (phonemes) in his or her own language (see, e.g., Kuhl, 1994). Another example is the perception of facial expression (Young et al., 1997), where although the researcher can continuously morph the picture of a happy face into that of an angry one, human observers have a tendency to categorize them into distinct emotions (full happiness or full anger). Our subjective distance would be a good way to quantify these effects. Another possible application would be to use the subjective distance as the input to a multidimensional scaling algorithm. This would allow placing

990

D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori

the stimuli in a finite-dimensional space and gaining further geometrical intuition about them. Acknowledgments We thank Alessandro Treves for his very useful suggestions. This work has been supported with a fellowship of the Alexander von Humboldt Foundation, a grant of Fundacion ´ Antorchas, one of the Human Frontier Science Program No. RG 01101998B, and NIMH grant 58755. References Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press and American Mathematical Society. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Cox, T. F., & Cox M. A. (2000). Multidimensional scaling, (2nd ed.). London: Chapman and Hall. Dayan, P., & Abbott L. F. (2001) Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press London. Green, D. M., & Swets J. A. (1966). Signal detection theory and psychophysics. New York: Willey. Kuhl, P. K. (1994). Learning and representation in speech and language. Curr. Op. Neurobiol., 4 812–822. Leutgeb, S., & Mizumori S. J. (2002). Context-specific spatial representations by lateral septal cells. J. Neurosci., 112 (3), 655–663. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Res., 34, 171–175. Redish, A. D. (1999). Beyond the cognitive map: From place cells to episodic memory. Cambridge, MA: MIT Press. Rieke F., Warland D., de Ruyter van Steveninck R., & Bialek W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Samengo, I. (2001). The information loss in an optimal maximum likelihood decoding. Neural Comput., 14, 771–779. Treves, A. (1997). On the perceptual structure of face space. Biosystems 40, 189–196. Victor, J. D., & Purpura K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network: Comput. Neural Syst., 8, 127–164. Young F. W., & Hamer R. M. (1994) Theory and applications of multidimensional scaling. Hillsdale, NJ: Eribaum. Young, A. W., Rowland, D., Calder, A. J., Etcoff, N. L., Seth, A., & Perrett, D. I. (1997). Facial expression megamix: Tests of dimensional and category accounts of emotion recognition. Cognition, 63 271–313.

Received April 30, 2003; accepted August 10, 2004.

NOTE

Communicated by Emilio Salinas

Gain Control by Concerted Changes in IA and IH Conductances Denis Burdakov [email protected] Neurosciences Division, Biological Sciences, University of Manchester, Manchester M13 9PT, U.K.

Stability of intrinsic electrical activity and modulation of input-output gain are both important for neuronal information processing. It is therefore of interest to define biologically plausible parameters that allow these two features to coexist. Recent experiments indicate that in some biological neurons, the stability of spontaneous firing can arise from coregulated expression of the electrophysiologically opposing IA and IH currents. Here, I show that such balanced changes in IA and IH dramatically alter the slope of the relationship between the firing rate and driving current in a Hodgkin-Huxley-type model neuron. Concerted changes in IA and IH can thus control neuronal gain while preserving intrinsic activity. 1 Introduction Maintaining stable intrinsic activity patterns in the absence of relevant stimuli is thought to be important for proper information processing in neurons and neural circuits (Turrigiano & Nelson, 2000; Davis & Bezproznanny, 2001; Marder & Prinz, 2002). But the ability of neurons to change gain, that is, to alter the slope of the relationship between driving stimulus and firing response, is also critical for computation in diverse neural systems (Salinas & Thier, 2000; Salinas & Sejnowski, 2001). It is therefore of interest to define biologically plausible parameters that allow the gain of a neuron to be modulated without changes in its intrinsic (stimulus-independent) activity. It was recently demonstrated, using biological and model neurons, that one of the ways to achieve such “silent” modulation of gain is by covarying the frequencies of excitatory and inhibitory background synaptic input currents (Chance, Abbott, & Reyes, 2002). Another recent report indicated that biological neurons can intrinsically covary the magnitudes of certain excitatory and inhibitory ionic currents: increasing the expression of the transient outward current (IA ) led to proportional increases in hyperpolarization-activated inward current (IH ) in lobster somatogastric neurons (MacLean, Zhang, Johnson, & Harris-Warrick, 2003). This biological coregulation of IA and IH did not result in significant changes in intrinsic firing properties (MacLean et al., 2003). Neural Computation 17, 991–995 (2005)

© 2005 Massachusetts Institute of Technology

992

D. Burdakov

Therefore, this note addresses the question of whether the concerted changes in IA and IH can act as a cellular mechanism for “silent” gain modulation. 2 Results and Discussion To explore the effects of concerted changes in IA and IH on neuronal gain, a recently published model of lobster somatogastric neurons (Prinz, Thirumalai, & Marder, 2003) was used. This is a Hodgkin-Huxleytype, single-compartment model that comprises seven membrane currents (INa , ICaS , IA , IKCa , IKd , IH , and Ileak ) and an intracellular calcium buffer, and exhibits tonic intrinsic firing (Prinz et al., 2003). All maximal conductances (except those for IA and IH , which were varied as indicated) were as in Prinz et al., and simulations were performed using MATLAB stiff systems numerical integrator ode15s, at time resolution of 25 µs. The maximal conductances (g) of IA and IH were changed from the values in Prinz et al. (10 and 0.05 mS/cm2 , respectively) in such a way so as to keep the intrinsic firing characteristics (tonic pattern and firing frequency) unaltered. The relationship between the two conductances for the latter condition is shown in Figure 1C. Changes in gA and gH markedly altered the firing responses of the model neuron to a tonic excitatory current: balanced increases in gA and gH decreased the firing response (see Figures 1A and 1B). This implies that the concerted changes in gA and gH selectively altered the gain, since the tonic nature of firing and the firing frequency remained unaffected. To quantify the changes in gain in a simplified but plausible way, the firing rate (f) was plotted against the driving excitatory current (I) for

Figure 1: (A, B) Examples of simulations illustrating that concerted changes in IA and IH affect neuronal gain without changes in unstimulated firing rate or pattern. (A) Neuron with a small IA and IH (gA and gH are 10 and 0.05 mS/cm2 , respectively) responds to a tonic excitatory current with a large increase in firing rate. (B) Increasing IA and IH in a balanced manner that leaves unstimulated firing unaltered (here, gA and gH are 50 and 1 mS/cm2 , respectively) leads to a marked reduction of the firing response to the same excitatory current. Simulation protocols used to elicit IA , IH , and firing responses in A and B are shown schematically below the corresponding traces. (C) The relationship between gA and gH that satisfies the condition of unaltered intrinsic firing characteristics. (D) Examples of f-I relationships of neurons with different balanced combinations of gA and gH (respectively, in mS/cm2 : black circles = 10 and 0.05, white squares = 20 and 0.18, black triangles = 50 and 1, white diamonds = 100 and 6). (E) Gain values (the slopes of f-I relationships such as those shown in D) plotted against the sum of gA and gH for combinations of gA and gH that do not alter intrinsic firing characteristics.

Gain Control by Concerted Changes in IA and IH Conductances

993

994

D. Burdakov

a range of combinations of gA and gH that resulted in the same intrinsic firing rate. The balanced increases in gA and gH led to progressive decreases in the gain (slope) of the f-I relationship (see Figure 1D). The gain was altered most steeply when the sum of gA and gH was below 100 mS/cm2 (see Figure 1E); this conductance range is consistent with physiological gA and gH values in lobster somatogastric neurons (MacLean et al., 2003; Prinz et al., 2003). Decreases in gain saturated when the sum of gA and gH exceeded about 100 mS/cm2 (see Figure 1E). These results indicate that concerted changes in IA and IH may allow neurons to vary gain while preserving intrinsic activity patterns. To the best of my knowledge, such cellular mechanism of gain control has not been previously reported. This mechanism is biologically plausible (MacLean et al., 2003), and could potentially be of general physiological importance considering that IA and IH are expressed together in many types of biological neurons and are under the control of a variety of neuromodulators (Yang et al., 2001; Ramakers & Storm, 2002; Schweitzer, Madamba, & Siggins, 2003; Frere & Luthi, 2004).

References Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35(4), 773–782. Davis, G. W., & Bezprozvanny, I. (2001). Maintaining the stability of neural function: A homeostatic hypothesis. Annu. Rev. Physiol., 63, 847–869. Frere, S. G., & Luthi, A. (2004). Pacemaker channels in mouse thalamocortical neurons are regulated by distinct pathways of cAMP synthesis. J. Physiol., 554, 111–125. MacLean, J. N., Zhang, Y., Johnson, B. R., & Harris-Warrick, R. M. (2003). Activityindependent homeostasis in rhythmically active neurons. Neuron, 37(1), 109–120. Marder, E., & Prinz, A. A. (2002). Modeling stability in neuron and network function: The role of activity in homeostasis. Bioessays, 24(12), 1145–1154. Prinz, A. A., Thirumalai, V., & Marder, E. (2003). The functional consequences of changes in the strength and duration of synaptic inputs to oscillatory neurons. J. Neurosci., 23(3), 943–954. Ramakers, G. M., & Storm, J. F. (2002). A postsynaptic transient K(+) current modulated by arachidonic acid regulates synaptic integration and threshold for LTP induction in hippocampal pyramidal cells. Proc. Natl. Acad. Sci. U. S. A, 99(15), 10144–10149. Salinas, E., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behavior, neurophysiology, and computation meet. Neuroscientist, 7(5), 430–440. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27(1), 15–21. Schweitzer, P., Madamba, S. G., & Siggins, G. R. (2003). The sleep-modulating peptide cortistatin augments the H-current in hippocampal neurons. J. Neurosci., 23(34), 10884–10891.

Gain Control by Concerted Changes in IA and IH Conductances

995

Turrigiano, G. G., & Nelson, S. B. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10(3), 358–364. Yang, F., Feng, L., Zheng, F., Johnson, S. W., Du, J., Shen, L., Wu, C. P., & Lu, B. (2001). GDNF acutely modulates excitability and A-type K(+) channels in midbrain dopaminergic neurons. Nat. Neurosci., 4(11), 1071–1078.

Received August 23, 2004; accepted October 1, 2004.

NOTE

Communicated by Klaus Obermayer

Winner-Relaxing Self-Organizing Maps Jens Christian Claussen [email protected] ¨ Theoretische Physik und Astrophysik, Institut fur Christian-Albrechts-Universit¨at zu Kiel, 24098 Kiel, Germany

A new family of self-organizing maps, the winner-relaxing Kohonen algorithm, is introduced as a generalization of a variant given by Kohonen in 1991. The magnification behavior is calculated analytically. For the original variant, a magnification exponent of 4/7 is derived; the generalized version allows steering the magnification in the wide range from exponent 1/2 to 1 in the one-dimensional case, thus providing optimal mapping in the sense of information theory. The winner-relaxing algorithm requires minimal extra computations per learning step and is conveniently easy to implement. 1 Introduction The self-organizing map (SOM) algorithm (Kohonen, 1982) served both as a model for topology-preserving primary sensory processing in the cortex (Obermayer, Blasdel, & Schulten, 1992) and for technical applications (Ritter, Martinetz, & Schulten, 1992). Self-organizing feature maps map an input space, such as the retina or skin receptor fields, into a neural layer by feedforward structures with lateral inhibition. Defining properties are topology preservation, error tolerance, plasticity, and self-organized formation by a local process. Compared to other clustering algorithms and vector quantizers, its apparent advantage for data visualization and exploration is its approximative topology preservation. In contrast to the elastic net (Durbin & Willshaw, 1987) and the Linsker (1989) algorithm, which perform gradient descent in a certain energy landscape, the Kohonen algorithm lacks an energy function in the general case of a continuous input distribution. Although the learning process can be described in terms of a Fokker-Planck equation (Ritter & Schulten, 1988), the expectation value of the learning step is a nonconservative force (Obermayer et al., 1992) driving the process so that it has no associated energy function. Despite much research, the relationships between the Kohonen model and its variants to general principles remain an open field (Kohonen, 1991). 1.1 Kohonen’s Self-Organizing Feature Map. Kohonen’s selforganizing map is defined as follows. Every stimulus v of an input space V Neural Computation 17, 996–1009 (2005)

© 2005 Massachusetts Institute of Technology

Winner-Relaxing Self-Organizing Maps

997

is mapped to a “center of excitation,” or winner, s = argminr∈R |wr − v|,

(1.1)

where |·| denotes the Euclidean distance in input space. In the Kohonen model, the learning rule for each synaptic weight vector wr is given by δwr = η · grs · (v − wr ),

(1.2)

where grs defines the neighborhood relation in R, and throughout this article will be a gaussian function of the Euclidean distance |r − s| in the neural layer. Topology preservation is enforced by the common update of all weight vectors whose neuron r is adjacent to the center of excitation s; the adjacency function grs prescribes the topology in the neural layer. The speed of learning η usually is decreased during the process. 1.2 The Winner-Relaxing Kohonen Algorithm. We now consider an energy function V first proposed in Ritter et al. (1992). If we have a discrete input space, the potential function for the expectation value of the learning step is given by V({w}) =

1 γ grs p(vµ ) · |vµ − wr |2 , 2 rs µ µ|v ∈F ({w})

(1.3)

s

where Fs ({w}) is the cell of the Voronoi tesselation (or Dirichlet tesselation) of input space defined by equation 1.1. For discrete input space, where p(v) is a sum over delta peaks δ(v − vµ ), the first derivative with regard to wr is not continuous at all weight vectors where the borders of the Voronoi tesselation are shifting over one of the input vectors (see Figure 1). However, equation 1.3 requires the assumption that none of the borders of the Voronoi tesselation is shifting over a pattern vector vµ , which may be fulfilled in the final convergence phase for discrete input spaces but becomes problematic if there are more receptor positions than neurons. If p(v) is continuous, the sum over µ becomes an integral, and with every stimulus vector update, the surrounding Voronoi borders slide over stimuli (which means they become represented by another weight vector), so that there is no global energy function for the general case. We remark that replacing the crisp (or hard) winner selection, equation 1.1, by a soft-winner s = argminr r grr |wr − v|2 minimizes equation 1.3 even in the continuous case (Graepel, Burger, & Obermayer, 1997, Heskes, 1999). This is a formally elegant approach if one wants to ensure the existence of an energy function and accepts modifying the winner selection. However, to motivate the winner-relaxing learning, we return to the hard winner selection scheme, equation 1.1, and take up the learning rule given

998

J. Claussen

Figure 1: Shift of Voronoi borders as an effect of weight vector update.

by Kohonen (1991). Our use of this ansatz, however, is justified here only a posteriori by its use for adjusting the magnification. From the shift of the borders of the Voronoi tesselation Fs ({w}) (see Figure 1) in evaluation of the gradient with respect to a weight vector wr , Kohonen (1991) derived for the (approximated) gradient descent in V the additive term − 12 ηδrs r =s gr s (v − wr ), extending equation 1.2 for the winning neuron. As it implied an additional elastic relaxation, it was straightforward to call it the “winner-relaxing” (WR) Kohonen algorithm (Claussen, 1992). In the remainder of this article, we study the (generalized) winner-relaxing Kohonen algorithm, or winner-relaxing self-organizing Map (WRSOM), introduced first in Claussen (1992) as δwr = η (v −

γ wr )grs

− λδrs

r =s

γ gr s (v

− w r ) ,

(1.4) γ

where s is the center of excitation for incoming stimulus v, and grs is a gaussian function of distance in the neural layer with characteristic length γ . Here, λ is a free parameter of the algorithm. The original algorithm (associated with the potential function) proposed by Kohonen in 1991 is obtained

Winner-Relaxing Self-Organizing Maps

999

Table 1: Magnification Laws for One-Dimensional Maps. Elastic Net 1 1+

κ P σ2 J

VQ, NG

WRK, λ = 12

SOM, λ=0

WRK, λ = −1

Linsker

P 1/3

P 4/7

P 2/3

P1

P1

for λ = +1/2, whereas the classical SOM algorithm is obtained for λ = 0. The influence of λ on the magnification behavior is the central issue of this article. 1.3 The Magnification Factor. The magnification factor is defined as the density of neurons r (i.e., the density of synaptic weight vectors wr ) per unit volume of input space, and therefore is given by the inverse Jacobi determinant of the mapping from input space to neuron layer: M = |J |−1 = | det(dw/dr)|−1 . We assume the input space to be continuous and of the same dimension as the neural layer, and the map to be noninverting (J > 0). The magnification factor quantifies the network’s response to a given probability density of stimuli P(v). To evaluate M in higher dimensions, one in general has to compute the equilibrium state of the entire network and therefore needs complete global knowledge on P(v), except for separable cases. For one-dimensional mappings, the magnification factor can ¯ follow a universal magnification law, that is, M(w(r)) is a function of the local probability density P only, independent of both the location r in the ¯ neural layer and the location w(r) in input space. It is nontrivial whether there exists a power law; the elastic net obeys a universal magnification law that remarkably is not a power law (Claussen & Schuster, 2002) due to a nonvanishing elastic tension in regions of small input density. For the classical Kohonen algorithm, the magnification law is given by a power ρ ¯ ¯ law M(w(r)) ∝ P(w(r)) with exponent ρ = 23 (Ritter & Schulten, 1986). (See Table 1 for an overview.) For a discrete neural layer and different neighborhood kernels, corrections apply (Ritter, 1991; Ritter et al., 1992; Dersch & Tavan, 1995). As the brain is assumed to be optimized by evolution for information processing, one could conjecture that maximal mutual information can define an extremal principle governing the setup of neural structures. For feedforward neural structures with lateral inhibition, an algorithm of maximal mutual information has been defined by Linsker (1989) using the gradient descent in mutual information. It requires computationally costly integrations and has a highly nonlocal learning rule; therefore, it is neither favorable as a model for biological maps nor feasible for technical applications. Due to realization constraints, both technical applications and cortical networks (Plumbley, 1999) are not necessarily capable of reaching this

1000

J. Claussen

optimum. Even if one had experimental data of the magnification behavior, the question from what self-organizing dynamics neural structures emerge remains. Overall it is desirable to find learning rules that minimize mutual information in a simpler way. An optimal map from the view of information theory would reproduce the input probability exactly (M ∼ P(v)ρ with ρ = 1), being equivalent to the condition that all neurons in the layer are firing with same probability. This defines an equiprobabilistic mapping (van Hulle, 2000). An exponent ρ = 0, on the other hand, corresponds to a uniform distribution of weight vectors, or no adaptation at all. So the magnification exponent is a direct indicator of how far an SOM algorithm is away from the optimum predicted by information theory. 2 Magnification Exponent of the Winner-Relaxing Kohonen Algorithm We now derive the magnification law of the winner-relaxing Kohonen algorithm, equation 1.4, for the case of a 1D→1D map. Note that for higher dimensions, analytical results can be obtained only for special degenerate cases of the input probability density and therefore lack generality. The necessary condition for the final state of the algorithm is that the expectation value of the learning step vanishes for all neurons r : ∀r ∈R

0=

dv p(v)δ w¯ r (v).

(2.1)

Since this expectation value is equal to the learning step of the pattern parallel rule, equation 2.1 is the stationary state condition for both serial and parallel updating and also for batch updating. Thus, we can proceed with these variants simultaneously. (As synaptic plasticity is widely assumed to be based on integrative effects, one could claim that a parallel model is sufficient.) The update rule, equation 4.1, can be extended by an additional diagonal term controlled by µ:1 δwr = η (v − wr ) ·

grγs

+ µ(v − wr )δr s − λδr s

r =s

γ gr s (v

−w ) . r

(2.2) 1 Whereas the extra term controlled by the parameter µ has been introduced in Claussen (1992) for pure generality and will be kept within the derivation, it does not contribute to the magnification. In general, the setting µ = 0 is recommended (and probably most stable), and the winner-relaxing Kohonen algorithm thus has only one relevant control parameter λ.

Winner-Relaxing Self-Organizing Maps

1001

By insertion of the update rule, equation 2.2, one obtains γ 0 = ds P(w(s))J(s)g ¯ − w(r ¯ )) ¯ r s (w(s) + (µ + λ) · −λ · =

ds P(w(s))J(s)δ ¯ ¯ − w(r ¯ )) r s (w(s) ≡0

γ

ds dr P(w(s))J(s)δ ¯ ¯ − w(r ¯ )) r s gr s (w(s)

ds P(w(s))J ¯ (s)grγs (w(s) ¯ − w(r ¯ ))

+ λ · P(w(r ¯ ))J (r ) ·

γ

dr gr r (w(r ¯ ) − w(r ¯ )).

(2.3)

The derivation can be performed analogous to Ritter et al. (1992). In the continuum limit, there is always an exactly matching winning weight vector w¯ s = v. Further, the integration variable is substituted, dv = dw ¯ s = J (s)ds, and we define the abbreviation P¯ := P(w(r ¯ )). In the first integrand, P¯ J has to be expanded in powers of q := s − r . Within the second integral, P¯ J is evaluated only at r . Thus, the integration yields in leading order in q :

d2 w¯ ¯ d( P¯ J ) 1 2 dw ¯ 0=γ + PJ dr dr 2 dr 2   2 2 dw¯ q d w¯  γ   + λ · P¯ J · dq g0q   q dr + 2 dr 2 

0=γ2

contribution 0

d( P¯ J ) P¯ J dJ P¯ J dJ J + +λ dr 2 dr 2 dr

.

(2.4)

¯ Further, we have to require γ = 0, P = 0, d P/dr = 0. Then the ansatz of a ¯ )) (that is, J depends only on universal local magnification law J(r ) = J( P(r the local value of P, which may be expected for the one-dimensional case only) requires J to fulfill the differential equation

J 1 λ dJ 0= + 1+ + (2.5) P¯ 2 2 d P¯ or 2 J dJ . =− d P¯ 3 + λ P¯

(2.6)

1002

J. Claussen

Magnification exponent

1 0.9 0.8

2/3 4/7 0.5

-1

-0.5

0

0.5

1

λ Figure 2: Impact of parameter λ on the magnification exponent. The cases of λ = 1/2 (Kohonen, 1991), the SOM case λ = 0 (Kohonen, 1982) and the winnerenhancing choice λ = −1 are marked with filled circles.

It has a power law solution (provided that λ = −3), which verifies the ansatz made above, J being a function of the local density only, M=

2 1 ∼ P(v) 3+λ . J

(2.7)

2 and can be tuned from Thus, the magnification exponent is given by 3+λ 1/2 to 1 (see Figure 2) within the range of stability. For the λ = 1/2 choice of the winner-relaxing Kohonen algorithm, the magnification factor follows an exact power law with magnification exponent ρ = 4/7, which is smaller than ρ = 2/3 for the classical self-organizing feature map (Ritter & Schulten, 1986), but is still much larger than ρ = 1/3 for vector quantization and neural gas. In any case, the maps resulting from the choices λ = 1/2 and λ = 0 are not optimal in terms of information theory.

3 Enhancing the Magnification From this result, one would try to invert the relaxing effect by choosing negative values for λ, which means to “enforce” the winner. In fact, the choice of λ = −1 leads to the magnification exponent 1. The magnification law, equation 2.7, is verified numerically, as is shown in Table 2. Apart from the fact that the exponent can be varied by a priori parameter choice between 1/2 and 1, the simulations show that our

Winner-Relaxing Self-Organizing Maps

1003

Table 2: Magnification Exponent of the Winner-Relaxing Algorithms Determined Numerically from a Sample Setup with 200 Neurons and 2 · 107 Update Steps and a Learning Rate of 0.1. ↓γλ→ 0.1 0.5 1.0 2.0 5.0 Theory:

−1

−3/4

−1/2

−1/4

0

1/4

1/2

3/4

1

0.29 ±.04 0.49 ±.02 0.75 ±.04 0.93 ±.03 0.99 ±.05 1.00

0.29 ±.04 0.46 ±.01 0.77 ±.02 0.86 ±.02 0.88 ±.04 0.89

0.23 ±.04 0.43 ±.02 0.68 ±.02 0.77 ±.02 0.80 ±.03 0.80

0.29 ±.04 0.45 ±.01 0.67 ±.02 0.71 ±.01 0.72 ±.02 0.73

0.27 ±.05 0.43 ±.01 0.61 ±.01 0.65 ±.01 0.66 ±.02 0.67

0.25 ±.04 0.40 ±.01 0.58 ±.01 0.61 ±.01 0.61 ±.02 0.62

0.26 ±.04 0.39 ±.02 0.58 ±.01 0.57 ±.01 0.57 ±.02 0.57

0.27 ±.04 0.37 ±.01 0.53 ±.01 0.53 ±.01 0.53 ±.02 0.53

0.27 ±.05 0.34 ±.01 0.51 ±.01 0.50 ±.01 0.50 ±.02 0.50

Notes: The input space was the unit interval; the stimulus probability density was chosen exponentially as exp(−βw) with β = 4. After an adaptation period of 5 · 107 learning steps further, 10% of the learning steps were used to calculate average slope and its fluctuation of log J as a function of log P. (The first and last 10% of neurons were excluded to eliminate boundary effects.) The small numbers denote the fluctuation of the exponent through the final 10% of the experiment. For small γ , the neighborhood interaction becomes too weak. If the gaussian neighborhood extends over some neurons (γ = 5), the exponent follows the predicted dependence of γ given by 2/(3 + λ). For |λ| > 1, the system is unstable. This is the case where the additional update term of the winner is larger than the sum over all other update terms in the network. Tuning of the parameter µ did not seem to extend the region of stability. As the relaxing effect is inverted for λ < 0, fluctuations are larger than in the Kohonen case.

winner-relaxing algorithm is able to establish information theoretically optimal SOMs in the “winner-enforcing” case (λ < 0).

4 Ordering Time and Stability Region At least for the 2D → 2D case, the winner-relaxing Kohonen algorithm was reported as somewhat faster (Kohonen, 1991) in the initial ordering process. In a 1D → 1D sample setup (Claussen, 2003), a marginally quicker ordering was observed for negative λ, at least at a relatively high learning rate η = 1. As a lot of parameters and the input distribution itself influence the ordering time and decay of fluctuations, different results may be obtained; for example, a small fraction of input distributions containing topological kinks takes much longer to become orderered, so minimal, maximal, averaged, and inverse-averaged ordering time will deviate.

1004

J. Claussen

λ = −1.0 λ = −0.5 λ = 0.0 λ = 0.5 λ = 1.0

log (fluctuations)

-1

-2

-3

-4

10

100

1000

10000

1e+05

1e+06

Number of iterations Figure 3: Time dependence (every tenth iterate shown) of the log root mean square fluctuations for different λ. Here, the same setup of a single run with γ = 1.0, η = 0.1, and 10 neurons is being used. Each run starts with the same configuration and random initial values between 0 and 1. For λ > 0, a quicker ordering is observed.

log (fluctuations)

-1

-2

-3

-4

10

100

1000

10000

1e+05

1e+06

Number of iterations Figure 4: Fast learning using a simple switching strategy. Starting with λ = 1/2, ordering is acheived quickly. At iteration step 2000, λ is immediately changed to −1 (dotted). This speeds up the learning phase by two orders of magnitude compared to starting with λ = −1 and by a factor 4 compared to λ = 0 (dashed, shown for comparison). If the duration of the initial ordering phase is underestimated, a long learning phase results (solid line; switch at step 200).

Winner-Relaxing Self-Organizing Maps

1005

log (fluctuations)

15

10

5

0

-5 -2

-1

0

1

2

3

4

5

λ Figure 5: For µ ∈ [−1, +1], the common stability range is λ ∈ [−1, +1]. For λ < −1, the log root mean square of the weight vector differences wr − wr −1 diverges, but extremely long, quiet transients are observed there. In the upper range λ > +1, making use of the diagonal term by using µ = 0 extends the stability range. The plots correspond to 107 (straight), 106 (dash-dotted), and 105 (dashed), respectively, for µ = 0. For 107 iterations, the cases µ = −1 (thin dots) and µ = +1 (thick dots) are shown. Parameters are γ = 1.0, η = 0.1, and 10 neurons are initialized near an equidistant chain with noise of amplitude 0.01 added.

If one instead investigates the time dependence of the fluctuations, for positive values of λ, a considerably quicker decay is observed (see Figure 3), being consistent with the observation by Kohonen (1991) mentioned above. These simulations indicate that for obtaining optimal magnification, the price of a longer learning phase may have to be paid. However, this drawback can be circumvented by combining the advantages of both λ ranges— using λ > 1 in the initial phase to speed up ordering and switching to λ = −1 after a considerable decay of fluctuations (see Figure 4). No complicated time dependence of this parameter switch has been used, and neither learning rate nor neighborhood has been changed during the simulation. The last important issue to be addressed is the dependence of stability on the parameter λ, especially at the border −1. Fortunately, the algorithm appears to be stable (in the 1D→1D case) in the whole range −1 ≤ λ ≤ +1, as shown in Figure 5. On both borders, the winner-relaxing learning remains

1006

J. Claussen

Output Entropy

4.6

10 * 10 batch 5 * 20 batch 10 * 10 5 * 20

4.4

4.2

4

-1

-0.8

-0.4

-0.6

-0.2

0

λ Figure 6: Entropy enhancement for the 2D→2D case for network geometries of 10 ∗ 10 and 5 ∗ 20 neurons. The data density was sin(πv1 ) · sin(πv2 ) within the unit square, γ = 5.0, and η was decreased from 0.01 to 0.001 during 106 learning steps. Alternatively, batch learning (over 100 steps) has been used; here, η was decreased from 0.05 to 0.001, and γ = 2.0 in the first 2 · 105 steps when ordinary SOM learning was applied (γ = 5.0, λ = 0). In all cases, for λ = −1, the entropy is enlarged compared to the unmodified case λ = 0 and close to the optimum (ln 100 = 4.605).

stable. Thus, the full range of magnification exponents between 1/2 and 1 can be acheived. In higher dimensions, no universal magnification law is expected, but one can evaluate the output entropy for a given input distribution and network. As shown in Figure 6, the enhancement of output entropy by winner-relaxing learning is effective also in the 2D case, although parameters have to be chosen more carefully.

5 Discussion After our first study (Claussen, 1992), Herrmann, Bauer, and Der (1995) introduced annother modification of the learning process, which was also applied to the neural gas algorithm (Villmann & Herrmann, 1998). Their central idea is to use a learning rate η that is locally dependent on the

Winner-Relaxing Self-Organizing Maps

1007

input probability density and also an exponent 1 can be obtained. As the input probability density should not be available to a neural map that selforganizes from stimuli drawn from that distribution, it is estimated from the actual local reconstruction mismatch (being an estimate for the size of the Voronoi cell) and from the time elapsed since the last time being the winner. Both operations require additional memory and computation, and due to the estimating character, the learning rate has to be bounded in practical use. This localized learning was overall more easily applicable and overcame the stability problems of the early approach of conscience learning (DeSieno, 1988). Another systematic method, the extended maximum entropy learning rule, has been introduced by van Hulle (1997). It approximates a map of maximal output entropy for arbitrary dimension, although in higher dimensions, the handling of the quantization regions becomes less practial (van Hulle, 1998). A quite different approach that is also capable of generating equiprobabilistic maps is via kernel optimization (van Hulle 1998, 2000, 2002); that is, neighborhood kernel radii themselves become learning parameters, in addition to the weight vectors defining the kernel centers. Other approaches, also influencing magnification, consider the selection of the winner to be probabilistic, leading to elegant statistical approaches to potential functions, as given by Graepel et al. (1997) and Heskes (1999). As shown recently (Claussen & Villmann, in press), the winner-relaxing concept can also be transferred successfully to the neural gas, confirming the utility of this class of learning rules.

6 Conclusions The Linsker, elastic net, and winner-relaxing Kohonen algorithms can be derived from an extremal principle, given by information theory, physical motivations, and reconstruction error, respectively. In this article, we have chosen the magnification law to indicate how close the algorithm reaches the adaptation properties of a map of maximal mutual information. The magnification law is one quantitative property that is both accessible by neurobiological experiments and manifests as a quantitative control parameter of a neural map used as vector quantizer in applications. A map of maximal mutual information uses all neurons with same probability; that is, their firing rate will be equal. In this article, we have investigated the winner-relaxing approach to establish a new family of vector quantizers. The shift from Kohonen (ρ = 2/3) to winner-relaxing Kohonen algorithm (ρ = 4/7) seems to be marginal if the emphasis is laid on the existence of a potential function. If a large magnification exponent is desired, the winner-relaxing Kohonen algorithm (with λ = −1) combines simple computation with a magnification corresponding to maximal mutual information.

1008

J. Claussen

Acknowledgments I thank H.G. Schuster for calling attention to the topic and for stimulating discussions. References Claussen, J. C. (1992). Selbstorganisation Neuronaler Karten. Diploma thesis, Kiel, Germany. Claussen, J. C. (2003). Winner relaxing and winner-enhancing Kohonen maps: Maximal mutual information from enhancing the winner. Complexity, 8(4), 15–22. Claussen, J. C., & Schuster, H. G. (2002). Asymptotic level density of the elastic net self-organizing map. In J. R. Dorronsoro (Ed.), Artificial neural networks. Berlin: Springer. Claussen, J. C., & Villmann, T. (in press). Magnification control in winner relaxing neural gas. Neurocomputing. Available at http://dx.doi.org/j.neucom. 2004.01.191. Dersch, D. R., & Tavan, P. (1995). Asymptotic level density in topological feature maps. IEEE Transactions on Neural Networks, 6, 230–236. DeSieno, D. (1988). Adding a conscience to competitive learning. In Proc. ICNN’88, International Conference on Neural Networks (pp. 117–124). Piscataway, NJ: IEEE Service Center. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic selforganizing maps. Phys. Rev. E, 56, 3876–3890. Herrmann, M., Bauer, H.-U., & Der, R. (1995). Optimal magnification factors in selforganizing maps. In F. Fogelman-Soulie, & P. Gallinari (Eds.), Proceedings of the International Conference on Artificial Networks (pp. 75–80). Paris: Ec2 et Cie. Heskes, T. (1999). Energy functions for self-organizing maps. In E. Oja & S. Kaski (Eds.), Kohonen maps. Amsterdam: Elsevier. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1991). Self-organizing maps: Optimization approaches. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks. Amsterdam: North-Holland. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402–411. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45, 7568–7589. Plumbley, M. D. (1999). Do cortical maps adapt to optimize information density? Network, 10, 41–58. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Trans. Neural Networks, 2, 173–175. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and self-organizing maps: An introduction. Reading, MA: Addison-Wesley.

Winner-Relaxing Self-Organizing Maps

1009

Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s self-organizing sensory mapping. Biol. Cybernetics, 54, 99–106. Ritter, H., & Schulten, K. (1988). Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability and dimension selection. Biological Cybernetics, 60, 59–71. van Hulle, M. M. (1997). Nonparametric density estimation and regression achieved with topographic maps maximizing the information-theoretic entropy of their outputs. Biological Cybernetics, 77, 49–61. van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computation. 10, 1847–1871. van Hulle, M. M. (2000). Faithful representations and topographic maps. New York: Wiley. van Hulle, M. M. (2002). Kernel-based topographic map formation by local density modeling. Neural Computation, 14, 1561–1573. Villmann, T., & Herrmann, M. (1998). Magnification control in neural maps. In Proc. European Symposium on Artificial Neural Networks, 1998 (pp. 191–196). Bruges: D-Facto.

Received August 20, 2002; accepted November 1, 2004.

LETTER

Communicated by DeLiang Wang

Image Segmentation by Networks of Spiking Neurons Joachim M. Buhmann [email protected]

Tilman Lange [email protected] Swiss Federal Institute of Technology, CH-8092 Zurich, Switzerland

Ulrich Ramacher [email protected] ¨ Infineon Technologies, D-81739 Munchen, Germany

A network of leaky integrate-and-fire (IAF) neurons is proposed to segment gray-scale images. The network architecture with local competition between neurons that encode segment assignments of image blocks is motivated by a histogram clustering approach to image segmentation. Lateral excitatory connections between neighboring image sites yield a local smoothing of segments. The mean firing rate of class membership neurons encodes the image segmentation. A weight modification scheme is proposed that estimates segment-specific prototypical histograms. The robustness properties of the network implementation make it amenable to an analog VLSI realization. Results on synthetic and real-world images demonstrate the effectiveness of the architecture. 1 Introduction The task of segmenting a given image into disjoint (homogeneous) areas or segments is often considered an important yet difficult task in computer vision. Image segmentation provides the common basis to solve many subsequent computer visions problems such as object recognition and scene analysis. Segmented images are significantly reduced in complexity, which might be measured by coding lengths, but at the same time they retain a large amount of information on objects in the image and their mutual relations. The most competitive models for image segmentation are currently formulated as high-level probabilistic models of pixel assignments to groups (Puzicha, Hofmann, & Buhmann, 1999; Tu, Zhu, & Shum, 2002). The inference essentially amounts to finding minima of the corresponding energy functional. This strategy is advantageous, since stable states correspond to (local) minima of the objective function. The optimization, however, might be infeasible for practical applications depending on the model. Furthermore, the biological plausibility of these models is tentative since the Neural Computation 17, 1010–1031 (2005)

© 2005 Massachusetts Institute of Technology

Image Segmentation by Networks of Spiking Neurons

1011

popular claim that biological neural networks are statistical inference machines involves a large amount of abstraction. Bridging the gap between biologically plausible models of low and intermediate vision and abstract probabilistic model for image analysis is required for two reasons: (1) a testable model in biological science has to be sufficiently close to physiological neurons so that theoretical predictions are amenable to experimental tests in living animals (Koch, 1999), and (2) biological neural networks might motivate new designs of hardware and software solutions for vision problems, for example, in analog VLSI. This article is motivated by both lines of thought, although our research effort was mainly driven by the second goal. Our proposal employs a network of spiking neurons for image segmentation. Although a similar implementation in a rate model is also possible, the use of spiking neurons promises high relevance for biological systems and, furthermore, might be more flexible for computer vision applications. Starting with a highly competitive probabilistic model for image segmentation called the histogram clustering model (HCM), we propose a neuronal version of this segmentation model that has been successfully applied to color and texture analysis of images (Puzicha et al., 1999). Section 2 introduces the model and shows how it can be applied to image segmentation and extends the model by a smoothness constraint based on a local neighborhood system. The introduction of these lateral relationships leads to significantly increased computational complexity compared to the original model. The basic architecture of the derived network implementation is presented in section 3, where the smoothness requirement is taken into account. Finally, section 4 provides segmentation results for artificial and real images and compares them to the deterministic annealing implementation of the unconstrained histogram clustering model. Alternative neural network architectures that follow a similar line of thought are discussed in von der Malsburg and Buhmann (1992) and Wang and Terman (1997). Locally excitatory, globally inhibitory oscillator networks (LEGION), which are introduced in Wang and Terman (1997), consist of locally coupled integrate-and-fire (IAF) neurons. Here, pairwise correlations are used in order to achieve grouping. From a conceptual point of view, this network design is close to pairwise Markov random fields (MRF) used in image grouping. The LEGION network has been successfully applied to image segmentation problems. However, it appears to be only generally applicable if an algorithmic abstraction as developed in Campbell, Wang, and Jayaprakash (1999) is employed. LEGION combined with an MRF-based preprocessing has been also applied for the purpose of texture segmentation in Cesmeli and Wang (2001). In Wersing, Steil, and Ritter (2001), the competitive-layer model (CLM) has been employed for perceptual grouping. Using different layers, each representing a different class, pairwise interactions encode whether an object does or does not belong to a certain group. This model is highly related to the approach presented here and—with

1012

J. Buhmann, T. Lange, and U. Ramacher

lateral connection weights chosen accordingly—to pairwise data clustering (Hofmann & Buhmann, 1997). 2 The Histogram Clustering Model The neural network for image segmentation (see section 3) is motivated by the histogram clustering model (HCM) as proposed in Pereira, Tishby, and Lee (1993) and Puzicha et al. (1999), which explains the observed data by prototypical distributions on the feature space X —one prototype distribution for each designated cluster. Experimental evaluations (e.g., in Puzicha et al., 1999) demonstrate the competitive performance of histogram clustering and its superiority to pure pairwise clustering approaches as well as other graph cut methods. This observation motivates the neural implementation of the model discussed next. For the HCM, the input data are composed of histograms h i, j that count the co-occurrence of object i ∈ {1, . . . , N} and feature j ∈ {1, . . . , M}. Hence, for each object (each image site for the segmentation), a histogram hi := (h i, j )1≤ j≤M is defined as a tuple hi ∈ N M . M denotes the number of features, and n i denotes the number of observations belonging to object i, that is, ni := j h i, j . Then, by normalizing h i, j , one arrives at empirical conditional probabilities pˆ ( j|i) :=

h i, j , ni

(2.1)

so that pˆ i := ( pˆ ( j|i))1≤ j≤M estimates the probability to observe features j given object i. To segment an image, we decompose it in N not necessarily disjoint image patches (see Figure 1) of equal size where ni is the number of pixels in image patch i. The central pixel of a patch is called a site. The sites can be thought of as pixels in a subsampled version of the original image. These sites are the objects considered in the HCM. Histograms are formed from gray values of each image patch to provide information about each site where each histogram consists of M bins. These bins represent the features in the segmentation problem and reflect the composition of intensity values within an image patch: an intensity value x ∈ {0, 1, . . . , 255} increases the Mx count in the bth bin by one where b = 256 + 1. The goal of image segmentation is to partition the set of image sites based on the features obtained from histogramming into K disjoint classes, called segments, where the number of groups K is supposed to be known a priori. In the HCM, segments are represented by prototypical distributions qν = (q ( j|ν)) j∈{1,...,M} where ν ∈ {1, . . . , K }. Here, q ( j|ν) can be interpreted as the probability of observing feature j with an object of class (segment) ν and, thus, qν is a class-conditional distribution. The HCM can be formulated as

Image Segmentation by Networks of Spiking Neurons

1013

Figure 1: Schematic view of the image patches and the histogram generation process.

the result of a maximum a posteriori (MAP) estimation for a generative model, which is characterized by the following sampling procedure: 1. Select object i with (a priori) probability Pr({i}). 2. Suppose i belongs to class ν. Then select feature j with probability q ( j|ν). For image segmentation, the prior probabilities of sites are uniform, that is, Pr({i}) := 1/N. Hence, the probability of observing feature j when object i belongs to class ν, is q ( j|ν)/N when generated according to the above sampling procedure. Here, the sets of unknown parameters are (1) the assignments of image sites to segments, which are denoted by c: {1, . . . , N} → {1, . . . , K }, and (2) the prototypical distributions q1 , . . . , q K , which both have to be estimated from the data. Considering the log posterior of the unknown parameters given the data (in the form of the histograms pˆ i ), we derive the negative log likelihood −L(c, q1 , . . . , q K |pˆ 1 , . . . , pˆ N ), const −

1≤i≤N 1≤ j≤M

pˆ ( j|i) log(q ( j|c(i))),

(2.2)

1014

J. Buhmann, T. Lange, and U. Ramacher

which has to be minimized. Replacing the constant term with the negative empirical entropy i Pr({i}) j pˆ ( j|i) log( pˆ ( j|i)), we derive the objective function

H(c, {q}) :=

DKL (pˆ i qc(i) ).

(2.3)

1≤i≤N

The argument that minimizes the costs H(c, {q }) represents the desired segmentation of the N image sites. Here, DKL denotes the Kullback-Leibler divergence between (discrete) probability distributions with DKL (pq) :=

p j log

1≤ j≤M

pj qj

(2.4)

for p = ( p1 , . . . , p M ) and q = (q 1 , . . . , q M ). Note that this model can be extended to include a local smoothness constraint. Suppose there is a (local) neighborhood Ni defined for each site i. By adding the term λ i j∈Ni 1{c(i) = c( j)} to the cost function defined in equation 2.3, one can penalize segmentations with high local variability. The multiplier λ ∈ R+ , which can be considered as a Lagrange parameter in a constrained optimization problem, controls the amount of penalization. The modified cost function H H (c, {q}) := H(c, {q}) + λ

i

1{c(i) = c( j)}

(2.5)

j∈Ni

represents the model that we attempt to implement with spiking neurons. It usually results in improved segmentations (due to the smoothness) in comparison to those obtained by minimizing H. In Puzicha et al. (1999), a deterministic annealing approach to minimizing the cost function H has been proposed that employs the Gibbs distribution for the inference of robust solutions in every iteration. Note that the Gibbs distribution factorizes for H, thereby enabling relatively efficient operation. However, for the modified model H , the Gibbs distribution has no longer a factorial form,1 which renders annealing approaches to exact optimization impractical, since the Gibbs distribution is hard to compute. The experimental section of Puzicha et al. (1999) provides experimental evidence for the increased computational complexity when couplings between assignments 1 To be precise, the Gibbs distribution is considered over the set of all possible clustering solutions {c}, that is, P(c) = Z1 exp(−β H(c))for a cost function H. The crucial point here represents the normalization constant Z := c exp(−β H(c)) (the partition function), being a sum over exponentially many solutions. For the HCM, this sum has a simple factorial form. However, for the constrained problem, there is no simple factorization, thereby making the exact computation of the Gibbs distribution intractable.

Image Segmentation by Networks of Spiking Neurons

1015

are introduced. The network-based approach aims at overcoming this issue without resorting to independence assumptions as, for example, used in mean-field approximations (Opper & Saad, 2001). For the experiments in section 4, we have employed the deterministic annealing approach in conjunction with the unconstrained model in order to compare the results with those obtained with network implementation, described next. 3 A Neuronal Implementation of the Histogram Clustering Model This section describes the network of spiking neurons that is employed for the purpose of image segmentation. We start by introducing the underlying leaky integrate-and-fire neuron model (Koch, 1999; Gerstner, 1998, 2002). 3.1 The Neuron Model. At a fixed time t, each neuron k is characterized by its internal state sk (t), its constant threshold θk , and its output yk (t) ∈ R. The spike duration τk is assumed to be constant for all neurons (τk := τ := 1) and is measured in milliseconds. The time at which neuron k fires for the (f) (1) (n) f th time is denoted by tk . Let tk , . . . , tk be the firing times of neuron k if it has fired for n times. The state of neuron k before firing, a fictitious neuron potential, is determined by the first-order ordinary differential equation (ODE), dsk sk (t) (t) = − + Ak, dt

(3.1)

(n)

for t > tk + τ (i.e., t larger than the last time neuron k stopped firing) and sk (t) < θk (i.e., the internal state is still below the threshold θk ). Here, Ak = l∈k wlk (t)yl (t) denotes the activation of neuron k, where k is the set of all presynaptic neurons of k and wlk ∈ R is the weight or strength of the connection between the presynaptic neuron l and the postsynaptic neuron k. When the internal state sk (t) reaches the threshold, it is reset to sk (t) = 0 and the neuron fires yk (t) for t ∈ [t ( f ) , t ( f ) + τ ], 1 ≤ f ≤ n; it emits a pulse of width τ . Neurons do not accumulate incoming signals while they emit a pulse themselves in our model, and the internal state sk (t) of neuron k is reset to zero immediately after firing. Furthermore, the term −sk (t)/ in equation 3.1 forces the neuron to forget prior excitation if the incoming stimulus (the activation Ak ) is not strong enough. In the following, t measures the model time (in ms). 3.2 The Network Architecture. As in the HCM (see section 2), we consider image sites as the objects to be grouped. In the case of segmentation,2 2

We adopt here the usual clustering nomenclature. What we call objects here are often denoted as features in the neural networks literature where the clustering problem is also called feature binding problem.

1016

J. Buhmann, T. Lange, and U. Ramacher

Figure 2: The basic architecture for an image site.

each object i is characterized by ( pˆ ( j|i))1≤ j≤M —the empirical probabilities of observing a pixel of a certain gray value belonging to the jth bin in the image path around site i. For the purpose of neural segmentation, we use for each site i a building block consisting of K + M neurons (see Figure 2). The M input neurons fire in each simulation step with frequency proportional to pˆ ( j|i). Each of the M input neurons is connected to all K neurons. The idea is now that firing patterns should reflect the dissimilarity of the prototypical feature probabilities q ( j|1), . . . , q ( j|K ) to the empirical feature frequencies pˆ ( j|i) as measured by DKL (pˆ i qν )—without explicitly relying on the prototypical distributions qν . The νth of the latter neurons is supposed to be active if the image site i should belong to the νth segment. Suppose the weight between an input neuron j for site i and the ν-segment membership neuron is chosen approximately as (i)

w j,ν (t) ≈ wimax + log(q ( j|ν))

(3.2) (i)

for sufficiently large times t; then the weights w j,ν (t) will reflect the KL divergence. The weight learning (see section 3.3) adapts the weights in this direction. The weights are time dependent due to the weight learning that employs the current state of the network. It should be noted that the prototypical distributions are not necessary for computation: we only need weights and an adaption scheme. wimax denotes a suitable constant that ensures positivity. In fact, it is the maximal synaptic strength (since log q ( j|ν) ≤ 0), and it also determines the minimal interspike interval through the synaptic dynamics in equation 3.1. Technically, wimax can be

Image Segmentation by Networks of Spiking Neurons

1017

seen as constant input current fed into each ν-segment membership neuron. Clearly, wimax ≤ θi,ν , where θi,ν is the threshold of the ν-segment-membership neuron in site i, in order to avoid trivial activity at maximal rate. Hence, the closer the input stimulus and a prototypical value are, the larger will be the input to the corresponding class representing neuron. With an appropriate weight learning, the input to segment-membership neurons reflects the per object costs (i.e., the per site costs) if site i is assigned to segment ν. This relation can be seen from equations 2.2 and 3.2. In particular, for appropriately chosen weights, the total input will reflect wimax − DKL (pˆ i qν ), and hence it will be the larger the more similar pˆ i is to qν , that is, the smaller DKL (pˆ i qν ) is, which exactly models the desired adaptation behavior. In addition to the excitatory stimuli, the segment membership neurons are connected by inhibitory synapses that realize a (soft) winner-takes-all mechanism; the segment-membership neuron ν with the largest input over several epochs is most likely to fire, while other neurons are potentially kept from firing. Ideally, the winner neuron is the one for which the corresponding cluster prototype q ( j|ν) has the highest similarity to pˆ ( j|i). This WTA mechanism is related to the inhibition scheme employed in Wersing et al. (2001), where lateral inhibition is also used for segment-coding neurons. The corresponding weights in our implementation are denoted by wµ,ν (set to −2/3 for the experiments) and are time independent and not affected by the learning of weights. However, these weights can be made dependent on the current state of the network. Experiments indicate that this dependence leads to slightly improved segmentations. A similar weight (i) adaption scheme as that for w j,ν (see below) can be employed for this purpose. We refer to the local neuronal architecture of a single site i as a block. There exist N blocks, one for each site. In order to smooth the neuronal segmentation, the architecture favors local cooperation between neurons coding for the same segments and competition between neurons of different segments. The standard fourneighborhood Ni of image sites around a site i is employed where neurons of the same segment layer are excitatorily connected and neurons of different segment layers inhibit each other. Clearly, larger neighborhoods are possible, essentially enforcing smoother segmentations. The weights between segment-membership neurons in neighboring blocks are chosen (i,l) to be time independent and are denoted by wµ,ν (set to −1/3(1 − 2δµ,ν ) in the experiments) if lateral connections between neuron µ from block i and neuron ν from block l are concerned. It should be noted that these connection strengths implicitly regulate the amount of penalization of nonsmooth segmentations and thereby determine the Lagrange parameter λ in equation 2.5. Technically such connections are realized as follows: for each site, the neuron representing the membership to segment ν gets inhibitory stimuli from all membership neurons for segments µ = ν of each block from the four-neighborhood and excitatory stimuli from those for the same segment ν of each block from this four-neighborhood.

1018

J. Buhmann, T. Lange, and U. Ramacher

To sum up, the total activation exciting the ν-segment membership neuron for site i is described by A(i) ν =

j

(i)

(i)

w j,ν (t) y j (t) +

µ=ν

wν,µ yµ(i) (t) +

l∈Ni

µ

(i,l) (l) wµ,ν yµ (t).

(3.3)

(i)

The(i) resulting ODE for the ν-segment membership neuron is dsdtν (t) = (i) − sρν + Aν . In equation 3.3, the first sum adds up the excitatory stimulus (i) (i) from the input layer y j (t), 1 ≤ j ≤ M, which is weighted by w j,ν (t) (see equation 3.2). The second sum in equation 3.3 represents the inhibitory stimuli from the other segment-membership neurons µ = ν from block i. Finally, the last double sum arises from the excitatory and inhibitory links from the neighboring blocks. The ν-segment-membership neuron in site i fires when the internal state reaches the threshold θi,ν . A (hard) assignment of image sites i to segments ν is finally obtained by considering the firing rates (i.e., the number of firing times normalized by the maximally possible number of firing times within the simulation time) of segment-membership neurons. Note that we determine the segment assignment by means of a rate code. At first glance, the timing of spikes does not play any role for the rate code that we employ. In fact, timing becomes crucial through the term −sk (t)/, which erases prior excitation by forgetting, and through the lateral connections that realize the soft WTA mechanism and the lateral smoothing. Furthermore, we observe synchrony of segmentmembership neurons coding for the same class. This is investigated in the experimental section. 3.3 Learning of Weights. For the HCM, K · (M − 1) parameters have to be learned—the prototypical probabilities q ( j|ν), ν ∈ {1, . . . , K }, j ∈ {1, . . . , M}. These probabilities can be constructed from a given assignment of image sites to image segments. In contrast to the explicit calculation of prototypical probabilities by Bayesian inference, we will define a network dynamics where these quantities are iteratively estimated on the basis of local input statistics. The network implementation has K · N · M free parameters—the weights between input neurons and segment-membership (i) neurons. The weights w j,ν are randomly initialized in most simulation experiments, and as a consequence, the initial assignments become random. The of the learning rule is to shift the weights in a direction such that aim (i) (i) j w j,ν (t) y j (t) reflects how well the ith site fits into segment ν. Based on the current firing rate of each segment-membership neuron within a block (∼site), a winner neuron can be determined, yielding a temporary assignment of the site to the winner segment. Let mi,ν = 1 iff segment ν is the winner segment in the block corresponding to site i and mi,ν = 0 otherwise. Denote by h j,ν the arithmetic mean of the probabilities for feature j of all sites that are temporarily assigned to segment ν, that

Image Segmentation by Networks of Spiking Neurons

1019

is, h j,ν = i pˆ ( j|i) mi,ν / i mi,ν . h j,ν can be seen as the current estimate of q ( j|ν). It should be noted that weighted sums can be computed in a network of spiking neurons (Ruf, 1998). Thus, considering h j,ν does not pose a conceptual problem, as is also demonstrated in Wersing et al. (2001), where a similar averaging quantity is involved in the learning process. The idea underlying the proposed weight adaption scheme originates from the observation that the prototypical distributions q ( j|ν) in the HCM ideally represent the average probability to observe feature j given segment ν, which is encoded by the histogram h j,ν . In the space of distributions, one therefore obtains d q ( j|ν) = α( h j,ν − q ( j|ν)) dt

(3.4)

as an update rule, where α is a suitable learning rate. For updating the weights between input neurons j and segment-membership neurons ν for site i and feature j, we propose the following rule, which can be derived from equations 3.2 and 3.9, d (i) (i) w = α exp wimax − w j,ν (t) h j,ν − 1 , dt j,ν

(3.5)

where, α ∈ [0, 1] is a time-independent learning rate that is usually small (i) (i.e., α ≈ 0.01). Note that exp(wimax − w j,ν (t)) ≈ 1/q ( j|ν) holds due to relation 3.2. It determines how quickly weights are adapted. The learning rule depends on the ratio of the pre- and the postsynaptic activity of all neurons coding for the same segment. The mechanism employed here is similar to the learning approaches used in competitive learning (Intrator, 2002) and supports the following statistical interpretation: when the likelihood ratio h j,ν /q ( j|ν) between the measurements and the inferred prototypical feature distribution exceeds one, that is, the learned feature probabilities (i) are too small to match the measurements, then the weights w j,ν should be increased to raise q ( j|ν). By using equation 3.5, the weight dynamics are regulated in such a way that the likelihood ratio fluctuates around one, which yields a self-stabilizing learning dynamics. The value h j,ν which stores a reference value for the average feature probability given segment ν, might raise questions on the biological plausibility of this quantity in the context of neural networks. In essence, it represents a population average (as it is an average over the output of the input, both in time and space). It is clear that globally distributed information has to be collected in the network to represent the segment statistics. Such information can be computed within the network (similar to those in Maass, 1998; see also Ruf, 1998) and thus can be read out, for example, by adding neurons for each segment ν and each feature j to the architecture.

1020

J. Buhmann, T. Lange, and U. Ramacher

3.4 Localization of Information Processing. The process of learning involves information gathered from all sites belonging to the same segment. This can be considered disadvantageous in terms of biological plausibility as well as for hardware implementability. We have translated the system to a localized scheme without resorting to pure pairwise interactions. The latter would imply a CLM-like architecture as proposed in Wersing et al. (2001) and thereby would essentially amount to a pairwise clustering setting (Hofmann & Buhmann, 1997). We sketch the localization scheme: the set of neighbors of site i mentioned above is used for local smoothing. For the localization, a second set of neighbors attached to each site, denoted by Mi , is considered. Regarding the energy functions of section 2, the additional neighborhood system affects only the HCM cost function H in equation 2.5. To be precise, we approximate H by

DKL (pˆ i qc(i) ) ≈

i

(M ) DKL pˆ i qc(i)i ,

(3.6)

i (M )

where qν i represents a local centroid that depends on only the elements of Mi and the given assignment c. Again, we want to stress that this provides (M ) only the intuition for the approach taken. In practice, qν i is not used in the network during propagation. In fact, the introduction of the additional neighborhood system Mi affects only the weight learning rule: instead of considering h j,ν as average over all sites assigned to segment ν, only those that are in the vicinity Mi of site i are taken into consideration. This leads (M ) to a local estimate h j,ν i . Clearly, if Mi = {1, . . . , N} for all i, we have the original cost function H . The neighborhood system therefore regulates the amount of localization. In Geman, Geman, Graffigne, and Dong (1990), few (randomly chosen) long-range connections are used in order to break permutation symmetries, that is, to make the labeling unique. For our purpose, the inclusion of such long-range interactions in Mi also ensures that weights for classes that potentially do not occur in a small neighborhood around the site of interest can be reasonably learned. Therefore, the neighborhood used (solely) for learning mainly contains local neighbors of the current site i, in particular Ni ⊂ Mi , and a few long-range connections. Experimentally, we have observed that the localized scheme leads to satisfying results, even for small neighborhoods Mi , as demonstrated in Figure 3. In this example, Mi consisted of the four-neighborhood Ni and randomly chosen connections to 10% of the remaining sites. A 32 × 32 grid of sites was chosen. The major drawback of the localized scheme now becomes apparent: the potentially decreased convergence speed. Note that this is a difficult example for the localized version since there might be information about the other segment available only for the sites at the segment borders. Nevertheless, the localized scheme renders possible a

Image Segmentation by Networks of Spiking Neurons

1021

Figure 3: Localized versus nonlocalized information processing: The localized scheme achieves similar results but needs much more time to reach a stable state. This is, however, a difficult example for the localization (see the text). The time t refers to the model time (not to the number of simulation steps).

simple, robust, and essentially local learning rule, which is desirable for an analog VLSI implementation. We conclude that localized information processing is possible in principle but might significantly decrease convergence speed. 4 Experimental Results We now provide some experimental results that shed light on the properties of the proposed architecture and its usefulness for image segmentation. We have implemented a discrete time simulation with time steps t := 0.05. Further parameters of the simulations are := 3.5, τ := 1, θ := 3, and w max := 3. States have been randomly initialized, and state updates are subject to small additive noise. It should be noted that this randomness reflects noisy environments as found, for example, in analog VLSI implementations. Learning starts after t = 2 in order to allow the network architecture to stabilize itself in its start configuration. The number of segments was chosen to be K := 5 in all experiments. For the histogramming, M := 12 bins were used. The grid used to generate image patches and sites consists of 64 × 64 nonoverlapping cells (as schematically depicted in Figure 1). 4.1 Application to Artificial Images. We apply the segmentation architecture to two artificial images (see Figure 4): an image consisting of simple geometric objects (a square and two disks) and a textured image composed of five textures. The network activity is monitored by measuring the average firing rate of the segment neurons for a particular site i and assigning this site to the segment represented by the neuron with the highest activity. The segment labels are gray value coded in the following images. For the first image (see Figure 4A), one would expect that all (four) objects are detected as entities of its own. In fact, HCM recovers this structure from the data as expected and introduces a fifth segment defined as the border around the lower disk-like object. The line-like fifth segment became

1022

J. Buhmann, T. Lange, and U. Ramacher

Figure 4: Artificial (input) images used for the experiments.

possible since we have chosen five rather than four potential segments and HCM does not invoke a model order selection procedure. The spiking neuron network finds a segmentation that is close to the ground-truth knowledge (and thus to the HCM result; see Figure 5). For comparing results, we have plotted the HCM result together with the current network segmentation and a difference picture in which black codes for a difference in the segmentations. For this image, we have rerun the simulation with K = 2 and monitored the mean activity of segment membership neurons (see Figure 6A). One clearly recognizes class-specific activity patterns that are shifted in phase. Therefore, neurons coding for the same class appear to synchronize, while one can observe desynchrony up to a certain extent with class membership neurons coding for different classes. Figures 7A and 7B show the firing rate and the spikes during simulation of the class membership neurons in a specific site. Two observations are noteworthy: the winner class in fact suppresses the other class, supported by the learning of weights, and the firing rate can be considered as an assignment probability. The second example—the image composed of five different textures; see Figure 4B—represents a much harder problem since the features (the histograms in this case) do not necessarily capture the difference between two textures. The notion of texture refers to the occurrence of repeated patterns in an image. The ability of the method to identify distinct textures based on local histograms largely depends on the kind of texture and of the locality of the histograms. Other features, such as Gabor filter responses, are more suitable for texture segmentation since they are more discriminative (Hofmann, Puzicha, & Buhmann, 1998) than gray value histograms. For

Image Segmentation by Networks of Spiking Neurons

1023

Figure 5: Segmentation results for an artificial image. Assignments (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 10, the segments are roughly recognized by the network. (B) At t ≈ 95, the network has reached a stable state. Here, the assignments do not change anymore.

Figure 6: (A) Mean activity of class membership neurons for the first toy image in Figure 4A. (B) The misclassification rate of the network with regard to the HCM segmentation for six simulation runs with the railway station image: The plot demonstrates how the weight learning adapts weights to capture the image statistics.

1024

J. Buhmann, T. Lange, and U. Ramacher

Figure 7: (A) Firing rates of class 1/class 2 membership neurons for a single site. (B) Spikes generated by the class 1/class 2 membership neurons for a single site. See the text.

this reason, neither HCM nor the network can be expected to perfectly find the five different textures, and, in fact, this is not the case (see Figure 8). Nevertheless, both segmentations come rather close to ground truth. 4.2 Segmentation of Real-World Images. The results of the last section have demonstrated that the network matches the quality of the segmentation results achieved by HCM when applied to artificial data. We use two real-world images (see Figure 9) to show the segmentation abilities of the network under realistic conditions of real-world imagery. On the first image, depicting a railway station, we observe again a segmentation similar to the one of HCM (see Figure 10). All major segments have been identified by both approaches. Figure 6B shows the misclassification rate with regard to the HCM segmentation for a simulation run with this image. 3 It demonstrates how the network approximates the HCM segmentation via weight adaption. After some fluctuations, the network starts learning the segments around t = 5–6, and after that, it quickly finds a segmentation that is already very close to the one generated by HCM. For the balloon image, the network segmentation and that of HCM differ significantly (consider Figure 11). We argue that the segmentation generated by the network is superior because the meadow in the foreground is split into two parts by HCM while the network recognizes and represents it as a homogeneous region. Similarly, the forest in the image background is divided into two segments in a stripe pattern by HCM. By reconsidering

3

Since segment labels are unique only up to a permutation of the label indices, we consider the misclassification rate with respect to the permutation that minimizes the total number of misclassification.

Image Segmentation by Networks of Spiking Neurons

1025

Figure 8: Segmentation results for a Mondrian image. Assignments (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 4, the segments are roughly recognized by the network. (B) At t ≈ 50 the network has reached a stable state.

Figure 9: Real-world (input) images used for the experiments: (A) Railway station. (B) Balloon image.

1026

J. Buhmann, T. Lange, and U. Ramacher

Figure 10: Segmentation results for an image of a railway station. Assignments (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 5, the segments are roughly recognized by the network. (B) At t ≈ 50, the network has reached a stable state.

the original image in Figure 9B, one can hardly detect this pattern except for one case where the network has also been able to detect two different segments. Note that this behavior was essentially reproducible over 20 different simulation runs. Hence, we conclude that the network proposed leads to a visually more plausible result for the balloon image, although effectively not all segments are used by the network. In the experiments, we observed waves of activation sweeping through segments. This synchronization effect is caused by the lateral inhibitory and excitatory stimuli between neighboring image sites. We want to shed some more light on the network dynamics using the balloon image example. Figure 12A shows the values of the HCM cost function as a function of the model time (in ms), while Figure 12B shows the costs of the constrained model. For comparison, the HCM costs as obtained by the deterministic annealing approach in Puzicha et al. (1999) are also depicted as a constant function. Clearly, the deterministic annealing solution is superior in terms of HCM costs, which is not particularly surprising since the learning mechanism does not optimize the exact HCM cost function. The network, for example, has built in local smoothness constraints by

Image Segmentation by Networks of Spiking Neurons

1027

Figure 11: Segmentation results for the balloon image. Activities (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 5, the network starts recognizing segments. (B) At t ≈ 50, the network has reached a stable state.

its lateral excitatory and inhibitory connections. Nevertheless, the weight adaption scheme still minimizes the HCM costs H, although in an oscillatory rather than steepest-descent fashion. Furthermore, we observe that the network finds a better solution in terms of the constrained objective function H in comparison to the HCM solution. In Figure 12D, we have again plotted the mean activity per segment of segment membership neurons (i, ν) in a small time window. One can detect periodic different activity patterns (see also above)—one characteristic pattern for each segment. In the time window shown, that network has almost reached a stable state. In general, we have observed these segment-specific periodic firing patterns, that is, synchrony within classes can be observed. The dynamics of the network can be nicely visualized by considering the prototypical histograms q ( j|ν). Note that these are computed for diagnostic purposes from the current assignment of image sites to segments, that is, from the current network read-out. The time evolution of these prototypical histograms is shown in Figure 12C for the second segment in the segmentation of the balloon image. Each curve depicts q ( j|ν), 1 ≤ j ≤ M as a function of the simulation time. From the figure, the effects of the weight adaptation scheme can be traced since

1028

J. Buhmann, T. Lange, and U. Ramacher

Figure 12: Demonstration of the network dynamics and learned statistics. (A) HCM costs for the balloon image in comparison with the costs achieved by the deterministic annealing approach. (B) HCM costs with smoothness penalty for the balloon image in comparison with the costs achieved by the deterministic annealing approach. (C) q ( j|2) as computed from the network site-to-segment assignment as a function of the time (see the text). (D) Activity pattern of segment membership neurons in a simulation time window.

the weights essentially determine the assignment of sites to segments and the prototypes amount to a compact representation of the assignments here. One can thus see how the weight adaptation scheme modifies the weights (i) w j,ν and how the network reaches a stable state. In Figure 13 we take a look at the number of milliseconds t (model time) required until a stable state is reached. In this context, we talk about a stable state if less than 1% of the assignments have changed for the previous 10 iterations. We have used 49 restarts, each with random initializations, in order to get an idea of how many simulation steps are needed. For the railway station, a stable state is reached after 44 ± 14 ms and after 28 ± 14 ms in the case of the balloon image. Note furthermore that no weight adaption takes place in the first 2 ms simulation steps. We conclude that the proposed

Image Segmentation by Networks of Spiking Neurons

1029

Figure 13: Histogram for the milliseconds t (model time) until a stable state is reached for the (A) railway station and (B) balloon image.

architecture rather quickly leads to stable results despite the use of lateral smoothness constraints. 5 Conclusion We have presented a network of spiking neurons for the purpose of image segmentation. It is motivated by a statistical model for histogram clustering. A learning rule with fast synaptic adaptation is derived from the model that allows the inference of robust segmentations. Results on artificial as well as real-world images demonstrate the applicability of the proposed procedure. The implementation of the segmentation model with spiking neurons provides the advantage that computationally expensive Gibbs models can be efficiently simulated by the network dynamics rather than approximated by mean field or loopy belief propagation techniques. Alternatives to the network simulation often lead to the use of such Gibbs sampling schemes, which are qualitatively equivalent to our suggested network model but demand a larger amount of computation for obtaining comparable results. Future work should address the computation of histograms within the network (which is externally done in the presented simulations) and an RBF-like network architecture. The latter would ideally rely on the timing of spikes for the purpose segmentation. This modification of the network essentially amounts to a change of the coding scheme (e.g., time to first spike), since the present architecture neglects the exact timing of spikes. For Euclidean data, a successful temporal scheme has been proposed in Ruf (1998) for the purpose of clustering. At the far end, a general theory for mapping statistical models to networks of pulsing neurons would be desirable since the use of statistical models enables robust inference and networks of pulsed neurons can be

1030

J. Buhmann, T. Lange, and U. Ramacher

cheaply implemented in hardware. We consider the proposed architecture a promising step in this direction. Acknowledgments We thank the anonymous reviewers for their helpful comments and suggestions. References Campbell, S. R., Wang, D. L., & Jayaprakash, C. (1999). Synchrony and desynchrony in integrate-and-fire oscillators. Neural Computation, 11, 1595–1619. Cesmeli, E., & Wang, D. (2001). Texture segmentation using gaussian-Markov random fields and neural oscillator networks. IEEE Transaction on Neural Networks, 12(2), 394–404. Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990). Boundary detection by constrained optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), 609–628. Gerstner, W. (1998). Spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Gerstner, W. (2002). Integrate-and-fire neurons and networks. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE PAMI, 19(1), 1–14. Hofmann, T., Puzicha, J., & Buhmann, J. (1998). Unsupervised texture segmentation in a deterministic annealing framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 803–818. Intrator, N. (2002). Competitive learning. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 238–241) Cambridge, MA: MIT Press. Koch, C. (1999). Biophysics of computation—Information processing in single neurons. New York: Oxford University Press. Maass, W. (1998). Computing with spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Opper, M., & Saad, D. (2001). Advanced mean field methods. Cambridge, MA: MIT Press. Pereira, F. C. N., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In Proceedings of the 31st Meeting of the Association for Computational Linguistics (pp. 183–190). Columbus, OH. Puzicha, J., Hofmann, T., & Buhmann, J. (1999). Histogram clustering for unsupervised segmentation and image retrieval. Pattern Recognition Letters, 20, 899–909. Ruf, B. (1998). Computing and learning with spiking neurons—theory and simulations. Unpublished doctoral dissertation, TU Graz. Tu, Z., Zhu, S. C., & Shum, H. (2002). Image segmentation by data driven Markov chain Monte Carlo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 657–673.

Image Segmentation by Networks of Spiking Neurons

1031

von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Wersing, H., Steil, J., & Ritter, H. (2001). A competitive-layer model for feature binding and sensory segmentation. Neural Computation, 13, 357–387.

Received January 7, 2004; accepted October 1, 2004.

LETTER

Communicated by Bard Ermentrout

Neural Modeling of an Internal Clock Tadashi Yamazaki [email protected]

Shigeru Tanaka [email protected] Laboratory for Visual Neurocomputing, RIKEN Brain Science Institute. Wako, Saitama 351-0198, Japan

We studied a simple random recurrent inhibitory network. Despite its simplicity, the dynamics was so rich that activity patterns of neurons evolved with time without recurrence due to random recurrent connections among neurons. The sequence of activity patterns was generated by the trigger of an external signal, and the generation was stable against noise. Moreover, the same sequence was reproducible using a strong transient signal, that is, the sequence generation could be reset. Therefore, a time passage from the trigger of an external signal could be represented by the sequence of activity patterns, suggesting that this model could work as an internal clock. The model could generate different sequences of activity patterns by providing different external signals; thus, spatiotemporal information could be represented by this model. Moreover, it was possible to speed up and slow down the sequence generation.

1 Introduction An internal clock, which explicitly represents temporal information among cognitive and behavioral events, is widely believed to exist in the brain. In particular, many lines of evidence support the idea that a time passage from the onset of an input signal is represented. Several hypotheses of how a time passage is represented have been proposed (see Ivry, 1996, for review), and one possible way of the representation is by assigning one population of active neurons to one time interval from the stimulus onset. If active neuronal populations are sequentially generated in the order of interval lengths and one population exclusively corresponds to only one time interval, then we regard the sequence of these populations as the time passage from the stimulus onset and the stimulus onset as the trigger of starting the internal clock. Moreover, if another sequence of populations of active neurons is generated when another input signal is presented, this mechanism can represent not merely temporal information but also spatiotemporal information. Yet what type of neuronal circuit enables such a population coding of time? Neural Computation 17, 1032–1058 (2005)

© 2005 Massachusetts Institute of Technology

Neural Modeling of an Internal Clock

1033

To address this question, we studied a simple recurrent inhibitory network in which neurons were connected with random synaptic weights. We conducted computer simulations and demonstrated that this model generates a sequence of activity patterns of neurons dynamically changing from the stimulus onset and these activity patterns did not recur. Thus, this model could represent a time passage from the stimulus onset in the rate-coding scheme. We confirmed the nonrecurrence of activity patterns by calculating correlation between patterns generated at two given time steps. We then examined the parameter dependency of dynamics both numerically and analytically. We found that the total number of active neurons remained constant, suggesting that the maximum length of a time passage may be exponential to the number of neurons. The total activity of neurons was also estimated and was constant. Moreover, we found a parameter that controls the accuracy of a time passage representation. We further demonstrated some abilities of the model. The model can generate different activity patterns of neurons when different input stimuli are presented. This implies the ability to generate different time passages for different input patterns, which suggests that spatiotemporal information is represented. The stability of sequence generation against noise was examined. An important issue regarding an internal clock is how to reset it, that is, when an identical stimulus is presented, the same population of active neurons must always be generated. This was possible by adding a strong transient signal to the input signal. Finally, the clock speed could be modified by changing the neuronal time constant or the synaptic strength of recurrent connections. 2 Model Description The model consists of N neurons, each with two types of input: excitatory afferent inputs and recurrent inhibitory inputs from other neurons. Let zi (t) be the activity of neuron i at time t. This is calculated as zi (t) = [ui (t)]+ , where [x]+ = x if x > 0 and 0 otherwise, and ui (t) is the internal state of neuron i at time t, which is calculated as ui (t) = I −

j

wij

t

exp (−(t − s)/τ ) z j (s − 1).

(2.1)

s=1

I and wij denote the afferent input signal to neuron i and the synaptic weight of recurrent inhibition from neuron j to neuron i, respectively. We assume the temporal integration of activities of neurons over a long time, which is represented by the summation with respect to s, where τ determines the

1034

T. Yamazaki and S. Tanaka

integration range. Since z can take a real value, the model is represented in the rate-coding scheme. Our goal is to show that the activity patterns of neurons generated using equation 2.1 can encode a passage of time. Specifically, we hypothesized that the activity patterns do not recur, that is, the activity pattern at one time step is dissimilar to the pattern at a different time step when the interval between the two steps is large. Therefore, we use the following correlation function as the similarity index: N zi (t1 )zi (t2 ) C(t1 , t2 ) = i=1 . N N 2 2 z (t ) z (t ) 1 2 i=1 i i=1 i

(2.2)

The numerator represents the inner product of vectors of activity patterns at times t1 and t2 , and the denominator normalizes the vector lengths. Since zi (t) can take only positive values, the index takes the value between 0 and 1. It takes 1 when theactivities of neurons at t1 and t2 are identical. We N N defined C(t1 , t2 ) = 0 if i=1 zi2 (t1 ) or i=1 zi2 (t2 ) = 0, which occurs when the inhibition is strong and no neurons are active. Furthermore, we considered the similarity between two activity patterns (1) (2) under different parameter settings. Let zi (t) and zi (t) be the activities of neuron i at time t under different parameter settings. In this case, we used the following cross-correlation function as the similarity index: N C(t1 , t2 ) =

N

(1) (2) i=1 zi (t1 )zi (t2 )

(1)2 i=1 zi (t1 )

N

(2)2 i=1 zi (t2 )

.

(2.3)

This similarity index also takes the value between 0 and 1. Its value is 1 when the two activity patterns are identical. We also defined C(t1 , t2 ) = 0 if N (1)2 N (2)2 i=1 zi (t1 ) or i=1 zi (t2 ) = 0. We conducted computer simulations in T steps and evaluated similarity indices between t1 and t2 . We used the following parameters as the basic parameter setting: N = 1000, T = 1000, I = 1, τ = 100, and wij ’s were given under the binomial distribution Pr(wij = 0) = Pr(wij = 2κ/N) = 0.5, where κ = 1. We tested the cases of uniform distribution and normal distribution with average κ/N and variance (κ/N)2 , and found no significant difference in results. 3 Results 3.1 Representation of a Time Passage. The left panel in Figure 1 shows the plot of active states of the first 100 neurons out of 1000 neurons composing the recurrent network during T steps under the basic parameter setting. Once a neuron started to become active, its activity continued for several

Neural Modeling of an Internal Clock

1035

t2 80 60 40 20 0

0

200 400 600 800 1000

t

t1

similarity index

1

neuron index

100

0.8 0.6 0.4

C(200,t) C(500,t) C(800,t)

0.2 0

0

200 400 600 800 1000

t

Figure 1: Raster plot of active states (zi (t) > 0) of the first 100 neurons (left), similarity index (C(t1 , t2 )) (middle), and plots of C(200, t), C(500, t), and C(800, t) (right).

hundreds of steps, and then the neuron became inactive. Some neurons were reactivated after the inactive period. Thus, the active and inactive periods appeared alternately. We examined how the number of active neurons changed with respect to time steps and found that the number rapidly converged to become constant; only 196.0 ± 7.5 neurons were activated on average in the last 500 steps. This indicates that a constant number of neurons are active at each time step, that is, a time passage is represented based on the sparse coding. This also implies the number of possible active states of N that 5 exp(0.5N) √ neurons is approximately ≈ . Here we used Stirling’s formula N/5 √ 2 2π N (n! ≈ 2πnnn e −n ). Hence, the model may be able to represent a sufficiently long time passage. We calculated the total activity of neurons and found that the quantity also rapidly converged to become constant—10.8 ± 0.07 on average for the last 500 steps. The middle panel in Figure 1 shows the similarity index calculated using equation 2.2. Since equation 2.2 takes two arguments of t1 and t2 , we obtained a T × T matrix, where the row and the column were specified by t1 and t2 , respectively. Similarity indices were plotted in a gray scale in which black indicated 0 and white 1. A broad white band appeared diagonally. Since the similarity index at the identical step (t2 = t1 ) takes 1, the diagonal elements of the similarity index appeared white. The similarity index decreased monotonically as the interval between t1 and t2 became longer. However, it remained high when the interval was short. This could be confirmed as shown in the right panel of Figure 1, which shows the plots of C(200, t), C(500, t), and C(800, t), namely, the 200th, 500th, and 800th rows in the similarity index. This result indicates that the activity pattern of neurons changed gradually with time and did not recur. We tested the case of 100,000 steps (i.e., T = 100,000) and confirmed that the similarity index displayed a white diagonal band and dark off-diagonal domains, indicating that the representation of a time passage was stable at least up to 100,000 steps. Thus, our model can represent a time passage by a dynamically changing population of active neurons in the rate-coding scheme.

1036

T. Yamazaki and S. Tanaka

(a) N = 100

(b) N = 500

(c) N= 1000

(d) N= 2000

(e) τ = 10

(f) τ = 50

(g) τ = 100

(h) τ = 200

(i) κ = 0.5

(j) κ = 1.0

(k) κ = 2.0

(l) κ = 3.0

t1t 2 Figure 2: Similarity indices at various parameter settings.

3.2 Parameter Dependency. We studied how the dynamics changed when the value of a parameter was modified. First, we varied N and calculated similarity indices. Other parameters were fixed under the basic parameter setting. For N = 100 (see Figure 2a), the width of the white diagonal band became narrower and jagged, and white off-diagonal domains indicating the recurrence of an activity pattern of neurons appeared. As N increased, the white off-diagonal domain disappeared, although the diagonal band became broad and vague (see Figures 2b–2d). This suggests that the number of neurons should be large for the unique representation of a time passage by activity patterns of neurons. On the other hand, for N = 2000 (see Figure 2d), off-diagonal domains tended to be totally white. Next, we varied τ . For τ = 10, the similarity index became totally white so that the diagonal band did not appear (see Figure 2e). This was because

Neural Modeling of an Internal Clock

1037

for τ = 10, no neurons exhibited the repetition of transition between active and inactive states so that the activity pattern did not change. For τ = 50, the diagonal band appeared but expanded as t1 or t2 increased (see Figure 2f). For larger τ , the diagonal band appeared clearly, and the off-diagonal domains became darker (see Figures 2g and 2h). This suggests that the range of temporal integration should be long for the stable representation of a time passage. Figures 2i to 2l plot similarity indices for various κ values. The diagonal band became narrower and the off-diagonal domains became darker as κ increased, suggesting that a large κ enables the model to stably represent a time passage. As κ increased, black domains appeared on the top and the left sides of the similarity index. The emergence of these domains indicates that all neurons became inactive for the first tens of time steps because of the strong recurrent inhibition due to a large κ. Hence, κ should be within an appropriate range. Varying I did not change the similarity index. Let zic (t) be the activity of neuron i at time t under I = c. For any i and t, we obtained the following relationship: zic (t) = czi1 (t), where {zi1 (t)} is the activities of neurons under the basic setting. This indicates that neuronal activity is scaled by the input strength. Consequently, the similarity index does not depend on I . This scaling property depends on the choice of the nonlinearity. In the model, the scaling property holds because the half-rectification zi (t) = [ui (t)]+ does not saturate the (positive) value of ui (t). In summary, N and τ should be large while κ should be within an appropriate range to generate a sequence of activity patterns of neurons uniquely representing a time passage. 3.3 Analysis of Parameter Dependency. We also analyzed the parameter dependency of the dynamics of our model. We analytically estimated the total activity of neurons, which is given by lim

t→∞

i

zi (t) =

NI . 1 + κτ

(3.1)

The derivation of this formulais found in the appendix. The left panel in Figure 3 shows the plots of i zi (T) obtained by computer simulations and equation 3.1 against τ , and the middle panel the plots of i zi (T) and equation 3.1 against κ. In computer simulations, we confirmed that the value of i zi (t) had already reached the steady state and became constant at t = T. As can be seen, there is excellent agreement between the analytical and simulation results.

1038

T. Yamazaki and S. Tanaka 100

Simulation Theory

80

60

60

40

40

20

0 0

50

100 150 200

0 0

τ

80 40

20

0

Simulation Theory

120

∑izi(t)

∑izi(t)

80

160

Simulation Theory

tinactive

100

0.5

1

1.5

κ

2

0

1

2

3

4

5

κ

Figure 3:Plots of i zi (T) obtained by computer simulations against τ (left), plots of i zi (T) against κ (middle), and the simulated duration of inactive periods in simulations against κ (right). The solid curves in the left and middle panels indicate the analytical estimation of the total activity given by equation 3.1, and the solid curve in the right panel indicates the analytically estimated duration of inactive periods given by equation 3.2.

Figures 2k and 2l show that all neurons became inactive in the first tens to hundreds of steps when κ was large. We theoretically estimated the duration of the initial inactive period. Let tinactive be the time step when neurons recovered from the initial inhibition and started to become active, that is, the duration of the initial inactive period. When κ < 1, tinactive = 0. On the other hand, when κ ≥ 1, we obtained tinactive = τ log κ + 1 . (3.2) This derivation is found in appendix. The right panel in Figure 3 plots equation 3.2 and the duration of inactive periods in simulations with different κ’s. The simulation result agrees with the theoretical one in general; however, there is a slight difference when κ is large. The difference could be due to an approximation in the derivation of equation 3.2. We omitted fluctuation terms in the derivation, which might shorten the inactive period when κ was large. Then we studied how the width of the diagonal band is determined. We defined width(t), the width of the diagonal band at time t, by width(t) = arg max (t > t, C(t, t ) > θ ), t

(3.3)

where θ is the threshold of similarity and was set at 0.9. We simplified equation 2.1 and derived the following equation (the derivation is found in appendix): t I 1 − √κτN κ + ui (t) ≈ δ

wij exp (−(t − s)/τ ) z j (s − 1), N j 1 + κτ 1 − √1 s=1 N

(3.4)

Neural Modeling of an Internal Clock

1039

800

800

width(t)

1000

width(t)

1000

600

N=1000 N=2000 N=3000 N=4000 N=5000

400 200 0 0

20

40

60

N=1000 N=2000 N=3000 N=4000 N=5000

600 400 200 0

80

100

0

200 400 600 800 1000

t

t

1000

N τ

and various N values (left) and the same

κ=0.40 κ=0.45 κ=0.50 κ=0.55 κ=0.60 κ=0.65 κ=1.00

800

width(t)

√

600 400 200 0 0

200 400 600 800 1000

t

4000 3000

width(t)

Figure 4: Plots of width(t) at κ = √ plots at κ = 2 τ N (right).

2000 1000 0 0.3

0.4

κ

0.5

0.6

Figure 5: Plots of width(t) (left) and width(T) against κ (right).

where Pr(δ

wij = 1) = Pr(δ

wij = −1) = 0.5. This equation shows that if √κτN is less than 1, then ui (t) for all i’s tends to be a positive constant due to the first term in equation 3.4, and thus the fluctuation term (the second term in equation 3.4) may not contribute. Hence, width(t) increases as t increases. The left panel of Figure 4 plots width(t) under different N’s, τ = 100, and √ κ = τN . As can be seen, width(t) increases exponentially as t increases. Note that the change in width(t) is invariant regardless of N, suggesting that √κτN controls the dynamics. On the other hand, as √ shown in the right panel of Figure 4, when κ was twice as large (i.e., κ = 2 τ N ), width(t) became constant. Moreover, the change in width(t) is invariant, again regardless of N. These results show that there exists a critical κ value denoted by κ∗ such that if κ < κ∗ then width(t) increases rapidly, whereas if κ > κ∗ , then width(t) remains constant. We varied κ and plotted width(t) with different κ’s, which is shown in the left panel of Figure 5. As can be seen, κ∗ can be estimated at approximately 0.5 when N = 1000 and τ = 100. If κ < 0.5, then width(t) increases rapidly, whereas if κ > 0.5, then width(t) remains constant, and when κ = 0.5, the increase in width(t) becomes linear. Then we estimated κ∗ more accurately. We conducted computer simulations in 5T steps and plotted width(T) against κ in the right panel of Figure 5. At

1040

T. Yamazaki and S. Tanaka

κ < 0.496, width(T) became 4T. Since the total simulation steps were 5T, the upper bound of width(T) was 4T, concluding that the diagonal band reached the maximum. At κ ≈ 0.496, width(T) sharply decreased, and as κ further increased, width(T) became constant. Hence, we could observe phase transition at κ∗ = 0.496. The control parameter √κτN also implies that width(t) increases when N is large, τ is small, or κ is small. This is consistent with the simulation results shown in Figure 2d (for N), Figures 2e and 2f (for τ ), and Figure 2i (for κ). Therefore, although the plots of similarity indices show substantial differences in their shapes, this control parameter provides a unified view of how the shapes differ under different parameter settings. 3.4 Analysis of the Sparseness. These dependency results may imply that much of the variance among similarity indices is accounted for by the level of sparseness of the active neuron population. We examined how much the sparseness could account for the variance. Let r (t) be the ratio of the number of active neurons to N at time t defined as r (t) =

# of active neurons at time t . N

The sparseness is the reciprocal of r (t), that is, smaller r (t) values indicate more sparseness. We conducted simulations to clarify the relationship between width(t) and r (t). Parameters N, τ , and κ were chosen from N ∈ {500, 1000, 2000}, τ ∈ {50, 100, 150, 200}, and κ ∈ {0.5, 1.0, 2.0, 3.0}, and all combinations of N,τ and κ were tested. Thus, we conducted 3 × 4 × 4 = 48 simulations. We did not test the case of N = 100, because the fluctuation of the diagonal band observed in Figure 2a is clearly due to the size effect. Hence, we considered only larger N values: N = 500, 1000, and 2000. We also ignored the case of τ = 10 and tested τ = 150 instead. Then we plotted width(t) against r (t) at t = T in Figure 6. The left panel of Figure 6 shows the plot of width(T) against r (T) for 0 ≤ r (T) ≤ 1. As can be seen, width(T) values for r (T) < 0.4 are small but scattered, while width(T) values for r (T) > 0.4 reach the maximum value of 1000. Moreover, this plot implies that there may be a positive correlation between width(T) and r (T). To verify this, we fitted a linear function to the set of data points for r (T) < 0.4 using a least-square method and drew the regression line. The line has a positive slope, indicating a positive correlation. Next, we examined how the scattering of data points appeared. For each τ ∈ {50, 100, 150, 200}, we selected corresponding data points, fitted a linear function to these points, and drew the line in the right panel of Figure 6. We used the data points for r (T) < 0.4 only. This figure indicates that the scattering of width(T) values is dependent on τ values, that is, width(T) increases with τ . Therefore, the sparseness can be a good measure

Neural Modeling of an Internal Clock

1041

800

800

width(T)

1000

width(T)

1000

600 400

τ=50 τ=100 τ=150 τ=200

600 400 200

200 0

0 0

0.2

0.4

0.6

r(T)

0.8

1

0

0.1

0.2

0.3

r(T)

0.4

0.5

Figure 6: Plots of width(T) against r (T) for 0 ≤ r (T) ≤ 1 with the linear regression line calculated using data points for r (T) ≤ 0.4 (left) and that for 0 ≤ r (T) ≤ 0.4 with regression lines determined using data points corresponding to each τ value (right).

that accounts for the variation among similarity indices, even though τ also plays a critical role in determining the exact value of width(T). 3.5 Representation of Different Time Passages for Different Input Signals. The model can represent different time passages for different input signals by generating different sequences of activity patterns of neurons. To demonstrate this, we extended the input I to Ii such that individual neurons could receive arbitrary signals independently, and we conducted two simulations. In each simulation, Ii ’s were set to 0 or 1 randomly with the probability 0.5. Then we calculated the similarity index between activity patterns in the two simulations. The left and middle panels in Figure 7 show the similarity indices obtained from each simulation, and the right panel represents the similarity index between two sequences of activity patterns generated at different settings of {Ii }. As can be observed, the similarity index for each simulation shows a clear diagonal band, suggesting that a sequence of activity patterns is generated without recurrence. The similarity index between these two sequences, however, is totally black. This indicates that different time passages are generated by different input signals. 3.6 Stability Against Noise in Input Signals. We examined to what extent a sequence of activity patterns of neurons generated by our model was robust against temporal noise. We extended input signals as follows, Ii (t) = I · (1 + η · U[−1, 1]),

(3.5)

where U[−1, 1] denotes a uniform random number in [−1, 1] considered as noise and η represents the noise intensity. Note that different noise was incorporated at each time step.

1042

T. Yamazaki and S. Tanaka

t1t 2 Figure 7: Similarity indices at different settings of {Ii }’s (left and middle) and the similarity index between two sequences (right).

t1t 2 Figure 8: Similarity indices at various η values. From left to right, the η values are 0.01, 0.05, and 0.10, respectively.

The simulation procedure was the same as that in the previous simulation. We tested the cases of η = 0.01, 0.05, and 0.10. For each η value, we (1) (2) generated two sequences of input signal patterns {Ii (t)} and {Ii (t)} and conducted simulations. Then we calculated the similarity indices between the simulations, which are shown in Figure 8. Even with injection of 5% noise, the diagonal band still remained. The diagonal band disappeared at 10% noise injection. 3.7 Stability Against Noise in Connection Matrix. We also examined the robustness of sequence generation against noise in the connection matrix wij . We extended the synaptic weights as follows, wij (t) =

(1 + η · U[−1, 1]) · wij

wij =

(η · U[0, 1]) ·

otherwise,

1 N

2κ , N

(3.6)

where U[−1, 1] and U[0, 1], respectively, denote uniform random numbers in [−1, 1] and [0, 1] considered as noise and η represents the noise intensity. Note that different noise was incorporated at each time step. The simulation procedure was the same as that in the previous section. Briefly, we varied

Neural Modeling of an Internal Clock

1043

t1t 2 Figure 9: Similarity indices at various η values. From left to right, the η values are 0.3, 0.5, and 0.7, respectively.

t1t 2 Figure 10: Similarity indices for {zi (t)} (left) and for {zi (t + T)} (middle), and the similarity index between {zi (t)} and {zi (t + T)} (right).

η as 0.3, 0.5, and 0.7. For each η value, we conducted two simulations under different random seeds, and then the similarity index between the two sequences generated by the two simulations was calculated. The calculated similarity indices are shown in Figure 9. Even with injection of 50% noise, the diagonal band still remained. The diagonal band disappeared at 70% noise injection. 3.8 Resetting of an Internal Clock. Can we reset the clock in the model? We conducted simulation in 2T steps, applying an input signal in which strong transient input signals I = 3.0 at t = 0 and t = T were added to the sustained signal I . The left and middle panels in Figure 10 are similarity indices for sequences {zi (t)} and {zi (t + T)} for 0 ≤ t ≤ T − 1, respectively. Immediately after the transient injection of a strong signal, all neurons became active once; then they became inactive for the subsequent tens of steps because they received strong inhibitory feedback from other neurons. Hence, similarity indices represent black domains on the top and the left side. After the release from inhibition, neurons started to become active, and a sequence of populations was generated, which is indicated by the white diagonal band. The right panel in Figure 10 is the similarity index between {zi (t)} and {zi (t + T)}, which also shows a white diagonal band.

1044

T. Yamazaki and S. Tanaka

100

I(0)=I+∆I I(0)=I

∑izi(t)

80 60 40 20 0 0

200

400

t

600

800 1000

Figure 11: Plots of i zi (t) when the transient signal was added (I (0) = I + I ) and not added (I (0) = I ).

This index shows that the sequence of activity patterns generated at t ≥ T was almost identical to that generated at t ≥ 0. This suggests that a transient component of the input signal can reset the internal clock. Does the transient signal affect the total activity i zi (t)? If the activity markedly decreases after receiving the signal and does not return to the normal calculation using equation 2.4, the model may not be adequate for the internal clock. We plotted in Figure 11 the total activity of neurons i zi (t) when the transient signal was added and not added. As seen, during the inactive period, the total activity was 0, and the activity returned to normal immediately after the end of the inactive period, indicating that the transient signal does not affect the level of the total activity. 3.9 Speeding Up and Slowing Down an Internal Clock. We examined the similarity between sequences of activity patterns of neurons generated using different values of τ or κ. For each case, we conducted two computer simulations: in one simulation, the value of either τ or κ was altered, and in the other, the basic setting was used. Then we calculated the similarity (1) index between the sequence {zi (t)} in the altered case and the sequence (2) {zi (t)} in the basic case. The left panel in Figure 12 shows the similarity index for the case in which τ was set at 90. As can be seen, the white band representing high similarity slanted slightly above the diagonal line. The degree of the slant depended on the difference of values of τ . If we set τ at 80, the slant increased, as shown in the middle panel of Figure 12. However, the length of the white band became shorter, which suggests that a long time passage could not be represented. The right panel in Figure 12 shows the similarity index in the second case, in which κ was at 1.5. We can see that the white band also slanted less than the diagonal line, and the band became shorter. The black

Neural Modeling of an Internal Clock

1045

t1t 2 Figure 12: Similarity indices in cases of decreasing τ to 90 (left) and 80 (middle), and increasing κ to 1.5 (right).

region on the top edge represents the period in which all neurons became inactive due to the strong inhibition by the large κ. These results indicate that the clock speed increased with a change in parameter τ or κ.

4 Discussion 4.1 The Model. We studied a random recurrent inhibitory network of an internal clock. Individual neurons exhibited the random repetition of transition between active and inactive states following the stimulus onset due to the randomness incorporated in the recurrent connections, and the active and inactive periods were sustained for several tens to hundreds of steps. Hence, the population of active neurons changed gradually with time and did not recur. This property was confirmed by calculating the similarity index. Both the number of active neurons and the total activity of neurons remained constant. This is an important property for a stable time representation because if the number of active neurons varies at each time step, there may exist a time step at which no neurons become active. Moreover, if a population of active neurons at a time step is a subset of another population of active neurons at another time step, then it is impossible to distinguish these two time steps from each other in the learning of a timing discussed in section 4.2. This also indicates that a time passage is represented based on the sparse coding, suggesting that the model may be able to generate a sequence of neuronal activity patterns in which the length is the exponential of the number of neurons. Noise stability was also discussed, and sufficient stability for up to 5% uniform noise in the input signals and 50% uniform noise in the connection matrix were demonstrated. Our internal clock could be reset by a strong transient input added to the sustained signal. Accuracy is an important issue for the representation of time. Accuracy in our framework may be estimated by the length of successive time steps represented by one single activity pattern of neurons, because when neurons reveal an activity pattern, time steps represented by this activity pattern

1046

T. Yamazaki and S. Tanaka

cannot be distinguished from each other. If successive time steps are represented by a single activity pattern, the similarity index between two time steps within the range is high. Thus, the range is equivalent to the width of the white diagonal band in the similarity index. Therefore, width(t) defined in equation 3.2 represents the accuracy of time representation in our framework. As discussed in section 3.3, the change in width(t) depends on the parameter √κτN . If this is small, then width(t) increases exponentially, hence, accuracy deteriorates as t increases, whereas if this is large, width(t) remains constant and accuracy is preserved. Therefore, depending on this parameter, the model can represent an accurate internal clock as well as an internal clock in which the accuracy decreases with time. Psychophysical experiments have shown a relationship between time interval and accuracy known as Weber’s law (Gibbon, Malapani, Dale, & Gallistel, 1997), which states that as time interval t increases, the standard deviation of reported intervals over trials increases linearly, while the average of reported intervals is equal to t. That is, the accuracy of the representation of time t decreases linearly as t increases. The accuracy in our model linearly decreases only at the critical point κ = κ∗ . Hence, strictly speaking, this model does not satisfy Weber’s law. Nevertheless, the left panel in Figure 5 demonstrates that at κ = 0.5, width(t) seems increasing linearly with respect to t, implying that Weber’s law is satisfied in the vicinity of the critical point for finite time steps even in this model. Furthermore, we tested whether the model could change the clock speed. It has been shown that the basal ganglia is involved in time estimation, specifically, for longer time events (>3 sec) in motor planning (Ivry, 1996; Lalonde & Hannequin, 1999). Some studies have shown that in the basal ganglia, the facilitation of dopamine secretion from substantia nigra pars compacta speeds up the internal clock, while the suppression or blockade of dopamine secretion decreases the speed (Meck, 1996). Since modulatory neurotransmitters change neuronal excitability (Missale, Nash, Robinson, Jaber, & Caron, 1998; Kandel, Schwartz, & Jessel, 2000), we considered varying τ or κ. Then we obtained similarity indices in which the white band deviated from the diagonal line. These results suggest that the change in neuronal activity patterns is speeded up or slowed down, hence the speed of the internal clock is also. 4.2 Learning of a Timing in the Cerebellum. Pavlovian eyelid conditioning is a clear and experimentally traceable example of the representation of a time passage (Mauk & Donegan, 1997). This experiment involves repeated presentations of a conditional stimulus (CS) paired with an unconditional stimulus (US) delayed for a certain time after the CS onset. Typically, a neutral stimulus such as a tone serves as the CS and an air puff directed to one eye as the US. With this training, the subject not only learns to close its eye (conditioned response, CR) in response to the CS but also learns to elicit the CR at the same time as the interstimulus interval (ISI)

Neural Modeling of an Internal Clock

1047

GOLGI

LTD

PARALLEL FIBERS

PURKINJE

GRANULE

MOSSY FIBERS

SYNAPSES EXCITATORY INHIBITORY

CS PONTINE NUCLEUS

CR

CEREBELLAR NUCLEUS CLIMBING FIBER

US INFERIOR OLIVE

Figure 13: Schematic of known cell types and synaptic connections in the cerebellum. Shown in bold letters and solid lines are the components corresponding to our model.

between the CS and US onsets during the conditioning, indicating that the time passage from the CS onset to the US onset is learned. Evidence from electrophysiological and behavioral studies demonstrates that the Pavlovian eyelid conditioning is mediated by the cerebellum working as an internal clock (Ivry, 1996). Figure 13 represents the schematic of known cell types and synaptic connections in the cerebellum (Eccles, Ito, & Szent´agothai, 1967; Ito, 1984). The neural signal of the CS comes from the pontine nucleus through mossy fibers to the cerebellar nucleus, which attempts to elicit spike activity. The CS signal is also sent to Purkinje cells relayed by granule cells, and then Purkinje cells inhibit the cerebellar nucleus. Thus, the cerebellar nucleus receives both excitatory inputs directly through mossy fibers and inhibitory inputs from Purkinje cells. Golgi cells are inhibitory neurons; they receive excitatory inputs from granule cells through parallel fibers and inhibit granule cells. The connections between granule and Golgi cells are recurrent and bipartite; there is no direct connection from a granule cell to another granule cell and from a Golgi cell to another Golgi cell. The neural signal of the US comes from the inferior olive through climbing fibers to Purkinje cells. The ISIs between the CS and US onsets are learned by long-term depression (LTD) of parallel fiber terminals at Purkinje cells (Ito, 1989). Before LTD is induced, the cerebellar nucleus receives the CS signal through mossy fibers and attempts to elicit the spike activity. The nucleus, however, is inhibited by Purkinje cells that are excited by granule cells; hence, the nucleus cannot elicit spikes. When the US signal arrives at a Purkinje cell through a climbing fiber, with conjunctive stimulation through parallel fibers, the

1048

T. Yamazaki and S. Tanaka 1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0 0

200

400

t

600

800

1000

0 0

200

400

t

600

800

1000

0

200

400

t

600

800

1000

Figure 14: Plots of 0.1Net input (t) (dashed line) and C(t1 , t) (solid line) for t1 = 200, 500, 800 (left to right).

LTD occurs at the parallel fiber terminals at the Purkinje cell. Thereby, the net input from granule cells to the Purkinje cell decreases. After the LTD is established, the nucleus is released from inhibition by the Purkinje cell and can elicit the CR. Our model described in equation 2.1 can be derived from dynamical equations of granule and Golgi cells; the derivation is in the appendix. Hence, the model may be able to play a role of the cerebellar internal clock. Let us examine whether our model can be used for the learning of ISI between the onsets of the CS and US on the basis of the above scenario. Let wiPF be the synaptic weight between granule cell i and the target Purkinje cell, which takes 1 for any i initially. We assumed that the US signal arrived at the Purkinje cell at t = 200 and set wiPF = 0 for i ∈ {i|zi (200) > 0} (neurons that were active at t = 200) due to the LTD of parallel fiber terminals. The net input to the Purkinje cell at time t was calculated according to Net input(t) = wiPF zi (t). i

We also tested the cases of t = 500 and 800. The left panel in Figure 14 shows the plot of Net input(t) and C(200, t). The value of Net input(t) was rescaled by multiplying 0.1 so as to be comparable to [0, 1]. At t = 1, this value reached 0.1N because inhibition was not effective at this step (out of range in this plot). Then it decreased gradually as t increased until t = 200 and became 0 at t = 200, after which the value slowly increased and became constant. Similar results were obtained in cases of t = 500 and 800 (the middle and right panels in Figure 14). These results indicate that our internal clock works for the learning of ISI. Note that the shape of Net input(t) became the inverse of the similarity index at time t, suggesting that the success of learning coincides with the appearance of a clear white band in the similarity index. The model can address several aspects in Pavlovian eyelid conditioning. Mauk & Ruiz (1992) demonstrated that a differential conditioning procedure in which discriminable CSs are individually paired with a US at different ISIs can promote the concurrent acquisition of differently timed

Neural Modeling of an Internal Clock

1049

CRs in an identical animal. Discriminable CSs used by Mauk & Ruiz (1992) may correspond to the different patterns of input signals in our model, and the model could represent different time passages for the different patterns as shown in Figure 7. Aitkin & Boyd (1978) reported that neurons in the lateral pontine nuclei, which project mossy fiber inputs to the cerebellum and is a source of the CS signal, show strong transient responses at the onset of a pure tone presentation followed by sustained activities during the presentation. They found that three-fourths of the mossy fibers showed transient activity. We simulated the 3:1 ratio of transient:sustained activity strength of the afferent input signal and successfully reproduced the identical sequence of activity patterns of neurons. The cerebellum is modulated by dopaminergic inputs. Specifically, dopaminergic fibers from the ventral tegmental area to the cerebellar cortex terminate in the granular layer (in which granule and Golgi cells are located) of the crus I and crus II ansiform lobules (Ikai, Takada, Shinonaga, & Mizuno, 1992). It was also reported that these areas are essential for the normal learning of Pavlovian eyelid conditioning (Hardiman & Yao, 1992). Although there has been no report of whether the cerebellar internal clock is speeded up or slowed down by these neurotransmitters, they might modulate the speed of the cerebellar internal clock. 4.3 Relationships with Other Models. While the model utilized random recurrent inhibitory connections corresponding to recurrent connections between granule and Golgi cells for the generation of a cerebellar internal clock, several models based on different mechanisms have been proposed to date. Moore, Desmond, and Berthier (1989) proposed a tappeddelay-line model. They assumed an anatomical structure of mossy fibers such that they fire one after another with a certain delay after the onset of mossy fiber signals. Due to such an explicit delay mechanism, they proposed that a firing event of each granule cell represents a time passage after the signal onset. However, no such anatomical structure among mossy fibers has been found. Gluck, Reifsnider, and Thompson (1990) assumed oscillating granule cells with different frequencies. Their model is analogous to the Fourier series expansion, on the point that any timing can be represented by a population of neurons. However, they had to assume synaptic weights of complex values to calculate the Fourier series expansion. Hence, their model seems unrealistic. Chapeau-Blondeau and Chauvet (1991) proposed a model in which parallel fibers play the role of an explicit delay mechanism so that the transmission of granule cell activities is delayed. However, as demonstrated by Vos, Maex, Volny-Luraghi, and De Schutter (1999), Golgi cells receiving the same parallel fiber input fire synchronously, suggesting that parallel fibers do not generate delay. Bullock, Fiala, and Grossberg (1994) have proposed a spectral timing model using an intracellular mechanism of delay generation. However, to generate such delay, they needed to assume that the number of Golgi cells is equal to that of granule cells so

1050

T. Yamazaki and S. Tanaka

that there is one-to-one correspondence between granule and Golgi cells. Biologically, a granule/Golgi cell ratio is estimated to be about 1000 on the basis of a granule/Purkinje cell ratio of 274 (Harvey & Napper, 1991) and a Purkinje/Golgi ratio of 3.22 (Palkovits, Magyar, & Szent´agothai, 1971). Later Fiala, Grossberg, and Bullock (1996) proposed another mechanism of delay generation in Purkinje cells on the basis of the slow process of intracellular transduction by the metabotropic glutamate receptor (mGluR), by which their model can generate delay up to 4 secs. Moreover, such slow processes have been reported (Batchelor, Madge, & Garthwaite, 1994; Dzubay & Otis, 2002) and glutamate uptake controls the length of delay (Reichelt & Knopfel, ¨ 2002). However, delay reported so far is at most 700 msec, which is insufficient to account for Pavlovian eyelid conditioning because the conditioning is successful during the ISI is up to 2 to 3 secs (Mauk & Donegan, 1997). Also, the 4 sec delay generated by their model is dependent on a parameter that does not have biological counterparts. Rather, they chose parameters so as to obtain 4 sec delay. Buonomano and Mauk (1994) were the first to focus on the recurrent loop between granule and Golgi cells as the device of time coding. To demonstrate how the recurrent circuit works, they modeled the circuit based on many biological properties, such as neural circuitry, cell ratio, the synaptic convergence-divergence ratio, realistic neuron models with leaky membranes, and realistic cell parameters. Then the model generated a temporal code based on the population of active neurons. Later the model was extended to the complete cerebellar circuit and could reproduce the experimental results of Pavlovian eyelid conditioning (Medina, Garcia, Nores, Taylor, & Mauk, 2000; Medina & Mauk 2000). However, despite the completeness of the realistic modeling, the mechanism of generating such a temporal code was still unclear due to the complexity of the realistic model. In contrast to their realistic modeling, our approach focused on the simplicity of the model. This simplicity made possible a degree of mathematical and numerical analysis of how and why this model could generate a temporal code. We could clarify the mechanism of time coding; that is, randomness in recurrent connections plays the key role. Moreover, we could investigate the dynamics of the model in depth and could demonstrate some properties, such as parameter dependency, noise stability, resetability, and ability, to produce different patterns for different input signals. Our simplified modeling approach had the advantage of enabling mathematical and numerical analyses, but potentially at the cost of realism. We have started a realistic modeling of the cerebellar circuit based on conductance-based leaky integrate-and-fire neuron models with large decay constants of excitatory postsynaptic potentials (EPSP) at synapses from granule to Golgi cells and inhibitory postsynaptic potentials (IPSP) at synapses from Golgi to granule cells. Such large time constants are evident; the NMDA component of EPSP can be fitted by two exponential functions in

Neural Modeling of an Internal Clock

1051

which the time constants are 31 msec (33%) and 170 msec (67%) (Dieudonn´e, 1998), and the GABAA component of IPSP by three exponential functions in which the time constants are 8.0 msec (22%), 33.5 msec (58%), and 102.9 msec (20%) (Brickley, Cull-Candy, & Farrant, 1999). To date, we have confirmed that the dynamics of the realistic model is quite similar to that of the simple model. It is not surprising, because Ermentrout (1994) reported the equivalence of conductance-based models to rate models when the conductance change caused by incoming spikes is sufficiently slow. Rabinovich et al. (2001) proposed a theoretical framework of a temporal coding called winnerless competition in which the dynamics of neurons is described by the Lotka-Volterra equation. This model slightly differs from ours on some points, although the temporal changes of active neurons are commonly caused by asymmetric recurrent inhibitory connections. In their approach, the response profile of one neuron, which is temporally complex, is assumed to carry information of a stimulus. In our approach, the temporal profile of one neuronal response does not carry this information. Rather, the temporal change of active neuronal populations correlates with a time passage. Hence, whereas in their model, the number of active neurons can vary with time because the response profiles are important, in our model the number should not vary for stable representation of a time passage as discussed in section 4.1. In their seminal papers, Marr (1969) and Albus (1971) had paid attention to the recurrent circuit between granule and Golgi cells as the device for the sparse representation of spatial information called codon representation (Marr, 1969) or expansion recoding (Albus, 1971). They did not, however, study the ability for time coding in the cerebellum. We extended their idea of sparse representation to spatiotemporal information processing. Li and Hopfield (1989) studied the dynamics of the network composed of mitral and granule cells in the olfactory bulb. Mitral cells are excitatory cells that receive odor signals and excite granule cells, which inhibit mitral cells in turn. In this way, the mitral-granule cells network is a recurrent inhibitory circuit like the cerebellar granule–Golgi cells network. Hence, their dynamical model becomes similar to ours. However, they focused not on the time coding but the oscillatory responses of mitral cells. Bugmann (1998) and Okamoto and Fukai (2001) proposed models representing a single time interval and satisfying Weber’s law. These models have symmetric recurrent connections and hence a steady state. A single time interval is represented by the length of the relaxation process required to attain the steady state. On the contrary, these models do not work as an internal clock because they cannot represent multiple time intervals. Our internal clock model is simple and derivable from the well-known structure of the cerebellum. Its simplicity helps us to clarify the mechanism and the function of the internal clock, while its realistic design may justify the biological reality.

1052

T. Yamazaki and S. Tanaka

Appendix A.1 Derivation of Equation 3.1 From equation 2.1, it follows that ui (t) ≈

I 1 wij z j (t − 1), + 1− ui (t − 1) − κ τ τ j

and by considering the continuum limit, we obtain τ

dui (t) wij z j (t). = −ui (t) + I − κτ dt j

Let ui and zi be ui (t) and zi (t) at the steady state, respectively. Since at the steady state, then ui = I − κτ

dui (t) dt

=0

wij z j ,

j

and hence zi = F

I − κτ

wij z j ,

j

where F (x) = max(x, 0). Since {wij } has the same number of 0’s and 2/Ns on average, if this equation can be linearized at the steady state, the mean field solution is I 1 zi = N i 1 + κτ and equation 3.1 is derived. def 1 N

A.2 Derivation of Equation 3.2. We calculate q (t) = steady state when N → ∞, which is q (t) = I − κ

i

t−1−s exp − q (s). τ s=0

t−1

From the definition, it follows that q (0) = 0 and q (1) = I . Then,

1 1 q (2) = I − κ exp − I = I 1 − κ exp − . τ τ

zi (t) at the

Neural Modeling of an Internal Clock

1053

If q (2) > 0, then the inactive period stops. On the other hand, if q (2) ≤ 0, then we calculate q (3) as

2 q (3) = I 1 − κ exp − . τ Inductively, tinactive can be estimated as tinactive =

>0 arg mint 1 − κ exp − t−1 τ

κ ≥ 1,

0

otherwise.

When κ ≥ 1, it follows that

t−1 tinactive = arg min 1 − κ exp − t τ

>0

= arg min(t > τ log κ + 1) t

= τ log κ + 1, and equation 3.2 is derived. def A.3 Derivation of Equation 3.4. We recalculate q (t) = N1 i zi (t) when N is finite but sufficiently large. A similar calculation carried out in the derivation of equation. 3.2 yields

t−1−s exp − q (s) τ s=0

t−1 1 t−1−s + κ δwij exp − z j (s). N i j τ s=0

q (t) ≈ I − κ

t−1

(A.1)

2 , Since N1 i δwij is a gaussian variable with zero mean and variance (1/N) N an upper bound of this term may be N√1 N . Then, instead of equation A.1, we consider the following equation:

t−1−s exp − q ∗ (s) τ s=0

t−1 t−1−s κ exp − q ∗ (s) +√ τ N s=0

t−1 1 t−1−s = I −κ 1− √ exp − q ∗ (s). τ N s=0

q ∗ (t) = I − κ

t−1

1054

T. Yamazaki and S. Tanaka

Thus, q ∗ (t) may be regarded as an upper bound of q (t). A similar calculation carried out in the derivation of equation 3.1 yields q ∗ (t) =

I 1 + κτ 1 −

√1 N

.

Notice that q ∗ (t) → q (t) when N → ∞. We assume that |q ∗ (t) − q (t)| is sufficiently small and hence q ∗ (t) ≈ q (t). Then, by incorporating this into equation 2.1, we have

t−1−s exp − z j (s) τ j s=0

t−1 t−1−s δwij exp − z j (s) +κ τ j s=0

t−1 t−1−s τκI +κ δwij exp − z j (s) ≈I− τ 1 + κτ 1 − √1N j s=0

t−1 I 1 − √κτN t−1−s +κ = δwij exp − z j (s). τ 1 + κτ 1 − √1N j s=0

ui (t) = I − wκ

t−1

A.4 Derivation of Equation 2.1 from the Cerebellar Circuit. We consider N granule cells and K Golgi cells. Each granule cell receives mossy fiber signal and sends its output signal to Golgi cells with excitatory synapses. In turn, each Golgi cell sends its output signal to granule cells with inhibitory synapses. Let ziGr (t) and zkGo (t) be activities of granule cell i and Golgi cell k at time t, respectively. They are given as + ziGr (t) = uiGr (t) + zkGo (t) = uGo k (t) , where uiGr (t) and uGo k (t) are internal states of granule cell i and Golgi cell i at time t, respectively, and [x]+ = x if x > 0 and 0 otherwise. uiGr (t) and uGo k (t) are calculated using the following equations, τ Gr u˙ iGr = −uiGr + I − Go τ Go u˙ Go k = −uk +

j

Gr←Go Go wi,k zk (t),

k Go←Gr Gr wk, z j (t), j

Neural Modeling of an Internal Clock

1055

where τ Gr and τ Go are decay time constants, respectively, and I represents Gr←Go the afferent input signal conveyed through mossy fibers. wi,k denotes the Go←Gr synaptic weight from Golgi cell k to granule cell i, and wk, j the synaptic weight from granule cell j to Golgi cell k. They are solved as

t uiGr (t) = I 1 − exp − Gr τ −

uGo k (t) =

1 τ Gr

j

Go←Gr wk, j

t

0

k

1 τ Go

Gr←Go wi,k

t

0

t−r exp − Gr zkGo (r )dr, τ

t−s exp − Go zGr j (s)ds. τ

(A.2)

(A.3)

Since Golgi cells receive only excitatory inputs, uGo k (t) is always positive, that is, + = uGo zkGo (t) = uGo k (t) k (t). Using this relationship and incorporating equation A.3 into equation A.2, we obtain

1 t Gr←Go Go←Gr uiGr (t) = I 1 − exp − Gr − Gr Go wi,k wk, j τ τ τ j k

t r r − s Gr t−r × exp − Gr exp − Go z j (s)dsdr τ τ 0 0

t = I 1 − exp − Gr τ

t r 1 t−r r −s − Gr wij exp − Gr exp − Go zGr j (s)dsdr, τ τ τ 0 0 j (A.4) where def

wij =

1 τ Go

Gr←Go Go←Gr wi,k wk, j .

k

We assume the temporal integration of activities of neurons over a long time, which is represented by the summation with respect to s, where τ Go determines the integration range. This assumption is based on biological

1056

T. Yamazaki and S. Tanaka

findings. Some in situ experiments showed that upon the injection of depolarizing current pulses of the same amplitude, Golgi cells fire much less frequently than granule cells: 0.07 HzpA−1 for Golgi cells (Dieudonn´e, 1998) and 2.94 HzpA−1 for granule cells (D’Angelo, Filippi, Rossi, & Taglietti, 1995). Since current pulses are injected continuously, this decreased firing frequency of Golgi cells may indicate that input signals are integrated in the time interval between two successive spikes. The large difference in firing rate per input current suggests τ Gr τ Go . Specifically, we assume τ Gr 1 τ Go . Then we regard τ 1Gr exp − t−r as the delta function δ(t − r ) τ Gr and exp − τ tGr ≈ 0. It follows that uiGr (t) = I −

t

wij 0

j

t−s exp − Go zGr j (s)ds. τ

Finally, by replacing uiGr (t),ziGr (t),τ Go simply with ui (t), zi (t), τ , and discretizing the time step, we obtain ui (t + 1) = I −

j

t−s wij exp − z j (s), τ s=0 t

and equation 2.1 is derived. References Aitkin, L. M., & Boyd, J. (1978). Acoustic input to the lateral pontine nuclei. Hearing Research, 1, 67–77. Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Batchelor, A. M., Madge, D. J., & Garthwaite, J. (1994). Synaptic activation of metabotropic glutamate receptors in the parallel fibre–Purkinje cell pathway in rat cerebellar slices. Neuroscience, 63(4), 911–915. Brickley, S. G., Cull-Candy, S. G., & Farrant, M. (1999). Single-channel properties of synaptic and extrasynaptic GABAa receptors suggest differential targeting of receptor subtypes. Journal of Neuroscience, 19, 2960–2973. Bugmann, G. (1998). Towards a neural model of timing. BioSystems, 48, 11–19. Bullock, D., Fiala, J. C., & Grossberg, S. (1994). A neural model of timed response learning in the cerebellum. Neural Networks, 7(6/7), 1101–1114. Buonomano, D. V., & Mauk, M. D. (1994). Neural network model of the cerebellum: Temporal discrimination and the timing of motor responses. Neural Computation, 6, 38–55. Chapeau-Blondeau, F., & Chauvet, G. (1991). A neural network model of the cerebellar cortex performing dynamic associations. Biological Cybernetics, 65, 267–279. D’Angelo, E., Filippi, G. D., Rossi, P., & Taglietti, V. (1995). Synaptic excitation of individual rat cerebellar granule cells in situ: Evidence for the role of NMDA receptors. Journal of Physiology, 484(2), 397–413.

Neural Modeling of an Internal Clock

1057

Dieudonn´e, S. (1998). Submillisecond kinetics and low efficacy of parallel fibre-Golgi cell synaptic currents in the rat cerebellum. Journal of Physiology, 510(3), 845– 866. Dzubay, J. A., & Otis, T. S. (2002). Climbing fiber activation of metabotropic glutamate receptors on cerebellar Purkinje neurons. Neuron, 36, 1159–1167. Eccles, J. C., Ito, M., & Szent´agothai, J. (1967). The cerebellum as a neuronal machine. New York: Springer-Verlag. Ermentrout, B. (1994). Reduction of conductance-based models with slow synapses to neural nets. Neural Computation, 6, 679–695. Fiala, J. C., Grossberg, S., & Bullock, D. (1996). Metabotropic glutamate receptor activation in cerebellar Purkinje cells as substrate for adaptive timing of the classically conditioned eye-blink response. Journal of Neuroscience, 16(11), 3760–3774. Gibbon, J., Malapani, C., Dale, C. L., & Gallistel, C. (1997). Toward a neurobiology of temporal cognition: Advances and challenges. Current Opinion in Neurobiology, 7, 170–184. Gluck, M. A., Reifsnider, E. S., & Thompson, R. F. (1990). Adaptive signal processing and the cerebellum: Models of classical conditioning and VOR adaptation. In M. A. Gluck & D. E. Rumelhart (Eds.), Neuroscience and connectionist theory (pp. 131–186). Hillsdale, NJ: Erlbaum. Hardiman, M. J., & Yao, C. H. (1992). The effect of kainic acid lesions of the cerebellar cortex on the conditioned nictitating membrane response in the rabbit. European Journal of Neuroscience, 4, 966–980. Harvey, R. J., & Napper, R. M. A. (1991). Quantitative studies of the mammalian cerebellum. Progress in Neurobiology, 36, 437–463. Ikai, Y., Takada, M., Shinonaga, Y., & Mizuno, N. (1992). Dopaminergic and nondopaminergic neurons in the ventral tegmental area of the rat project, respectively, to the cerebellar cortex and deep cerebellar nuclei. Neuroscience, 51(3), 719– 728. Ito, M. (1984). The cerebellum and neuronal control. New York: Raven Press. Ito, M. (1989). Long-term depression. Annual Review of Neuroscience, 12, 85–102. Ivry, R. B. (1996). The representation of temporal information in perception and motor control. Current Opinion in Neurobiology, 6, 851–857. Kandel, E. R., Schwartz, J. H., & Jessel, T. M. (2000). Principles of neural science (4th ed.). New York: McGraw-Hill. Lalonde, R., & Hannequin, D. (1999). The neurobiological basis of time estimation and temporal order. Reviews in the Neurosciences, 10, 151–173. Li, Z., & Hopfield, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processing. Biological Cybernetics, 61, 379–392. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Mauk, M. D., & Donegan, N. H. (1997). A model of Pavlovian eyelid conditioning based on the synaptic organization of the cerebellum. Learning and Memory, 3, 130–158. Mauk, M. D., & Ruiz, B. P. (1992). Learning-dependent timing of Pavlovian eyelid responses: Differential conditioning using multiple interstimulus intervals. Behavioral Neuroscience, 106(4), 666–681. Meck, W. H. (1996). Neuropharmacology of timing and time perception. Cognitive Brain Research, 3, 227–242.

1058

T. Yamazaki and S. Tanaka

Medina, J. F., Garcia, K. S., Nores, W. L., Taylor, N. M., & Mauk, M. D. (2000). Timing mechanisms in the cerebellum: Testing predictions of a large-scale computer simulation. Journal of Neuroscience, 20(14), 5516–5525. Medina, J. F., & Mauk, M. D. (2000). Computer simulation of cerebellar information processing. Nature Neuroscience, 3, 1205–1211. Missale, C., Nash, S. R., Robinson, S. W., Jaber, M., & Caron, M. G. (1998). Dopamine receptors: From structure to function. Physiological Reviews, 78, 189–225. Moore, J. W., Desmond, J. E., & Berthier, N. E. (1989). Adaptively timed conditioned responses and the cerebellum: A neural network approach. Biological Cybernetics, 62, 17–28. Okamoto, H., & Fukai, T. (2001). Neural mechanism for a cognitive timer. Physical Review Letters, 86(17), 3919–3922. Palkovits, M., Magyar, P., & Szent´agothai, J. (1971). Quantitative histological analysis of the cerebellar cortex in the cat. II. Cell numbers and densities in the granular layer. Brain Research, 32, 13–32. Rabinovich, M., Volkovskii, A., Lecanda, P., Huerta, R., Abarbanel, H. D. I., & Laurent, G. (2001). Dynamicsl encoding by networks of competing neuron groups: Winnerless competition. Physical Review Letter, 87(6), 068102. Reichelt, W., & Knopfel, ¨ T. (2002). Glutamate uptake controls expression of a slow postsynaptic current mediated by mGluRs in cerebellar Purkinje cells. Journal of Neurophysiology, 87, 1974–1980. Vos, B. P., Maex, R., Volny-Luraghi, A., & De Schutter, E. (1999). Parallel fibers synchronize spontaneous activity in cerebellar Golgi cells. Journal of Neuroscience, 19(RC6), 1–5.

Received March 17, 2004; accepted October 1, 2004.

LETTER

Communicated by Klaus Obermayer

Mirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic Changes Reiner Schulz [email protected]

James A. Reggia [email protected] Departments of Computer Science and Neurology, UMIACS, University of Maryland, College Park, MD 20742, U.S.A.

Multiple adjacent, roughly mirror-image topographic maps are commonly observed in the sensory neocortex of many species. The cortical regions occupied by these maps are generally believed to be determined initially by genetically controlled chemical markers during development, with thalamocortical afferent activity subsequently exerting a progressively increasing influence over time. Here we use a computational model to show that adjacent topographic maps with mirror-image symmetry can arise from activity-dependent synaptic changes whenever the distribution radius of afferents sufficiently exceeds that of horizontal intracortical interactions. Which map edges become adjacent is strongly influenced by the probability distribution of input stimuli during map formation. Our results suggest that activity-dependent synaptic changes may play a role in influencing how adjacent maps become oriented following the initial establishment of cortical areas via genetically determined chemical markers. Further, the model unexpectedly predicts the occasional occurrence of adjacent maps with a different rotational symmetry. We speculate that such atypically oriented maps, in the context of otherwise normally interconnected cortical regions, might contribute to abnormal cortical information processing in some neurodevelopmental disorders. 1 Introduction Multiple adjacent, roughly mirror-image topographic maps are commonly observed experimentally in the sensory neocortex of many species.1 Familiar examples of adjacent mirror image cortical maps include multiple representations of the body surface in primary somatosensory cortex of monkeys as illustrated in Figure 1A (Merzenich, Kaas, Sur, & Lin, 1978; Sur, Nelson, & 1

We are concerned solely with topographic map formation, in which a roughly pointto-point “image” of a receptive surface forms on the cortical surface, and not with “feature maps,” such as those involving orientation sensitivity. Neural Computation 17, 1059–1083 (2005)

© 2005 Massachusetts Institute of Technology

1060

R. Schulz and J. Reggia

Figure 1: (A) A cortical map in the somatosensory region SI of squirrel monkey, composed of adjoining areas 3b and 1, based on multiunit microelectrode readings. The arrow pairs indicate that the map consists of two complete, nearly topographic representations of the body surface that are roughly mirror images of each other, where the axis of reflection corresponds to the border between areas 3b and 1. From (Sur et al., 1982, with permission). (B) Architecture of the model cortical region used in this and previous studies. An input pattern x encoding the stimulation of a point on the sensory surface is modulated by afferent synaptic strengths W to produce an activation pattern y over a region of cortical elements representing the neocortical surface. During learning of W, a map of the input patterns and hence, assuming a suitable encoding is used, of the underlying sensory surface forms in the cortical region.

Kaas, 1982) and several mirror-image tonotopic maps in primary auditory cortex (Heschl’s gyrus) in humans (Engelien et al., 2002; Formisano, Kim, & Salle, 2003). If one considers not only primary but also secondary sensory cortex (which also receives thalamocortical projections), numerous other mirror-image maps have been found in somatosensory (Beck, Pospichal, & Kaas, 1996; Krubitzer & Calford, 1992; Krubitzer, Clarey, Tweedale, Elston, & Calford, 1995; Nelson, Sur, Felleman, & Kaas, 1980), visual (Drager, 1975; Newsome, Maunsell, & van Essen, 1986; Sereno et al., 1995; Tiao &

Symmetric Maps

1061

Blakemore, 1976), and auditory (Imig, Real, & Brugge, 1986; Pantev et al., 1995; Talavage, Ledden, & Benson, 2000) cortex in a variety of species. In addition, mirror-image movement representations have been found in the motor cortex of the macaque monkey (Gentilucci et al., 1989). At present, the initial parcellation of cortex into multiple regions or areas is generally believed to be due to genetically determined chemical markers and independent of thalamocortical afferent activity (Sur & Leamey, 2001). However, it remains less clear as to why partially redundant cortical maps occur, why they are so often oriented with reflection symmetry, and what role thalamocortical activity plays in their formation. Multiple adjacent maps are often hypothesized to arise during development due to genetically mediated chemical gradients (Grove & Tomomi, 2003; Levitt, 2000; Zhou & Black, 2000). They are sometimes conjectured to have evolved due to genetic mutations (Allman & Kaas, 1971; Krubitzer, 1995), and it has been suggested that they may provide advantages due to separation of spatial and temporal processing, parallel processing of different sensory attributes, minimization of connection distances, and other factors (Kaas, 1988; Cowey, 1981; Jones, 1990). There has been less speculation as to why such maps often exhibit reflection symmetry, and the relative contributions during map formation of activity-dependent versus activity-independent mechanisms remain the source of some debate, even for individual maps (Grove & Tomomi, 2003; Cohen-Cory, 2002). Past computational models of self-organizing neocortical topographic maps (Kohonen, 2001; Ritter, Martinetz, & Schulten, 1992; Sutton, Reggia, Armentrout, & D’Autrechy, 1994; Sirosh & Miikkulainen, 1994) have generally been limited to single maps and thus do not shed substantial light on the issue of forming multiple mirror-image maps. Here we use a computational model to show that, in principle, multiple adjacent and mirror symmetric topographic maps can arise from activitydependent synaptic changes alone, assuming that the distribution radius of already established cortical afferents sufficiently exceeds that of horizontal intracortical interactions during map development (Brown, Keynes, & Lumsden, 2001). This result raises the possibility that activity-dependent synaptic changes, which are generally assumed to be a central force in individual topographic map self-organization during development (Kohonen, 2001; Ritter et al., 1992; Sutton et al., 1994), may play a more important role in forming the orientations of adjacent cortical maps than is currently recognized. At the least, our results indicate the existence of an influence on the relative orientations of adjacent maps that may have affected the evolution of their genetically guided afferent connectivity. 2 Methods Our computer simulations adopt the widely used Kohonen model of map formation (Kohonen, 2001; Ritter et al., 1992) in a straightforward fashion.

1062

R. Schulz and J. Reggia

Our goal is not to provide a detailed, veridical model of neocortical circuitry and its dynamics, but simply to show that principles widely used in past selforganizing map models are sufficient (with simple modifications) to create mirror-image map formation. While the standard Kohonen model is a substantial simplification of biological reality, it can produce single maps that, at least qualitatively, are strikingly similar to neocortical maps (Bauer, 1995; Kohonen, 1989; Martinetz, Ritter, & Schulten, 1989; Obermayer, Blasdel, & Schulten, 1992; Obermayer, Ritter, & Schulten, 1990; Obermayer, Schulten, & Blasdel, 1992; Palakal, Murthy, Chittajallu, & Wong, 1995; Ritter & Schulten, 1986), suggesting that it captures some fundamental principles of cortical self-organization. We extend the single-winner Kohonen model in a natural way to allow for the occurrence of multiple “winner nodes” across the emerging map’s surface. Such an extension is consistent with the multifocal activity peaks often found in neocortical regions (Donoghue, Leibovic, & Sanes, 1992; Georgopoulos, Kettner, & Schwartz, 1988; Pei, Vidyasagar, Volgushev, & Creutzfeldt, 1994). Multiwinner self-organizing maps (SOMs) have been used in past— more biologically realistic and more complex and computationally expensive cortical models (von der Malsburg, 1973; Sutton et al., 1994; Sirosh & Miikkulainen, 1994). However, these previous multiwinner SOMs typically do not compute their output using Kohonen’s algorithm (or a generalization thereof, as we do here). Instead, a system of differential equations governs the SOM’s activation dynamics in these cases, and the system’s iterative simulation usually leads to multiple, spatially separated “Mexican hat” patterns of activity. So, past multiwinner SOMs differ from Kohonen SOMs algorithmically and also with respect to their application domain (Schulz & Reggia, 2004). Here, we address the issue of how generalizing Kohonen’s algorithm to multiple winners influences map formation, an issue that to our knowledge has not been investigated before. 2.1 Model Architecture and Dynamics. The basic architecture of the multiwinner SOM, illustrated in Figure 1B, is identical to that of a standard SOM. While many variations are possible, we elected for the cortical nodes to be arranged in a regular, rectangular lattice of R rows by C columns. The distance between two cortical nodes i and i at positions (r, c) and (r , c ) in the lattice is measured using the box-distance metric, that is, dlattice (i, i ) = max(|r − r |, |c − c |). Each cortical node i receives an afferent connection from each of the P nodes in the input layer. Every afferent connection carries a nonnegative, real-valued weight wi j on the connection from the jth input P to the ith cortical node, and w i ∈ R+ represents the afferent weight vector to the ith cortical node. The level of activation of an input or cortical node ranges between 0 (inactive) and 1 (fully active). The activation levels of all P input nodes make up the input pattern, a vector x ∈ [0, 1] P of unit length.

Symmetric Maps

1063

Similarly, the activation levels of all cortical nodes form the output pattern, a vector y ∈ [0, 1] RC . In general, an input pattern x encodes the stimulation of a point on a two-dimensional sensory surface. To avoid biases due to unequal length input vectors, the planar sensory surface inputs were normalized in length by their projection onto the surface of the unit sphere. Specifically, given a point p = unit square, its image on the unit sphere is [ px , p y ] on the p point q = q x = pax , q y = ay , q z = ab , where a = ( px2 + p 2y + b 2 )1/2 and b = √ 2 − ( px2 + p 2y )1/2 . The images on the unit sphere of the 441 points at the intersections in a regular grid of 21 rows by 21 columns covering the unit square were used for training. We used coordinate encoding of input patterns in most of our experiments because, especially during the search for suitable training parameters, computational efficiency was critical. Real sensory stimuli during biological map formation are, of course, more complex, but such simple local stimuli have often been used in past models of topographic map formation and have repeatedly been found to be adequate to support formation of single topographic maps. However, to demonstrate that coordinate encoding does not bias the model in favor of our hypothesis about orientations of adjacent maps, we repeated some of the experiments using bell-shaped sensory activation patterns (described later). Given an input pattern x, the output pattern is determined by the same computationally efficient process employed by the standard SOM (Kohonen, 2001), except that it is generalized in a natural way to allow the simultaneous existence of multiple “winners.” First, the net input h to each cortical node i is computed as h i = w iT x where T indicates the transpose of the column vector w i . We approximate the computationally expensive, iterative competitive activation dynamics (Mexican hat pattern) that is often implemented via the numerical solution of differential equations and iteratively transforms h into y (Cho & Reggia, 1994; Pearson, Finkel, & Edelman, 1987; Reggia, D’Autrechy, Sutton, & Weinrich, 1992; Sutton et al, 1994; von der Malsburg, 1973) by a one-step selection of winners. However, unlike in the standard SOM, multiple winners occur where each cortical node i that receives a net input greater than that to each of the N neighboring cortical nodes closest to i (ties resolved arbitrarily) is taken to be a winner. For all cortical nodes, including those near or on the edges of the SOM’s lattice, N is taken to be the number of other cortical nodes within a fixed radius of competition rcomp from the cortical node at location (R/2, C/2) in the center of the lattice. Note how having the same N for cortical nodes along the lattice’s edges is different from letting each cortical node compete with all other cortical nodes within a fixed radius from its position, which would introduce a bias favoring cortical nodes located near the lattice boundary. Since parameter rcomp is usually chosen to be small relative to the size of the lattice, typically multiple winner cortical nodes occur throughout the lattice in response to each input pattern. Each winner is made the central ”peak” of an ”island” of activation. The distribution of activation on a single island

1064

R. Schulz and J. Reggia

is such that the winner at the center of the island (cortical node i) is maximally active (yi = 1), and the activation level of each cortical node j that competed with i decreases exponentially with increasing distance between j and i. Specifically, if the set V of winners is V = {i | ∀ j = i : j competes with i ⇒ h j (t) < h i (t)},

(2.1)

then the activation of cortical node j is y j = γ d(i, j)

with i ∈ V, and

∀k ∈ V, d(k, j) ≥ d(i, j),

(2.2)

where γ ∈ [0, 1] determines the shape of each island of activation (lower γ means faster drop-off from the peak). If two or more islands of activation partially overlap, the activation level of a cortical node j in the region of overlap is determined by the island whose peak is closest to j. Unless stated otherwise, the parameter values used in the experiments reported here are C = 11 and rcomp = 7. Before training, each weight is independently initialized with a random value from the interval [0, 1], and each weight vector is then normalized to unit length. During training, the SOM learns by adjusting the weights on the incoming connections in response to each input of a vector from the training set, presented in a random order that is different for each epoch. The number of training epochs is 2000 unless noted otherwise. For each cortical node i, the learning rule is w i = w i + µyi x

(2.3)

i /||w i ||2 . w i = w

(2.4)

Equation 2.3 implements typical Hebbian learning where µ ∈ (0, 1] is the learning rate. Normalization in equation 2.4 restricts w i to move across the surface of the unit hypersphere and may decrease a connection’s efficacy due to competition with the other connections terminating at cortical node i. We refer to this learning as competitive Hebbian learning, as it incorporates competition between cortical nodes for activation, and competition between weights due to vector normalization, as well as Hebbian learning. Typically during the training of a standard single-winner SOM, the values of certain parameters in the learning rule depend on how far training has progressed (Kohonen, 2001). For example, training is often divided into two phases: a rough ordering phase corresponding to large values for γ and µ in our model and a convergent phase corresponding to small values for γ and µ. Analogously in our model, parameters γ and µ monotonically decrease in a nonlinear fashion from some initial value to a smaller final value. For example, γ (t) = γfin + (γinit − γfin )/(1 + e (t−γinfl )/γσ ), where t is the

Symmetric Maps

1065

fraction of completed training epochs, γinit = 0.9 (γfin = 0.0) determines γ ’s initial (final) value, γinfl = 0.33 is the point of inflection, and γσ = 0.1 determines the rate of decline. A similar function is used for µ, where µinit = 0.5, µfin = 0.0, µinfl = 0.5, and µσ = 0.1. 2.2 Measures of Map Formation. Measures of the goodness of map formation, such as the topographic product (Bauer & Pawelzik, 1992) or the topographic function (Villmann, Der, Herrmann, & Martinetz, 1997), include in their calculations the extent to which adjacent locations in the input sensory space are represented by nonadjacent cortical nodes. Thus, these measures give inaccurate scores when multiple maps are involved because two cortical elements in different cortical maps representing nearby sensory surface locations can be widely separated from each other. For this reason, we devised and used a measure M ≥ 0 of map formation computed as the mean of the smallest 2% of all pairwise dot products between the weight vectors of adjacent cortical nodes. Roughly, M is inversely proportional to the average distance between receptive field centers of adjacent cortical elements in those parts of the cortex where these distances are greatest. Larger values of M indicate better map formation. Due to the normalization of the weight vectors to unit length, M = 1 is maximal. Solely for the purpose of visualizing map formation, each input location is associated with a label that is either the character @ or the blank character. When these labels are viewed on the sensory surface at their associated positions as in Figure 2A, they can be seen to form an inward clockwise spiral, with the size of the @ labels decreasing toward the center of the spiral (the size of an @ label has no meaning other than to indicate visually its location). The blank spaces in between the arms of the spiral aid the efficient visual judgment of whether a roughly topographic map of the sensory surface has formed, and if so, what its orientation is. The spiral labeling pattern is asymmetric so that the relative orientations of any two adjacent maps can be determined unambiguously (without, for example, mistakenly identifying them as a single large map). Each cortical element i is labeled by the input pattern x j to which that element is most sensitive, that is, j = argmaxk (w iT xk ). A topology-preserving or well-formed map of the sensory surface thus shows up on the cortical surface as a projected image of the input surface’s spiral pattern. The image may be rotated or slightly distorted or may show a reversal of the spiral’s direction from clockwise to counterclockwise, since these transformations do not violate the topology of the sensory surface. 3 Results Given the radius of competition rcomp of 7, multiwinner SOMs of 15 by 15 or fewer cortical nodes (15 = 2rcomp + 1) are equivalent to the standard single-winner SOM since each cortical node competes with all other cortical

1066

R. Schulz and J. Reggia

nodes for activation and learning, and hence there is always only a single winner. Figure 2B shows a typical example of the initial disorganized state of an 11 by 11 cortical region’s representation of the sensory surface prior to training that is due to the random initialization of the afferent weights. Figure 2C shows, for the same region, the ordered map representation that was formed by training the network. As expected, when 11 by 11 and 11 by 15 cortical regions were trained, each self-organized into a single topologypreserving map of the sensory surface that covered the entire cortical region (see Figure 2C). Fourteen separate experiments were conducted, each corresponding to a specific cortical region size (R = 11, 15, 20, . . . , 75; C = 11), and for each size cortical region, 20 independent runs were done (independent in the sense that in each run, the network was initialized with different random weights and a different random order was used for the presentation of the input patterns during training). In all simulations prior to learning, the points of the sensory surface were found to be randomly represented over the cortical surface. As can be seen from the left half of Table 1, for a sufficiently small R ≤ 20, our model was essentially equivalent to a standard single-winner SOM, and consequently, only a single map of the sensory surface formed, covering the entire lattice of cortical nodes (as in Figure 2C). In contrast, with R ≥ 25, multiple well-formed maps invariably appeared. Examples of

Figure 2: (A) The input patterns encode representative points (the points of intersection in a grid) on a planar 2D sensory surface. For illustrative purposes, parts of this surface are labeled with an inward clockwise spiral of @ characters of progressively decreasing size. By showing how this spiral is projected onto the cortical surface, one can reliably judge by visual inspection whether a rough topology-preserving map of the sensory surface has formed, and, if so, its orientation. (B) The disorganized pretraining representation of the sensory surface by an 11 by 11 cortical surface (one square cell per cortical element), showing that the label positions (which indicate the locations of the cortical elements’ receptive field centers on the sensory surface) are not arranged in any order due to the initially random weights. A cell’s brightness indicates how similar that cortical node’s weight vector is to the weight vectors of its immediate neighbors. Disorganization here is thus also indicated by the very dark shading and an associated M = .35. (C) As expected with a small cortical region, the posttraining representation is like that seen with a standard single-winner SOM: a single topology-preserving map of the sensory surface that covers the entire region. This is indicated by @ signs showing how the spiral on the sensory surface has become topographically projected onto the cortex, the relatively light shading of all cortical elements (which also shows that the topographic organization indeed extends to the unlabeled regions in between the spiral arms), and a much larger M = .98.

Symmetric Maps

1067 A 1.0 .9 .8 .7 .6 .5 .4 .3 .2 .1 0.0

@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@ @@@ @@@ @@@ @@@ @@@ @@@ @ @ @ @ @ @ @ @ @ @@@ @@@ @ @ @ @ @ @ @ @ @ @@@ @@@ @ @ @ @ @ @ @ @ @ @@@ @@@

@@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ 0.0 .1

@@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@

@ @ @ @ @ @

@ @ @ @ @ @

@ @ @ @ @ @

@ @ @ @ @ @

@ @ @ @ @ @

@ @ @ @ @ @ @@ @ @@@@@@@ @ @ @ @@ @ @@@@@@@ @ @ @ @@ @ @@@@@@@

.2

.3

@

@

.4

@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

.5

.6

.7

.8

.9

@

@

1.0

B 11

@

10

@

@

9 8

@

@

@ @

@

@

@

@ @

@

@ @

@ @

@

7

@

@

@

@ @

@

@

@

@

6

@ @ @

@

@

@

@

@

@

@

@

@

@

@

@

@

@

5 4

@

@

@ @

@

@

3

@

@

@

2

@

@

@

1

@

@

2

3

1

@

@

@

4

@

@

5

6

7

8

@

@

@

@

9

10 11

C 11 10

@ @ @ @ @ @ @ @ @ @ @ @ @

9

@

8

@

@

@

@

@

@

@

@

@

@

@

@

@

7

@

@

@

@

@

@

@

6

@

@

@

@

5

@

@

@

4

@

@

@

3

@

2

@

1

@ 1

@

@

@

@

@

@ @

@ @

@ @

2

@

@

@ @

@

@

@

@

@

@

@

@

3

4

5

6

7

8

9

10 11

1068

R. Schulz and J. Reggia

Table 1: Number and Pairwise Symmetries of Learned Maps (20 Runs Each). Pairwise Symmetriesa

Number of Maps R

Mean

Minimum

Maximum

m

g

r

11 15 20 25 30 35 40 45 50 55 60 65 70 75

1.00 1.00 1.00 2.00 2.00 2.12 2.95 3.00 3.40 3.83 4.00 4.50 5.07 5.18

1 1 1 2 2 2 2 3 3 3 4 2+ 2+ 3+

1 1 1 2 2 3 3 3 4 4 4 6 7 6

1.00 .90 .63 .92 .78 .73 .81 .93 .77 .82 .77

.00 .10 .37 .03 .10 .19 .15 .02 .20 .05 .12

.00 .00 .00 .05 .13 .08 .04 .05 .03 .14 .11

m = mirror reflection. g = glide reflection. r = 180◦ rotation. + = unorganized areas also present.

a

multiple map formation are illustrated in Figure 3. In general, the number of maps formed increased proportional to R: approximately one additional map was formed for each additional increment of R by 15 rows. This suggests that in general, the number of additional rows required to accommodate an additional map roughly equals the “diameter” of competition, that is, the number of rows 2rcomp + 1 (for R ≥ 2rcomp + 1) of other cortical nodes with which each cortical node has to compete for activation and learning (e.g., 15 for rcomp = 7). All of the maps in any one instance where multiple maps occurred were generally of the same size. Two adjacent maps were usually immediately adjacent; that is, there were no lattice parts in between them that were not part of the two adjacent maps, regardless of R. In all simulations producing multiple maps, only three types of symmetries were observed between any two adjacent maps, regardless of the total number of maps present. In the overwhelming majority (82%) of cases, the two adjacent maps were mirror images of each other (see Figure 3A). The second type of symmetry observed, found in 11% of the cases, was again essentially a mirror reflection, but now the axis of reflection was tilted so that the boundary between the two maps was no longer of minimal length (see Figure 3B). In addition, the maps were translated in opposite directions along their common tilted boundary so that the resultant transformation is better characterized as a glide reflection. Thus, in 93% of the cases, adjacent maps exhibited mirror symmetry or distorted mirror symmetry. In the remaining 7% of adjacent map pairs, each individual map was characterized as a rotation relative to the other of 180 degrees around a symmetry

Symmetric Maps

A

35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

1069

@@@@@@@@@@@ @@@@@@@@@@@ @@@ @@ @@@@ @@ @@@@@ @@ @ @ @ @@@ @ @@ @ @ @ @@ @ @ @@ @ @ @ @@ @@ @@ @@ @@ @@

@@@ @

@@

@@@@ @@@@ @@@

@ @ @ @

@@ @@ @@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@@@@@@@

@@@

@ @@ @@ @@ @@ @@ @@ @@

@@ @@@@ @@ @ @ @ @ @@ @ @ @ @ @ @ @ @@

@@ @@ @@ @ @@@@@@@

@ @ @ @@@ @ @@

@@ @@@@@@ @@ @@@ @@ @@@ @@@@@@@@ @ @@@@@@@@@@@ 1 2 3 4 5 6 7 8 9 10 11

B35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

@@@@@@@@@@@ @@@@@@@@@@@ @@ @@ @@ @ @@ @ @@

@ @ @ @@@ @ @@@@

@ @ @@

@@ @@ @ @ @@ @ @ @@@ @ @@ @

@@ @@ @@@ @@

@ @ @ @ @@ @@ @@

@@ @@ @@ @@ @@@ @@@@@ @ @ @ @@@@@@ @@@ @@@ @ @@ @ @ @ @@ @@ @ @ @@ @ @ @@ @ @ @ @@

@@ @@

@@ @@ @@

@@ @@ @@

@@

@@ @@ @@

@@@ @@@ @@@

@@ @ @ @ @ @@ @ @ @@ @ @ @ @@

@ @ @@ @@ @@ @

@@ @@ @@@@@@@@@@@ @@@@@@@@@@@ 1 2 3 4 5 6 7 8 9 10 11

C

40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

@@@@@@@ @ @ @@@@ @@ @ @@ @ @@@@@@ @@ @@@ @@ @@@ @@@ @@ @@@@ @@@ @@ @@@@ @ @ @@ @@ @@ @@ @@ @@ @@ @@

@@@ @@ @@ @ @@ @@@ @ @ @@@ @ @ @@ @ @ @@ @ @

@

@ @ @

@ @@ @@ @@ @@ @@ @@@ @@ @@@@ @@ @@@@ @@@@@ @@@@ @@@@@@ @@@@ @@@@@ @@@ @@@ @@ @@ @@ @@ @@ @@ @@@ @@ @ @ @@@ @ @ @ @ @@@ @ @ @ @ @@ @ @ @@ @ @ @ @ @

@@ @@ @@ @@ @@ @@@ @@ @@ @@@@ @ @@ @@ @@@@@ @@ @@@@ @ @

@@@ @@@@ @@@@@@@ @@@@@@

1 2 3 4 5 6 7 8 9 1011

Figure 3: Representative instances where two maps formed in cortical regions during learning (top). A corresponding schematic representation is given (bottom) for each of the three observed types of symmetry between adjacent maps. The spatial organization of the maps is indicated by how a single spiral painted on the sensory surface (see Figure 2A) is replicated and oriented in the map. (A) Mirror symmetry. (B) Glide reflection symmetry (distorted mirror symmetry). (C) Rotational symmetry. In the schematic representations, the thin lines in A and B and the point in C indicate the symmetry axis and the center of rotation, respectively.

point at the center in between the two maps (see Figure 3C). Rotational symmetry was easily identifiable since it was the only type of symmetry that caused the projected spiral patterns in two adjacent maps to follow the same (clockwise or anticlockwise) direction (see Figure 3). The right-most

1070

R. Schulz and J. Reggia

three columns of Table 1 show the fractions of mirror (m), glide (g), and rotation (r ) symmetries between adjacent maps for different lattice sizes R, averaged over 20 independent runs, respectively. Map visualizations like those in Figure 3 also revealed that the three symmetry types exhibited distinct patterns of similarity among the cortical elements along an intermap boundary. Cortical nodes along the boundaries between mirror-image maps typically were similar to their neighbors in the lattice, that is, their afferent weight vectors and, thus, their receptive field centers were close to one another. In Figure 3A, this is manifested by the lightly shaded cells along the intermap boundary. In contrast, for glide reflections and rotationally symmetric adjacent maps, dissimilar cortical elements (dark shaded cells) were typically prominent along the intermap boundary. Such dissimilar cortical elements were present all along the boundary between glide reflection symmetric maps (dark shaded boundary elements of rows 13–25 in Figure 3B), while their presence was limited to the outer reaches of the intermap boundary in the rotationally symmetric case (very dark shaded elements on the extreme left and right of rows 18–20 in Figure 3C). There was a weak tendency for the largest fractions of mirror symmetric maps (>80%) to occur when R was a multiple of 15 or slightly smaller (R = 25, 30, 40, 55, 60, 70 in Table 1). Under these conditions, the height of a single map was roughly 15. However, we found several cortical regions on which exclusively mirror symmetric map pairs formed and the number of maps exceeded the expected value because they were smaller. The single cortical region in Figure 4 provides an example of this where six somewhat compressed maps formed on the 75 by 11 cortical region. In a small minority of cases, the network did not completely self-organize, and parts of the cortical region remained disorganized after learning. For example, the entry 2+ for R = 65 in Table 1 indicates that in 1 of the 20 simulations with networks of this size, only two representations of the sensory surface were found, with the rest of the cortical region being disorganized (all other 19 simulations in this case exhibited at least four maps and no disorganized regions). 3.1 Measuring Map Formation and Types of Symmetries. For the 220 simulations with cortical regions sufficiently large for multiple maps to appear (R ≥ 25), the mean initial value of M prior to any learning was 0.31 (SD 0.02, minimum 0.23, maximum 0.38). Following learning, this increased to 0.97 (SD 0.02, minimum 0.87, maximum 0.98). Each cortical region that was in principle large enough for multiple well-formed maps to appear (all 220 runs in Table 1 for which R ≥ 25) was assigned to one of three categories after training. First, if any disorganized region was present, the cortical region was placed in the “?” category (20 runs, or 9%), even if well-formed maps were also present. The remaining fully organized cortical regions were divided into those in category m, where exclusively mirror

Symmetric Maps

75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38

1071

@@@@@@@ @@ @ @@@@@@@@ @ @ @@ @ @@ @ @ @ @@ @ @ @@ @ @ @ @ @ @ @@ @@ @ @ @ @@ @@ @ @ @ @@ @

@

@@ @@ @ @@ @@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@ @ @@@@@

@@ @@@ @@ @@@ @ @@ @ @ @ @ @@@@ @@ @ @ @@ @ @ @ @ @ @ @ @ @@ @@

@

@@

@@ @@ @@@@ @@@@ @@@@@@@@ @@@@@@@@ @@

@ @ @ @ @ @ @@ @@ @ @@ @ @@ @ @@ @ @@ @ @@ @ @ @ @@ @@@@@ @

@ @

@

@@ @@@@@ @ @@ @ @@ @@@@@@@@@@@ 1 2 3 4 5 6 7 8 9 1011

37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

@@@@@@@@@@@ @ @@ @ @@ @@ @

@ @ @@ @ @@ @@ @@ @

@@ @@ @@

@@

@ @

@@ @@ @@ @@ @@ @@ @@ @@

@@ @@ @@@@@@@@@ @@@@@@@@ @ @@ @@

@ @ @@ @ @@ @ @ @@ @@ @ @ @ @@ @ @ @@ @@ @ @@ @ @@@@ @ @@

@ @@ @ @ @ @ @@ @@ @@ @@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@@ @@ @@ @@ @ @@ @@ @ @@ @@

@ @ @ @ @@ @ @

@@ @ @@ @ @@ @ @ @@ @

@@ @@ @@ @@ @@ @@ @

@@

@ @@ @@ @@@@@@ @ @ @@ @@@@@@@@ 1 2 3 4 5 6 7 8 9 1011

Figure 4: A single cortical region on which six maps of the sensory surface appeared; every two adjacent maps are mirror images of each other. For illustrative purposes, the region has been split in the middle, with the top half shown on the left (schematic representation on the right).

symmetric map pairs had formed (138, 62%), and those in category g|r (62, 29%), where at least one glide reflection or rotationally symmetric map pair occurred. Figure 5 shows the distribution of the M values for these three categories of cortical regions. The mean M value was 0.980 (SD 0.002) for category m simulations and a significantly different 0.965 (SD 0.018) for category g|r simulations ( p < 10−3 on t-test). On average, the M values were significantly greater for category m (g|r ) than for category g|r (“?”) (Jonckheere trend test; p .0001), and the spread of the values was smaller for category m (g|r )

1072

R. Schulz and J. Reggia 80

Number of Pairs of Adjacent Maps

70

m g|r ?

60

50

40

30

20

10

0 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

M Figure 5: A histogram of the M values for each of the three symmetry categories. The values have been grouped into 16 consecutive intervals of width 0.01. To be comparable across the different-sized categories, each histogram shows, for each interval, the relative within-category frequency with which M fell within the limits of the interval. The histograms suggest that the means µ and standard deviations σ of the actual distributions of M values are ordered so that µm > µg|r > µ? and σm < σg|r < σ? .

than for category g|r (“?”). Since M primarily measures the organization along map boundaries when multiple maps are present, these results indicate that the same synaptic modifications responsible for individual map formation also tend to maximally preserve similarity of adjacent cortical element receptive fields along map boundaries by producing adjacent maps that are mirror symmetric. In contrast, other symmetry relationships (glide, rotational) are local maxima of M in which the map formation process becomes trapped during learning. Since category g|r lattices also exhibited mirror symmetric map pairs, the differences observed between categories m and g|r most likely would have been even more pronounced if each pair of adjacent individual maps had been manually categorized individually and if M had been measured separately for each pair (this was impractical to do).

Symmetric Maps

1073

3.2 Nonuniform Density of Sensory Stimuli. In the simulations described above where each sensory surface location was stimulated exactly once per epoch, when mirror-image maps formed, there was nothing to bias which of their four edges became adjacent. To see if the selection of which edge pair becomes adjacent during formation of mirror-image maps could be biased, we altered the previously uniform probability distribution of stimuli over the sensory surface during learning. This was done by partitioning the regular 21 by 21 grid of the sensory surface points into three consecutive 21 by 7 subgrids (I, II, and III). Each point in subgrid I was stimulated only once during a single epoch of training, while each point in subgrid II (III) was stimulated twice (three times) during the same period. During training, we used a fixed lattice size of 25 by 11 cortical nodes and coordinate-encoding input patterns, a combination that earlier had produced a pair of mirror-image maps in 19 of 20 runs when uniformly probable input stimuli were used. With uniform sensory stimuli previously, the absolute frequencies of the four relative orientations among these 19 mirror-image map pairs (6, 3, 6 and 4) were consistent with equiprobable adjacent edge selection (χ 2 = 1.42, p > 0.05). In contrast, of the same 20 independent runs of training that were performed using the nonuniformly distributed sensory stimuli, 19 again produced mirror-image map pairs. However, now 17 of the adjacent map edges corresponded to the single sensory surface edge where stimuli were most probable (absolute frequencies: 2, 0, 0, and 17). This was inconsistent with equiprobable edge selection (χ 2 = 42.26, p < 0.05). At this preferred orientation, the most frequently stimulated region of the sensory surface usually becomes represented in both maps along the intermap boundary, with the least frequently stimulated region most removed from the intermap boundary. Under these conditions, we also examined whether magnification effects occurred. In biological systems, sensory surface regions with higher innervation density or that are more frequently stimulated are typically magnified in cortical maps: their cortical representation occupies a disproportionately large area of cortical surface (Dykes & Ruest, 1984). Past computational models of single maps have reproduced this magnification effect, given inputs having a nonuniform distribution of sensory stimuli (Grajski & Merzenich, 1990). With the nonuniform sensory stimuli described above, we found that the images of the two more frequently stimulated sensory surface regions came to occupy a relatively larger area of the cortical region (at the expense of the third least stimulated sensory region) compared to when the three sensory regions were stimulated equally frequently during training. With the uniformly distributed sensory stimuli (the baseline), the map representations of the sensory regions corresponding to subgrids I, II, and III occupied, on average, 79.0, 104.0, and 91.9 cortical elements (standard deviations 13.0, 3.6, and 13.6, respectively). The somewhat smaller size of regions I and III here is due to edge effects. In comparison, with the nonuniformly distributed sensory stimuli, the same averages for regions

1074

R. Schulz and J. Reggia

I, II, and III were 57.6, 112.0, and 105.4 (standard deviations 4.9, 3.3, and 7.7, respectively). This is a significantly different result (U-test; p < 10−6 , p < 10−5 and p < .005, respectively, for each region). 3.3 Sensitivity to Model Changes. We examined a number of variations in the parameters and other aspects of our model to determine its robustness, that is, whether qualitatively similar results would be obtained after these variations. For a radius of competition rcomp = 7, we observed the largest fraction of mirror symmetric map pairs for a lattice width of C = 11 (0.82m, 0.11g, 0.07r ). Experiments with C < 11 resulted in a relatively larger number of glide reflections (e.g., 0.69m, 0.27g, 0.04r for C = 9). For C > 11, the relative number of rotational symmetries increased (e.g., 0.78m, 0.04g, 0.18r for C = 13, and 0.68m, 0.02g, 0.30r for C = 15). This suggests that for a given radius of competition rcomp , a particular width C of the lattice may be optimal for the formation of mirror symmetric map pairs. The initial values of γ , γinit , and γinfl were also important. For γinit = .8 (rather than .9, as in the experiments above), the fraction of mirror symmetric map pairs dropped to typically 60%. Rotation and glide reflection symmetry became more frequent, with each occurring in roughly 20% of the cases. Delaying γ ’s descent by increasing γinfl to 0.5 increased the number of cases in which self-organization failed partially or completely so that no wellformed maps were discernible in (parts of) the SOM’s lattice. The effects of parameter changes pertaining to γ especially depend on how the learning rate µ changes over time during training. We made the above observations on the effects of changes to γinit and γinfl while µinit = 0.5, µinfl = 0.5 and µsigma = 0.1 were held fixed. Two variations of how islands of activity are determined (see equation 2.2) were also tested. The first variant determines the activity of a cortical element by taking into account all winners (as opposed to just the closest one) and adding their activity contributions. Equation 2.2 was replaced by yj =

γ d(i, j) ,

(3.1)

i∈V

which, in general, increases the activity of cortical nodes that are located in an area of the cortical region where several islands of activation overlap. Given equation 3.1, it is possible that a cortical element in an area of overlap becomes more active than a winner. The second variant prevents this from happening by capping each cortical element’s activation if it exceeds 1. Both variants were somewhat less conducive to the formation of mirror symmetric map pairs than the original equation 2.2. In all of the simulations described, we used a computationally efficient coordinate encoding of sensory stimuli, as is often done with Kohonen maps. To verify that this did not bias our results, we repeated the first

Symmetric Maps

1075

Table 2: Number and Pairwise Symmetries of Learned Maps (10 Runs Each, Full Encoding). Pairwise Symmetriesa

Number of Maps S

Mean

Minimum

Maximum

m

g

r

11 15 20 25 30 35 40 45 50 55 60 65 70 75

1.00 1.00 1.00 2.00 2.00 2.22 3.00 3.00 3.40 4.00 4.00 4.78 5.20 5.50

1 1 1 2 2 2 3 3 3 2+ 3+ 3+ 5 5

1 1 1 2 2 3 3 3 4 4 4 5 6 6

1.00 1.00 1.00 1.00 .95 .92 .94 .79 .94 .93 .82

.00 .00 .00 .00 .00 .08 .03 .21 .00 .07 .09

.00 .00 .00 .00 .05 .00 .03 .00 .06 .00 .09

m = mirror reflection. g = glide reflection. r = 180◦ rotation. + = unorganized areas also present.

a

10 runs of each experiment, training the networks with a full encoding of the sensory input patterns. With the latter, each sensory activation pattern is a vector with as many components as the number of sample sensory surface points (441 in our model), making the model computationally much more expensive. With this full encoding, stimulation of a point on the sensory surface was represented as a bell-shaped activation pattern centered on the stimulus location. Specifically, if the stimulation of point p = ( px , p y , pz ) on the sensory surface is encoded by x( p) then ( p) xq , the component of x( p) corresponding to the activation level at point p q +p q +p q q = (q x , q y , q z ), equals ( ( pxx qx + pyy qy + pzz qz )2 )1/2 . This implies that the sensory x y z q activation patterns evoked by two separate and independent point stimuli are more similar the smaller the distance on the sensory surface between the two points (this is true also if coordinate encoding is used). The overall results, given in Table 2, were 91% mirror symmetric, 6% glide reflection symmetric, and 3% rotationally symmetric map pairs, indicating a significant increase in the fraction of (distorted or undistorted) mirror symmetric map pairs from 93% to 97% (one-sided χ 2 test; χ 2 = 6.71, p < .005). The average number of maps appearing on the cortical lattice was not significantly different at the 0.05 significance level, regardless of the specific lattice size R (U statistical test). These results indicate that using coordinateencoding input patterns did not substantially influence the average number of well-formed maps per cortical region and, more important, did not bias our results in favor of mirror symmetric adjacent maps.

1076

R. Schulz and J. Reggia

4 Discussion In this work we have explored the extent to which activity-dependent synaptic changes may play a role in the self-organization and relative orientation of adjacent topographic maps. Our computational model, when trained with input patterns that represent the stimulation of points on a sensory surface, formed multiple topologically correct maps of the sensory surface whenever the distribution radius of cortical afferents sufficiently exceeded that of horizontal intracortical interactions, conditions that may be present during development when thalamocortical afferent projections are more widespread than in adults (Brown et al., 2001; Mountcastle, 1998). Further, the adjacent maps produced in this fashion were largely mirror symmetric with respect to their common boundary, a finding that is consistent with observations of topographic maps and their relative orientations in biological cortex. These results persisted in the face of parameter variations and even a different representation of sensory stimuli as long as basic map self-organization was not disrupted. In all of our experiments, the topology of the sensory surface (the 2D unit square, projected onto the unit sphere to normalize input vectors to unit length) matched the topology of the modeled cortical surface (2D lattice of cortical elements). If the topology of the sensory surface had been as complex as, for example, the surface of the human body, and thus, more complex than the topology of the SOM’s 2D lattice, then the latter would have been unable to form perfectly topology-preserving maps of the entire sensory surface. A standard Kohonen SOM tends to minimize the degree to which the map of the sensory surface violates the topology of the sensory surface (Kohonen, 2001). However, it is an open question how a mismatch in topology would influence multiple map formation and their relative orientations in the multiwinner case. Mirror-image and distorted mirror-image maps in our model arose due to the natural tendency of competitive Hebbian learning to make adjacent cortical elements represent neighboring regions of the input surface while ensuring that the entire sensory surface is represented. A single topographic map of the sensory surface achieves both of these ends, but its formation requires a single global competition. In contrast, which biological cortical element responds most to a particular input is presumably known only locally. In our model, the multiple “winner elements” that arise due to simulating this locality result in the formation of multiple individually topology-preserving maps plus the optimization of the transitions between them. As illustrated in Figure 6A, which is pictured in the input space, this optimal transition corresponds to a fold in the cortical region’s lattice when projected onto the input space, the fold being perpendicular to the map edges and leading to mirror symmetric adjacent maps. The other two types of symmetry that were observed constitute suboptimal transitions from one

Symmetric Maps

A

1077

B

C

Figure 6: Various ways in which the model cortical region became embedded in the input space, a two-dimensional sensory surface that is shown as a light gray planar area. Cortical elements (small black rectangles) are connected by a solid line iff they are adjacent in the cortical region. Each case involves two roughly topographic maps of the sensory surface that have been spatially separated along the vertical axis to illustrate the projection of the cortical region in the input space. For illustrative purposes, the cortical regions shown here are much smaller than those used in the actual experiments. (A) Mirror symmetric maps (compare to Figure 3A). The cortical region becomes folded in the process of self-organization with the fold perpendicular to the longer sides of the cortical region, keeping connected cortical elements along the intermap boundary close in the input space. (B) Glide reflection symmetric maps (compare to Figure 3B). The cortical region contains a diagonal fold, and the cortical elements along the fold line (separated from the two maps along the z-axis) have been forced to move counterclockwise (clockwise) in the input space around the top (bottom) map. Many neighboring cortical elements along the fold line become widely separated in the input space (such nodes are gray and connected by thick gray lines). (C) Rotationally symmetric maps (compare to Figure 3C). The cortical region has been twisted in addition to being folded in the same manner as in A. In this case, adjacent cortical elements along the fold line and close to the edges of the cortical region become very widely separated in the input space.

map to the next. A glide reflection symmetry is the visible expression of a diagonal fold in the lattice combined with a shearing motion along the fold (see Figure 6B), which causes the entire diagonal intermap boundary to be mildly nonoptimal. Rotation symmetry indicates a combination of a perpendicular fold and a twist of the lattice (see Figure 6C) so that only the central region of the intermap boundary retains its optimality; near the edges of the cortical region, the boundary is highly nonoptimal. The key

1078

R. Schulz and J. Reggia

point illustrated by Figure 6 is that in the latter two cases, cortical elements near the map boundary become removed from one another in the input space, rendering these two transitions suboptimal and hence significantly less likely to occur. We also observed in our simulations that when the sensory surface was subdivided into regions, some being stimulated more often during training, two adjacent maps, in addition to being mirror symmetric, were almost always oriented in such a way that their representations of the most often stimulated region were located next to each other at the intermap boundary. The other regions were represented farther away from the boundary, following the gradient in the frequency of stimulation. The more often stimulated regions were represented by a relatively larger area of modeled cortical surface in each of the individual maps. Our finding that two adjacent maps produced by activity-dependent synaptic changes will usually be mirror images of one another is complementary to and consistent with the prevalent notion that activityindependent genetic factors initially determine cortical arealization and affect targeting of thalamocortical afferents. It raises the question of how genetic and activity-dependent synaptic changes might interact during development and even during evolution, as it seems improbable that evolutionary processes would hard-wire adjacent cortical maps to be mirror images so often unless there was some advantage to this arrangement (such as consistency with local synaptic plasticity). Our model also makes two specific, testable predictions that may or may not relate to biological cortical maps. First, when adjacent mirror image topographic maps occur in neocortex, their common edge should represent the region of sensory surface that develops and innervates first (i.e., that has the most frequent stimuli initially during map development). This is consistent with, for example, the otherwise surprising location of fingers and toes in biological neocortex far from the symmetry axis in mirror image hand and foot representations in S1 (see Figure 1A), as these distal digits appear late during development (Gilbert, 1994; Lonai, 1996). Second, adjacent maps may occasionally exhibit a very different rotational symmetry. If such previously unreported rotationally symmetric maps are ever observed experimentally in a small percentage of currently known cortical map regions, they would provide very strong support for our model. Such atypically oriented adjacent maps, in the context of normal connectivity between cortical regions, would be expected to cause abnormal cortical information processing, and we speculate that they might account for some of the cognitive deficits and functional imaging changes observed in neurodevelopmental disorders such as dyslexia or autism (Frank & Pavlakis, 2001; Papanicolaou et al., 2003; Temple et al., 2003). We believe that the rarity of such atypically oriented adjacent maps and the very limited experimental data on human maps may explain why they have not been reported experimentally.

Symmetric Maps

1079

In our model, each “competition” covers roughly as much cortical area as a single topographic map. This raises the question of whether lateral interactions in biological primary sensory cortex can span comparable distances. While most direct lateral interactions in adults are more localized than this, long-range lateral interactions have been reported in cortex. For example, in primary visual cortex (V1), specifically in the context of understanding the mechanisms that underlie the contextual modulation of a neuron’s receptive field, a range of interaction exceeding 13 degrees of the visual field has been described (Angelucci et al., 2002; Bringuier, Chavane, Glaeser, & Fr´egnac, 1999; Grinvald, Lieke, Frostig, & Hildesheim, 1994; Kitano, Niiyama, Kasamatsu, Sutter, & Norcia, 1994; Levitt & Lund, 2002). In adult macaque monkeys, this roughly corresponds to a distance in V1 of 20 mm. While this figure falls short of a lateral interaction range spanning all of V1 in the adult (macaque V1 is roughly an ellipsoid of 25 mm by 50 mm when flattened: Felleman and van Essen, 1991), the relative range may be longer early during development when map formation is occurring (Brown et al., 2001). In addition, the measurements often constitute lower bounds on the maximum range, and they take into account only specific types of lateral interaction (facilitation and surround inhibition). Thus, the range of competition in our model is not necessarily inconsistent with biological reality. Finally, we note that our one-shot model of map formation is computationally efficient, like Kohonen’s widely used model, yet at the same time exhibits more properties that are biologically plausible (e.g., multiwinner representations, multiple maps, mirror-image adjacent maps). This combination of properties may be of interest to future builders of very large-scale neural models where computational cost is a critical issue and therefore the use of iterative multiwinner SOMs becomes problematic. Some of our model’s computational properties may also prove valuable in traditional application domains of the single-winner Kohonen SOM. The property of redundant map formation could be exploited to increase the fault tolerance and, thus, the robustness of an existing single-winner SOM-based system. Data visualization, one of the main uses for single-winner SOMs, may benefit from the property that the relative orientation of adjacent mirror maps depends on the distribution of inputs, adding a visual clue that can help identify, for example, gradients in the input distribution. Exploration of these ideas is an important possible direction for future research.

Acknowledgments This work was supported by NIH award NS35466. We thank C. Lynne D’Autrechy, Mary Howard, and Scott Weems for constructive comments.

1080

R. Schulz and J. Reggia

References Allman, J., & Kaas, J. (1971). A representation of the visual field in the caudal third of the middle temporal gyrus of the owl monkey. Brain Research, 31, 85–105. Angelucci, A., Levitt, J., Walton, E., Hupe, J., Bullier, J., & Lund, J. (2002). Circuits for local and global signal integration in primary visual cortex. J. Neuroscience, 22(19), 8633–8646. Bauer, H.-U. (1995). Development of oriented ocular dominance bands as a consequence of areal geometry. Neural Computation, 7(1), 36–50. Bauer, H.-U., & Pawelzik, K. (1992). Quantifying the neighborhood preservation of self-organizing feature maps. IEEE Transactions on Neural Networks, 3(4), 570–579. Beck, P., Pospichal, M., & Kaas, J. (1996). Topography, architecture, and connections of somatosensory cortex in opossums: Evidence for five somatosensory areas. J. Comparative Neurology, 1(366), 109–133. Bringuier, V., Chavane, F., Glaeser, L., & Fregnac, ´ Y. (1999). Horizontal propagation of visual activity in the synaptic integration field of area 17 neurons. Science, 283, 695–699. Brown, M., Keynes, R., & Lumsden, A. (2001). The developing brain. New York: Oxford University Press. Cho, S., & Reggia, J. (1994). Map formation in proprioceptive cortex. Int. J. Neural Systems, 5(2), 87–101. Cohen-Cory, S. (2002). The developing synapse: Construction and modulation of synaptic structures and circuits. Science, 298, 770–776. Cowey, A. (1981). Why are there so many visual areas? In F. Schmitt (Ed.), The organization of the cerebral cortex (pp. 395–413). Cambridge, MA: MIT Press. Donoghue, J., Leibovic, S., & Sanes, J. (1992). Organization of the forelimb area in squirrel monkey motor cortex. Experimental Brain Research, 89, 1–19. Drager, U. (1975). Receptive fields of single cells and topography in mouse visual cortex. J. Comparative Neurology, 160, 269–290. Dykes, R., & Ruest, A. (1984). What makes a map in somatosensory cortex? In E. Jones & A. Peters (Eds.), Cerebral cortex (Vol. 5, pp. 1–29). New York: Plenum Press. Engelien, A., Yang, Y., Engelien, W., Zonana, J., Stern, E., & Silbersweig, D. (2002). Physiological mapping of human auditory cortices with a silent event-related fMRI technique. Neuroimage, 4(16), 944–953. Felleman, D., & van Essen, D. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1(1), 1–47. Formisano, E., Kim, D., & Salle, F. (2003). Mirror-symmetric tonotopic maps in human primary auditory cortex. Neuron, 40, 859–869. Frank, Y., & Pavlakis, S. (2001). Brain imaging in neurobehavioral disorders. Pediatr. Neurol., 25, 278–287. Gentilucci, M., Fogassi, L., Luppino, G., Matelli, M., Camarda, R., & Rizzolatti, G. (1989). Somatotopic representation in inferior area 6 of the macaque monkey. Brain, Behavior and Evolution, 2–3(33), 118–121. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the directions of movement by a neural population. J. Neuroscience, 8, 2928–2937.

Symmetric Maps

1081

Gilbert, S. (1994). Developmental biology. Sunderland, MA: Sinauer. Grajski, K., & Merzenich, M. (1990). Neural network simulation of somatosensory representational plasticity. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 52–59). San Mateo, CA: Morgan Kaufmann. Grinvald, A., Lieke, E., Frostig, R., & Hildesheim, R. (1994). Cortical point-spread function and long-range lateral interactions revealed by real-time optical imaging of macaque monkey primary visual cortex. J. Neuroscience, 14(5), 2545–2568. Grove, E., & Tomomi, F. (2003). Generating the cerebral cortical area map. Annual Review Neuroscience, 26, 355–380. Imig, T., Reale, R., & Brugge, J. (1986). Topography of cortico-cortical connections related to tonotopic and binaural maps. In F. Lepore (Ed.), Two hemispheres—One brain (pp. 103–115). New York: Alan Liss. Jones, E. (1990). Modulatory events in the development and evolution of primate neocortex. In A. Peters (Ed.), Cerebral cortex (pp. 311–351). New York: Plenum. Kaas, J. (1988). Why does the brain have so many cortical areas? J. Cognitive Neuroscience, 1, 121–134. Kitano, M., Niiyama, K., Kasamatsu, T., Sutter, E., & Norcia, A. (1994). Retinotopic and nonretinotopic field potentials in cat visual cortex. Visual Neuroscience, 11, 953–977. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer. Kohonen, T. (2001). Self-organizing maps (3rd ed.). Berlin: Springer. Krubitzer, L. (1995). The organization of neocortex in mammals. Trends in Neuroscience, 18, 408–417. Krubitzer, L., & Calford, M. (1992). Five topographically organized fields in the somatosensory cortex of the flying fox: Microelectrode maps, myeloarchitecture, and cortical modules. J. Comparative Neurology, 1(317), 1–30. Krubitzer, L., Clarey, J., Tweedale, R., Elston, G., & Calford, M. (1995). A redefinition of somatosensory areas in the lateral sulcus of macaque monkeys. J. Neuroscience, 5(15), 3821–3839. Levitt, J., & Lund, J. (2002). The spatial extent over which neurons in macaque striate cortex pool visual signals. Visual Neuroscience, 19(4), 439–452. Levitt, P. (2000). Molecular determinants of regionalization of the forebrain and cerebral cortex. In M. Gazzaniga (Ed.), The new cognitive neurosciences (pp. 23–43). Cambridge, MA: MIT Press. Lonai, P. (1996). Mammalian development. London: Harwood. Martinetz, T., Ritter, H., & Schulten, K. (1989). Kohonen’s self-organizing map for modeling the formation of the auditory cortex of a bat. In R. Pfeifer, Z. Schreter, F. Fogelman-Souli´e, & L. Steels (Eds.), Connectionism in perspective (pp. 403–412). Amsterdam: North-Holland. Merzenich, M., Kaas, J., Sur, M., & Lin, C. (1978). Double representation of the body surface within cytoarchitectonic areas 3b and 1 in “S1” in the owl monkey (aotus trivirgatus). J. Comparative Neurology, 181(1), 41–73. Mountcastle, V. (1998). The cerebral cortex. Cambridge, MA: Harvard University Press. Nelson, R., Sur, M., Felleman, D., & Kaas, J. (1980). Representations of the body surface in postcentral parietal cortex of macaca fascicularis. J. Comparative Neurology, 4(192), 611–643.

1082

R. Schulz and J. Reggia

Newsome, W., Maunsell, J., & van Essen, D. (1986). Ventral posterior visual area of the macaque: Visual topography and areal boundaries. J. Comparative Neurology, 2(252), 139–153. Obermayer, K., Blasdel, G., & Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Physical Review A, 45(10), 7568–7589. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proc. National Academy of Sciences USA, 87, 8345–8349. Obermayer, K., Schulten, K., & Blasdel, G. (1992). A comparison of a neural network model for the formation of brain maps with experimental data. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 83–90). San Mateo, CA: Morgan Kaufmann. Palakal, M., Murthy, U., Chittajallu, S., & Wong, D. (1995). Tonotopic representation of auditory responses using self-organizing maps. Mathematical and Computer Modelling, 22(2), 7–21. Pantev, C., Bertrand, O., Eulitz, C., Verkindt, C., Hampson, S., Schuierer, G., & Elbert, T. (1995). Specific tonotopic organizations of different areas of the human auditory cortex revealed by simultaneous magnetic and electric recordings. Electroencephalography and Clinical Neurophysiology, 1(94), 26–40. Papanicolaou, A., Simos, P., Breier, J., Fletcher, J., Foorman, B., Francis, D., Costillo, E., & Davis, R. (2003). Brain mechanisms for reading in children with and without dyslexia. Dev. Neuropsychol., 24, 593–612. Pearson, J., Finkel, L., & Edelman, G. (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neuroscience, 7, 4209–4223. Pei, X., Vidyasagar, T., Volgushev, M., & Creutzfeldt, O. (1994). Receptive field analysis and orientation selectivity of postsynaptic potentials of simple cells in car visual cortex. J. Neuroscience, 11(14), 7130–7140. Reggia, J., D’Autrechy, C., Sutton, G., & Weinrich, M. (1992). A competitive distribution theory of neo-cortical dynamics. Neural Computation, 4, 287–317. Ritter, H., Martinetz, T., & Schulten, K. (Eds.). (1992). Neural computation and selforganizing maps. Reading, MA: Addison-Wesley. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s self-organizing sensory mapping. Biological Cybernetics, 54, 99–106. Schulz, R., & Reggia, J. (2004). Temporally asymmetric learning supports sequence processing in multi-winner self-organizing maps. Neural Computation, 16(3), 535– 561. Sereno, M., Dale, A., Reppas, J., Kwong, K., Belliveau, J., Brady, T., Rosen, B., & Tootell, R. (1995). Borders of multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268(5212), 889–893. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 65–78. Sur, M., & Leamey, C. (2001). Development and plasticity of cortical areas and networks. Nature Reviews Neuroscience, 2, 251–262. Sur, M., Nelson, R., & Kaas, J. (1982). Representations of the body surface in cortical areas 3b and 1 of squirrel monkeys: Comparisons with other primates. J. Comparative Neurology, 2(211), 177–192.

Symmetric Maps

1083

Sutton, G., Reggia, J., Armentrout, S., & D’Autrechy, C. (1994). Cortical map reorganization as a competitive process. Neural Computation, 6, 1–13. Talavage, T., Ledden, P., & Benson, R. (2000). Frequency-dependent responses exhibited by multiple regions in human auditory cortex. Hearing Research, 150, 225–244. Temple, E., Deutsch, G., Poldrack, R., Miller, S., Tallal, P., Merzenich, M., & Gabrieli, J. (2003). Neural deficits in children with dyslexia ameliorated by behavioral remediation. Proc. Nat. Acad. Sci., 100, 2860–2865. Tiao, Y., & Blakemore, C. (1976). Functional organization in the visual cortex of the golden hamster. J. Comparative Neurology, 4(168), 459–481. Villmann, T., Der, R., Herrmann, M., & Martinetz, T. (1997). Topology preservation in self-organizing feature maps: Exact definition and measurement. IEEE Transactions on Neural Networks, 8(2), 256–266. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Zhou, R., & Black, I. (2000). Development of neural maps. In M. Gazzaniga (Ed.), The new cognitive neurosciences (pp. 213–236), Cambridge, MA: MIT Press.

Received June 11, 2004; accepted October 14, 2004.

LETTER

Communicated by Tamar Flash

Stochastic Optimal Control and Estimation Methods Adapted to the Noise Characteristics of the Sensorimotor System Emanuel Todorov [email protected] Department of Cognitive Science, University of California San Diego, La Jolla CA 92093-0515.

Optimality principles of biological movement are conceptually appealing and straightforward to formulate. Testing them empirically, however, requires the solution to stochastic optimal control and estimation problems for reasonably realistic models of the motor task and the sensorimotor periphery. Recent studies have highlighted the importance of incorporating biologically plausible noise into such models. Here we extend the linear-quadratic-gaussian framework—currently the only framework where such problems can be solved efficiently—to include controldependent, state-dependent, and internal noise. Under this extended noise model, we derive a coordinate-descent algorithm guaranteed to converge to a feedback control law and a nonadaptive linear estimator optimal with respect to each other. Numerical simulations indicate that convergence is exponential, local minima do not exist, and the restriction to nonadaptive linear estimators has negligible effects in the control problems of interest. The application of the algorithm is illustrated in the context of reaching movements. A Matlab implementation is available at www.cogsci.ucsd.edu/∼todorov. 1 Introduction Many theories in the physical sciences are expressed in terms of optimality principles, which often provide the most compact description of the laws governing a system’s behavior. Such principles play an important role in the field of sensorimotor control as well (Todorov, 2004). A quantitative theory of sensorimotor control requires a precise definition of success in the form of a scalar cost function. By combining top-down reasoning with intuitions derived from empirical observations, researchers have proposed a number of hypothetical cost functions for biological movement. While such hypotheses are not difficult to formulate, comparing their predictions to experimental data is complicated by the fact that the predictions have to be derived in the first place—that is, the hypothetical optimal control and estimation problems have to be solved. The most popular approach has been to optimize, in an open loop, the sequence of control signals (Chow & Jacobson, Neural Computation 17, 1084–1108 (2005)

© 2005 Massachusetts Institute of Technology

Methods for Optimal Sensorimotor Control

1085

1971; Hatze & Buys, 1977; Anderson & Pandy, 2001) or limb states (Nelson, 1983; Flash & Hogan, 1985; Uno, Kawato, & Suzuki, 1989; Harris & Wolpert, 1998). For stochastic partially observable plants such as the musculoskeletal system, however, open-loop approaches yield suboptimal performance (Todorov & Jordan, 2002b; Todorov, 2004). Optimal performance can be achieved only by a feedback control law, which uses all sensory data available online to compute the most appropriate muscle activations under the circumstances. Optimization in the space of feedback control laws is studied in the related fields of stochastic optimal control, dynamic programming, and reinforcement learning. Despite many advances, the general-purpose methods that are guaranteed to converge in a reasonable amount of time to a reasonable answer remain limited to discrete state and action spaces (Bertsekas & Tsitsiklis, 1997; Sutton & Barto, 1998; Kushner & Dupuis, 2001). Discretization methods are well suited for higher-level control problems, such as the problem faced by a rat that has to choose which way to turn in a twodimensional maze. But the main focus in sensorimotor control is on a different level of analysis: on how the rat chooses a hundred or so graded muscle activations at each point in time, in a way that causes its body to move toward the reward without falling or hitting walls. Even when the musculoskeletal system is idealized and simplified, the state and action spaces of interest remain continuous and high-dimensional, and the curse of dimensionality prevents the use of discretization methods. Generalizations of these methods to continuous high-dimensional spaces typically involve function approximations whose properties are not yet well understood. Such approximations can produce good enough solutions, which is often acceptable in engineering applications. However, the success of a theory of sensorimotor control ultimately depends on its ability to explain data in a principled manner. Unless the theory’s predictions are close to the globally optimal solution of the hypothetical control problem, it is difficult to determine whether the (mis)match to experimental data is due to the general (in)applicability of optimality ideas to biological movement, or the (in)appropriateness of the specific cost function, or the specific approximations—in both the plant model and the controller design—used to derive the predictions. Accelerated progress will require efficient and well-understood methods for optimal feedback control of stochastic, partially observable, continuous, nonstationary, and high-dimensional systems. The only framework that currently provides such methods is linear-quadratic-gaussian (LQG) control, which has been used to model biological systems subject to sensory and motor uncertainty (Loeb, Levine, & He, 1990; Hoff, 1992; Kuo, 1995). While optimal solutions can be obtained efficiently within the LQG setting (via Riccati equations), this computational efficiency comes at the price of reduced biological realism, because (1) musculoskeletal dynamics are generally nonlinear, (2) behaviorally relevant performance criteria are

1086

E. Todorov

unlikely to be globally quadratic (Kording & Wolpert, 2004), and (3) noise in the sensorimotor apparatus is not additive but signal-dependent. The third limitation is particularly problematic because it is becoming increasingly clear that many robust and extensively studied phenomena—such as trajectory smoothness, speed-accuracy trade-offs, task-dependent impedance, structured motor variability and synergistic control, and cosine tuning— are linked to the signal-dependent nature of sensorimotor noise (Harris & Wolpert, 1998; Todorov, 2002; Todorov & Jordan, 2002b). It is thus desirable to extend the LQG setting as much as possible and adapt it to the online control and estimation problems that the nervous system faces. Indeed, extensions are possible in each of the three directions listed above: 1. Nonlinear dynamics (and nonquadratic costs) can be approximated in the vicinity of the expected trajectory generated by an existing controller. One can then apply modified LQG methodology to the approximate problem and use it to improve the existing controller iteratively. Differential dynamic programming (Jacobson & Mayne, 1970), as well as iterative LQG methods (Li & Todorov, 2004; Todorov & Li, 2004), are based on this general idea. In their present form, most such methods assume deterministic dynamics, but stochastic extensions are possible (Todorov & Li, 2004). 2. Quadratic costs can be replaced with a parametric family of exponential-of-quadratic costs, for which optimal LQG-like solutions can be obtained efficiently (Whittle, 1990; Bensoussan, 1992). The controllers that are optimal for such costs range from risk averse (i.e., robust), through classic LQG, to risk seeking. This extended family of cost functions has not yet been explored in the context of biological movement. 3. Additive gaussian noise in the plant dynamics can be replaced with multiplicative noise, which is still gaussian but has standard deviation proportional to the magnitude of the control signals or state variables. When the state of the plant is fully observable, optimal LQG-like solutions can be computed efficiently, as shown by several authors (Kleinman, 1969; McLane, 1971; Willems & Willems, 1976; Bensoussan, 1992; El Ghaoui, 1995; Beghi & D’Alessandro, 1998; Rami, Chen, & Moore, 2001). Such methodology has also been used to model reaching movements (Hoff, 1992). Most relevant to the study of sensorimotor control, however, is the partially observable case, which remains an open problem. While some work along these lines has been done (Pakshin, 1978; Phillis, 1985), it has not produced reliable algorithms that one can use off the shelf in building biologically relevant models (see section 9). Our goal here is to address that problem, and provide the model-building methodology that is needed.

Methods for Optimal Sensorimotor Control

1087

Table 1: List of Notation. xt ∈ Rm ut ∈ R p yt ∈ Rk n A, B, H ξ t , ω t , εt , t , η t ξ , ω , ε , , η C1 , . . . , Cc D1 , . . . , Dd Qt , R xt et t x xe te , t , t vt Stx , Ste , st Kt Lt

state vector at time step t control signal sensory observation total number of time steps system dynamics and observation matrices zero-mean noise terms covariances of noise terms scaling matrices for control-dependent system noise scaling matrices for state-dependent observation noise matrices defining state- and control-dependent costs state estimate estimation error conditional estimation error covariance unconditional covariances optimal cost-to-go function parameters of the optimal cost-to-go function filter gain matrices control gain matrices

In this letter, we define an extended noise model that reflects the properties of the sensorimotor system; derive an efficient algorithm for solving the stochastic optimal control and estimation problems under that noise model; illustrate the application of this extended LQG methodology in the context of reaching movements; and study the properties of the new algorithm through extensive numerical simulations. A special case of the algorithm derived here has already allowed us (Todorov & Jordan, 2002b) to construct models of a wider range of empirical results than previously possible. In section 2 we motivate our extended noise model, which includes control-dependent, state-dependent, and internal estimation noise. In section 3 we formalize the problem and restrict the feedback control laws under consideration to functions of state estimates that are obtained by unbiased nonadaptive linear filters. In section 4 we compute the optimal feedback control law for any nonadaptive linear filter and show that it is linear in the state estimate. In section 5 we derive the optimal nonadaptive linear filter for any linear control law. The two results together provide an iterative coordinate-descent algorithm (equations 4.2 and 5.2), which is guaranteed to converge to a filter and a control law optimal with respect to each other. In section 6 we illustrate the application of our method to the analysis of reaching movements. In section 7 we explore numerically the convergence properties of the algorithm and observe exponential convergence with no local minima. In section 8 we assess the effects of assuming a nonadaptive linear filter and find them to be negligible for the control problems of interest. Table 1 shows the notation used in this letter.

1088

E. Todorov

2 Noise Characteristics of the Sensorimotor System Noise in the motor output is not additive but instead increases with the magnitude of the control signals. This is intuitively obvious: if you rest your arm on the table, it does not bounce around (i.e., the passive plant dynamics have little noise), but when you make a movement (i.e., generate control signals), the outcome is not always as desired. Quantitatively, the relationship between motor noise and control magnitude is surprisingly simple. Such noise has been found to be multiplicative: the standard deviation of muscle force is well fit with a linear function of the mean force, in both static (Sutton & Sykes, 1967; Todorov, 2002) and dynamic (Schmidt, Zelaznick, Hawkins, Frank, & Quinn, 1979) isometric force tasks. The exact reasons for this dependence are not entirely clear, although it can be explained at least in part with Poisson noise on the neural level combined with Henneman’s size principle of motoneuron recruitment (Jones, Hamilton, & Wolpert, 2002). To formalize the empirically established dependence, let u be a vector of control signals (corresponding to the muscle activation levels that the nervous system attempts to set) and ε be a vector of zero-mean random numbers. A general multiplicative noise model takes the form C(u)ε, where C(u) is a matrix whose elements depend linearly on u. To express a linear relationship between a vector u and a matrix C, we make the ith column of C equal to Ci u, where Ci are constant scaling matrices. Then we have C(u)ε = i Ci uεi , where εi is the ith component of the random vector ε. Online movement control relies on feedback from a variety of sensory modalities, with vision and proprioception typically playing the dominant role. Visual noise obviously depends on the retinal position of the objects of interest and increases with distance away from the fovea (i.e., eccentricity). The accuracy of visual positional estimates is again surprisingly well modeled with multiplicative noise, whose standard deviation is proportional to eccentricity. This is an instantiation of Weber’s law and has been found to be quite robust in a variety of interval discrimination experiments (Burbeck & Yap, 1990; Whitaker & Latham, 1997). We have also confirmed this scaling law in a visuomotor setting, where subjects pointed to memorized targets presented in the visual periphery (Todorov, 1998). Such results motivate the use of a multiplicative observation noise model of the form D (x) = i Di x i , where x is the state of the plant and environment, including the current fixation point and the positions and velocities of relevant objects. Incorporating state-dependent noise in analyses of sensorimotor control can allow more accurate modeling of the effects of feedback and various experimental perturbations; it also can effectively induce a cost function over eye movement patterns and allow us to predict the eye movements that would result in optimal hand performance (Todorov, 1998). Note that if other forms of state-dependent sensory noise are found, the model can still be useful as a linear approximation.

Methods for Optimal Sensorimotor Control

1089

Intelligent control of a partially observable stochastic plant requires a feedback control law, which is typically a function of a state estimate that is computed recursively over time. In engineering applications, the estimation-control loop is implemented in a noiseless digital computer, and so all noise is external. In models of biological movement, we usually make the same assumption, treating all noise as being a property of the musculoskeletal plant or the sensory apparatus. This is in principle unrealistic, because neural representations are likely subject to internal fluctuations that do not arise in the periphery. It is also unrealistic in modeling practice. An ideal observer model predicts that the estimation error covariance of any stationary feature of the environment will asymptote to 0. In particular, such models predict that if we view a stationary object in the visual periphery long enough, we should eventually know exactly where it is and be able to reach for it as accurately as if it were at the center of fixation. This contradicts our intuition as well as experimental data. Both interval discrimination experiments and reaching to remembered peripheral targets experiments indicate that estimation errors asymptote rather quickly, but not to 0. Instead, the asymptote level depends linearly on eccentricity. The simplest way to model this is to assume another noise process, which we call internal noise, acting directly on whatever state estimate the nervous system chooses to compute. 3 Problem Statement and Assumptions Consider a linear dynamical system with state xt ∈ Rm , control ut ∈ R p , feedback yt ∈ Rk , in discrete time t: Dynamics

xt+1 = Axt + But + ξ t +

Feedback

yt = Hxt + ω t +

Cost per step

xtT Qt xt + utT Rut

d

c

εti Ci ut

i=1

ti Di xt

(3.1)

i=1

The feedback signal yt is received after the control signal ut has been generated. The initial state has known mean x1 and covariance 1 . All matrices are known and have compatible dimensions; making them time varying is straightforward. The control cost matrix R is symmetric positive definite (R > 0), and the state cost matrices Q1 , . . . , Qn are symmetric positive semidefinite (Qt ≥ 0). Each movement lasts n time steps; at t = n, the final cost is xnT Qn xn , and un is undefined. The independent random variables ξ t ∈ Rm , ω t ∈ Rk , εt ∈ Rc , and t ∈ Rd have multidimensional gaussian distributions with mean 0 and covariances ξ ≥ 0, ω > 0, ε = I and = I respectively. Thus, the control-dependent and state-dependent noise terms in equation 3.1 have covariances i Ci ut uTt CiT and i Di xt xTt DiT . When the

1090

E. Todorov

control-dependent noise is meant to be added to the control signal (which is usually the case), the matrices Ci should have the form B Fi where Fi are the actual noise scaling factors. Then the control-dependent part of the plant dynamics becomes B(I + i εti Fi )ut . The problem of optimal control is to find the optimal control law, that is, the sequence of causal control functions ut (u1 , . . . , ut−1 , y1 , . . . , yt−1 ) that minimize the expected total cost over the movement. Note that computing the optimal sequence of functions u1 (·), . . . , un−1 (·) is a different, and in general much more difficult, problem than computing the optimal sequence of open-loop controls u1 , . . . , un−1 . When only additive noise is present (i.e., C1 , . . . , Cc = 0 and D1 , . . . , Dd = 0), this reduces to the classic LQG problem, which has the well-known optimal solution (Davis & Vinter, 1985) Linear-Quadratic Regulator ut = −L t xt

Kalman Filter xt + But + K t (yt − H xt ) xt+1 = A

L t = (R + B T St+1 B)−1 B T St+1 A

K t = At H T (Ht H T + ω )−1

St = Qt + AT St+1 (A − B L t )

t+1 = ξ + (A − K t H) t AT

(3.2)

In that case, the optimal control law depends on the history of control and feedback signals only through the state estimate xt , which is updated recursively by the Kalman filter. The matrices L that define the optimal control law do not depend on the noise covariances or filter coefficients, and the matrices K that define the optimal filter do not depend on the cost and control law. In the case of control-dependent and state-dependent noise, the above independence properties no longer hold. This complicates the problem substantially and forces us to adopt a more restricted formulation in the interest of analytical tractability. We assume that, as in equation 3.2, the entire history of control and feedback signals is summarized by a state estimate xt , which is all the information available to the control system at time t. The feedback control law ut (·) is allowed to be an arbitrary function of xt , but xt can be updated only by a recursive linear filter of the form: xt + But + K t (yt − H xt ) + ηt . xt+1 = A The internal noise η t ∈ Rm has mean 0 and covariance η ≥ 0. The filter gains K 1 , . . . , K n−1 are nonadaptive; they are determined in advance and cannot change as a function of the specific controls and observations within a simulation run. Such a filter is always unbiased: for any K 1 , . . . , K n−1 , we have E [xt | xt ] = xt for all t. Note, however, that under the extended noise model, any nonadaptive linear filter is suboptimal: when xt is computed as defined above, Cov [xt | xt ] is generally larger than Cov [xt |u1 , . . . , ut−1 , y1 , . . . , yt−1 ]. The consequences of this will be explored numerically in section 8.

Methods for Optimal Sensorimotor Control

1091

4 Optimal Controller The optimal ut will be computed using the method of dynamic programming. We will show by induction that if the true state at time t is xt and the unbiased state estimate available to the control system is xt , then the optimal cost-to-go function (i.e., the cost expected to accumulate under the optimal control law) has the quadratic form vt (xt , xt ) = xTt Stx xt + (xt − xt )T Ste (xt − xt ) + st = xTt Stx xt + eTt Ste et + st , xt is the estimation error. At the final time t = n, the optimal where et xt − cost-to-go is simply the final cost xnT Qn xn , and so vn is in the assumed form with Snx = Qn , Sne = 0, sn = 0. To carry out the induction proof, we have to show that if vt+1 is in the above form for some t < n, then vt is also in that form. Consider a time-varying control law that is optimal at times t + 1, . . . , n, and at time t is given by ut = π ( xt ). Let vtπ (xt , xt ) be the corresponding cost-to-go function. Since this control law is optimal after time t, we have π vt+1 = vt+1 . Then the cost-to-go function vtπ satisfies the Bellman equation: xt ) = xTt Qt xt + π( xt )T Rπ ( xt ) + E [vt+1 (xt+1 , xt+1 )|xt , xt , π]. vtπ (xt , To compute the above expectation term, we need the update equations for the system variables. Using the definitions of the observation yt and the estimation error et , the stochastic dynamics of the variables of interest become xt+1 = Axt + Bπ ( xt ) + ξ t +

εti Ci π( xt )

i

et+1 = (A − K t H)et + ξ t − K t ω t − η t +

εti Ci π ( xt ) −

i

ti K t Di xt .

i

(4.1) Then the conditional means and covariances of xt+1 and et+1 are xt , π ] = Axt + Bπ( xt ) E [xt+1 |xt , E [et+1 |xt , xt , π ] = (A − K t H)et Cov [xt+1 |xt , xt , π ] = ξ + Ci π( xt )π ( xt )T CiT i

Cov [et+1 |xt , xt , π ] = + ξ

Ci π( xt )π ( xt )T CiT + η

i ω

+ K t K tT +

i

K t Di xt xTt DiT K tT ,

1092

E. Todorov

and the conditional expectation in the Bellman equation can be computed. The cost-to-go becomes x xt ) = xTt Qt + AT St+1 A + Dt xt vtπ (xt , e + eTt (A − K t H)T St+1 (A − K t H)et T x + tr (Mt ) + π ( B + Ct π ( xt ) xt ) R + B T St+1 x Axt , + 2π( xt )T B T St+1

where we defined the shortcuts e x Ci , CiT St+1 + St+1 Ct i

Dt

e DiT K tT St+1 K t Di ,

and

i

ξ x e ξ + St+1 + η + K t ω K tT . Mt St+1 Note that the control law affects only the cost-go-to function through an expression that is quadratic in π( xt ), which can be minimized analytically. But there is a problem: the minimum depends on xt , while π is only allowed to be a function of xt . To obtain the optimal control law at time t, we have to take an expectation over xt conditional on xt , and find the function π that minimizes the resulting expression. Note that the control-dependent expression is linear in xt , and so its expectation depends on the conditional mean of xt but not on any higher moments. Since E [xt | xt ] = xt , we have π x E vt (xt , xt )T R + B T St+1 xt ) xt )| xt = const + π ( B + Ct π ( x A xt , + 2π( xt )T B T St+1

and thus the optimal control law at time t is −1 T x x xt ; xt ) = −L t L t R + B T St+1 B + Ct B St+1 A. ut = π( Note that the linear form of the optimal control law fell out of the optimization and was not assumed. Given our assumptions, the matrix being inverted is symmetric positive-definite. To complete the induction proof, we have to compute the optimal costto-go vt , which is equal to vtπ when π is set to the optimal control law −L t xt . x x x Using the fact that L Tt (R + B T St+1 B + Ct )L t = L Tt B T St+1 A = AT St+1 B L t , and that xT Z x − 2 xT Zx = (x − x )T Z(x − x ) − xT Zx = eT Ze − xT Zx for a symx metric matrix Z (in our case equal to L Tt B T St+1 A), the result is x xt ) = xTt Qt + AT St+1 (A − B L t ) + Dt xt + tr (Mt ) + st+1 vt (xt , x e + eTt AT St+1 B L t + (A − K t H)T St+1 (A − K t H) et .

Methods for Optimal Sensorimotor Control

1093

We now see that the optimal cost-to-go function remains in the assumed quadratic form, which completes the induction proof. The optimal control law is computed recursively backward in time as Controller ut = −L t xt −1 x e x x Ci L t = R + B T St+1 B+ CiT St+1 + St+1 B T St+1 A i e x Stx = Qt + AT St+1 (A − B L t ) + DiT K tT St+1 K t Di ; Snx = Qn i

x e Ste = AT St+1 B L t + (A − K t H)T St+1 (A − K t H) ; ξ x e ξ η st = tr St+1 + St+1 + + K t ω K tT + st+1 ;

(4.2)

Sne = 0 sn = 0.

x1 + tr((S1x + S1e )1 ) + s1 . The total expected cost is x T1 S1x When the control-dependent and state-dependent noise terms are removed (i.e., C1 , . . . , Cc = 0, D1 , . . . , Dd = 0), the control laws given by equation 4.2 and 3.2 are identical. The internal noise term η, as well as the additive noise terms ξ and ω, do not directly affect the calculation of the feedback gain matrices L. However, all noise terms affect the calculation (see below) of the optimal filter gains K , which in turn affect L. One can attempt to transform equation 3.1 into a fully observable system by setting H = I , ω = η = 0, D1 , . . . , Dd = 0, in which case K = A, and apply equation 4.2. Recall, however, our assumption that the control signal is generated before the current state is measured. Thus, even if we make the sensory measurement equal to the state, we would still be dealing with a partially observable system. To derive the optimal controller for the fully observable case, we have to assume that xt is known at the time when ut is generated. The above derivation is now much simplified: the optimal costto-go function vt is in the form xTt St xt + st , and the expectation term that needs to be minimized with regard to ut = π(xt ) becomes E [vt+1 ] = (Axt + But )T St+1 (Axt + But )

T T Ci St+1 Ci ut + tr [St+1 ξ ] + st+1 , + ut i

and the optimal controller is computed in a backward pass through time as ut = −L t xt −1 T T R + B St+1 B + Ci St+1 Ci B T St+1 A

Fully observable controller

Lt =

i

St = Qt + A St+1 (A − B L t );

Sn = Qn

st = tr (St+1 ξ ) + st+1 ;

sn = 0.

T

(4.3)

1094

E. Todorov

5 Optimal Estimator So far, we have computed the optimal control law L for any fixed sequence of filter gains K . What should these gains be fixed to? Ideally they should correspond to a Kalman filter, which is the optimal linear estimator. However, in the presence of control-dependent and state-dependent noise, the Kalman filter gains become adaptive (i.e., K t depends on xt and ut ), which would make our control law derivation invalid. Thus, if we want to preserve the optimality of the control law given by equation 4.2 and obtain an iterative algorithm with guaranteed convergence, we need to compute a fixed sequence of filter gains that are optimal for a given control law. Once the iterative algorithm has converged and the control law has been designed, we could use an adaptive filter in place of the fixed-gain filter in run time (see section 8). Thus, our objective here is the following: given a linear feedback control law L 1 , . . . , L n−1 (which is optimal for the previous filter K 1 , . . . , K n−1 ), compute a new filter that, in conjunction with the given control law, results in minimal expected cost. In other words, we will evaluate the filter not by the magnitude of its estimation errors, but by the effect that these estimation errors have on the performance of the composite estimation-control system. We will show that the new optimal filter can be designed in a forward pass through time. In particular, we will show that regardless of the new values of K 1 , . . . , K t−1 , the optimal K t can be found analytically as long as K t+1 , . . . , K n−1 still have the values for which L t+1 , . . . , L n−1 are optimal. Recall that the optimal L t+1 , . . . , L n−1 depend only on K t+1 , . . . , K n−1 , and so the parameters (as well as the form) of the optimal cost-to-go function vt+1 cannot be affected by changing K 1 , . . . , K t . Since K t affects only the computation of xt+1 , and the effect of xt+1 on the total expected cost is captured by the function vt+1 , we have to minimize vt+1 with respect to K t . But v is a function of x and x, while K cannot be adapted to the specific values of x and x within a simulation run (by assumption). Thus, the quantity we have to minimize is the unconditional expectation of vt+1 . In doing so, we will use that fact that E [vt+1 (xt+1 , xt+1 )] = Ext ,xt [E [vt+1 (xt+1 , xt+1 )|xt , xt , L t ]]. The conditional expectation was already computed as an intermediate step in the previous section (not shown). The terms in E [vt+1 (xt+1 , xt+1 )|xt , xt , L t ] that depend on K t are

T T e ω T T T e Di xt xt Di K t St+1 . et (A − K t H) St+1 (A − K t H)et + tr K t + i

Defining the (uncentered) unconditional covariances te E [et eTt ] and tx E [xt xTt ], the unconditional expectation of the K t -dependent expression

Methods for Optimal Sensorimotor Control

1095

above becomes e ; a (K t ) = tr (A − K t H)te (A − K t H)T + K t Pt K tT St+1 Pt ω + Di tx DiT . i

The minimum of a (K t ) is found by setting its derivative with regard to K t to 0. Using the matrix identities ∂∂X tr (XU) = U T and ∂∂X tr XU XT V = V XU + e V T XU T , and the fact that the matrices St+1 , ω , te , tx are symmetric, we obtain ∂a (K t ) e = 2St+1 K t Hte H T + Pt − Ate H T . ∂ Kt This expression is equal to 0 whenever K t = Ate H T (Hte H T + Pt )−1 , ree gardless of the value of St+1 . Given our assumptions, the matrix being inverted is symmetric positive-definite. Note that the optimal K t depends on K 1 , . . . , K t−1 (through te and tx ) but is independent of K t+1 , . . . , K n−1 e (since it is independent of St+1 ). This is the reason that the filter gains are reoptimized in a forward pass. To complete the derivation, we have to substitute the optimal filter gains and compute the unconditional covariances. Recall that the variables xt , xt , et are deterministically related by et = xt − xt , so the covariance of any one of them can be computed given the covariances of the other two, and we have a choice of which pair of covariance matrices to compute. The resulting equations are most compact for the pair xt , et . The stochastic dynamics of these variables are xt + K t Het + K t ω t + η t + ti K t Di (et + xt ). xt+1 = (A − B L t ) i xt et+1 = (A − K t H)et + ξ t − K t ω t − η t − εti Ci L t (5.1) i − ti K t Di (et + xt ). i

Define the unconditional covariances, T tx E xt te E et eTt ; xt ;

T txe E x t et ,

x1 is a known constant, noting that tx is uncentered and tex = (txe )T . Since the initialization at t = 1 is 1e = 1 , 1x = x1 xT1 , 1xe = 0. With these defiT nitions, we have tx = E [(et + xt )(et + xt )T ] = te + tx + txe + txe . Using equation 5.1, the updates for the unconditional covariances are e = (A − K t H)te (A − K t H)T + ξ + η + K t Pt K tT t+1 + Ci L t tx L Tt CiT i

1096

E. Todorov

x t+1 = (A − B L t )tx (A − B L t )T + η + K t Hte H T + Pt K tT + (A − B L t )txe H T K tT + K t Htex (A − B L t )T xe t+1 = (A − B L t )txe (A − K t H)T + K t Hte (A − K t H)T

− η − K t Pt K tT . Substituting the optimal value of K t , which allows some simplifications to the above update equations, the optimal nonadaptive linear filter is computed in a forward pass through time as Estimator xt+1 = (A − B L t ) xt + K t (yt − H xt ) + η t T −1 e T e T ω e e x x xe K t = At H Ht H + + Di t + t + t + t Di i e t+1

ξ

η

= + + (A −

K t H)te AT

+

Ci L t tx L Tt CiT ;

1e = 1

i x t+1 = η + K t Hte AT + (A − B L t )tx (A − B L t )T

+ (A − B L t )txe H T K tT + K t Htex (A − B L t )T ; xe t+1 = (A − B L t ) txe (A − K t H)T − η ;

1x = x1 xT1 1xe = 0.

(5.2) It is worth noting the effects of the internal noise η t . If that term did not exist (i.e., η = 0), the last update equation would yield txe = 0 for all t. Indeed, for an optimal filter, one would expect txe = 0 from the orthogonality principle: if the state estimate and estimation error were correlated, one could improve the filter by taking that correlation into account. However, the situation here is different because we have noise acting directly on the state estimate. When such noise pushes xt in one direction, et is (by definition) pushed in the opposite direction, creating a negative correlation between xt and et . This is the reason for the negative sign in front of the η term in the last update equation. The complete algorithm is the following: Algorithm: Initialize K 1 , . . . , K n−1 , and iterate equation 4.2 and equation 5.2 until convergence. Convergence is guaranteed, because the expected cost is nonnegative by definition, and we are using a coordinate-descent algorithm, which decreases the expected cost in each step. The initial sequence K could be set to 0—in which case, the first pass of equation 4.2 will find the optimal open-loop controls, or initialized from equation 3.2—which is equivalent to assuming additive noise in the first pass. We can also derive the optimal adaptive linear filter, with gains K t that depend on the specific xt and ut = −L t xt within each simulation run. This is again accomplished by minimizing E [vt+1 ] with respect to K t , but the expectation is computed with xt being a known constant rather than a random xTt and txe = 0, and so the last two update variable. We now have tx = xt

Methods for Optimal Sensorimotor Control

1097

equations in equation 5.2 are no longer needed. The optimal adaptive linear filter is Adaptive estimator xt+1 = (A − B L t ) xt ) + ηt xt + K t (yt − H −1 K t = At H T Ht H T + ω + Di t + xt xTt DiT i t+1 = ξ + η + (A − K t H) t AT + Ci L t xt xTt L Tt CiT ,

(5.3)

i

xt ] is the conditional estimation error covariance where t = Cov [xt | (initialized from 1 , which is given). When the control-dependent, state-dependent, and internal noise terms are removed (C1 , . . . , Cc = 0, D1 , . . . , Dd = 0, η = 0), equation 5.3 reduces to the Kalman filter in equation 3.2. Note that using equation 5.3 instead of equation 5.2 online reduces the total expected cost because equation 5.3 achieves lower estimation error than any other linear filter, and the expected cost depends on the conditional estimation error covariance. This can be seen from E[vt (xt , xt )| xt ] = xTt Stx xt ] xt + st + tr Stx + Ste Cov[xt | 6 Application to Reaching Movements We now illustrate how the methodology developed above can be used to construct models relevant to motor control. Since this is a methodological rather than a modeling article, a detailed evaluation of the resulting models in the context of the motor control literature will not be given here. The first model is a one-dimensional model of reaching, and includes control-dependent noise but no state-dependent or internal noise. The latter two forms of noise are illustrated in the second model, where we estimate the position of a stationary peripheral target without making a movement. 6.1 Models. We model a single-joint movement (such as flexing the elbow) that brings the hand to a specified target. For simplicity, the rotational motion is replaced with translational motion; the hand is modeled as a point mass (m = 1 kg) whose one-dimensional position at time t is p(t). The combined action of all muscles is represented with the force f (t) acting on the hand. The control signal u(t) is transformed into force f (t) by adding controldependent multiplicative noise and applying a second-order muscle-like ˙ + f (t) = low-pass filter (Winter, 1990) of the form τ1 τ2 f¨ (t) + (τ1 + τ2 ) f(t) u(t), with time constants τ1 = τ2 = 0.04 sec. Note that a second-order filter can be written as a pair of coupled first-order filters (with outputs g and f ) ˙ + g(t) = u(t), τ2 f˙ (t) + f (t) = g(t). as follows: τ1 g(t) The task is to move the hand from the starting position p(0) = 0 m to the target position p ∗ = 0.1 m and stop there at time tend , with minimal energy

1098

E. Todorov

consumption. Movement durations are in the interval tend ∈ [0.25 sec; 0.35 sec]. Time is discretized at = 0.01 sec. The total cost is defined as ( p(tend ) − p ∗ )2 + (wv p˙ (tend ))2 + (w f f (tend ))2 +

n−1 r u(k)2 . n − 1 k=1

The first term enforces positional accuracy; the second and third terms specify that the movement has to stop at time tend , that is, both the velocity and force have to vanish; and the last term penalizes energy consumption. It makes sense to set the scaling weights wv and w f so that wv p˙ (t) and w f f (t) averaged over the movement have magnitudes similar to the hand displacement p ∗ − p(0). For a 0.1 m reaching movement that lasts about 0.3 sec, these weights are wv = 0.2 and w f = 0.02. The weight of the energy term was set to r = 0.00001. The discrete-time system state is represented with the five-dimensional vector xt = [ p(t); p˙ (t); f (t); g(t); p ∗ ] initialized from a gaussian with mean x1 = [0; 0; 0; 0; p ∗ ]. The auxiliary state variable g(t) is needed to implement a second-order filter. The target p ∗ is included in the state so that we can capture the above cost function using a quadratic with no linear terms: defining p = [1; 0; 0; 0; −1], we have pT xt = p(tend ) − p ∗ , and so xTt (ppT )xt = ( p(tend ) − p ∗ )2 . Note that the same could be accomplished by setting p = [1; 0; 0; 0; − p ∗ ] and xt = [ p(t); p˙ (t); f (t); g(t); 1]. The advantage of the formulation used here is that because the target is represented in the state, the same control law can be reused for other targets. The control law, of course, depends on the filter, which depends on the initial expected state, which depends on the target— and so a control law optimal for one target is not necessarily optimal for all other targets. Unpublished simulation results indicate good generalization, but a more detailed investigation of how the optimal control law depends on the target position is needed. The sensory feedback carries information about position, velocity, and force: yt = [ p(t); p˙ (t); f (t)] + ω t . The vector ω t of sensory noise terms has zero-mean gaussian distribution with diagonal covariance, ω = (σs diag[0.02 m; 0.2 m/s; 1 N])2 , where the relative magnitudes are set using the same order-of-magnitude reasoning as before, and σs = 0.5. The multiplicative noise term added to the discrete-time control signal ut = u(t) is σc εt ut , where σc = 0.5. Note that

Methods for Optimal Sensorimotor Control

1099

σc is a unitless quantity that defines the noise magnitude relative to the control signal magnitude. The discrete-time dynamics of the above system are p(t + ) = p(t) + p˙ (t) p˙ (t + ) = p˙ (t) + f (t)/m f (t + ) = f (t) (1 − /τ2 ) + g(t)/τ2 g(t + ) = g(t) (1 − /τ1 ) + u(t) (1 + σc εt ) /τ1 , which is transformed into the form of equation 3.1 by the matrices 

1

0

0

0



  0 0  0 1 /m    A=   0 0 1 − /τ2 /τ2 0    0 1 − /τ1 0  0 0 0 0 0 0 1



0



   0     B=  0     /τ1  0



10000



  H = 0 1 0 0 0 00100 C1 = Bσc ; c = 1; d = 0 1 = ξ = η = 0.

The cost matrices are R = r , Q1,...,n−1 = 0, and Qn = ppT + vvT + ffT , where p = [1; 0; 0; 0; −1];

v = [0; wv ; 0; 0; 0];

f = [0; 0; w f ; 0; 0].

This completes the formulation of the first model. The above algorithm can now be applied to obtain the control law and filter, and the closed-loop system can be simulated. To replace the control-dependent noise with additive noise of similar magnitude (and compare the effects of the two forms of noise), we will set c = 0 and ξ = (4.6 N)2 B B T . The value of 4.6 N is the average magnitude of the control-dependent noise over the range of movement durations (found through 10,000 simulation runs at each movement duration). We also model an estimation process under state-dependent and internal noise, in the absence of movement. In that case, the state is xt = p ∗ , where the stationary target p ∗ is sampled from a gaussian with mean x1 ∈ {5 cm, 15 cm, 25 cm} and variance 1 = (5 cm)2 . Note that target eccentricity is represented as distance rather than visual angle. The state-dependent noise has scale D1 = 0.5, fixation is assumed to be at 0 cm, the time step is = 10 msec, and we run the estimation process for n = 100 time steps. In one set of simulations, we use internal noise η = (0.5 cm)2 without additive noise. In another set of simulations, we study additive noise with the same magnitude ω = (0.5 cm)2 , without internal noise. There is no actuator to be controlled, so we have A = H = 1 and B = L = 0. Estimation is based on the adaptive filter from equation 5.3.

1100

E. Todorov

6.2 Results. Reaching movements are known to have stereotyped bellshaped speed profiles (Flash & Hogan, 1985). Models of this phenomenon have traditionally been formulated in terms of deterministic open-loop minimization of some cost function. Cost functions that penalize physically meaningful quantities (such as duration or energy consumption) did not agree with empirical data (Nelson, 1983); in order to obtain realistic speed profiles, it appeared necessary to minimize a smoothness-related cost that penalizes the derivative of acceleration (Flash & Hogan, 1985) or torque (Uno et al., 1989). Smoothness-related cost functions have also been used in the context of stochastic optimal feedback control (Hoff, 1992) to obtain bell-shaped speed profiles. It was recently shown, however, that smoothness does not have to be explicitly enforced by the cost function; open-loop minimization of end-point error was found sufficient to produce realistic trajectories, provided that the multiplicative nature of motor noise is taken into account (Harris & Wolpert, 1998). While this is an important step toward a more principled optimization model of trajectory smoothness, it still contains an ad hoc element: the optimization is performed in an open loop, which is suboptimal, especially for movements of longer duration. Our model differs from Harris and Wolpert (1998) in that not only the average sequence of control signals is optimal, but the feedback gains that determine the online sensory-guided adjustments are also optimal. Optimal feedback control of reaching has been studied by Meyer, Abrams, Kornblum, Wright, and Smith (1988) in an intermittent setting, and Hoff (1992) in a continuous setting. However, both of these models assume full state observation. Ours is the first optimal control model of reaching that incorporates sensory noise and combines state estimation and feedback control into an optimal sensorimotor loop. The predicted movement kinematics shown in Figure 1A closely resemble observed movement trajectories (Flash & Hogan, 1985). Another well-known property of reaching movements, first observed a century ago by Woodworth and later quantified as Fitts’ law, is the trade-off between speed and accuracy. The fact that faster movements are less accurate implies that the instantaneous noise in the motor system is controldependent, in agreement with direct measurements of isometric force fluctuations (Sutton and Sykes, 1967; Schmidt et al., 1979; Todorov, 2002) that show standard deviation increasing linearly with the mean. Naturally, this noise scaling has formed the basis of both closed-loop (Meyer et al., 1988; Hoff, 1992) and open-loop (Harris & Wolpert, 1998) optimization models of the speed-accuracy trade-off. Figure 1B illustrates the effect in our model: as the (specified) movement duration increases, the standard deviation of the end-point error achieved by the optimal controller decreases. To emphasize the need for incorporating control-dependent noise, we modified the model by making the noise in the plant dynamics additive, with fixed magnitude chosen to match the average multiplicative noise magnitude over the range of movement durations. With that change, the end-point error showed the opposite trend to the one observed experimentally (see Figure 1B).

Methods for Optimal Sensorimotor Control

1101

Figure 1: (A) Normalized position (Pos), velocity (Vel), and acceleration (Acc) of the average trajectory of the optimal controller. (B) A separate optimal controller was constructed for each instructed duration, the resulting closed-loop system was simulated for 10,000 trials, and the positional standard deviation at the end of the trial was plotted. This was done with either multiplicative (solid line) or additive (dashed line) noise in the plant dynamics. (C) The position of a stationary peripheral target was estimated over time, under internal estimation noise (solid line) or additive observation noise (dashed line). This was done in three sets of trials, with target positions sampled from gaussians with means 5 cm (bottom), 15 cm (middle), and 25 cm (top). Each curve is an average over 10,000 simulation runs.

It is interesting to compare the effects of the control penalty r and the multiplicative noise scaling σc . As equation 4.2 shows, both terms penalize large control signals—directly in the case of r and indirectly (via increased uncertainty) in the case of σc . Consequently, both terms lead to a negative bias in end-point position (not shown), but the effect is much more pronounced for r . Another consequence of the fact that larger controls are more costly arises in the control of redundant systems, where the optimal strategy is to follow a minimal intervention principle, that is, to leave task-irrelevant deviations from the average behavior uncorrected (Todorov & Jordan, 2002a, 2002b). Simulations have shown that this more complex effect is dependent on σc and actually decreases when r is increased while σc is kept constant (Todorov & Jordan, 2002b). Figure 1C shows simulation results from our second model, where the position of a stationary peripheral target is estimated by the optimal adaptive filter in equation 5.3, operating under internal estimation noise or additive observation noise of the same magnitude. In each case, we show results for three sets of targets with varying average eccentricity. The standard deviations of the estimation error always reach an asymptote (much faster in the case of internal noise). In the presence of internal noise, this asymptote depends on target eccentricity; for the chosen model parameters, the dependence is in quantitative agreement with our experimental results (Todorov, 1998). Under additive noise, the error always asymptotes to 0.

1102

E. Todorov

Figure 2: Relative change in expected cost as a function of iteration number, in (A) psychophysical models and (B) random models. (C) Relative variability (SD/mean) among expected costs obtained from 100 different runs of the algorithm on the same model (average over models in each class).

7 Convergence Properties We studied the convergence properties of the algorithm in 10 models of psychophysical experiments taken from Todorov and Jordan (2002b) and 200 randomly generated models. The psychophysical models had dynamics and cost functions similar to the above example. They included two models of planar reaching, three models of passing through sequences of targets, one model of isometric force production, three models of tracking and reaching with a mechanically redundant arm, and one model of throwing. The dimensionalities of the state, control, and feedback were between 5 and 20, and the horizon n was about 100. The psychophysical models included control-dependent dynamics noise and additive observation noise, but no internal or state-dependent noise. The details of all these models are interesting from a motor control point of view, but we omit them here since they did not affect the convergence of the algorithm in any systematic way. The random models were divided into two groups of 100 each: passively stable, with all eigenvalues of A being smaller than 1, and passively unstable, with the largest eigenvalue of A being between 1 and 2. The dynamics were restricted so that the last component of xt was 1—to make the random models more similar to the psychophysical models, which always incorporated a constant in the state description. The state, control, and measurement dimensionalities were sampled uniformly between 5 and 20. The random models included all forms of noise allowed by equation 3.1. For each model, we initialized K 1,...,n−1 from equation 3.2 and applied our iterative algorithm. In all cases convergence was very rapid (see Figures 2A and 2B), with the relative change in expected cost decreasing exponentially. The jitter observed at the end of the minimization (see Figure 2A) is due to numerical round-off errors (note the log scale) and continues indefinitely. The exponential convergence regime does not always start from the first iteration (see Figure 2A). Similar behavior was observed for the absolute

Methods for Optimal Sensorimotor Control

1103

change in expected cost (not shown). As one would expect, random models with unstable passive dynamics converged more slowly than passively stable models. Convergence was observed in all cases. To test for the existence of local minima, we focused on five psychophysical, five random stable, and five random unstable models. For each model, the algorithm was initialized 100 times with different randomly chosen sequences K 1,...,n−1 , and run for 100 iterations. For each model, we computed the standard deviation of the expected cost obtained at each iteration and divided by the mean expected cost at that iteration. The results, averaged within each model class, are plotted in Figure 2C. The negligibly small values after convergence indicate that the algorithm always finds the same solution. This was true for every model we studied, despite the fact that the random initialization sometimes produced very large initial costs. We also examined the K and L sequences found at the end of each run, and the differences seemed to be due to round-off errors. Thus, we conjecture that the algorithm always converges to the globally optimal solution. So far we have not been able to prove this analytically and cannot offer a satisfying intuitive explanation at this time. Note that the system can be unstable even for the optimal controller. Formally, that does not affect the derivation, because in a discrete-time finitehorizon system, all numbers remain finite. In practice, the components of xt can exceed the maximum floating-point number whenever the eigenvalues of (A − B L t ) are sufficiently large. In the applications we are interested in (Todorov, 1998; Todorov & Jordan, 2002b), such problems were never encountered. 8 Improving Performance via Adaptive Estimation Although the iterative algorithm given by equations 4.2 and 5.2 is guaranteed to converge, and empirically it appears to converge to the globally optimal solution, performance can still be suboptimal due to the imposed restriction to nonadaptive filters. Here we present simulations aimed at quantifying this suboptimality. Because the potential suboptimality arises from the restriction to nonadaptive filters, it is natural to ask what would happen if that restriction were removed in run time and the optimal adaptive linear filter from equation 5.3 were used instead. Recall that although the control law is optimized under the assumption of a nonadaptive filter, it yields better performance if a different filter, which somehow achieves lower estimation error, is used in run time. Thus, in our first test, we simply replace the nonadaptive filter with equation 5.3 in run time and compute the reduction in expected total cost. The above discussion suggests a possibility for further improvement. The control law is optimal with respect to some sequence of filter gains K 1,...,n−1 . But the adaptive filter applied in run time uses systematically different gains, because it achieves systematically lower estimation error. We can run

1104

E. Todorov

Table 2: Cost Reduction. Model Method

Psychophysical

Random Stable

Random Unstable

Adaptive Estimator Reoptimized Controller

1.9 % 1.7

0% 0

31.4 % 28.3

Notes: Numbers indicate percent reduction in expected total cost, relative to the cost of the solution found by our iterative algorithm. The two improvement methods are described in the text. Each method is applied to 10 models in each model class. For each model and method, expected total cost is computed from 10,000 simulation runs. A value of 0% indicates that with a sample size of 10 models, the improvement was not significantly different from 0% (t-test, p = 0.05 threshold).

our control law in conjunction with the adaptive filter and find the average 1,...,n−1 that are used online. Now, one would think that if we filter gains K 1,...,n−1 , which better reoptimized the control law for the nonadaptive filter K reflects the gains being used online by the adaptive filter, this will further improve performance. This is the second test we apply. As Table 2 shows, neither method improves performance substantially for psychophysical models or random stable models. However, both methods result in substantial improvement for random unstable models. This is not surprising. In the passively stable models, the differences between the expected and actual values of the states and controls are relatively small, and so the optimal nonadaptive filter is not that different from the optimal adaptive filter. The unstable models, on the other hand, are very sensitive to small perturbations and thus follow substantially different state-control trajectories in different simulation runs. So the advantage of adaptive filtering is much greater. Since musculoskeletal plants have stable passive dynamics, we conclude that our algorithm is well suited for approximating the optimal sensorimotor system. It is interesting that control law reoptimization in addition to adaptive filtering is actually worse than adaptive filtering alone—contrary to our intuition. This was the case for every model we studied. Although it is not clear where the problem with the reoptimization method lies, this somewhat unexpected result provides further justification for the restriction we introduced. In particular, it suggests that the control law that is optimal under the best nonadaptive filter may be close to optimal under the best adaptive filter. 9 Discussion We have presented an algorithm for stochastic optimal control and estimation of partially observable linear dynamical systems, subject to quadratic costs and noise processes characteristic of the sensorimotor system (see

Methods for Optimal Sensorimotor Control

1105

equation 3.1). We restricted our attention to controllers that use state estimates obtained by nonadaptive linear filters. The optimal control law for any such filter was shown to be linear, as given by equation 4.2. The optimal nonadaptive linear filter for any linear control law is given by equation 5.2. Iteration of equations 4.2 and 5.2 is guaranteed to converge to a filter and a control law optimal with respect to each other. We found numerically that convergence is exponential, local minima do not to exist, and the effects of assuming nonadaptive filtering are negligible for the control problems of interest. The application of the algorithm was illustrated in the context of reaching movements. The optimal adaptive filter, equation 5.3, as well as the optimal controller for the fully observable case, equation 4.3, were also derived. To facilitate the application of our algorithm in the field of motor control and elsewhere, we have made a Matlab implementation available at www.cogsci.ucsd.edu/∼todorov. While our work was motivated by models of biological movement, the results presented here could be of interest to a wider audience. Problems with multiplicative noise have been studied in the optimal control literature, but most of that work has focused on the fully observable case (Kleinman, 1969; McLane, 1971; Willems & Willems, 1976; Bensoussan, 1992; El Ghaoui, 1995; Beghi & D’Alessandro, 1998; Rami et al., 2001). Our equation 4.3 is consistent with these results. The partially observable case that we addressed (and that is most relevant to models of sensorimotor control) is much more complex, because the independence of estimation and control breaks down in the presence of signal-dependent noise. The work most similar to ours is Pakshin (1978) for discrete-time dynamics and Phillis (1985) for continuoustime dynamics. These authors addressed a closely related problem using a different methodology. Instead of analyzing the closed-loop system directly, the filter and control gains were treated as open-loop controls to a modified deterministic dynamical system, whose cost function matches the expected cost of the original system. With that transformation, it is possible to use Pontryagin’s maximum principle, which is applicable only to deterministic open-loop control, and obtain necessary conditions that the optimal filter and control gains must satisfy. Although our results were obtained independently, we have been able to verify that they are consistent with Pakshin (1978) by removing from our model the internal estimation noise (which to our knowledge has not been studied before); combining equations 4.2 and 5.2; and applying certain algebraic transformations. However, our approach has three important advantages. First, we managed to prove that the optimal control law is linear under a nonadaptive filter, while this linearity had to be assumed before. Second, using the optimal cost-to-go function to derive the optimal filter revealed that adaptive filtering improves performance, even though the control law is optimized for a nonadaptive filter. And most important, our approach yields a coordinatedescent algorithm with guaranteed convergence, as well as appealing numerical properties illustrated in sections 7 and 8. Each of the two steps of our

1106

E. Todorov

coordinate-descent algorithm is computed efficiently in a single pass through time. In contrast, application of Pontryagin’s maximum principle yields a system of coupled difference (Pakshin, 1978) or differential (Phillis, 1985) equations with boundary conditions at the initial and final time, but no algorithm for solving that system. In other words, earlier approaches obscure the key property we uncovered: that half of the problem can be solved efficiently given a solution to the other half. Finally, there may be an efficient way to obtain a control law that achieves better performance under adaptive filtering. Our attempt to do so through reoptimization (see section 8) failed, but another approach is possible. Using the optimal adaptive filter (see equation 5.3) would make E [vt+1 ] a complex function of xt , ut , and the resulting vt would no longer be in the assumed parametric form (which is why we introduced the restriction to nonadaptive filters). But we could force that complex vt in the desired form by approximating it with a quadratic in xt , ut . This yields additional terms in equation 4.2. We have pursued this idea in our earlier work (Todorov, 1998); an independent but related method has been developed by Moore, Zhou, and Lim (1999). The problem with such approximations is that convergence guarantees no longer seem possible. While Moore et al. did not illustrate their method with numerical examples, in our work we have found that the resulting algorithm is not always stable. These difficulties convinced us to abandon the earlier idea in favor of the methodology presented here. Nevertheless, approximations that take adaptive filtering into account may yield better control laws under certain conditions and deserve further investigation. Note, however, that the resulting control laws will have to be used in conjunction with an adaptive filter, which is much less efficient in terms of online computation. Acknowledgments Thanks to Weiwei Li for comments on the manuscript. This work was supported by NIH grant R01-NS045915. References Anderson, F., & Pandy, M. (2001). Dynamic optimization of human walking. J Biomech. Eng, 123(5), 381–390. Beghi, A., & D’Alessandro, D. (1998). Discrete-time optimal control with controldependent noise and generalized Riccati difference equations. Automatica, 34, 1031–1034. Bensoussan, A. (1992). Stochastic control of partially observable systems. Cambridge: Cambridge University Press. Bertsekas, D., & Tsitsiklis, J. (1997). Neuro-dynamic programming. Belmont, MA: Athena Scientific.

Methods for Optimal Sensorimotor Control

1107

Burbeck, C., & Yap, Y. (1990). Two mechanisms for localization? Evidence for separation-dependent and separation-independent processing of position information. Vision Research, 30(5), 739–750. Chow, C., & Jacobson, D. (1971). Studies of human locomotion via optimal programming. Math Biosciences, 10, 239–306. Davis, M., & Vinter, R. (1985). Stochastic modelling and control. London: Chapman and Hall. El Ghaoui, L. (1995). State-feedback control of systems of multiplicative noise via linear matrix inequalities. Systems and Control Letters, 24, 223–228. Flash, T., & Hogan, N. (1985). The coordination of arm movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5(7), 1688– 1703. Harris, C., & Wolpert, D. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–784. Hatze, H., & Buys, J. (1977). Energy-optimal controls in the mammalian neuromuscular system. Biol. Cybern., 27(1), 9–20. Hoff, B. (1992). A computational description of the organization of human reaching and prehension. Unpublished doctoral dissertation, University of Southern California. Jacobson, D., & Mayne, D. (1970). Differential dynamic programming. New York: Elsevier. Jones, K., Hamilton, A., & Wolpert, D. (2002). Sources of signal-dependent noise during isometric force production. Journal of Neurophysiology, 88, 1533–1544. Kleinman, D. (1969). Optimal stationary control of linear systems with controldependent noise. IEEE Transactions on Automatic Control, AC-14(6), 673–677. Kording, K., & Wolpert, D. (2004). The loss function of sensorimotor learning. Proceedings of the National Academy of Sciences, 101, 9839–9842. Kuo, A. (1995). An optimal control model for analyzing human postural balance. IEEE Transactions on Biomedical Engineering, 42, 87–101. Kushner, H., & Dupuis, P. (2001). Numerical methods for stochastic optimal control problems in continuous time (2nd ed.). New York: Springer. Li, W., & Todorov, E. (2004). Iterative linear-quadratic regulator design for nonlinear biological movement systems. In First International Conference on Informatics in Control, Automation and Robotics, vol. 1, 222–229. N.P.: INSTICC Press. Loeb, G., Levine, W., & He, J. (1990). Understanding sensorimotor feedback through optimal control. Cold Spring Harb. Symp. Quant. Biol., 55, 791–803. McLane, P. (1971). Optimal stochastic control of linear systems with state- and control-dependent disturbances. IEEE Transactions on Automatic Control, AC-16(6), 793–798. Meyer, D., Abrams, R., Kornblum, S., Wright, C., & Smith, J. (1988). Optimality in human motor performance: Ideal control of rapid aimed movements. Psychological Review, 95, 340–370. Moore, J., Zhou, X., & Lim, A. (1999). Discrete time LQG controls with control dependent noise. Systems and Control Letters, 36, 199–206. Nelson, W. (1983). Physical principles for economies of skilled movements. Biological Cybernetics, 46, 135–147. Pakshin, P. (1978). State estimation and control synthesis for discrete linear systems with additive and multiplicative noise. Avtomatika i Telemetrika, 4, 75–85.

1108

E. Todorov

Phillis, Y. (1985). Controller design of systems with multiplicative noise. IEEE Transactions on Automatic Control, AC-30(10), 1017–1019. Rami, M., Chen, X., & Moore, J. (2001). Solvability and asymptotic behavior of generalized Riccati equations arising in indefinite stochastic LQ problems. IEEE Transactions on Automatic Control, 46(3), 428–440. Schmidt, R., Zelaznik, H., Hawkins, B., Frank, J., & Quinn, J. (1979). Motor-output variability: A theory for the accuracy of rapid motor acts. Psychol Rev., 86(5), 415–451. Sutton, G., & Sykes, K. (1967). The variation of hand tremor with force in healthy subjects. Journal of Physiology, 191(3), 699–711. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Todorov, E. (1998). Studies of goal-directed movements. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Todorov, E. (2002). Cosine tuning minimizes motor errors. Neural Computation, 14(6), 1233–1260. Todorov, E. (2004). Optimality principles in sensorimotor control. Nature Neuroscience, 7(9), 907–915. Todorov, E., & Jordan, M. (2002a). A minimal intervention principle for coordinated movement. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 27–34). Cambridge, MA: MIT Press. Todorov, E., & Jordan, M. (2002b). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11), 1226–1235. Todorov, E., & Li, W. (2004). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. Manuscript submitted for publication. Uno, Y., Kawato, M., & Suzuki, R. (1989). Formation and control of optimal trajectory in human multijoint arm movement: Minimum torque-change model. Biological Cybernetics, 61, 89–101. Whitaker, D., & Latham, K. (1997). Disentangling the role of spatial scale, separation and eccentricity in Weber’s law for position. Vision Research, 37(5), 515–524. Whittle, P. (1990). Risk-sensitive optimal control. New York: Wiley. Willems, J. L., & Willems, J. C. (1976). Feedback stabilizability for stochastic systems with state and control dependent noise. Automatica, 1976, 277–283. Winter, D. (1990). Biomechanics and motor control of human movement. New York: Wiley.

Received June 21, 2002; accepted October 1, 2004.

LETTER

Communicated by Jochen J. Steil

Universal Approximation Capability of Cascade Correlation for Structures Barbara Hammer [email protected] Institute of Computer Science, Clausthal University of Technology, 38678 Clausthal-Zellerfeld, Germany

Alessio Micheli [email protected] Dipartimento di Informatica, Universit`a di Pisa, Pisa, Italy

Alessandro Sperduti [email protected] Dipartimento di Matematica Pura ed Applicata, Universit`a di Padova, Padova, Italy

Cascade correlation (CC) constitutes a training method for neural networks that determines the weights as well as the neural architecture during training. Various extensions of CC to structured data have been proposed: recurrent cascade correlation (RCC) for sequences, recursive cascade correlation (RecCC) for tree structures with limited fan-out, and contextual recursive cascade correlation (CRecCC) for rooted directed positional acyclic graphs (DPAGs) with limited fan-in and fan-out. We show that these models possess the universal approximation property in the following sense: given a probability measure P on the input set, every measurable function from sequences into a real vector space can be approximated by a sigmoidal RCC up to any desired degree of accuracy up to inputs of arbitrary small probability. Every measurable function from tree structures with limited fan-out into a real vector space can be approximated by a sigmoidal RecCC with multiplicative neurons up to any desired degree of accuracy up to inputs of arbitrary small probability. For sigmoidal CRecCC networks with multiplicative neurons, we show the universal approximation capability for functions on an important subset of all DPAGs with limited fan-in and fan-out for which a specific linear representation yields unique codes. We give one sufficient structural condition for the latter property, which can easily be tested: the enumeration of ingoing and outgoing edges should be compatible. This property can be fulfilled for every DPAG with fan-in and fan-out two via reenumeration of children and parents, and for larger fan-in and fan-out via an expansion of the fan-in and fan-out and reenumeration of children and parents. In addition, the result can be generalized to the case of input-output Neural Computation 17, 1109–1159 (2005)

© 2005 Massachusetts Institute of Technology

1110

B. Hammer, A. Micheli, and A. Sperduti

isomorphic transductions of structures. Thus, CRecCC networks constitute the first neural models for which the universal approximation capability of functions involving fairly general acyclic graph structures is proved.

1 Introduction Pattern recognition methods such as neural networks or support vector machines constitute powerful machine learning tools in a widespread area of applications. Usually they process data in the form of real vectors of fixed and finite dimensionality. Canonical extensions of these basic models to sequential data that occur in language processing, bioinformatics, or time-series prediction, for example, are provided by recurrent networks or statistical counterparts (Bengio & Frasconi, 1996; Kremer, 2001; Sun, 2001). Recently, an increasing interest in the possibility of dealing with more complex nonstandard data has been observed, and several models have been proposed within this topic: specifically designed kernels such as string and tree kernels or kernels derived from statistical models constitute a canonical interface for kernel methods for structures (Haussler, 1999; Jaakkola, Diekhans, & Haussler, 2000; Leslie, Eskin, & Noble, 2002; Lodhi, ShaweTaylor, Christianini, & Watkins, 2000; Watkins, 1999). Alternatively, recurrent neural models can be extended to complex dependencies that occur in tree structures, graphs, or spatial data (Baldi, Brunak, Frasconi, Pollastri, Soda, 1999; Frasconi, Gori, & Sperduti, 1998; Goller & Kuchler, ¨ 1996; Hammer, 2000; Micheli, Sona, & Sperduti, 2000; Pollastri, Baldi, Vullo, & Frasconi, 2002; Sperduti, Majidi, & Starita, 1996; Sperduti & Starita, 1997; Wakuya & Zurada, 2001). These structure processing models have the advantage that complex data such as structured data, tree structures, and graphs can be directly tackled, and thus a possibly time-consuming and usually not information-preserving encoding of the structures by a finite number of real-valued features can be avoided. Successful applications of these models have been reported in such areas as image and document processing, logic, natural language processing, chemistry, DNA processing, homology detection, and protein structure prediction (Baldi et al., 1999; Diligenti, Frasconi, & Gori, 2003; Goller, 1997; Jaakkola et al., 2000; Leslie et al., 2002; Lodhi et al., 2000; de Mauro, Diligenti, Gori, & Maggini, 2003; Pollastri et al., 2002; Sturt, Costa, Lombardo, & Frasconi, 2003; Vullo & Frasconi, 2003). Here, we are interested in a variant of so-called recursive networks, which constitute a straightforward generalization of well-known recurrent networks to tree structures and for which excellent results for large learning tasks have been reported (Frasconi, 2002). Training recurrent neural networks faces severe problems, such as the problem of long-term dependencies, and the design of efficient training algorithms is still a challenging problem of ongoing research (Bengio, Simard,

Universal Approximation Capability

1111

& Frasconi, 1994; Hammer & Steil, 2002). Since the number of potentially different structures increases exponentially with the size of the data, the space is often only sparsely covered by a given training set. In addition, the generalization ability might be fundamentally worse compared to simple pattern recognition tools because the capacity of the models (measured, for example, in terms of the VC dimension) depends on the structures (Hammer, 2001). For recursive models for structures, the long-term dependencies problem is often reduced because structural representations via trees or graphs yield more compact representation of data. However, the other problems remain. Thus, the design of efficient training algorithms that yield sparse models with good generalization ability is a key issue for recursive models for structures. Cascade correlation (CC) has been proposed by Fahlmann and Lebiere (1990) as a particularly efficient training algorithm for feedforward networks that simultaneously determines the architecture of the network and the parameters. Hidden neurons are created and trained consecutively, so as to maximize the correlation of the hidden neuron with the current residual error. Thus, hidden neurons can be used for error correction in consecutive steps, which often leads to very sparse solutions with good generalization ability. CC proved to be particularly appropriate for training problems where standard feedforward network training is difficult, such as the two spirals problem (Fahlmann & Lebiere, 1990). The main difference of CC networks with respect to feedforward networks lies in the particularly efficient constructive training method. Since any feedforward network structure can be embedded into the structure of a CC network and vice versa, the principled representational capabilities of CC networks and standard feedforward networks coincide (Hornik, Stinchcombe, & White, 1989). Like other standard feedforward networks, CC deals only with vectors of fixed dimensionality. Recurrent cascade correlation (RCC) has been proposed as a generalization of CC to sequences of unlimited length (Fahlmann, 1991). Recently, more powerful models for tree structures and acyclic directed graphs have been proposed and successfully applied as prediction models for chemical structures: recursive cascade correlation and contextual recursive cascade correlation, respectively (Bianucci, Micheli, Sperduti, & Starita, 2000; Micheli, 2003; Micheli, Sona, & Sperduti, 2002, in press; Micheli et al., 2000; Sperduti et al., 1996). The efficient training algorithm of CC can be transferred to these latter models, and very good results have been reported. Here, we are interested in the in-principle representational capabilities of these models for different types of structures. RCC extends simple CC by recurrent connections such that input sequences can be processed step by step. The iterative training scheme of hidden neurons is thereby preserved and training is very efficient. However, because of this iterative training scheme, hidden neurons have to be independent of all hidden neurons that are introduced later. This causes the fact that an RCC network, unlike RNNs, is not fully recurrent. Recurrent

1112

B. Hammer, A. Micheli, and A. Sperduti

connections of a hidden neuron are pointing toward the neuron itself and toward hidden neurons that are introduced later, but not toward previous hidden neurons. Because of this fact, RNNs in general cannot be embedded in RCC networks: the two models differ with respect to the network architectures. RNNs constitute universal approximators (Funahashi & Nakamura, 1993), and the question now occurs whether the restricted topology of RCC networks, which makes the particularly efficient iterative training scheme possible, preserves this property. This is not clear because local recurrence instead of full connectivity of neurons might severely limit the representational capabilities. Frasconi and Gori (1996) have shown that the number of functions that can be implemented by a specific locally recurrent network architecture with the Heaviside activation function can be limited regardless of the number of neurons, and thus restrictions apply. The work presented in Giles et al. (1995) explicitly investigates RCC architectures and shows that these networks equipped with the Heaviside function or a monotonically increasing activation function cannot implement all finite automata. Thus, in the long-term limit, RCC networks are strictly weaker than fully connected recurrent networks. However, the universal approximation capability of recurrent networks usually refers to the possibility of approximating every given function on a finite time horizon. This question, which has not yet been answered in the literature, is investigated in this article. We will show that RCC networks are universal approximators with respect to a finite time horizon; they can approximate every function up to inputs of arbitrary small probability arbitrarily well. Hence, RCC networks and recurrent networks do not differ with respect to this notion of approximation capability. Note that RCC networks, unlike RNN networks, are built and trained iteratively, adding one hidden neuron at a time. Thus, the architecture is determined automatically, and only small parts of the network are adapted during a training cycle. RCC networks have been generalized to recursive cascade correlation (RecCC) networks for tree-structured inputs (Bianucci et al., 2000; Sperduti et al., 1996; Micheli, 2003). Recursive networks constitute the analogous extension of recurrent networks to tree structures (Frasconi et al., 1998; Goller & Kuchler, ¨ 1996; Sperduti & Starita, 1997). For the latter, the universal approximation capability has been established in Hammer (2000). However, RecCC networks share the particularly efficient iterative training scheme of RCC and CC networks, and thus recurrence is restricted to recursive connections of hidden neurons to neurons introduced later, but not in the opposite direction also in RecCC networks. Thus, it is not surprising that the same restrictions as for RCC networks apply, and some functions, which can be represented by fully recursive neural networks, cannot be represented by RecCC networks in the long-term limit because of these restrictions (Sperduti, 1997). However, the practically relevant representational capabilities if restricted to a finite horizon, that is, input trees with restricted maximum depth, are not yet answered in the literature. We show

Universal Approximation Capability

1113

in this article that RecCC networks with multiplicative neurons possess the universal approximation property for finite time horizon and thus constitute valuable and fast alternatives to fully connected recursive networks for practical applications. Recurrent and recursive models put a causality assumption on data processing: sequences are processed from one end of the sequences to the other, and trees are processed from the leaves to the root. Thus, the state variables associated with a sequence entry depend on the left context, but not on the right one, and state variables associated with internal nodes in a tree depend on their children, but not the other way around. This causality assumption might not be justified by the data. For time series, the causality assumption implies that events in the future depend on events in the past, but not vice versa, which is reasonable. For spatial data such as sentences or DNA strings, the same causality assumption implies that the value at an interior position of the sequence depends on only the left part of the sequence but not on the right one. This is obviously not true in general. A similar argument can be stated for tree structures and more general graphs such as they occur in logic or chemistry, for example. Therefore, extensions of recursive models for sequences and tree structures have been proposed that also take additional contextual information into account. The model proposed in Baldi et al. (1999) trains two RNNs simultaneously to predict the secondary structure of proteins. The networks are thereby transforming the given sequence in reverse directions, such that the full information is available at each position of the sequence. A similar approach is proposed in Pollastri et al. (2002) for grids, whereby four recursive networks are trained simultaneously in order to integrate all available information for each vertex of a regular two-dimensional lattice. Similar problems for sequences have been tackled by a cascade correlation approach including context in Micheli et al. (2000). Another model with similar ideas has been proposed in Wakuya and Zurada (2001). Note that several of these models simply construct disjoint recursive networks according to different causality assumptions and afterward integrate the full information in a feedforward network. However, all reported experiments show that the accuracy of the models can be improved significantly by this integration of context if spatial data or graph structures are dealt with. The restricted recurrence of cascade correlation offers a particularly simple, elegant, and, as we will see, fundamentally different way to include contextual information: the hidden neurons are consecutively trained and, once they are trained, frozen. Thus, we can introduce recursive as well as contextual connections of these neurons to neurons introduced later without getting cyclic (i.e., ill-defined) dependencies; that is, given a vertex in a tree or graph, the hidden neuron activation for this vertex can have direct access to the state of its children and its parents for each previous hidden neuron. Thereby, the effective iterative training scheme of CC is preserved. Obviously, this possibility of context integration essentially depends on the

1114

B. Hammer, A. Micheli, and A. Sperduti

recurrences being restricted, and it cannot be transferred to fully recursive networks. This generalization of RecCC networks toward context integration has recently been proposed in Micheli et al. (2002, in press). These models are applied to prediction tasks for chemical structures, and they compare favorably to models without context integration. Plots of the principal components of the hidden neurons’ states of the contextual models indicate that the additional contextual information is effectively used by the network. Unlike the approaches noted, contextual information is here directly integrated into the recursive part of the network such that the information can be used in an earlier stage than in Baldi et al. (1999) and Pollastri et al. (2002). Thanks to these characteristics, in the cascade approach, the processing of contextual information is efficiently extended, with respect to the previous approaches, from sequence or grids to more complex structured data such as trees and acyclic graphs. The question now occurs whether the in principle extension to contextual processing and the improved accuracy in experiments can be accompanied by more general theoretical results on the universal approximation capability of the model. A first study of the enlarged capacity of the model can be found in Micheli et al. (in press) and Micheli (2003), where it is shown that in a directed acyclic graph, contextual RecCC (CRecCC) takes into account the context of each vertex, including both successors and predecessors of the vertex (as formally proven in Micheli, Sona, & Sperduti, 2003). Moreover, the integration of context extends the class of functions that can be approximated in the sense that more general structures can be distinguished: contextual models can differentiate between vertices with different parents but the same content and children. Finally, contextual RecCC can implement classes of contextual functions where a desired output for a given vertex may depend on the predecessors of the vertex, up to the whole structure (contextual transductions) (Micheli et al., in press; Micheli, 2003). Of course, contextual RecCC networks can still implement all the functions that are computable by RCC models (Micheli et al., in press; Micheli, 2003). We will show in this article that the proposed notion of context yields universal approximators for rooted directed acyclic graphs with appropriate positioning of children and parents. The argumentation can be transferred to the more general case of input-output (IO) isomorphic transduction of such structures. Hence, context integration enlarges the class of functions that can be approximated to acyclic graph structures. We first formally introduce the various network models and clarify the notation. We then consider the universal approximation capability of cascade correlation networks in detail. We start with the simplest case: the universal approximation property of standard RCC networks. Only two parts of the proof are specific for the case of sequences; the other parts can be transferred to more general cases. We then prove the universal approximation capability of RecCCs, substituting the specific parts of the previous proof. Finally we investigate CRecCCs, which are particularly interesting,

Universal Approximation Capability

1115

since CRecCCs do not constitute universal approximators for all structures, but only a subset. We provide one example for graph structures that cannot be differentiated by these models. We then define a linear encoding of acyclic graph structures and discuss one sufficient property that can easily be tested and ensures that this encoding is unique. This property requires that positioning of children and parents can be done in a compatible way. We show that this can always be achieved by reenumeration and expansion by empty vertices for directed positional acyclic graphs with limited fan-in and fan-out. Structures that fulfill this property are referred to as directed acyclic bipositional graphs (DBAGs). Afterward, we show the universal approximation for contextual RecCC networks on DBAGs. This result can immediately be extended to general graph structures for which the proposed encoding is unique. An informal discussion of this property is given. We conclude with a discussion. 2 Cascade Correlation Models for Structures Standard feedforward neural networks (FNN) consist of neurons connected in an acyclic directed graph. Neuron i computes the function xi = σi wi j x j + θi j→i

of the outputs x j of its predecessors, where wi j ∈ R are the weights, θi ∈ R is the bias, and σi : R → R constitutes the activation function of neuron i. The input neurons (i.e., neurons without predecessors) are directly set to external values, and the output of the network can be found at specified output neurons. All other neurons are called hidden neurons. Often neurons are arranged in layers with full connectivity of the neurons of one layer to the consecutive one. Activation functions σi that are of interest in this article include the logistic function, 1 , 1 + exp(−x) the Heaviside function, sgd(x) =

H(x) =

1 if x ≥ 0 , 0 otherwise

and the identity id(x) = x. More generally, a squashing activation function f : R → R is a monotonic function such that values l f < h f exist with limx→∞ f (x) = h f , limx→−∞ f (x) = l f . A network that uses activation functions from a set F of functions is referred to as an F-FNN. It is well known that FNNs with only one hidden layer with a squashing activation function and linear outputs fulfill a universal approximation property (Hornik et al.,

1116

B. Hammer, A. Micheli, and A. Sperduti

1989). The universal approximation property holds for FNNs, if for a given probability measure P on Rn , any Borel-measurable f : Rn → Rm , and any , δ > 0 we can find an FNN that computes a function g such that P(x ∈ Rn | | f (x) − g(x)| > δ) < . There exist alternative characterizations of the hidden neurons activation function for which the universal approximation property holds, for example, being analytic (Scarselli & Tsoi, 1998). Often, more powerful neurons computing products of inputs are considered (Alquezar & Sanfeliu, 1995; Forcada & Carrasco, 1995; Hornik et al., 1989). Let us denote the indices of predecessors of neuron i by i 1 , . . . , i j . A multiplicative neuron of degree d ≥ 1 computes a function of the form xi = σi

d

wi j1 ... jk x j1 · . . . · x jk + θi ,

k=1 j1 ,..., jk ∈{i 1 ,...,i j }

where wi j1 ,..., jk ∈ R refers to the weights and θi is the bias as above. We will explicitly indicate if multiplicative neurons are used. Otherwise, the term neuron refers to standard neurons. Standard neural network training first chooses the network architecture and then fits the parameters on the given training set. In contrast, cascade correlation (CC) determines the architecture and the parameters simultaneously. CC starts with a minimum architecture without hidden neurons and iteratively adds hidden neurons unless the approximation accuracy is satisfiable. Each hidden neuron is connected to all inputs and all previous hidden neurons. The weights associated with these connections are trained in such a way as to maximize the correlation between their output and the residual error of the network constructed so far. Afterward, the weights entering the hidden neuron are frozen. All weights to the outputs are retrained according to the given task. If needed, new hidden neurons are iteratively added in the same way. One CC network with two hidden neurons is depicted in Figure 1. Since only one neuron is trained at each stage, this algorithm is very fast. Moreover, it usually provides excellent classification results with a small number of hidden neurons, which, because of the specific training method (Fahlmann & Lebiere, 1990), serve as error-correcting units. This training scheme is used for all further cascade correlation models introduced in this article. Since we are not interested here in the specific training method but in the in-principle representation and approximation capabilities of the architectures, we will not explain the training procedure in more detail. The only point of interest in this article is that the iterative training method assumes that hidden neurons are functionally dependent on previous hidden neurons, while they are functionally independent of hidden neurons introduced later, because of the iterative training scheme.

Universal Approximation Capability

1117

Figure 1: (Left) An example of a cascade correlation network with cascaded

hidden neurons xi added iteratively to the network. (Right) A recursive cascade correlation network where recursive connections, indicated by thick lines marked with r , are used to store contextual information from the first part of a processed sequence. The activation of the neuron of the previous time step is propagated through recurrent connections. Note that recurrent connections point from x1 to x2 but not vice versa.

2.1 Recurrent Models. Obviously, FNNs can deal only with input vectors of finite and fixed size. Recurrent networks (RNNs) constitute one alternative approach, which allows the processing of sequences of unlimited length. Denote by L a set where the labels are taken from, for example, L = Rn for continuous labels or L = {1, . . . , r } for discrete labels. Assume a sequence [l1 , . . . , l T ] of length T over L is given. Denote the set of finite sequences over L by L∗ . A recurrent network recursively processes a given sequence from left to right. Assume that the output of neuron i has to be computed for a given sequence entry lt at position t of the sequence. Assume N neurons are present. We assume for simplicity that the immediate predecessors of neuron i are the neurons from 1 to i − 1.1 The functional dependence of neuron i can then be written as xi (lt ) = τi (lt , x1 (lt ), . . . , xi−1 (lt ), x1 (lt−1 ), . . . , xN (lt−1 )),

1

(2.1)

It can always be achieved via reenumeration that the predecessors are a subset of this set of neurons. Thus, setting some weights to 0, we can represent every FNN in this form.

1118

B. Hammer, A. Micheli, and A. Sperduti

where τi is a function computed by a single neuron,2 and for the initial case t = 1, we set xi (l0 ), interpreted as the output for the empty sequence, to some specified value, for example 0. The computed values depend on the values of all neurons in the previous time step and thus, indirectly, on the left part of the sequence. The final output of the sequence is given by the values xi (l T ) obtained after processing the last entry. Recurrent cascade correlation (RCC) also refers to the activation of neurons in the previous time step. We assume that the neurons are enumerated in the order in which they are created during training—neuron i is added to the RCC network in the ith construction step. Because of the iterative training scheme, hidden neuron i is functionally dependent only on neurons introduced in previous steps. The functional dependence now reads as xi (lt ) = τi (lt , x1 (lt ), . . . , xi−1 (lt ), x1 (lt−1 ), . . . , xi (lt−1 )), where τi is the function computed by a single neuron. Again, xi (l0 ) is set to some specified initial value. Note that unlike RNNs, RCC networks are not fully connected. Recurrent connections that start at neurons j > i to neuron i are dropped, because all weights to neuron i are, once trained, frozen. One example network with two hidden neurons is depicted in Figure 1. It is well known that fully recurrent neural networks constitute universal approximators in the sense that, given appropriate activation functions of the neurons, any measurable function from sequences into a real vector space can be approximated arbitrarily well, that is, the distance of the desired output from the network output is at most δ for inputs of probability at least 1 − (Hammer, 2000). We will show that an analogous result also holds for RCC networks with restricted recurrence. 2.2 Recursive Models. Both RNNs and RCC networks have been generalized to tree structures in the literature. A tree over L consists of either an empty structure ξt or a vertex v with label l(v) ∈ L and a finite number of subtrees, some of which may be empty. We refer to the ith child of v by chi (v). A tree has limited fan-out k if each vertex has at most k children or subtrees. In this case, we can always assume that exactly k children are present for any nonempty vertex, introducing empty vertices if necessary. The set of finite trees with fan-out k over L is denoted by L∗k . For simplicity, we deal only with fan-out 2 in the following. The generalization to larger 2 Here we assume only dependencies from the previous time step. Dependencies from time steps t − i, i > 1 could be considered and added to equation 2.1 since the definition would still be well defined. Because more information is then available at one time step, benefits for concrete learning problems could result. Obviously, the universal approximation capability of equation 2.1 immediately transfers to the more general definition. Thus, for simplicity, we restrict ourselves to equation 2.1 in the following, and we restrict in a similar way for the case of input trees or structures.

Universal Approximation Capability

1119

fan-out will be immediate. Fan-out 1 gives sequences. Since trees form a recursive structure, recursive processing of each vertex in the tree starting from the leaves up to the root is possible. This leads to the notion of recursive networks (RecNNs) as introduced, for example, in Frasconi et al. (1998), Goller and Kuchler ¨ (1996), and Sperduti and Starita (1997) and recursive cascade correlation (RecCC) as proposed in Bianucci et al. (2000) and Sperduti et al. (1996). As before, for RecNN, we assume direct dependence of neuron i from neurons
1120

B. Hammer, A. Micheli, and A. Sperduti

graph structures: a labeled directed acyclic graph (DAG) over L consists of a finite set vert(D) of vertices v with labels l(v) ∈ L assigned to each v and a finite set of edges edge(D) ⊂ vert(D) × vert(D), such that no cycles can be observed in the graph. For each v ∈ vert(D), its parents are given by parent(v) = {u ∈ vert(D) | (u, v) ∈ edge(D)}, and the children are children(v) = {u ∈ vert(D) | (v, u) ∈ edge(D)}. As usual, the fan-in and fan-out of vertices are given by fan in(v) = |parent(v)| and fan out(v) = |children(v)|. A rooted directed positional acyclic graph over L (DPAG) consists of a labeled directed acyclic graph D such that all vertices in the graph can be reached from one specific vertex, the root root(D). In addition, an ordering of the parents and children is specified; two injective functions Pv : parent(v) → {1, . . . , fan in(v)} and Cv : children(v) → {1, . . . , fan out(v)}, respectively, are fixed for each v, which specify the order in which the children and parents are enumerated when considering v. For each v ∈ vert(D) we define its height height(v) as the length of a longest path from the root to v plus one, and the height of the DPAG, height(D) as the maximum height of the contained vertices, that is, the root has height 1. Note that a tree is just a specific form of DPAGs where each vertex (except for the root) possesses exactly one parent. Contextual recursive cascade correlation (CRecCC) (Micheli et al., 2002, in press; Micheli, 2003) can also take into account information about parents. Assume DPAGs with limited fan-in at most k1 and fan-out at most k2 are given. In the following, we restrict ourselves to DPAGs with fan-in and fanout two. The generalization to larger fan-in and fan-out is immediate unless stated otherwise. We denote the set of DPAGs over L with limited fan-in and fan-out 2 by DPAG2L . As before, for a vertex v in a DPAG, ch1 (v) and ch2 (v) refer to the first and second child of vertex v, respectively. pa1 (v) and pa2 (v) refer to the two parents of the vertex. An empty child is denoted by ξc . ξ p refers to an empty parent. The empty DPAG is also denoted by ξc . In a CRecCC network, each hidden neuron has also access to the activation of previous hidden neurons for all parents of the vertex. Formally, given a vertex v of a DPAG, the functional dependencies of hidden neuron number i can be written as xi (v) = τi (l(v), x1 (v), . . . , xi−1 (v), x1 (ch1 (v)), . . . , xi (ch1 (v)), x1 (ch2 (v)), . . . , xi (ch2 (v)), x1 (pa1 (v)), . . . , xi−1 (pa1 (v)), x1 (pa2 (v)), . . . , xi−1 (pa2 (v))), where xi (ξ p ) := 0ip , xi (ξc ) = 0ic for some fixed values 0ip and 0ic in R, and τi constitutes the function computed by the respective (possibly multiplicative) hidden neuron. For i = 1, this reduces to x1 (v) = τ1 (l(v), x1 (ch1 (v)), x1 (ch2 (v))).

Universal Approximation Capability

1121

As before, the output of the root vertex is considered output of a given DPAG. Defining the operator q +1 (xi (v)) := (xi (pa1 (v), xi ( pa 2 (v))) allows us to recast the functional dependence as xi (v) = τi (l(v), x1 (v), . . . , xi−1 (v), q −1 (x1 (v)), . . . , q −1 (xi (v)), q +1 (x1 (v)), . . . , q +1 (xi−1 (v))). The general functional dependence for larger fan-in and fan-out is thus obtained by extending q −1 and q +1 to more children and parents, respectively. Thus, neuron i also depends on the activation of parents and, indirectly, its parent’s children, via the previous hidden neuron. This is possible without getting cyclic definitions because of the restricted recurrence of RecCC networks. In particular, the strict causality assumption of vertices depending only on successors is no longer valid here, and increasing context is taken into account for large i, eventually involving the whole structure in the context. The above recurrent dynamic is well defined due to two features. On one hand, the DPAGs are acyclic and rooted, such that, as in any tree structure, induction over the structure is possible. Starting from the vertices without further successors, we can continue to process data up to the root. We refer to this principle as structural induction. On the other hand, the network architecture is inductively constructed, whereby hidden neurons constructed in earlier steps are functionally independent of neurons introduced later. Hence, induction over the hidden neurons is possible, whereby we can assume that the output of all previously constructed hidden neurons for all vertices in a DPAG is available. Since RecNNs might be fully connected, structural induction cannot be applied for general RecNNs: each hidden neuron is dependent on the output of every other hidden neuron. Note that we can substitute a CRecCC network for each given fixed DPAG structure to be processed by an equivalent feedforward network: we take copies of all neurons for all vertices in the DPAG and unfold the recurrent connections according to the given structure over this set. This general principle of unfolding a recursive network according to a given fixed input structure can be done for all network models defined so far. It is based on the principle that no cycles are given in the dynamics. We add a final remark to the definition of cascade architectures: in practice, a set of output neurons is usually introduced and all hidden neurons are connected to the output neurons. The output connections are trained directly on the given training set. For simplicity, we will drop this output layer in the following, since we are not interested in training. We assume that the last hidden neuron of the cascade correlation network serves as the output neuron, to simplify notation. We will, for simplicity deal only with the case of functions to R instead of outputs with higher dimensionality. The generalization of our results to outputs in Rm will be immediate.

1122

B. Hammer, A. Micheli, and A. Sperduti

3 Universal Approximation Ability Having introduced the cascade correlation models, we now consider the universal approximation capability of these models with respect to sequences, tree structures, and DPAGs, respectively. We investigate the case of sequences first and point out the structure of this first proof of approximation completeness. It will turn out that we have to substitute only two steps that are specific for sequences. A generalization to tree structures will be quite simple. The case of graphs, however, will be much more involved, because fundamentally new results are implied in this part: the investigation will include an analysis of the possibility of representing acyclic graphs by linear representations. It will, as a by-product, show that contextual networks can also approximate so-called IO isomorphic transductions (as defined below), that is, they can deal with the case that not only the inputs but also the outputs are structured. We first clarify which type of functions we are interested in. Input sets are sequences, trees, or DPAGs of arbitrary but finite size. We assume that the respective labels are contained in a subset L of a real-vector space. For discrete L, the discrete σ -algebra over L is considered; otherwise, we deal with the Borel σ -algebra over L. This algebra is expanded to structures by interpreting structures as the direct sum of objects of one fixed structure but possibly varying labels, for example, sequences decompose into the sum of sequences of a fixed length T. Thus, the set of structures is equipped with the sum of the Borel σ -algebras of objects with a fixed structure. Note that the choice of the sum algebra has consequences for the semantic: for every fixed probability measure P on the set of finite sequences, we can find a size n such that most information lies on the structures of size at most n. n depends on the chosen probability measure and might vary with P. A formal argument for this fact will be given in the proof of theorem 1. We do not consider infinite-length sequences or infinite-size structures in this article. A given (measurable) function with structural inputs shall be approximated by a cascaded network. We assume that the weights of the network are fixed when processing data (i.e., stationarity of the function τi computed by neuron i with respect to the structure holds). There are different possibilities for using recursive models: so-called supersource transduction refers to the task of mapping an entire structure to a given value (e.g., mapping a chemical molecule to its activity). Thus, only the value obtained after having processed the whole structure is of interest. We will mainly focus on functions of this type, whereby we assume that the outputs are elements of the real numbers, that is, functions to be approximated are measurable functions of the form f : S → R, with S denoting the considered structures (i.e., sequences, graphs, or DPAGs). Recursive models can alternatively be used for IO isomorphic transduction, which are functions f : S → S that map a given structure to a structure of the same form

Universal Approximation Capability

1123

but possibly different labels. The output f (s) is obtained substituting each vertex label l(v) of the input structure s by the output value xi (v) obtained for this vertex in the recursive model. This setting describes problems such as mapping the sequence of proteins to its secondary structure, where the form of the secondary structure of the protein at each amino acid position has to be predicted. We will discuss this situation, and it will turn out that our results on the approximation capability of CRecCC networks for supersource transductions will imply results for IO isomorphic transductions. 3.1 Recurrent CC. We start with the simplest case: recurrent networks for sequence processing. The key ideas of the proof borrow steps from the universal approximation proof of recursive neural networks as given in Hammer (2000). For convenience, we first outline the general steps of the proof, some of which can be reused for recursive cascade correlation and contextual recursive cascade correlation: I. Find a linear code for the given structures over discrete labels up to a finite height; a linear code can be represented as a real number whereby the number of digits is not limited a priori. This code serves as a candidate to act as an internal distributed representation of the structure within a neural network. II. Show that this code can be computed with the considered network structure; as a consequence of such a construction, it is clear that the network can turn a structured input to an internal distributed code suitable for processing with a standard feedfoward network. III. Integrate a universal FNN, which can map the encoded structures to the desired outputs. As a result of these three steps, it is proved that the network considered is capable of mapping every finite number of discrete labeled structures to arbitrary output values. IV. Integrate discretization of real-valued labels. This allows us to turn structures with real-valued labels to similar discrete structures, which we can process. As a consequence of this step, we can approximate functions on structures with real-valued labels by first turning them internally to discrete labels. V. Approximate nonstandard activation functions. The above construction steps use specific activation functions (e.g., exact identity, Heaviside function) to make sure that the desired outputs are precisely met. These have to be substituted by the considered activation functions such as the logistic transfer function. Only steps I and II will be specific for the network and input structure considered. The other steps do not specifically refer to the model but can be similarly done in the other cases. We will use this fact later to generalize

1124

B. Hammer, A. Micheli, and A. Sperduti

universal approximation for RCC networks to RecCC and CRecCC networks. Now we present the construction steps for RCC networks for sequences. The first step is trivial, since sequences are naturally represented in a linear way. Assume L is a finite alphabet. A sequence over L with elements li ∈ L is given by [l1 , . . . , lt ], a linear code. This is essentially step I. This linear representation can be embedded in the real numbers by concatenating the entries of the sequence. More precisely, assume L = {1, . . . , r } ⊂ N, and assume the numbers can be represented by at most length L. Then the representation 0.lt . . . l1 where the li are expanded by leading entries 0 to exactly L places is a unique representation of the sequence [l1 , . . . , lt ]. This observation is obvious because each entry li occupies exactly the same number of digits within this representation. The next observation of the construction is that this code can be “computed” by a recursive cascade correlation network. More precisely, we can construct an RCC network with only one neuron and output 0.lt . . . l1 (li expanded by leading entries 0 to L digits) for an input sequence [l1 , . . . , lt ]. We define the output of the empty sequence by 0. The functional dependence of neuron x1 of a RCC network is x1 (lt ) = τ1 (lt , x1 (lt−1 )) . Assume as a structural assumption that x1 (lt−1 ) already represents 0.lt−1 . . . l1 . Then τ1 (lt , 0.lt−1 . . . l1 ) := 0.1 L · lt + 0.1 L · 0.lt−1 . . . l1 = 0.lt lt−1 . . . l1 . That means the function τ1 (a 1 , a 2 ) = 0.1 L · a 1 + 0.1 L · a 2 gives the desired functionality. Note that τ1 can be computed by a standard RCC neuron with linear activation function. Hence step II is very easy for the sequence setting as well; we have found a linear RCC network with only one neuron, which encodes all sequences with a finite input alphabet and finite height h (in this case also infinite h) uniquely within a real number. Steps III to V now mainly consist of general observations that are not specific for RCC networks. Step III is the integration of feedforward parts into an RCC network without violating the structure. For any given function f : Rn → Rm , we denote by f D : (Rn )∗ → (Rm )∗ the extension to sequences that maps each entry of a given sequence over Rn to the image of the entry under f in Rm , thus giving a sequence over Rm . Later we use the same notation for more general structures, trees, and DPAGs to refer to the map, which substitutes each entry of a structure by its image under D. So far, we have defined cascade correlation networks to compute functions with output domain R.

Universal Approximation Capability

1125

We can extend to Rm for m > 1 by interpreting the output of the last m neurons xmax −m , . . . , xmax as the output of the network. Lemma 1. Assume f : Ri → Rn and g : Rm → R are computed by FNNs. Assume h : (Rn )∗ → Rm is computed by an RCC network. Then the composition g ◦ h ◦ f D : (Ri )∗ → R can also be computed by a RCC network. Proof. This statement is easy because we can interpret all neurons of f and g as additional hidden neurons of the RCC network, setting connections that are not used in these subnetworks but are introduced due to the definition of RCC architectures to 0. More precisely, denote the neurons in f by n1 , . . . , ni f , whereby connections ni → n j exist only if i < j. Denote the neurons in g by m1 , . . . , mi g , again with connections existing only to neurons with larger indices. Denote the neurons of the RCC network by o 1 , . . . , o ih arranged in the order of their appearance. Then we can choose the weights of an RCC network such that hidden neuron number i takes the role of  if i ≤ i f  ni if i f < i ≤ i f + i h xi = o i−i f  mi−i f −ih otherwise. This is possible because the functional dependence of the RCC hidden neurons ensures direct access to the respective predecessors in the single networks: each hidden neuron has access to the activation of all previous neurons. All recursive connections in the RCC network for ni and mi are set to 0. The same holds for all additional feedforward as well as recursive connections not present in the original part ni , o i , and mi . Note that recurrence in the first part corresponding to f has the effect that f D is computed on input sequences. Recurrence in g has no effect because the output of the last sequence entry is considered output of the entire computation. Note that an analogous result holds for RecCC networks and CRecCC networks. It is not necessary to change the proof if h and g ◦ h ◦ f D are computed by RecCC networks or CRecCC networks. We obtain as an immediate result for the sequence case: Theorem 1. Assume L = {1, . . . , r } is a finite set. Assume P is some probability measure on finite sequences over L. Assume f : L∗ → R is a function. Assume > 0. Then there exists an RCC network with activation functions in {id, σ } that computes a function g such that P(x ∈ L∗ | f (x) = g(x)) < . σ is thereby some activation function for which the universal approximation property of FNNs is fulfilled.

1126

B. Hammer, A. Micheli, and A. Sperduti

Proof. L∗ decomposes into sets of sequences of different length. We denote these sets by (L∗ )i for i = 1, . . . , ∞ and an arbitrary enumera∞ 3 ∗ ∗ ∗ tion. Since (L ) is measurable, 1 = P(L ) = i i=1 P((L )i ); thus, we find ∞ ∗ P((L ) ) < /4 for some n . Therefore, we can find some length h (the i 0 i=n0 maximum length in the sets (L∗ )i for i < n0 ) such that larger sequences have probability smaller than . For sequences of length at most h over L, we can find an RCC network with the identity as an activation function unit that maps these inputs to unique values, as shown above. Because FNNs with activation function σ and linear outputs possess the universal approximation property, we can find an FNN that maps these encodings exactly to the outputs given by f (Sontag, 1992). The composition of these two networks can be interpreted as an RCC network because of lemma 1 with the desired properties. Note that we have restricted ourselves to outputs in R for convenience. It is obvious from the above construction that the same result can be obtained for functions f : L∗ → Rm if we integrate several output neurons for RCC networks. The same remark holds for all universal approximation results that we present here, thus giving us approximation results for output set Rm instead of R. We can extend the above result to sequences with real labels, introducing a first part that maps real-valued labels to a discrete set. Theorem 2. Assume L = Rn . Assume P is some probability measure on finite sequences over L. Assume f : L∗ → R is a Borel-measurable function. Assume > 0, δ > 0. Then there exists an RCC network with activation functions in {id, σ, H} that computes a function g such that P(x ∈ L∗ | | f (x) − g(x)| > δ) < . σ is thereby some activation function for which the universal approximation property of FNNs is fulfilled. Proof. We can fix some length h such that longer sequences have probability at most /4. We can find a constant C > 0 such that sequences with one entry outside [−C, C)n have probability smaller than /4.4 For sequences of length at most h and labels in [−C, C)n , we can approximate f by a continuous function such that the probability of values with distance larger than δ/2 3 We could simply choose (L∗ ) as the set of sequences of length i. However, we would i like to achieve compatibility with the more general case of tree structures or DPAGs, such that we use an arbitrary enumeration of the subsets with fixed structure. 4 This is the same argument as above because the set of sequences decomposes into subsets of sequences with entries in a fixed interval with boundaries given by natural numbers.

Universal Approximation Capability

1127

is at most /4.5 Hence, substituting by /4 and δ by δ/2, it is sufficient to prove the stated theorem for continuous functions f on sequences of limited length at most h with limited labels in [−C, C)n . Note that such function f is uniformly continuous, being a function on a compact interval. Hence, we can find 0 > 0 with the following property6 : given any two sequences D and D of the same structure (i.e., the same length). Assume lv is an entry of D and lv is an entry of D at the same position. Assume that for every such pair of entries, the inequality |(lv )i − (lv )i | < 0 holds for all positions i = 1, . . . , n, (lv )i and (lv )i , respectively, denoting component number i of the respective entry. Then | f (D) − f (D )| < δ. Next we choose points I0 < I1 < . . . , I p in [−C, C] such that I0 = −C, I p = C, and |I j − I j−1 | < 0 for all j = 1, . . . , p. Note that varying the components of the entries of a sequence within some interval [I j−1 , I j ) does not change the output with respect to f by more than δ. Therefore, we can find a function f 1 that differs from f by at most δ for all inputs and maps all sequences of the same structure where the components of the sequence elements are contained in the same interval [I j−1 , I j ) to a constant value. We will now show that this function f 1 can be computed by an RCC network, which approximates f as stated in the theorem. The characteristic function of [I j−1 , I j ) can be computed as 1[I j−1 ,I j ) (x) = H(x − I j−1 ) − H(x − I j ). The function d : [−C, C)n → {1, . . . , ( p + 1)n − 1}(x1 , . . . , xn ) →

p n

1[I j−1 ,I j ) ( p + 1)i−1

i=1 j=1

uniquely encodes the intervals in which the components of the input lie. d thereby uses an encoding to base p + 1 of the p intervals in n dimensions. Since f 1 is piecewise constant, we find a function f 2 such that f 1 = f 2 ◦ d D where, as before, d D denotes the application of d to each element in a sequence. This function f 2 maps a code to the output of some fixed sequence with entries in the encoded intervals as indicated by the input. Because of theorem 1, we can find an RCC network g 1 that computes f 2 . (Since we are dealing with only a finite length at this point, the approximation can be done for all input values.) Hence, g 1 ◦ d D equals with f 1 . Note that d can be computed by a feedforward network with activation function in {H, id}.

5 This is a well-known fact about functions over the real numbers, and it follows from, for example, the approximation results in Hornik et al. (1989). 6 We use a notation that could refer to sequences as well as trees or DPAGs for convenience. An identical proof shows an analogous result for these more general cases.

1128

B. Hammer, A. Micheli, and A. Sperduti

Because of lemma 1, g 1 ◦ d D can be computed by an RCC network with activation function in {H, id, σ }. Usually a sigmoidal activation function is used as activation function of the RCC network for all neurons but the last one, which is linear to account for the fact of as a priori unbounded output set. We are interested in the possibility of substituting the above specific activation functions by a sigmoidal function σ . Note that for any function σ : R → R that is continuously differentiable in the vicinity of some point x0 with nonvanishing derivative σ (x0 ) the following holds: σ (x0 + · x) − σ (x0 ) →x · σ (x0 )

( → 0).

This convergence is uniform on compacta. If the above property holds for σ , we refer to σ as being locally C 1 . Moreover, for a squashing function σ with limits lσ and h σ , we find (σ (x/) − lσ )/(h σ − lσ ) → H(x) ( → 0) for all x = 0. The convergence is uniform on (−∞, −δ] ∪ [δ, ∞) for any δ > 0. These two properties are in particular fulfilled for the standard logistical activation function. The following theorem allows us to substitute the above specific activation functions by an activation function as used in practice. Theorem 3. Assume L = Rn . Assume P is some probability measure on finite sequences over L. Assume f : L∗ → R is a Borel-measurable function. Assume > 0, δ > 0. Assume σ : R → R is a continuous and locally C1 squashing function. Then there exists an RCC network with activation functions σ and linear activation function for the last neuron of the network, which computes a function g such that P(x ∈ L∗ | | f (x) − g(x)| > δ) < . Proof. Note that a squashing activation function leads to the universal approximation property of FNNs (Hornik et al., 1989), such that we can construct as in theorem 2 an RCC network that contains the activation functions id and H in addition to σ . We now show that the activation functions id and H can be substituted in the network by the above approximations for appropriate , such that the difference of the outputs is small for inputs of high probability. Note that we argued in the proof of theorem 2 that we can restrict to finite length h and labels in [−C, C)n . Moreover, we can choose the interval borders I j , which have been used to discretize the inputs in such a way that the probability of sequences that contain a label with a component in (I j − δ0 , I j + δ0 ) is smaller than /2 for some δ0 . Denote p by L = ([−C, C]\ ∪ j=0 (I j − δ0 , I j + δ0 ))n . Since the involved functions are

Universal Approximation Capability

1129

continuous, the set of activations xi (lv ) for all elements lv of sequences of length at most h with labels in L is compact; say it is contained in [−C0 , C0 ]. We formally substitute each activation function id (except for the output neuron, that is, the last neuron in the network) by the term (which we simply regard as a slightly more complicated activation function, for the moment) (σ (x0 + 0 · x) − σ (x0 ))/(0 · σ (x0 )) with parameter 0 , which has to be fixed later. Each activation function H in the network is substituted by the term (σ (x/0 ) − lσ )/(h σ − lσ ), where 0 has to be chosen later. Denote by xi the original neurons and by xi0 the neurons that involve the above more complicated activation functions; thereby, xi = xi0 if the activation function is σ . We now show that some 0 can be found such that the resulting network differs from the original one by no more than δ/2 for all inputs in (L )∗ with length at most h. The following observations are easy: 1. Given some neuron xi with activation function H. Such a neuron is functionally dependent only on the label of a given entry lv in the above network by construction; hence, we can write xi (lv ) as a shorthand notation. Given some δ1 > 0. Then 0 can be found such that |xi (lv ) − xi0 (lv )| ≤ δ1 for all inputs lv ∈ L . This follows immediately because of the uniform convergence of (σ (x/0 ) − lσ )/(h σ − lσ ) to H if a neighborhood of 0 is not contained in the inputs. 2. Given some neuron xi with linear activation function. Assume lv are the inputs of xi directly referring to input entries, and assume a = (a 1 , . . . , a i0 ) are the additional inputs referring to outputs of other neurons. We write xi ( lv , a ), for convenience. Given some δ1 > 0. Then we can find 0 and δ2 such that |xi ( lv , a ) − xi0 ( lv , a )| ≤ δ1 for all a , a with a i , a i ∈ [−C0 − δ0 , C + δ0 ] with |a i − a i | ≤ δ2 and lv labels in L . This follows from the fact that xi is continuous and, hence, uniformly continuous on compacta, and (σ (x0 + · x) − σ (x0 ))/( · σ (x0 )) converges to id uniformly on compacta; hence, xi0 converges to xi uniformly on compacta. We therefore find 0 and δ2 such that |xi ( lv , a ) − xi0 ( lv , a )| ≤ |xi ( lv , a ) − xi ( lv , a )| + |xi ( lv , a ) − xi0 ( lv , a )| ≤ δ1 /2 + δ1 /2 for |a i − a i | ≤ δ2 . 3. Given some neuron xi = xi0 with activation function σ . By construction, xi is functionally dependent only on outputs of other neurons. Denote, for convenience, the direct functional dependence by xi ( a ). Given some δ1 > 0. Then we can find δ2 such that |xi ( a ) − xi ( a )| ≤ δ1 for all a , a with a i , a i ∈ [−C0 − δ0 , C + δ0 ] and |a i − a i | ≤ δ2 . This follows immediately from the fact that σ is continuous, hence uniformly continuous on compacta.

1130

B. Hammer, A. Micheli, and A. Sperduti

These properties allow us to substitute each single activation function and control the effect of this substitution. We now formally unfold the RCC for all possible sequences of length at most h, obtaining several feedforward computations composed of functions computed by single neurons. For each such feedforward computation, we start at the right-most neuron and consider the neurons in the order of their appearance in this feedforward computation from right to left (thereby visiting the original neurons more than once since recurrence accounts for duplicates). The final output should deviate from the original one by at most δ/2 on sequences with entries in L . Starting with δ1 = δ/2, we determine for each neuron the value 0 and a number δ2 > 0 such that the following holds: if the inputs that come from other neurons deviate by no more than δ2 and the activation function is substituted by the above term including 0 , the output deviates by at most δ1 . For the three types of neurons, this can be achieved due to properties 1 through 3 as explained above. We can then continue with neurons to the left, setting δ to the minimum of the found δ2 and δ0 . (The latter bound δ0 thereby ensures that values lie in the compactum [−C0 − δ0 , C0 + δ0 ] on which uniform approximation is guaranteed.) So we collect a number 0 for each copy of a neuron such that this choice guarantees appropriate approximation of the whole function. We can now choose for the entire network the minimum of all values 0 as a uniform value that simultaneously fulfills all constraints. Finally, it remains to show that the network that involves the above complicated activation functions can be interpreted as an RCC network with linear output and hidden neurons with activation function σ . Note that the above terms are both of the following form a · σ (b · x + c) + d, for real values a , b, c, d. We can simply integrate the term c into the bias of the respective neuron, b into the weights, a into the weights of all neurons that have direct connections from the neuron, and d into the bias of these neurons. Note that the last neuron of the RCC network (which does not possess a successor) is linear; hence, no substitution has to be done at this part. Note furthermore that outgoing connections include recurrent connections. (We do not have to deal with cycles because the integration of a and b is done only once for each connection of the network.) Hence the initial values for the empty sequence xi (l0 ) are affected by this change too. We account for this fact, resetting the term to xi (l0 )/a − b. Hence RCC networks constitute universal approximators for sequences. 3.2 Recursive CC. Referring to the above proof, the essential steps to show approximation completeness for RecCC networks for input tree structures are (I) the construction of a unique linear code for tree structures over

Universal Approximation Capability

1131

a finite set L and (II) the construction of a network with possibly multiplicative neurons and appropriate activation functions that can compute these codes. Then construction steps III to V can be transferred from the previous case. The linear representation of trees is slightly more difficult than the representation of sequences since they are usually represented as abstract nonlinear data types. The prefix notation of trees offers one possibility, which we use in the following. Assume L is a finite set that does not contain the symbols “(,” “),”and “,”. Assume ξ and f are additional symbols not contained in L. Define for each tree t over L with root vertex v and subtrees t1 and t2 the term-representation rep(t) recursively over the structure as follows: rep(ξt ) = ξ , rep(t) = f (l(v), rep(t1 ), rep1 (t2 )) . This representation is obviously unique, and hence step I is achieved. Step II can be done as follows: assume L = {1, . . . , r }, assume n f , nξ are numbers not contained in L, assume the maximum length of these numbers is L, and assume the numbers are extended by leading entries 0 to occupy exactly L positions. Denote by nrep(t) the number that we obtain if we substitute in rep(t) the symbols f by n f and ξ by nξ and extend the numbers from L by leading entries 0. Then nrep(t) is unique too. Note that it is not necessary to introduce symbols for “(”, “)”, and “,”, because we deal with only one function symbol. We now show that two multiplicative hidden neurons x1 and x2 of an RecCC network with the identity as activation function can be constructed that compute x1 (t) = exp0.1 length (nrep(t)) and x2 (t) = 0.nrep(t), where length denotes the length of the number including leading entries 0. We set x1 (ξt ) = 0.1 L ,

x2 (ξt ) = 0.nξ .

Assume the root vertex of a tree t is v and the two children are t1 and t2 . The functional dependence of x1 is x1 (t) = τ1 (l(v), x1 (t1 ), x1 (t2 )), where by structural induction, x1 (t1 ) and x1 (t2 ) represent the terms

1132

B. Hammer, A. Micheli, and A. Sperduti

exp0.1 (nrep(t1 )) and exp0.1 (nrep(t2 )). Hence, the choice τ1 (a 1 , a 2 , a 3 ) = 0.12L · a 2 · a 3 gives the appropriate length. The functional dependence of x2 is x2 (t) = τ2 (l(v), x1 (t), x1 (t1 ), x2 (t1 ), x1 (t2 ), x2 (t2 )), where by structural assumption, x2 (t1 ) and x2 (t2 ) constitute 0.nrep(t1 ) and 0.nrep(t2 ). Hence, the function τ2 (a 1 , . . . , a 6 ) = 0.1 L · n f + 0.12L · a 1 + 0.12L · a 3 + 0.12L · a 2 · a 5 gives 0.nrep(t). Note that these functions can be computed by multiplicative neurons of order 2 with the linear activation function. Thus, in contrast to the case of sequences, we deal with multiplicative neurons in these approximation results. Now we can proceed with steps III, IV, and V just as in the case of RCC networks and we obtain as a result that theorems 1 through 3 also hold for tree structures instead of sequences, that is, for functions f : L∗2 → R, whereby the RecCC network has multiplicative neurons. Due to the construction, we use multiplicative neurons of order 2 in this case. In practical applications, usually standard neurons instead of multiplicative neurons are used. One can directly substitute the multiplication if, instead of single neurons, a feedforward network is attached in each recursive construction step; that is, τi may be computed by a feedforward network with a fixed number of neurons. Note that an activation function for which some point x1 exists such that σ is twice continuously differentiable in a neighborhood of x1 with nonvanishing second derivative σ

(x1 ) = 0 can be used to approximate the function x 2 in the following way: σ (x1 + · x) + σ (x1 − · x) − 2 · σ (x1 ) → x2 2 · σ

(x1 )

( → 0),

with uniform convergence on compacta. We refer to this property as σ being locally C 2 . In addition, multiplication can be substituted by the square using the identity x · y = ((x + y)2 − (x − y)2 )/4. Note that the property locally C 2 implies σ being locally C 1 . Analogous to theorem 3, we can thus substitute each function computed by a multiplicative neuron τi by a feedforward network, which in this case contains only one hidden layer with eight hidden neurons in the hidden layer and activation function σ .7 7 We do not specify this remark for trees but formalize the result later for the case of graph structures.

Universal Approximation Capability

1133

RecCC networks are usually not complete if they deal with DPAGs as inputs that are first transformed to a tree structure. However, some specific settings can be observed where uniqueness of representation is ensured and hence approximation is possible such as the case of DPAGs such that no two vertices in a DPAG contain the same label. In this case, the label can also serve as an identifier of the vertex, and the original graph can easily be reconstructed from the tree. We have considered only supersource transductions so far. For tree structures, IO isomorphic transductions as a generalization of the more simple supersource transductions are interesting. This means that an input tree is mapped to an output tree of the same structure whereby the labels are substituted by values computed by the transduction. Thereby, the output label l (v) of a vertex v might depend on the vertex v and the whole tree structure. We can use RecCC networks for IO isomorphic transduction by substituting the label at each vertex by the output computed with the network for the subtree rooted at this vertex. It is an immediate consequence of the principled dynamics of RecCC networks, and our previous approximation results that exactly those (measurable) supersource transductions can be approximated, for which a strict causality assumption applies: the output label of a vertex can depend on the subtree rooted at the vertex but not on any other vertex of the tree structure. Since the approximation of functions that fulfill this causality assumption can be reduced to the problem to approximate a supersource transduction for any given subtree of the considered trees, the approximation completeness is obvious in this case. Conversely, the RecCC dynamics is based on this causality assumption with respect to functional dependencies of vertices. Thus, no functions violating the causality assumption can be approximated by an RecCC network. 3.3 Contextual RecCC. CRecCC networks integrate contextual information since hidden neurons have access to the parents’ activation of previous neurons. This possibility expands the functions that can be addressed with these networks to functions on graphs and IO isomorphic transductions, as we will see. However, the situation is more complicated for DPAGs than for tree structures because it is not clear how to represent DPAGs in a linear code based on the information available within a contextual recursive neuron. We now discuss this issue in detail and develop a code for a subset of DPAGs. The context integrated in CRecCC neurons is of crucial importance. The available context of vertex v at neuron i can, for example, be measured in terms of the vertices of the DPAG, which contribute directly or indirectly to xi (v). This context is precisely computed in Micheli et al. (2003). As demonstrated in Micheli et al. (in press) and Micheli (2003), this contextual information in principle can be used to compute certain transductions on graphs that require global knowledge of the structure and hence cannot be computed with simple recursive cascade correlation networks.

1134

B. Hammer, A. Micheli, and A. Sperduti

Figure 2: Example of two graph structures that cannot be differentiated by RecCC networks if the labels of the vertices at each layer are identical.

Hence, the functions that can be computed by RecCC networks on DPAGs form a proper subset of the functions that can be computed by CRecCC networks, for example, the two structures depicted in Figure 2 cannot be differentiated by an RecCC network, but they can be differentiated by an CRecCC network. RecCC networks constitute universal approximators for tree structures. RecCC networks and CRecCC networks do not cover different function classes if restricted to tree structures of limited height. For DPAGs, CRecCC networks are strictly more powerful. Here we go a step further and identify a subset of DPAGs that fulfill a canonical structural constraint, and we show that CRecCCs possess the universal approximation property for these structures. In addition, we construct a simple counterexample showing that there exist structures outside this class that cannot be distinguished even by CRecCC networks regardless of the chosen functions of the neurons. As before, we restrict ourselves to DPAGs where the fan-in and fan-out is restricted by two. 3.3.1 A Counterexample. The question arises whether CRecCC networks can differentiate all possible DPAG structures. The following example shows that this is not the case. However, this example requires a highly symmetrical structure where different parts of the DPAG cannot be differentiated by the structure or the labels of the vertices. Hence, the example can be regarded as an artificial example. Theorem 4. There exist graphs in DPAG that cannot be distinguished by any CRecCC network. The counterexample is shown in Figure 3. The formal proof that these two structures cannot be differentiated by a CRecCC network is in the appendix.

Universal Approximation Capability

1135

Figure 3: Example of two graphs mapped to identical values with CRecCC

networks for a noncompatible enumeration of parents. The enumeration of parents and children is indicated by boxes at the vertices; the left box is number one and the right box number two. Note that this construction would not have been possible if a different enumeration of children and parents had been chosen: consider edges (0, 4) and (0 , 4 ); 4 and 4 constitute the second child of 0 and 0 , respectively. Assume we had enumerated parents in such a way that 0 and 0 also constituted the second parent of 4 and 4 , respectively. Then the proof for the counterexample in theorem 4 would no longer hold: the vertices 1 and 4 could then be differentiated by an appropriate choice of the weights and activation functions of the neurons, and the same would hold for 1 and 4 . This can be seen due to the activation of the second hidden neuron whose functional dependence is x2 (1) = τ2 l, x1 (1), x1 (2), x2 (2), x1 (3), x2 (3), x1 (0), 01p = τ2 l, x1 (1), x1 (2), x2 (2), x1 (3), x2 (3), 01p , x1 (0) = x2 (4) . This inequality holds for appropriate τ2 because the inputs 01p and x1 (0) are given at different positions. The possibility of differentiating these two vertices leads to the possibility of differentiating all vertices i in one of the structures, and finally to the possibility of differentiating the two structures since the vertices are connected in a different way in the two structures. This example leads to a simple structural condition ensuring that different structures can be differentiated by CRecCC networks. 3.3.2 Identification of a Subclass of DPAGs. Motivated by the previous example, we now identify one subclass of DPAGs for which such an example cannot be found and which allows a linear encoding of the structures. The following property constitutes one sufficient condition to ensure this property; it is not a necessary condition. However, it can be easily tested and does often not put severe constraints on the respective application.

1136

B. Hammer, A. Micheli, and A. Sperduti

Definition 1. Assume D is a DPAG. We say that the ordering of the parents and children is compatible if for each edge (u, v) of D the equality Pv (u) = Cu (v) holds. This property states that if vertex u is enumerated as parent number i of v, then v is enumerated as child number i of u, and vice versa. We call the subset of DPAGs with compatible positional ordering rooted directed bipositional acyclic graphs (DBAGs). Note that the above counterexample used elements in DPAG\DBAG. A different enumeration as DBAG would have been possible in this case, such that the structures could be distinguished. We now investigate in which sense the requirement on the presence of a compatible enumeration puts restrictions on the considered structures. Due to the definition, DPAGs with compatible positioning necessarily have the same fan-in and fan-out. However, we can always extend a DPAG with fan-in k1 = fan-out k2 by empty children and parents, respectively, such that the fan-in and fan-out coincide. Moreover, the following theorem shows that any given DPAG with limited fan-in and fan-out two can always be transformed into a DBAG via reenumeration of children and parents. Theorem 5. Given a DPAG D with limited fan-in and fan-out two, we can find an enumeration Pv and Cv of parents and children for each vertex v such that the resulting DPAG constitutes a DBAG with respect to the enumerations Pv and Cv

for vertices v of the structure. The proof is in the appendix. It holds only for fan-in and fan-out two. For larger fan-in and fan-out, an analogous result can be stated if, in addition, the vertices might be expanded by empty children and parents. Theorem 6. Assume D is a DPAG with fan-in and fan-out limited by at most k. Then we can find a DBAG that arises from D by an expansion of the vertices of D with empty children and parents to fan-in and fan-out 2k − 1 and reenumeration of children and parents. The proof is in the appendix. We conjecture that D can also be transformed to a DBAG via reenumeration of children and parents with fan-in and fanout k and without expanding. However, we could not find a proof or a counterexample for this conjecture. In applications such as chemistry, positioning of children and parents is often done only by convention, without referring to semantic implications such as in the case of directed acyclic graphs (without prior positioning). Hence, restricting to DBAGs does not limit the applicability of CRecCC in these cases. In alternative scenarios such as logic, positioning might be crucial, and it might not be possible to change the positioning

Universal Approximation Capability

1137

without affecting the semantic. A simple example for this fact is the term implies(loves(Bill,Mary),smiles at(Bill,Mary)), or, as a whole sentence, “If Bill loves Mary then Bill smiles at Mary”. This cannot be transferred into a DBAG unless “Bill” becomes the first child of the first symbol “loves”and the second child of the second symbol “smiles”, or vice versa. That is, we would obtain, “If Bill loves Mary then Mary smiles at Bill” or “If Mary loves Bill then Bill smiles at Mary”, both nice sentences but with a different meaning from the original one. We will see that alternative guarantees for the universal approximation ability can be found and a large number of structures in DPAG\DBAG can be appropriately processed as well. This alternative characterization is not as straightforward as the restriction to DBAGs, which constitutes a purely structural property and can easily be tested. 3.3.3 Encoding of DBAGs and Certain DPAGs. Next we introduce a recursive mapping of vertices in a DPAG to terms that describe information contained in the DPAG. The purpose of this transformation is to obtain linear representations of DPAGs. It will turn out that these linear representations characterize DBAGs up to a certain height uniquely. Naturally, the representation also characterizes a large number of DPAGs uniquely, and approximation completeness will hold for these DPAGs too. However, whether this holds for any given set of DPAGs needs to be tested for each application. For DBAGs, it is always valid because of the structural restrictions. Assume L is a finite set that does not contain the symbols ”(”, ”)”, and ”,”. Assume c , p , and f are additional symbols not contained in L. Define for each i ≥ 0 and vertex v in a given DPAG the term-representation repi (v) recursively over i and the structure as follows: repi (ξc ) = c , repi (ξ p ) = p , rep1 (v) = f (l(v), rep1 (ch1 (v)), rep1 (ch2 (v))), repi (v) = f (l(v), repi (ch1 (v)), repi (ch2 (v)), repi−1 (pa1 (v)), repi−1 (pa2 (v))), for i > 1. This representation contains all information available for a CRecCC network, since the following result holds: Theorem 7. Assume D and D are DPAGs that can be differentiated by hidden neuron h of some CRecCC network. Then reph (root(D)) and reph (root(D )) are different. Proof. The functional dependence of CRecCC neurons has the form xi (v) = τi (l(v), x1 (v), . . . , xi−1 (v), x1 (ch1 (v)), . . . , xi (ch1 (v)), x1 (ch2 (v)), . . . , xi (ch2 (v)), x1 (pa1 (v)), . . . , xi−1 (pa1 (v)), x1 (pa2 (v)), . . . , xi−1 (pa2 (v))).

1138

B. Hammer, A. Micheli, and A. Sperduti

Hence, all available information for neuron i is contained in the term that we obtain if we evaluate l(v) to the respective label and all other terms to symbols in the corresponding Herbrand-algebra. We denote the evaluation l(v) by the same symbol for simplicity. xi (ξ p ) is evaluated to the constant 0ip , and xi (ξc ) is evaluated to the constant 0ic . We denote the resulting string for xi (v) by repi (v). Two DPAGs that can be differentiated by hidden neuron h lead to different representations rep h (root(D)) and rep h (root(D )). We now show that if for vertices v of D and v of D and some i the terms repi (v) and repi (v ) differ, then so do repi (v) and repi (v ). This is done by simultaneous induction over i and the structure of the terms: if one of the vertices v or v is an empty child or parent, we get by definition 0ip or 0ic for the representation repi . However, repi also gives a unique identifier p or c for the empty child or parent, respectively. So we can assume that v and v are not empty. i = 1: We obtain rep 1 (v) = τ1 (l(v), rep 1 (ch1 (v)), rep 1 (ch2 (v))). and rep 1 (v ) = τ1 (l(v ), rep 1 (ch1 (v )), rep 1 (ch2 (v ))) , Hence, one of the three subterms is different: it holds l(v) = l(v ), or rep 1 (ch1 (v)) = rep 1 (ch1 (v )), or rep 1 (ch2 (v))) = rep 1 (ch2 (v )). In the latter two cases, we can assume by structural assumption that rep1 (ch1 (v)) = rep1 (ch1 (v )) or rep1 (ch2 (v)) = rep1 (ch2 (v )) holds. rep1 evaluates to rep1 (v) = f (l(v), rep1 (ch1 (v)), rep1 (ch2 (v))) and rep1 (v ) = f (l(v ), rep1 (ch1 (v )), rep1 (ch2 (v ))), and therefore also gives two different terms. Inductive step for i: We find

repi (v) = τi (l(v), rep 1 (v), . . . , repi−1 (v),

rep 1 (ch1 (v)), . . . , repi (ch1 (v)), rep 1 (ch2 (v)), . . . , repi (ch2 (v)),

rep 1 (pa1 (v)), . . . , repi−1 (pa1 (v)), rep 1 (pa2 (v)), . . . , repi−1 (pa2 (v)))

and a similar term for repi (v ). If these terms are different, one of the subterms differs. Hence one of the following holds: 1. l(v) = l(v ). 2. rep j (ch1 (v)) = rep j (ch1 (v )) for some j ≤ i.

Universal Approximation Capability

1139

3. rep j (ch2 (v)) = rep j (ch2 (v )) for some j ≤ i. 4. rep j (pa1 (v)) = rep j (pa1 (v )) for some j < i. 5. rep j (pa2 (v)) = rep j (pa2 (v )) for some j < i. In cases 2 and 3, we can assume by structural assumption that rep j gives different values for the respective terms; in cases 4 and 5, we can assume this fact because of the inductive hypothesis. If rep j yields different values for two terms, then so does rep j for all j ≥ j as we will show in lemma 2. Hence we find that the terms repi (v) = f (l(v), repi (ch1 (v)), repi (ch2 (v)), repi−1 (pa1 (v)), repi−1 (pa2 (v))) and repi (v ) = f (l(v ), repi (ch1 (v )), repi (ch2 (v )), repi−1 (pa1 (v )), repi−1 (pa2 (v ))) are also different. We postponed in theorem 7 the proof of the following lemma: Lemma 2. Assume D and D are DPAGs with vertex v in D and v in D , respectively. Assume repi (v) = repi (v ) for some i. Then rep j (v) = rep j (v ) for all j ≥ i. Proof. The proof is by simultaneous induction over j and the structure. For j = 1, i = 1 also holds and the result is obvious. Inductive step for j: If one of the vertices, say, v, is empty, the result is obvious because rep j (v) yields p or c , respectively. rep j (v ) yields a different value by definition for all j unless v also constitutes an empty child or parent. Assume the vertices are both not empty. Assume without loss of generally that j > i. Assume repi (v) = repi (v ). The representations are of the form repi (v) = f (l(v), repi (ch1 (v)), repi (ch2 (v)), repi−1 (pa1 (v)), repi−1 (pa2 (v))) and repi (v ) = f (l(v ), repi (ch1 (v )), repi (ch2 (v )), repi−1 (pa1 (v )), repi−1 (pa2 (v ))).

1140

B. Hammer, A. Micheli, and A. Sperduti

(The last two subterms are missing for i = 1.) Hence, one of the subterms differs: l(v) = l(v ), repi (ch1 (v)) = repi (ch1 (v )), repi (ch2 (v)) = repi (ch2 (v )), repi−1 (pa1 (v)) = repi−1 (pa1 (v )), or repi−1 (pa2 (v)) = repi−1 (pa2 (v )). In cases 2 and 3, we can assume by structural assumption that the same holds for the index j. In the last two cases, we can assume by the inductive hypothesis that the same holds also for the index j − 1. In all cases, we obtain at least one subterm of rep j (v) that is different from the respective subterm of rep j (v ); hence, rep j (v) = rep j (v ) holds. Therefore, CRecCC networks can constitute universal approximators at most on DPAGs for which repi yields a unique representation for some i by definition of the functional dependencies. We will now show that the mapping that maps DBAGs of height at most h to reph+1 (v), v denoting the root of the respective DBAG, is injective. As a consequence of this fact, all information available in a given DBAG is provided to a CRecCC network via the defined dynamics. For this purpose, we describe an algorithm that allows recovering the original DBAG from the respective linear representation. Note that each term repi (v) can uniquely be decomposed into its subterms. The reconstruction is in several steps. Definition 2. Assume a path of the DBAG D that starts at the root is given and consecutively visits the vertices vi0 , . . . , vih . A linear path representation consists of a string s = s1, . . . , sih , where s j ∈ {1, 2}∗ equals s j = Cvi j−1 vi j for all j = 1, . . . , h. Hence a linear path representation collects which child of the previously visited vertex is visited at each position in the path. si = 1 indicates that the left child of the actual vertex in the path is visited; si = 2 indicates that the next vertex is given by the right child. We assume the convention that the empty string represents the path of length 0, which stays at the root. Obviously, the mapping of rooted paths to linear path representations is injective. Lemma 3. For any vertex v of height at most i in a DBAG D, it is possible to recover the linear path representation of all paths from the root of D to v from repi+1 (v). Proof. The proof is via induction. If i = 1, v can only be the root of the DBAG with linear path representation of the path of length 0. If i > 1, then we find repi+1 (v) = f (. . . , repi (pa1 (v)), repi ( pa 2 (v))).

Universal Approximation Capability

1141

Note that we can decide whether the left or right parent is empty since p is a symbol not contained in L. If a parent is not empty, we can recover the linear path representation of all paths from the root to pa1 (v) and pa2 (v) from repi (pa1 (v)) or repi (pa2 (v)), respectively, by the inductive hypothesis, because the height of pa1 (v) and pa 2 (v) is smaller than i. Note that the DBAG has a compatible positioning of children and parents. Hence, we obtain the linear path representation of all paths from the root to v in the following way: we extend the linear path representation of the paths to pa1 (v) (if this parent is existing) by the letter 1. To this set we add the linear path representation of the paths to pa2 (v) (if existing) extended by the letter 2. Lemma 3 tells us that we can recover all paths from the root to a vertex from the given linear representation. Note that the linear path representations of a rooted path to a vertex differs for each vertex. The next lemma extends this result to the extraction of all vertices and their respective paths from the linear representation. Lemma 4. For a DBAG D of height at most h, it is possible to extract from reph+1 (root(D)) the set S = {(l(v), B) | v ∈ vert(D), B ⊂ {1, 2}∗ , B is ordered lexicographically, B contains the linear path representation of all paths from the root to v in D}. Proof. For an empty DBAG, the term reph+1 (root(D)) equals c , and we simply extract the empty set. Otherwise, we can start with the empty set S and iteratively add tuples (l(v), B) to the set S that collect all vertices v of the DBAG and the set of linear path representations of the vertex. This starts from the root and the given term reph+1 (root(D)), and recursively extracts vertices and subterms that describe all paths for these vertices until we arrive at the leaves. The procedure is as follows. Note that all vertices in the DBAG have height at most h by assumption. Given the term reph+1 (v) = f (l(v), reph+1 (ch1 (v)), reph+1 (ch2 (v)), . . .), we recover the vertex v: its label l(v) can be obtained from the term reph+1 (v). All linear path representation of all paths from the root to v can be recovered because of lemma 3. We sort the linear path representations lexicographically, getting B, and add the tuple (l(v), B) to the set S. Note that this tuple might already be contained in S if vertex v can be reached by more than one path from the root. In this case, S is not changed since we take the union and since the linear path representation of a path to a vertex characterizes the vertex uniquely. We can decide, given reph+1 (v), whether a left and/or right child of v exists, because empty children are represented by the unique symbol c as the second or third subterm of reph+1 (v). If nonempty children exist, we continue with each of the two children and the representation

1142

B. Hammer, A. Micheli, and A. Sperduti

reph+1 (ch1 (v)) and/or reph+1 (ch2 (v)), respectively. Both representations can be obtained from reph+1 (v). This iterative step yields all vertices that are direct or indirect successors of the previous vertex v. Thus, the entire structure will be visited during this iterative process, starting from the root of D. Hence, we finally get the uniqueness of the linear representation for DBAGs: Theorem 8. The mapping that maps DBAGs D over L of height at most h to reph+1 (root(D)) is injective. Proof. Because of lemma 4, we can recover from the given linear representation for any DBAG D of height at most h and each vertex of D a tuple consisting of the label of the vertex and the linear path representations of all paths from the root to the respective vertex. This information uniquely characterizes D because the DBAG can easily be reconstructed using this information. We introduce a vertex for each tuple and assign the given label to it. The problem is that linear path representations of paths do not include the identifier for the vertices we have just chosen. However, the root is uniquely characterized by the fact that the only path to the root is empty. We can then iteratively add all edges to the DBAG and add the positioning of children and parents implicitly contained in the linear path representations as follows: we start with the disconnected vertices and iteratively add connections starting from the root up to the leaves. In an iterative step, we consider all paths starting from the root in the already constructed part, and we choose a shortest path in the linear path representations of vertices of which not yet all edges are included (it is possible to test this property since we can identify the root and afterward follow the successors given by the identifiers of the linear path representation). Because we choose the shortest path, only the last edge of the path, say (u, v), is not yet included in the DBAG. We can uniquely identify u via the first part of the linear path representation. The identifier of v is known (since the path is contained in the tuple which describes v), and hence we can connect u and v in the DBAG, that is, introduce the edge and assign the positioning as indicated in the linear path representation to u and v. Since DBAGs are considered, the positioning of children, which is given in the linear path representation, also gives the positioning of parents. The representation repi preserves all information available for a CRecCC network as we have shown in theorem 7. Hence, CRecCC networks with appropriate choice of the weights and activation function (such that different terms repi are mapped to different values also by a CRecCC network) can exactly map all DPAGs appropriately for which repi yields a unique representation for large enough i. We will later show that multiplicative CRecCC neurons with a standard activation function can perform this task.

Universal Approximation Capability

1143

Since DBAGs can uniquely be reconstructed from repi , CRecCC networks are complete with respect to DBAGs. Note that the reconstruction of a DPAG from the encoding repi is in general not possible because forward paths in a DPAG together with the respective positioning of the vertices (i.e., linear path representations) cannot be obtained from backward connections from the children toward its parents, as has been done in lemma 3. Thereby, backward connections are connections from a vertex to a parent (together with the respective positioning), and forward connections are connections from a vertex to a child (together with the respective positioning). Nevertheless, we can extract all paths that start at the root, follow a certain number of forward and/or backward connections, and end up at some vertex even for a DPAG. This is possible because a forward connection starting from v can be obtained from repi (ch1 (v)) and repi (ch2 (v)), while a backward connection can be obtained from repi−1 (pa1 (v)) and repi−1 (pa1 (v)), whereby full information is available for large enough i. Hence, repi can differentiate for large enough i between any two DPAGs such that at least one path starting from the root exists (following forward and/or backward connections) that leads to a vertex with a different label or different number of children or parents in the two DPAGs. Theorem 4 shows that this is not always possible. However, we claim that the capacity of CRecCC networks to differentiate between these DPAGs (and to approximate functions on these DPAGs as we will see later) is sufficient for practical applications because these highly symmetrical structures usually do not occur in practice. 3.3.4 Approximation with a CRecCC Network. We will now show that specific CRecCC networks can “compute” the representation repi in some form. Hence, they can uniquely encode DBAGs (and DPAGs for which the encoding is unique for some i) into a real vector. This gives us step II of the general proof of approximation completeness. We restrict the presentation in the next section to DBAGs for convenience, keeping in mind that similar constructions can be done for certain DPAGs too. Lemma 5. Assume L = {1, . . . , r }. Assume h ∈ N. Then a CRecCC network with multiplicative neurons of order at most 4 and linear activation function with 2(h + 1) hidden neurons can be found such that the output x2(h+1 ) (root(D)) yields unique representations of DBAGs D over L of height at most h. Proof. Fix different numbers n f , n, , n( , n) , n p , and nc not contained in {1, . . . , r }. Assume that 0 is not contained in this set of numbers. Assume L is the maximum length of the digital representation of the above symbols (including elements of L). We assume in the following that all numbers are extended by leading entries 0 such that they occupy precisely L places. We can then uniquely represent the term repi (v) by a number: we set nc for c , n p for p . A general term repi (v) is represented by the number that we obtain substituting in repi (v) the symbols “ f ” by n f , “(” by n( , “,” by n, , “)” by

1144

B. Hammer, A. Micheli, and A. Sperduti

n) , l(v) by the respective element of L, and the contained references to rep by the recursively constructed numbers (always including leading entries 0). We refer to the number that represents repi (v) by the symbol nrepi (v). We will now iteratively construct hidden neurons x2i−1 and x2i for i ≥ 1 such that x2i−1 (v) = exp0.1 (length(nrepi−1 (v)))

and

x2i (v) = 0.nrepi−1 (v).

The latter term denotes the decimal number smaller than 1 with digits nrepi−1 (v)), and length(nrepi−1 (v)) denotes the length of the term including leading entries 0. Since the representation reph+1 (root(D)) is unique for DBAGs of height at most h, the output x2(h+1) (root(D)) of hidden neuron 2(h + 1) gives unique values for these DBAGs. According to our definition, xi (ξ p ) = 0ip and xi (ξc ) = 0ic are fixed symbols for all i. We choose 0ip = 0ic = (0.1) L for odd i and 0ip = 0.n p and 0ic = 0.nc for even i. Assume in the following that v is nonempty. For the first hidden neuron, we find x1 (v) = τ1 (l(v), x1 (ch1 (v)), x1 (ch2 (v))). Assume l1 and l2 , respectively, denote the length of nrep1 of the first and second child, respectively, of v. Since the term rep1 (v) = f (l(v), rep1 (ch1 (v)), rep1 (ch2 (v))) consists of six symbols and two subterms that represent ch1 (v) and ch2 (v), the length of rep1 (v) is given by 6L + l1 + l2 . Hence, the choice τ1 (a 1 , a 2 , a 3 ) = (0.1)6L · a 2 · a 3 gives us the desired output, since via structural induction, the latter two entries can be identified with a 2 = (0.1)l1 and a 3 = (0.1)l2 . Note that this function can be computed by a multiplicative neuron of order 2 with linear activation function. The functional dependence of the second hidden neuron is x2 (v) = τ2 (l(v), x1 (v), x1 (ch1 (v)), x2 (ch1 (v)), x1 (ch2 (v)), x2 (ch2 (v)), x1 (pa1 (v)), x1 (pa2 (v))). By structural induction, we can assume that x2 (ch1 (v)) and x2 (ch2 (v)) contain the number representation of rep1 of the two children. As shown previously, x1 (ch1 (v)) and x1 (ch2 (v)) represent the length of rep1 (ch1 (v)) and rep1 (ch2 (v)), respectively. Hence, the function τ2 (a 1 , . . . , a 8 ) = (0.1) L n f + (0.1)2L n( + (0.1)3L a 1 + (0.1)4L n, + (0.1)4L a 4 + (0.1)5L a 3 · n, + (0.1)5L a 3 · a 6 + (0.1)6L a 3 · a 5 · n) computes the desired number representing nrep1 (v). Thereby, concatenation of digits representing the symbols in rep1 (v) is done by summing the appropriately shifted numbers. Note that the length of the computed subnumbers representing children is available. This function τ2 can be computed by a multiplicative neuron of order 2 with linear activation function.

Universal Approximation Capability

1145

Assume we have constructed hidden neurons up to number 2i. It is repi+1 (v) = f (l(v), repi+1 (ch1 (v)), repi+1 (ch2 (v)), repi (pa1 (v)), repi (pa2 (v))). The functional dependence of hidden neuron number 2i + 1 is x2i+1 (v) = τ2i+1 (l(v), x1 (v), . . . , x2i (v), x1 (ch1 (v)), . . . , x2i+1 (ch1 (v)), x1 (ch2 (v)), . . . , x2i+1 (ch2 (v)), x1 (pa1 (v)), . . . , x2i (pa1 (v)), x1 (pa2 (v)), . . . , x2i (pa2 (v))) By induction, x2i−1 (∗) yields the term exp0.1 (length(nrepi (∗))), where ∗ is one of ch1 (v), ch2 (v), pa1 (v) and pa2 (v). By structural induction, the inputs x2i+1 (ch1 (v)) and x2i+1 (ch2 (v)), respectively, give the terms exp0.1 (length(nrepi+1 (ch1 (v)))) and exp0.1 (length(nrepi+1 (ch2 (v)))). Since repi+1 consists of eight symbols and contains the subterms that describe the length of the representations of children and parents, the function τ2i+1 (a 1 , . . . , a 10i+3 ) = (0.1)8L · a 4i+2 · a 6i+3 · a 8i+2 · a 10i+2 computes exp0.1 (length(nrepi+1 (v))). Note that this function can be computed by a multiplicative neuron of order 4 with linear activation function. The functional dependence of hidden neuron 2i + 2 is x2i+2 (v) = τ2i+2 (l(v), x1 (v), . . . , x2i+1 (v), x1 (ch1 (v)), . . . , x2i+2 (ch1 (v)), x1 (ch2 (v)), . . . , x2i+2 (ch2 (v)), x1 (pa1 (v)), . . . , x2i+1 (pa1 (v)), x1 (pa2 (v)), . . . , x2i+1 (pa2 (v))). x2 j−1 (∗) yields the term exp0.1 (length(nrep j (∗))) for j ≤ i + 1, and ∗ ∈ {ch1 (v), ch2 (v), pa1 (v), pa2 (v)}. x2(i+1) (∗) yields the term 0.nrepi+1 (∗) for ∗ ∈ {pa1 (v), pa2 (v)}. x2 j (∗) gives the term 0.nrep j (∗) for j ≤ i + 1 and ∗ ∈ {ch1 (v), ch2 (v)}. Hence, the function τ2i+2 (a 1 , . . . , a 10i+8 ) = (0.1) L n f + (0.1)2L n( + (0.1)3L a 1 + 0.14L n, + (0.1)4L a 4i+4 + (0.1)5L a 4i+3 · n, + (0.1)5L a 4i+3 · a 6i+6 + (0.1)6L a 4i+3 · a 6i+5 · n, + (0.1)6L a 4i+3 · a 6i+5 · a 8i+6 + (0.1)7L a 4i+3 · a 6i+5 · a 8i+5 · n, + (0.1)7L a 4i+3 · a 6i+5 · a 8i+5 · a 10i+7 + (0.1)8L a 4i+3 · a 6i+5 · a 8i+5 · a 10i+6 · n)

1146

B. Hammer, A. Micheli, and A. Sperduti

computes the representation 0.nrepi+1 (v). Note that this function can be computed by a multiplicative neuron of order 4 with linear activation function. A similar construction is possible if fan-in and fan-out k ≥ 2 is considered. Then products of at most 2k factors are to be dealt with; hence, CRecCC networks with multiplicative neurons of order at most 2k can in this case compute a unique representation. It follows in the same way as lemma 1 that integration of feedforward networks into a CRecCC network is possible. In analogy to theorem 2, this can be extended to real-valued labels, and, mimicking the proof of theorem 3, we can substitute specific activation functions by standard ones. Hence, CRecCC networks with multiplicative neurons constitute universal approximators for structures analogous to theorem 3. This result has been proved for DBAGs with limited fan-in and fan-out two. It can immediately be generalized to larger fan-in and fan-out k, giving us approximation completeness for these DBAGs for CRecCC networks with multiplicative neurons of order 2k. In practical applications, usually standard neurons instead of multiplicative neurons are used. In analogy to the case of trees, we can substitute multiplicative neurons by networks of standard neurons in the following way: Theorem 9. Assume L = Rn . Assume P is some probability measure on finite DBAGs over L. Assume f : DBAGL2 → R is a Borel-measurable function. Assume > 0, δ > 0. Assume σ : R → R is a continuous and locally C 2 squashing function. Then there exists a CRecCC network that has a linear activation function for the output neuron. The function of the other hidden states τi is computed by a feedforward network with at most two hidden layers and a universally bounded number of neurons in the network with activation functions σ . This CRecCC network computes a function g such that P(x ∈ DBAGL2 | | f (x) − g(x)| > δ) < . Proof. The network that we constructed in theorem 3 uses multiplicative neurons because of the computation of the linear codes of DBAGs as constructed in lemma 5. The multiplicative neurons that are used in this construction contain multiplications of order at most 4. We can substitute one multiplication x · y by the term ((x + y)2 − (x − y)2 )/4. The function x → x 2 can be approximated uniformly by the term (σ (x1 + 0 · x) + σ (x1 − 0 · x) − 2 · σ (x1 ))/(02 · σ

(x1 )). Hence, we can approximate the multiplicative neurons arbitrarily well by networks of standard neurons with approximation function σ . The same argumentation as in theorem 3 shows that the approximation can be done uniformly up to inputs of arbitrarily small probability. Since the multiplicative neuron with maximum complexity in lemma 5

Universal Approximation Capability

1147

includes two multiplicative terms of order 2, two terms of order 3, and two terms of order 4, a feedforward network with at most two hidden layers and 49 neurons in the feedforward network can approximate the computation of a multiplicative neuron. 3.3.5 Some Implications. Note that the key point of the construction is given by theorem 8 and lemma 5. Theorem 8 defines a linear code that represents DBAGs up to a given height h over a finite alphabet L uniquely, and lemma 5 shows that this code can be computed in an appropriate way with a CRecCC network. The other ingredients are steps III to V of the construction for RCC networks, which can be directly transferred to trees and DBAGs. Note that we can perform the above construction steps formally for DPAGs instead of DBAGs as well. The only problem is that the linear codes provided by repi might not encode all DPAGs uniquely. We have already found one example of DPAGs (see theorem 4) that are not uniquely encoded by any repi . If such situations are prevented, universal approximation for DPAGs can be guaranteed as well: Theorem 10. Assume L = Rn . Assume P is some probability measure on finite DPAGs over L. Assume f : DPAG2L → R is a Borel-measurable function. Assume > 0, δ > 0. Assume σ : R → R is a continuous and locally C 1 squashing function. Assume the set {D ∈ DPAG2L | ∃D ∈ DPAG2L f (D) = f (D ), ∀i repi (root(D)) = repi (root(D ))} has probability at most δ/2. Then there exists a CRecCC network with multiplicative neurons with activation functions σ and linear activation function for the last neuron of the network, which computes a function g such that P(x ∈ DPAG2L | | f (x) − g(x)| > δ) < . Proof. We can disregard the set of DPAGs that are not uniquely encoded by repi by assumption. For the remaining part, we can construct an approximative CRecCC network as in the previous case. Note that encoding has to be done only for DPAGs of limited height with discretized labels, that is, for a finite number of structures, such that a maximum number i that differentiates all these structures can be found. The remaining steps are analogous to those for DBAGs whereby the encoding network computes repi for the above i. In practice, this allows us to approximate reasonable mappings on DPAGs as well: repi (D) = repi (D ) for two DPAGs D = D and all i require that for any vertex in the given two DPAGs and all paths following the connections toward children or parents, only vertices of the two DPAGs with the same label and the same number of children and parents are reached in D and D . Hence, this requires highly symmetrical situations such that

1148

B. Hammer, A. Micheli, and A. Sperduti

CRecCC networks can be expected to perform quite well in practical applications even for general DPAGs. We would like to add a remark on the form of functions to be approximated. So far we have considered supersource transductions, that is, functions that map a structure to a single value—the recursively computed output of the supersource of the structure. IO isomorphic transductions refer to more general functions, which map a given input DPAG to a DPAG of the same structure where each vertex label of the input DPAG l(v) is substituted by a value l (v). Thereby, the value l (v) of an output vertex might depend on the entire input DPAG. Thus, IO isomporphic transductions are functions f : DPAG2L → DPAG2Y with label sets L and Y such that f (x) has the same structure as x for every x ∈ DPAG2L . As before, we assume that L and Y are subsets of a real-vector space. We call a function f measurable if for every fixed input structure of DPAGs and every fixed vertex in the structure, the mapping, that maps the input DPAG to the output value of the vertex is Borel measurable. The distance of two output DBAGs of the same structure is measured by taking the maximum distance of the labels of vertices at the same position. We can obviously obtain IO isomorphic transductions from CRecCC networks by setting the output label of a vertex v of a given input structure to the output of the CRecCC network after processing the vertex v. The question now is whether all measurable IO isomorphic transductions on DBAGs can be approximated by some CRecCC network. To assess this question, we can use the same steps as before, whereby the only difference occurs for step I. We have to compute an appropriate output after each presentation of a vertex; thus, we have to find unique linear codes for each vertex in each possible DBAG. If this is possible, we can solve the problem to map each encoded vertex of a DBAG to the desired output as before. Thus, steps II to V can be transferred from the previous case. Assume a DBAG D and a vertex v in D is given. Consider the representation reph (v). If the height of v is smaller than h, we can reconstruct the actual vertex v, including its label l(v), and all linear path representations of the root to v from this term because of lemma 3. In addition, we can mimic the construction of lemma 4 and collect all direct or indirect successors of v together with the linear path representations of all paths from the root to these vertices. For predecessors of v, however, we have to add an argument: note that reph (v) has the form τh (l(v), reph (ch1 (v)), reph (ch2 (v)), reph−1 (pa1 (v)), reph−1 (pa2 (v))). Thus, if the DBAG has height at most h − 2, we can also reconstruct the two parents of v and all linear path representations from reph−1 (pa1 (v)) and reph−1 (pa2 (v)). Iterating this argument yields the result that the vertices and all linear path representations of the DBAG can be reconstructed from reph (v) as in lemma 4 if the DBAG has height at most h/2 − 1. Note that for each reconstruction step, the index i of repi (v) is decreased by one, such that we have to start with twice the height. In other words, rep2h+1 (v) is a unique

Universal Approximation Capability

1149

linear representation of v with respect to the vertex v and the entire DBAG if the height of the DBAG is at most h. As a consequence, we can conclude that CRecCC networks are also universal approximators for IO isomorphic transduction on DBAGs. 4 Discussion We have considered various cascade correlation models enhanced by recurrent or recursive connections for structured input data. These models have been proposed in the literature based on recurrent or recursive neural networks in analogy to simple cascade correlation. Recurrent and recursive networks constitute well-established machine learning tools to deal with sequences or tree structures as input, respectively. They have been applied successfully for various types of classification and regression problems on structured data ranging from picture processing up to chemistry (Frasconi et al., 1998). However, it is well known that training recurrent networks poses severe problems (Bengio et al., 1994) and the generalization ability might be considerably worse compared to standard feedforward networks (Hammer, 2001). Moreover, classification tasks for sequences and tree structures are likely more difficult than standard learning problems for feedforward networks due to the increased complexity of data. Hence, constructive learning algorithms that divide the problem into several subtasks and automatically find an appropriate (preferably small) architecture are particularly useful within structured domains. Cascade correlation for structures has shown superior performance for several problems in chemistry (Bianucci et al., 2000) and seems particularly appropriate due to very low training effort (the weights of only one neuron are trained at a time) and an automatically growing structure. Unlike the modification of standard feedforward networks to feedforward cascade correlation networks, the specific training scheme of CC restricts the possible dynamics of recurrent cascade architectures to only partially recurrent systems. As a consequence, the class of recurrent cascade architectures constitutes a proper subclass of the class of recurrent network architectures. This might restrict the capability of these types of networks compared to the general model, as demonstrated, for example, in Giles et al. (1995). Hence, the question was whether the restriction of the recurrence in cascade models poses severe restrictions on the applicability of the models because possibly only very simple functions can be represented with these restricted architectures. We have shown in this article that in terms of the universal approximation capability of these models, no restriction can be observed: CC architectures can approximate every measurable function with sequences or tree structures as input, respectively, arbitrarily well up to a set of arbitrary small probability in spite of their restricted recurrence. So if fundamental differences concerning the capacity of these

1150

B. Hammer, A. Micheli, and A. Sperduti

systems in comparison to their fully recurrent counterparts exist (as demonstrated in Giles et al., 1995), then these fundamental differences will not manifest for any given finite set of training data or for any given finite time horizon or height of the trees, respectively, provided enough neurons are available. The restriction of recurrence in cascade models offers a very natural possibility to enlarge the dynamics: not only information provided by successors of vertices of a graph or tree structure can be taken into account but also contextual information provided by the predecessors of the vertices. These contextual recursive models have been proposed in Micheli et al. (2002, in press). It has been demonstrated that this additional information is actually used by the contextual models such that tasks involving global information of the structure can be solved more easily with these models. Note also that for standard recurrent and recursive networks, several models that integrate global information for a restricted form of structured data such as sequences or grids have been proposed in the literature and successfully been applied in bioinformatics, for example where spatial data, instead of temporal data, often have to be dealt with (Baldi et al., 1999; Micheli et al., 2000; Pollastri et al., 2002; Wakuya & Zurada, 2001). However, unlike contextual cascade models, structure processing is restricted to special subclasses, and integration of contextual information is done on top of the basic recursive networks. This is necessary because of full recurrence of the networks. Context integration in the form of structural induction would yield a cyclical, ill-defined dynamics. The restriction of the recurrence in cascade models allows integrating the contextual information directly into the model in an elegant and efficient way. As already demonstrated in Micheli et al. (in press), this enlarges the class of structures the models can deal with from tree structures to acyclic rooted positional graphs with limited fan-in and fan-out. We went a step further in this article by showing that contextual cascade models in fact constitute universal approximators for the class of functions defined on acyclic directed graphs as input. More precisely, we have identified a subclass of DPAGs, which we called DBAGs, where enumeration of children and parents is compatible, and we have shown the universal approximation capability for this subclass. Moreover, we have shown that any DPAG can be transformed into a DBAG in a natural way via possible expansion by empty children and parents and reenumeration of children and parents. Hence, acyclic graph structures where the positioning is not relevant can easily be dealt with in this way. In addition, the approximation capability is extended to IO isomorphic transductions; CRecCC networks constitute one of the first models for which a proof exists that they can also approximate functions with structured outputs. Thus, the universal approximation capability of cascade architectures with restricted recurrence and context integration is enlarged to general data structures such as rooted directed acyclic graphs with limited fan-in and fan-out.

Universal Approximation Capability

1151

Appendix A.1 Proof of Theorem 4. Assume a stationary CRecCC network is given. Consider the two graphs depicted in Figure 3. We refer to the vertices by the numbers indicated in the figure. Thereby, enumeration of children and parents is done from left to right as they appear in the figure: the upper left box of a vertex refers to parent number one and the upper right box to parent number two, and the lower left box indicates child number one and the lower right box to child number two. The labels of all vertices are identical, say l. Note that the structures are highly symmetrical: each pair of vertices i and i has the same number of children and parents and the same label. Moreover, i and i are pointing to vertices that have the same structure in the left and the right DPAG. We point out one aspect of these structures: consider the edges (0, 4) and (0 , 4 ). 4 and 4 constitute the respective second child of the root, whereas 0 and 0 constitute the first parent of 4 and 4 , respectively. For all other edges, the positioning of the respective parent coincides with the positioning of the respective child. As a first step, we show by induction over i that the following equalities hold for the hidden neurons’ activation: xi (1) = xi (4), xi (2) = xi (5), xi (3) = xi (6), xi (1 ) = xi (4 ), xi (2 ) = xi (5 ), xi (3 ) = xi (6 ). This indicates that a large symmetry can already be found within each structure: i =1: x1 (3) = τ1 l, 01c , 01c = x1 (6), x1 (2) = τ1 l, x1 (3), 01c = τ1 l, x1 (6), 01c = x1 (5), x1 (1) = τ1 (l, x1 (2), x1 (3)) = τ1 (l, x1 (5), x1 (6)) = x1 (4) and

x1 (3 ) = τ1 l, 01c , 01c = x1 (6 ), x1 (2 ) = τ1 l, x1 (3 ), 01c = τ1 l, x1 (6 ), 01c = x1 (5 ), x1 (1 ) = τ1 (l, x1 (2 ), x1 (6 )) = τ1 (l, x1 (5 ), x1 (3 )) = x1 (4 )

Induction step for i: xi (3) = τi l, x1 (3), .. xi−1 (3), 01c , .. 0ic , 01c , .. 0ic , x1 (2), .. xi−1 (2), x1 (1), .. xi−1 (1) = τi l, x1 (6), .. xi−1 (6), 01c , .. 0ic , 01c , .. 0ic , x1 (5), .. xi−1 (5), x1 (4), .. xi−1 (4) = xi (6)

1152

B. Hammer, A. Micheli, and A. Sperduti

xi (2) = τi l, x1 (2), .. xi−1 (2), x1 (3), .. xi (3), 01c , .. 0ic , x1 (1), .. xi−1 (1), 01p , .. 0i−1 p = τi l, x1 (5), .. xi−1 (5), x1 (6), .. xi (6), 01c , .. 0ic , x1 (4), .. xi−1 (4), = xi (5) 01p , .. 0i−1 p xi (1) = τi l, x1 (1), .. xi−1 (1), x1 (2), .. xi (2), x1 (3), .. xi (3), x1 (0), .. xi−1 (0), 01p , .. 0i−1 p = τi l, x1 (4), .. xi−1 (4), x1 (5), .. xi (5), x1 (6), .. xi (6), = xi (4) x1 (0), .. xi−1 (0), 01p , .. 0i−1 p and

xi (3 ) = τi l, x1 (3 ), .. xi−1 (3 ), 01c , .. 0ic , 01c , .. 0ic , x1 (2 ), .. xi−1 (2 ), x1 (4 ), .. xi−1 (4 ) = τi l, x1 (6 ), .. xi−1 (6 ), 01c , .. 0ic , 01c , .. 0ic , x1 (5 ), .. xi−1 (5 ), x1 (1 ), .. xi−1 (1 ) = xi (6 ) xi (2 ) = τi l, x1 (2 ), .. xi−1 (2 ), x1 (3 ), .. xi (3 ), 01c , .. 0ic , x1 (1 ), .. xi−1 (1 ), 01p , .. 0i−1 p = τi l, x1 (5 ), .. xi−1 (5 ), x1 (6 ), .. xi (6 ), 01c , .. 0ic , x1 (4 ), .. xi−1 (4 ), 01p , .. 0i−1 p = xi (5 ) xi (1 ) = τi l, x1 (1 ), .. xi−1 (1 ), x1 (2 ), .. xi (2 ), x1 (6 ), .. xi (6 ), x1 (0 ), .. xi−1 (0 ), 01p , .. 0i−1 p = τi l, x1 (4 ), .. xi−1 (4 ), x1 (5 ), .. xi (5 ), x1 (3 ), .. xi (3 ), = xi (4 ). x1 (0 ), .. xi−1 (0 ), 01p , .. 0i−1 p

Next, we show by induction over i that xi (n) = xi (n ) holds for all n ∈ {0, . . . , 6}. In particular, any CRecCC network yields the same output for both graphs independent of the precise form of the network functions τi . i =1:

x1 (3) = x1 (6) = τ1 l, 01c , 01c = x1 (3 ) = x1 (6 ) x1 (2) = x1 (5) = τ1 l, x1 (3), 01c = τ1 l, x1 (3 ), 01c = x1 (2 ) = x1 (5 ) x1 (1) = x1 (4) = τ1 (l, x1 (2), x1 (3)) = τ1 (l, x1 (2 ), x1 (6 )) = x1 (1 ) = x1 (4 ) x1 (0) = τ1 (l, x1 (1), x1 (4)) = τ1 (l, x1 (1 ), x1 (4 )) = x1 (0 )

Universal Approximation Capability

1153

Induction step for i: xi (3) = τi l, x1 (3), .. xi−1 (3), 01c , .. 0ic , 01c , .. 0ic , x1 (2), .. xi−1 (2), x1 (1), .. xi−1 (1) = τi l, x1 (3), .. xi−1 (3), 01c , .. 0ic , 01c , .. 0ic , x1 (2), .. xi−1 (2), x1 (4), .. xi−1 (4) = τi l, x1 (3 ), .. xi−1 (3 ), 01c , .. 0ic , 01c , .. 0ic , x1 (2 ), .. xi−1 (2 ), x1 (4 ), .. xi−1 (4 ) = xi (3 ) = xi (6 ) = xi (6) xi (2) = τi l, x1 (2), .. xi−1 (2), x1 (3), .. xi (3), 01c , .. 0ic , x1 (1), .. xi−1 (1), 01p , .. 0i−1 p = τi l, x1 (2 ), .. xi−1 (2 ), x1 (3 ), .. xi (3 ), 01c , .. 0ic , x1 (1 ), .. xi−1 (1 ), 01p , .. 0i−1 p = xi (2 ) = xi (5 ) = xi (5) xi (1) = τi l, x1 (1), .. xi−1 (1), x1 (2), .. xi (2), x1 (3), .. xi (3), x1 (0), .. xi−1 (0), 01p , .. 0i−1 p = τi l, x1 (1), .. xi−1 (1), x1 (2), .. xi (2), x1 (6), .. xi (6), x1 (0), .. xi−1 (0), 01p , .. 0i−1 p = τi l, x1 (1 ), .. xi−1 (1 ), x1 (2 ), .. xi (2 ), x1 (6 ), .. xi (6 ), x1 (0 ), .. xi−1 (0 ), 01p , .. 0i−1 p = xi (1 ) = xi (4 ) = xi (4) xi (0) = τi l, x1 (0), .. xi−1 (0), x1 (1), .. xi (1), 1 i−1 x1 (4), .. xi (4), 01p , .. 0i−1 p , 0 p , .. 0 p = τi l, x1 (0 ), .. xi−1 (0 ), x1 (1 ), .. xi (1 ), 1 i−1 x1 (4 ), .. xi (4 ), 01p , .. 0i−1 p , 0 p , .. 0 p = xi (0 ). Hence, the structures cannot be differentiated by any CRecCC network regardless of the chosen specific weights of the network and form of the neurons.

1154

B. Hammer, A. Micheli, and A. Sperduti

Figure 4: Example of the transformation of a graph (left) for which a com-

patible positioning has to be found to the graph where each vertex is split into a parent vertex denoted by 0, . . . , 3 and a child vertex denoted by 0 , . . . , 3 (middle). Thereby, connections to empty childen and connections from empty parents are dropped. If the connections are considered as undirected connections, the dual graph can be constructed (right). Now an enumeration of the vertices in the dual graph that corresponds to a compatible positioning of children and parents in the original DPAG can be constructed.

A.2 Proof of Theorem 5. Assume a DPAG D is given. If D is empty, the result is obvious. Assume D is not empty. The first construction steps can be done for any limited fan-in and fan-out at most k; thus we describe the steps for general k. Given D, consider the following related (split) graph D where the role of parents and children is explicitly assigned to the vertices: vertices in D are {u p | ∃(u, v) ∈ edge(D)} and {vc | ∃(u, v) ∈ edge(D)}, edges in D are {(u p , vc ) | ∃(u, v) ∈ edge(D)}. See Figure 4 (middle) for an example. This graph can be transformed to its dual graph D ∨ with labeled edges with labels in {i, o}, which we obtain as follows: the dual graph introduces a vertex ve for each edge e in D and an edge (ve , v f ) if a vertex in D exists such that e and f are both connected to this vertex, that is, e = (u p , vc ) and f = (u p , vc ) (the edges are connected via a child) or e = (u p , vc ) and f = (u p , vc ) (the edges are connected via a parent) for some vertices u p , vc , u p , vc in D . An edge (ve , v f ) in D ∨ is labeled with i if e and f are connected via a child in D ; both are ingoing edges in D . An edge (ve , v f ) in D ∨ is labeled with o if e and f are connected via a parent in D ; both are outgoing edges in D . See Figure 4 (right) for an example. Now we can relate positionings of D to positionings of D and positionings of D to enumerations of D ∨ to prove the theorem. First, consider D and D . D has fan-in and fan-out k like D. Since the positioning of the children of a fixed vertex can be done independent of the positioning of the parents

Universal Approximation Capability

1155

Figure 5: Example of the structure of the dual graph (right) for a given DPAG (left). It decomposes into linear subgraphs and ring structures with an even number of vertices.

of this vertex, the following holds: there is a one-to-one connection between compatible positionings of (Pv , Cv ) of the vertices v in D and (Pv , Cv ) of the vertices in D. Thereby, D is no longer a rooted DPAG; however, the notion of a compatible positioning in the above form can also be applied for unrooted DPAGs. The one-to-one connection is given by the identities Pv c (u p ) = Pv (u) and Cu p (vc ) = Cu (v). This holds because positioning of the (empty) children of vc and of the (empty) parents of u p can be done arbitrarily. Note that any compatible positioning of D corresponds to the assignment of a number in {1, . . . , k} to each edge in the graph such that edges connected to the same vertex have different numbers. An edge (u p , vc ) is assigned the number Pv c (u p ) = Cu p (vc ); these values coincide for a compatible positioning. Since the numbers come from a positioning, edges connected to the same vertex have different numbers. Conversely, any assignment of numbers {1, . . . , k} to the edges of D such that edges connected to the same vertex have different numbers can be transformed to a compatible positioning. If a value i is assigned to edge (u p , vc ), we set Pv c (u p ) = Cu p (vc ) = i. Since edges connected to the same vertex have different numbers, this defines a valid positioning. Now consider the dual graph D ∨ . Assigning numbers to the edges in D

such that edges in D connected to the same vertex have different numbers means to assign numbers to the vertices in D ∨ such that vertices in D ∨ that are connected in D ∨ have different numbers. We have already argued that such an assignment for D leads to a compatible positioning of the graph D. We now use the restriction on k. Since D has limited fan-in and fan-out k equal to two, each vertex in D ∨ has at most one connection labeled with i and one connection labeled with o. Thus, each vertex in D ∨ has at most two edges. Hence, D ∨ decomposes into connected components that have a very simple structure: the components either constitute a linear graph or have a ring structure (see Figure 5). Thereby, each ring necessarily has an even number of vertices because each vertex is connected by one edge labeled

1156

B. Hammer, A. Micheli, and A. Sperduti

with i and one edge labeled with o. The assignment of values in {1, 2} can be done independently for each connected component of D ∨ . For a linear structure, we can simply assign consecutively alternating values to the vertices. For a ring, we can do the same starting at some arbitrary position because the number of vertices in a ring is even. Thus, an appropriate assignment of numbers to the vertices of D ∨ such that a compatible positioning of D arises is always possible. A.3 Proof of Theorem 6. As before, we can construct the graph D from D with vertices {u p | ∃(u, v) ∈ edge(D)} and {vc | ∃(u, v) ∈ edge(D)} and edges in D are {(u p , vc ) | ∃(u, v) ∈ edge(D)}, and its dual graph D ∨ . Each vertex in D ∨ has at most 2(k − 1) connections, since each edge in D is connected to a child with at most k − 1 further ingoing edges in D and a parent with at most k − 1 further outgoing edges in D . Hence, we can obviously assign numbers {1, . . . , 2k − 1} to each vertex in D ∨ such that no two vertices connected in D ∨ have the same numbers assigned. Even if all 2(k − 1) neighbors of a vertex are already enumerated, there is still one number available. These numbers correspond to a compatible positioning of children and parents in the original DPAG D with positions in {1, . . . , 2k − 1}. Hence, filling the unassigned positions for children and parents with empty children and parents, we obtain a DPAG from D with compatible positioning and fan-in and fan-out 2k − 1, thereby expanding with empty children and parents. Acknowledgments This work has been partially supported by MIUR grant 2002093941 004. We thank two anonymous referees for profound and valuable comments on an earlier version of the manuscript. References Alquezar, R., & Sanfeliu, A. (1995). An algebraic framework to represent finite state machines in single-layer recurrent neural networks. Neural Computation, 7(5), 931– 949. Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., & Soda, G. (1999). Exploiting the past and future in protein secondary structure prediction. Bioinformatics, 15(11), 937– 946. Bengio Y., & Frasconi, P. (1996). Input/output HMMs for sequence processing. IEEE Transactions on Neural Networks, 7(5), 1231–1249. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Bianucci, A. M., Micheli, A., Sperduti, A., & Starita, A. (2000). Application of cascade correlation networks for structures to chemistry. Journal of Applied Intelligence, 12, 117–146.

Universal Approximation Capability

1157

de Mauro, C., Diligenti, M., Gori, M., & Maggini, M. (2003). Similarity learning for graph-based image representations. Pattern Recognition Letters, 24(8), 1115–1122. Diligenti, M., Frasconi, P., & Gori, M. (2003). Hidden tree Markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 519–523. Fahlmann, S. E. (1991). The recurrent cascade-correlation architecture. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 190–196). San Mateo, CA: Morgan Kaufmann. Fahlmann, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 524–532). Son Mateo, CA: Morgan Kaufmann. Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a second order recurrent neural network during regular language inference. Neural Computation, 7(5), 923–949. Frasconi, P. (2002). Comparing convolution kernels and RNNs on a wide-coverage computational analysis of natural language. NIPS 2002 workshop. Frasconi, P., & Gori, M. (1996). Computational capabilities of local-feedback recurrent networks acting as finite-state machines. IEEE Transactions on Neural Networks, 7(6), 1521–1524. Frasconi, P., Gori, M., & Sperduti, A. (1998). A general framework of adaptive processing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. Funahashi K., & Nakamura, Y. (1993). Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6), 801–806. Giles, C. L., Chen, D., Sun, G. Z., Chen, H. H., Lee, Y. C., & Goudreau, M. W. (1995). Constructive learning of recurrent neural networks: Limitations of recurrent cascade correlation and a simple solution. IEEE Transactions on Neural Networks, 6(4), 829–836. Goller, C. (1997). A connectionist approach for learning search control heuristics for automated deduction systems. Unpublished doctoral dissertation, Technische Universit¨at Munchen. ¨ Goller, C., & Kuchler, ¨ A. (1996). Learning task-dependent distributed structurerepresentations by backpropagation through structure. In IEEE International Conference on Neural Networks (pp. 347–352). Piscataway, NJ: IEEE Computer Society Press. Hammer, B. (2000). Learning with recurrent neural networks. Berlin: Springer. Hammer, B. (2001). Generalization ability of folding networks. IEEE Transactions on Knowledge and Data Engineering, 13(2), 196–206. Hammer, B., & Steil, J. J. (2002). Perspectives on learning with recurrent neural networks. In M. Verleysen (Ed.), ESANN’02 (pp. 357–368). Brussels: D-side Publications. Haussler, D. (1999). Convolution kernels on discrete structure (Tech. Rep.). Santa Cruz: University of California at Santa Cruz. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Jaakkola, T., Diekhans, M., & Haussler, D. (2000). A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7, 95– 114.

1158

B. Hammer, A. Micheli, and A. Sperduti

Kremer, S. (2001). Spatio-temporal connectionist networks: A taxonomy and review. Neural Computation, 13(2), 249–306. Leslie, C., Eskin, E., & Noble, W. (2002). The spectrum kernel: A string kernel for SVM protein classification. In Proc. Pacific Symposium on Biocomputing (pp. 564– 575). Available online: http://psb.stanford.edu/psb.online/. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C.J.C.H. (2000). Text classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in information processing systems, 12 (pp. 563–569). Cambridge, MA: MIT Press. Micheli, A. (2003) Recursive processing of structured domains in machine learning. Unpublished doctoral dissertation, University of Pisa. Micheli, A., Sona, D., & Sperduti, A. (2000). Bi-causal recurrent cascade correlation. In Proceedings of the International Joint Conference on Neural Networks, (Vol. 3, pp. 3–8). Piscataway, NJ: IEEE Computer Society Press. Micheli, A., Sona, D., & Sperduti, A. (2002). Recursive cascade correlation for contextual processing of structured data, In Proceedings of the International Joint Conference on Neural Networks 2002, 1 (pp. 268–273). N. P. Micheli, A., Sona, D., & Sperduti, A. (2003). Formal definition of context in contextual recursive cascade correlation networks. In E. Alpaydin, E. Oja, & L. Xu (Eds.), Proceedings of ICANN/ICONIP 2003, LNCS 2714 (pp. 173–130). Berlin: Springer. Micheli, A., Sona, D., & Sperduti, A. (In press). Contextual processing of structured data by recursive cascade correlation. IEEE Transactions on Neural Networks. Pollastri, G., Baldi, P., Vullo, A., & Frasconi, P. (2002). Prediction of protein topologies using GIOHMMs and GRNNs. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Scarselli, F., & Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11, 15–37. Sontag, E. D. (1992). Feedforward nets for interpolation and classification. Journal of Computer and System Sciences, 45, 20–48. Sperduti, A. (1997). On the computational power of neural networks for structures. Neural Networks, 10(3), 395–400. Sperduti, A., Majidi, D., & Starita, A. (1996). Extended cascade-correlation for syntactic and structural pattern recognition. In P. Perner, P. Wand, & A. Rosenfeld (Eds.), Advances in structured and syntactical pattern recognition (pp. 90–99). New York: Springer. Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3), 714–735. Sturt, P., Costa, F., Lombardo, V., & Frasconi, P. (2003). Learning first-pass structural attachment preferences with dynamic grammars and recursive neural networks. Cognition, 88(2), 133–169. Sun, R. (2001). Introduction to sequence learning. In R. Sun & C. L. Giles (Eds.), Sequence learning: Paradigms, algorithms, and applications (pp. 1–10), New York: Springer. Vullo, A., & Frasconi, P. (2003). A recursive connectionist approach for predicting disulfide connectivity in proteins. In Proceedings of the Eighteenth Annual ACM Symposium on Applied Computing (SAC 2003). Melbourne, FL.

Universal Approximation Capability

1159

Watkins, C. (1999). Dynamic alignment kernels (Tech. Rep.). Royal Holloway, University of London. Wakuya, H., & Zurada, J. (2001). Bi-directional computing architectures for time series prediction. Neural Networks, 14, 1307–1321.

Received January 13, 2004; accepted August 10, 2004.

LETTER

Communicated by Peter Bartlett

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming Qiang Wu [email protected]

Ding-Xuan Zhou [email protected] Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China

Support vector machine (SVM) soft margin classifiers are important learning algorithms for classification problems. They can be stated as convex optimization problems and are suitable for a large data setting. Linear programming SVM classifiers are especially efficient for very large size samples. But little is known about their convergence, compared with the well-understood quadratic programming SVM classifier. In this article, we point out the difficulty and provide an error analysis. Our analysis shows that the convergence behavior of the linear programming SVM is almost the same as that of the quadratic programming SVM. This is implemented by setting a stepping-stone between the linear programming SVM and the classical 1-norm soft margin classifier. An upper bound for the misclassification error is presented for general probability distributions. Explicit learning rates are derived for deterministic and weakly separable distributions, and for distributions satisfying some Tsybakov noise condition. 1 Introduction Support vector machines (SVMs) form an important subject in learning theory. They are very efficient for many applications, especially for classification problems. The classical SVM model, the so-called 1-norm soft margin SVM, was introduced with polynomial kernels by Boser, Guyon, and Vapnik (1992) and with general kernels by Cortes and Vapnik (1995). Since then, many different forms of SVM algorithms have been introduced for different purposes (e.g., Niyogi & Girosi, 1996; Vapnik, 1998). Among them, the linear programming (LP) SVM (Bradley & Mangasarian, 2000; Kecman & Hadzic, 2000; Niyogi & Girosi, 1996; Pedroso & Murata, 2001; Vapnik, 1998) is important because of its linearity and flexibility for large data setting. The term linear programming means that the algorithm is based on linear programming optimization. Correspondingly, the 1-norm soft margin SVM is Neural Computation 17, 1160–1187 (2005)

© 2005 Massachusetts Institute of Technology

SVM Soft Margin Classifiers

1161

also called quadratic programming (QP) SVM since it is based on quadratic programming optimization (Vapnik, 1998). Many experiments demonstrate that LP-SVM is efficient and performs even better than QP-SVM for some purposes: it is capable of solving huge sample size problems (Bradley & Mangasarian 2000), improving computational speed (Pedroso & Murata, 2001), and reducing the number of support vectors (Kecman & Hadzic, 2000). While the convergence of QP-SVM has become well understood because of recent work (Steinwart, 2002; Zhang, 2004; Wu & Zhou, 2004; Scovel & Steinwart, in press; Wu, Ying, & Zhou, in press), little is known about LPSVM. The purpose of this article is to point out the main difficulty and then provide error analysis for LP-SVM. Consider the binary classification setting. Let (X, d) be a compact metric space and Y = {1, −1}. A binary classifier is a function f : X → Y, which labels every point x ∈ X with some y ∈ Y. Both LP-SVM and QP-SVM considered here are kernel-based classifiers. A function K : X × X → R is called a Mercer kernel if it is continuous, symmetrical, and positive semidefinite, that is, for any finite set of distinct points {x1 , · · · , x } ⊂ X, the matrix (K (xi , x j ))i, j=1 is positive semidefinite. Let z = {(x1 , y1 ), · · · , (xm , ym )} ⊂ (X × Y)m be the sample. Motivated by reducing the number of support vectors of the 1-norm soft margin SVM, Vapnik (1998) introduced the LP-SVM algorithm associated with a Mercer kernel K . It is based on the following linear programming optimization problem:

m m 1 1 min ξi + αi α∈Rm m i=1 C i=1 + , b∈R m α j y j K (xi , x j ) + b ≥ 1 − ξi subject to yi

(1.1)

j=1

ξi ≥ 0, i = 1, · · · , m. The trade-off parameter Here α = (α1 , · · · , αm ), ξi ’s are slack variables. C = C(m) > 0 depends on m and is crucial. If αz = (α1,z , · · · , αm,z ), b z solves the optimization problem, equation 1.1, the LP-SVM classifier is given by sgn( f z ) with f z (x) =

m

αi,z yi K (x, xi ) + b z .

(1.2)

i=1

For a real-valued function f : X → R, its sign function is defined as sgn( f )(x) = 1 if f (x) ≥ 0 and sgn( f )(x) = − 1 otherwise.

1162

Q. Wu and D.-X. Zhou

The QP-SVM is based on a quadratic programming optimization problem: m m 1 1 min ξi + αi yi K (xi , x j )α j y j α∈Rm m i=1 2C + , b∈R i, j=1

m (1.3) subject to yi α j y j K (xi , x j ) + b ≥ 1 − ξi j=1

ξi ≥ 0, i = 1, · · · , m. = C(m) Here, C > 0 is also a trade-off parameter depending on the sample b z ) solves the optimization problem 1.3, then size m. If ( αz = (α˜ 1,z , · · · , α˜ m,z ), the 1-norm soft margin classifier is defined by sgn( f z ) with f z (x) =

m

αi,z yi K (x, xi ) + bz .

(1.4)

i=1

Observe that both the LP-SVM classifier (see equation 1.1) and QP-SVM classifier (see equation 1.3) are implemented by convex optimization problems. Compared with this, neural network learning algorithms are often performed by nonconvex optimization problems. The reproducing kernel property of Mercer kernels ensures nice approximation power of SVM classifiers. Recall that the reproducing kernel Hilbert space (RKHS) H K associated with a Mercer kernel K is defined (Aronszajn, 1950) to be the closure of the linear span of the set of functions {K x := K (x, ·) : x ∈ X} with the inner product ·, · K satisfying K x , K y K = K (x, y). The reproducing property is given by f, K x K = f (x),

∀x ∈ X, f ∈ H K .

(1.5)

The QP-SVM is well understood. It has attractive approximation properties (see section 2) because the learning scheme can be represented as a Tikhonov regularization (Evgeniou, Pontil, & Poggio, 2000) (modified by an offset) associated with the RKHS, fz =

arg min

f = f ∗ + b∈H K +R

m 1 1 (1 − yi f (xi ))+ + f ∗ 2K m i=1 2C

,

(1.6)

where (t)+ = max{0, t}. Set H K := H K + R. For a function f = f 1 + b 1 ∈ H K , we denote f ∗ = f 1 and b f = b 1 . Write b fz as b z . It turns out that equation 1.6 is the same as 1.3 together with 1.4. To m see this, we first note that f ∗z must lie in the span of {K xi }i=1 according

SVM Soft Margin Classifiers

1163

to the representation theorem (Wahba, 1990). Next, the dual problem of equation 1.6 shows (Vapnik, 1998) that the coefficient of K xi , αi yi , has the same the definition of the H K norm yields m sign as y2i . Finally, f ∗z 2K = i=1 αi yi K xi K = i,mj = 1 αi yi K (xi , x j )α j y j . The rich knowledge on Tikhonov regularization schemes and the idea of bias-variance trade-off developed in the neural network literature provide a mathematical foundation of the QP-SVM. In particular, the convergence is well understood due to the work done within the past few years. Here the form 1.6 illustrates some advantages of the QP-SVM: the minimization is taken over the whole space H K , so we expect the QP-SVM has some good approximation power, similar to the approximation error of the space H K . Things are totally different for LP-SVM. Set H K ,z =

m

αi yi K (x, xi ) : α = (α1 , · · · , αm ) ∈

Rm +

.

i=1

Then the LP-SVM scheme, equation 1.1, can be written as fz =

arg min

f = f ∗ + b∈H K ,z +R

m 1 1 ∗ (1 − yi f (xi ))+ + ( f ) . m i=1 C

(1.7)

m m αi for f ∗ = i=1 αi yi K xi with Here we have denoted ( f ∗ ) = yα 1 = i=1 ∗ αi ≥ 0. It plays the role of a norm of f in some sense. This is not a Hilbert space norm, which raises the technical difficulty for mathematical analysis. More serious, the hypothesis space H K ,z depends on the sample z. The “centers” xi of the basis functions in H K ,z are determined by the sample z; they are not free. One might consider regularization schemes in the space of all linear combinations with free centers, but whether the minimization can be reduced to a convex optimization problem of size m, like equation 1.1, is unknown. Also, it is difficult to relate the corresponding optimum (in a ball with radius C) to f z∗ with respect to the estimation error. Thus, separating the error for LP-SVM into two terms of sample error and approximation error is not as immediate as for the QP-SVM or neural network methods (Niyogi & Girosi 1996), where the centers are free. In this article, we overcome this difficulty by setting a stepping-stone. Turn to the error analysis. Let ρ be a Borel probability measure on Z := X × Y and (X , Y) be the corresponding random variable. The prediction power of a classifier f is measured by its misclassification error, that is, the probability of the event f (X ) = Y: R( f ) = Prob{ f (X ) = Y} =

P(Y = f (x)|x) dρ X . X

(1.8)

1164

Q. Wu and D.-X. Zhou

Here ρ X is the marginal distribution, and ρ(·|x) is the conditional distribution of ρ. The classifier minimizing the misclassification error is called the Bayes rule f c . It takes the form f c (x) =

1,

if P(Y = 1|x) ≥ P(Y = − 1|x),

−1,

if P(Y = 1|x) < P(Y = − 1|x).

If we define the regression function of ρ as f ρ (x) =

ydρ(y|x) = P(Y = 1|x) − P(Y = − 1|x),

x ∈ X,

Y

then f c = sgn( f ρ ). Note that for a real-valued function f, sgn( f ) gives a classifier, and its misclassification error will be denoted by R( f ) for abbreviation. Although the Bayes rule exists, it cannot be found directly since ρ is unm known. Instead, we have in hand a set of samples z = {zi }i=1 = {(x1 , y1 ), · · · , (xm , ym )} (m ∈ N). Throughout the article, we assume {z1 , · · · , zm } are independently and identically distributed according to ρ. A classification algorithm constructs a classifier f z based on z. Our goal is to understand how to choose the parameter C = C(m) in algorithm 1.1 so that the LP-SVM classifier sgn( f z ) can approximate the Bayes rule f c with satisfactory convergence rates (as m → ∞). Our approach provides clues to studying learning algorithms with penalty functional different from the RKHS norm (Niyogi & Girosi 1996; Evgeniou et al., 2000). It can be extended to schemes with general loss functions (Rosasco, De Vito, Caponnetto, Piana, & Verri, 2004; Lugosi & Vayatis, 2004; Wu et al., in press). 2 Main Results In this article, we investigate learning rates: the decay of the excess misclassification error R( f z ) − R( f c ) as m and C(m) become large. Consider the QP-SVM classification algorithm f z defined by equation 1.3. = C(m) Steinwart (2002) showed that R( f z ) − R( f c ) → 0 (as m and C → ∞), when H K is dense in C(X), the space of continuous functions on X with the norm · ∞ . Lugosi and Vayatis (2004) found that for the exponential loss, the excess misclassification error of regularized boosting algorithms can be estimated by the excess generalization error. An important result on the relation between the misclassification error and generalization error for a convex loss function is due to Zhang (2004). (See Bartlett, Jordan, & McAuliffe, 2003, and Wu, Ying, & Zhou, in press, for extensions to general

SVM Soft Margin Classifiers

1165

loss functions.) Here we consider the hinge loss V(y, f (x)) = (1 − yf (x))+ . The generalization error is defined as E( f ) =

V(y, f (x)) dρ. Z

Note that f c is a minimizer of E( f ). Then Zhang’s results assert that R( f ) − R( f c ) ≤ E( f ) − E( f c ),

∀ f : X → R.

(2.1)

Thus, the excess misclassification error R( f z ) − R( f c ) can be bounded by the excess generalization error E( f z ) − E( f c ), and the following error decomposition (Wu & Zhou, 2004) holds: C). E( f z ) − E( f c ) ≤ {E( f z ) − Ez ( f z ) + Ez ( f K ,C ) − E( f K ,C )} + D( Here, Ez ( f ) = m1 defined as

m i=1

(2.2)

and is V(yi , f (xi )). The function f K ,C depends on C

1 f ∗ 2K , f K ,C : = arg min E( f ) + 2C f ∈H K

C > 0.

(2.3)

The decomposition, equation 2.2, makes the error analysis for QP-SVM easy, similar to that in Niyogi and Girosi (1996). The second term of equation 2.2 measures the approximation power of H K for ρ. Definition 1. The regularization error of the system (K , ρ) is defined by

1 inf f ∗ 2K . D(C) := f ∈H E( f ) − E( f c ) + K 2C

(2.4)

The regularization error for a regularizing function f K ,C ∈ H K is defined as D(C) := E( f K ,C ) − E( f c ) +

1 f ∗ 2 . 2C K ,C K

(2.5)

In Wu and Zhou (2004) we showed that E( f ) − E( f c ) ≤ f − f c L 1ρ . X Hence, the regularization error can be estimated by the approximation in 1 a weighted L space, as done in Smale and Zhou (2003) and Chen et al. (2004). Definition 2. We say that the probability measure ρ can be approximated by H K with exponent 0 < β ≤ 1 if there exists a constant c β such that H1 :

D(C) ≤ c β C −β ,

∀C > 0 .

1166

Q. Wu and D.-X. Zhou

The first term of equation 2.2 is called the sample error. It has been well understood in learning theory by concentration inequalities (Vapnik, 1998; Devroye, Gyorfi, ¨ & Lugosi, 1997; Niyogi, 1998; Cucker & Smale, 2001; Bousquet & Elisseeff, 2002). The approaches developed in Barron (1990), Bartlett (1998), Niyogi and Girosi (1996), and Zhang (2004) separate the regularization error and the sample error concerning f z . In particular, for the QP-SVM, Zhang (2004) proved that f z )} ≤ inf Ez∈Zm {E(

f ∈H K

E( f ) +

1 2C . f ∗ 2K + m 2C

(2.6)

C) + 2C . When assumption H1 f z ) − E( f c )} ≤ D( It follows that Ez∈Zm {E( m fz) − holds, Zhang’s bound in connection with equation 2.1 yields Ez∈Zm {R( −β ) + 2C . This is similar to some well-known bounds for the R( f c )} = O(C m neural network learning algorithms (see, e.g., theorem 3.1 in Niyogi & Girosi, 1996). The best learning rate derived from equation 2.6 by choos = m1/(β+1) is ing C f z ) − R( f c )} = O(m−α ), Ez∈Zm {R(

α=

β . β +1

(2.7)

Observe that the sample error bound 2mC in equation 2.6 is independent of the kernel K or the distribution ρ. If some information about K or ρ is available, the sample error, and hence the excess misclassification error, can be improved. The information we need about K is the capacity measured by covering numbers. Definition 3. Let F be a subset of a metric space. For any ε > 0 , the covering number N (F, ε) is defined to be the minimal integer ∈ N such that there exist balls with radius ε covering F. In this article, we use only the uniform covering number. Covering numbers measured by empirical distances are also used in the literature (van der Vaart & Wellner, 1996). (For comparisons, see Pontil, 2003.) Let B R = { f ∈ H K : f K ≤ R}. It is a subset of C(X), and the covering number is well defined. We denote the covering number of the unit ball B1 as N (ε) := N (B1 , ε),

ε > 0.

(2.8)

SVM Soft Margin Classifiers

1167

Definition 4. The RKHS H K is said to have logarithmic complexity exponent s ≥ 1 if there exists a constant c s > 0 such that H2 :

logN (ε) ≤ c s (log(1 /ε))s .

It has polynomial complexity exponent s > 0 if there is some c s > 0 such that H2’ :

logN (ε) ≤ c s (1 /ε)s .

The uniform covering number has been extensively studied in learning theory. In particular, we know that for the gaussian kernel K (x, y) = exp{−|x − y|2 /σ 2 } with σ > 0 on a bounded subset X of Rn , assumption H2 holds with s = n + 1 (see Zhou, 2002); if K is C r with r > 0 (Sobolev smoothness), then assumption H2 is valid with s = 2n/r (see Zhou, 2003). The information we need about ρ is a Tsybakov noise condition (Tsybakov, 2004). Definition 5. Let 0 ≤ q ≤ ∞. We say that ρ has Tsybakov noise exponent q if there exists a constant c q > 0 such that H3 :

PX ({x ∈ X : | f ρ (x)| ≤ c q t}) ≤ t q .

All distributions have at least noise exponent 0. Deterministic distributions (which satisfy | f ρ (x)| ≡ 1) have the noise exponent q = ∞ with c ∞ = 1. Using the above conditions about K and ρ, Scovel and Steinwart (in press) showed that when assumptions H1, H2 , and H3 hold, for every > 0 and every δ > 0, with confidence 1 − δ, R( f z ) − R( f c ) = O(m−α ),

α=

4β(q + 1) − . (2q + sq + 4)(1 + β)

(2.9)

When no conditions are assumed for the distribution (i.e., q = 0) or s = 2 for the kernel (the worst case when empirical covering numbers are used; β see van der Vaart and Wellner, 1996), the rate is reduced to α = β+1 − , arbitrarily close to Zhang’s rate (2004) (see equation 2.7). Recently, Wu et al. (in press) improved the rate (see equation 2.9) and showed that under the same assumptions—H1, H2 , and H3—for every , δ > 0, with confidence 1 − δ, R( f z ) − R( f c ) = O(m−α ),

β(q + 1) 2β α = min − , . β(q + 2) + (q + 1 − β)s/2 β +1

(2.10)

1168

Q. Wu and D.-X. Zhou

When some condition is assumed for the kernel but not for the distribution, β 2β that is, s < 2 but q = 0, the rate 2.10 has power α = min{ 2β+(1−β)s/2 − , β+1 }. This is better than equation 2.7 or 2.9 (or the rates given in Bartlett et al. 2003, and Blanchard, Bousquet, & Massart, 2004; see Chen et al., 2004, and Wu et al., in press, for detailed comparisons) if β < 1. This improvement is possible due to the projection operator. Definition 6. The projection operator π is defined on the space of measurable functions f : X → R as  if f (x) > 1,   1, if f (x) < −1, π ( f )(x) = −1,   f (x), if − 1 ≤ f (x) ≤ 1. The idea of projections appeared in margin-based bound analysis, for example, Bartlett (1998), Lugosi and Vayatis (2004), Zhang (2002), and Anthony and Bartlett (1999). We used the projection operator for the purpose of bounding misclassification and generalization errors in Chen et al. (2004). It helped us get sharper bounds of the sample error: probability inequalities are applied to random variables involving functions π( f z ) (bounded by 1), not to f z (the corresponding bound increases to infinity as C becomes large). In this article, we apply the projection operator to the LP-SVM. Let us turn to our main goal: the LP-SVM classification algorithm f z defined by equation 1.1. To our knowledge, the convergence of the algorithm has not been verified, even for distributions strictly separable by a universal kernel. What is the main difficulty in the error analysis? One difficulty lies in the error decomposition: nothing like equation 2.2 exists for LP-SVM in the literature. Bounds for the regularization or approximation error independent of z are not available. We do not know whether it can be bounded by a norm in the whole space H K or a norm similar to those in Niyogi and Girosi (1996). In this article, we overcome the difficulty by means of a stepping-stone from QP-SVM to LP-SVM. Then we can provide error analysis for general distributions. In particular, explicit learning rates will be presented. To this end, we first make an error decomposition. Theorem 1. Let C > 0 , 0 < η ≤ 1, and f K ,C ∈ H K . There holds R( f z ) − R( f c ) ≤ 2ηR( f c ) + S(m, C, η) + 2D(ηC), where S(m, C, η) is the sample error defined by S(m, C, η) := {E(π ( f z )) − Ez (π ( f z ))} + (1 + η){Ez ( f K ,C ) − E( f K ,C )}. (2.11)

SVM Soft Margin Classifiers

1169

Theorem 1 will be proved in section 4. The term D(ηC) is the regularization error (Smale & Zhou 2004) defined for a regularizing function f K ,C (arbitrarily chosen) by equation 2.5. In Chen et al. (2004), we showed that D(C) ≥ D(C) ≥

κ2 2C

(2.12)

where κ := E0 /(1 + κ),

κ = sup

K (x, x),

x∈X

E0 := inf {E(b) − E( f c )}. b∈R

Also, κ = 0 only for very special distributions. Hence, the decay of D(C) cannot be faster than O(1/C) in general. Thus, to have satisfactory convergence rates, C cannot be too small, and it usually takes the form of mτ for some τ > 0. The constant κ is the norm of the inclusion H K ⊂ C(X): f ∞ ≤ κ f K ,

∀ f ∈ HK .

(2.13)

Next we focus on analyzing the learning rates. Since a uniform rate is impossible for all probability distributions as shown in theorem 7.2 of Devroye et al. (1997), we need to consider subclasses. The choice of η is important in the upper bound in theorem 1. If the distribution is deterministic, that is, R( f c ) = 0, we may choose η = 1. When R( f c ) > 0, we must choose η = η(m) → 0 as m → ∞ in order to get the convergence rate. Of course, the latter choice may lead to a slightly worse rate. Thus, we will consider these two cases separately. The following proposition gives the bound for deterministic distributions. Proposition 1. Suppose R( f c ) = 0. If f K ,C is a function in H K satisfying V(y, f K ,C (x)) ∞ ≤ M, then for every 0 < δ < 1 , with confidence 1 − δ, there holds R( f z ) ≤ 32εm,C +

20M log(2/δ) + 8D(C), 3m

where with a constant c s depending on c s , κ and s, εm,C is given by 

s 22 2 2 

 , log + c s log C M log + log (mCD(C))    m δ δ    if assumption H2 holds; s 32c (CD(C)) 1+s  35 log2/δ  1 s  s

1+s  1+s ) (C M) , 1 + (c +  s  m 3m1/(1+s)  

if assumption H2 holds.

1170

Q. Wu and D.-X. Zhou

Proposition 1 will be proved in section 6. As corollaries, we obtain learning rates for strictly separable distributions and for weakly separable distributions. Definition 7. We say that ρ is strictly separable by H K with margin γ > 0 if there is some function f γ ∈ H K such that f γ∗ K = 1 and yf γ (x) ≥ γ almost everywhere. For QP-SVM, the strictly separable case is well understood (see, e.g., Vapnik, 1998, and Cristianini & Shawe-Taylor, 2000). For LP-SVM, we have Corollary 1. If ρ is strictly separable by H K with margin γ > 0 and assumption H2 holds, then R( f z ) ≤

s 1 704 2 4 . log + c s log m + log 2 + m δ γ Cγ 2

In particular, this will yield the learning rate O

(log m)s m

by taking C = m/γ 2 .

Proof. Take f K ,C = f γ /γ . Then V(y, f K ,C (x)) ≡ 0 and D(C) equals 1 1 f γ∗ /γ 2K = 2Cγ 2 . The conclusion follows from proposition 1 by choosing 2C M = 0. Remark 1. For strictly separable distributions, we verify the optimal rate when assumption H2 holds. Similar rates are true for more general kernels. But we omit the details here. Definition 8. We say that ρ is (weakly) separable by H K if there is some function ∗ f sp ∈ H K , called the separating function, such that f sp K = 1 and yf sp (x) > 0 almost everywhere. It has separating exponent θ ∈ (0, ∞] if for some γθ > 0, there holds ρ X (0 < | f sp (x)| < γθ t) ≤ t θ .

(2.14)

Corollary 2. Suppose that ρ is separable by H K with equation 2.14 valid. (i) If assumption H2 holds, then

(log m + log C)s θ − θ+ 2 . +C R( f z ) = O m This gives the learning rate O(

(log m)s ) m

by taking C = m(θ+2)/θ .

SVM Soft Margin Classifiers

1171

(ii) If assumption H2 holds, then   2s 11+s s 1 +s θ +2 C C θ R( f z ) = O  + C − θ +2  . + m m θ

θ +2

This yields the learning rate O(m− sθ +2s+θ ) by taking C = m sθ +2s+θ . 1

Proof. Take f K ,C = C θ +2 f sp /γθ . By the definition of f sp , we have yf K ,C (x) ≥ 0 almost everywhere. Hence, 0 ≤ V(y, f K ,C (x)) ≤ 1. Moreover, E( f K ,C ) =

X

1

C θ +2 | f sp (x)| 1− γθ

dρ X +

1 = ρ X 0 < | f sp (x)| < γθ C − θ +2 , θ

θ

which is bounded by C − θ +2 . Therefore, D(C) ≤ (1 + 2γ1 2 )C − θ +2 . Then θ the conclusion follows from proposition 1 by choosing M = 1. Example. Let X = [−1/2, 1/2] and ρ be the Borel probability measure on Z such that ρ X is the Lebesgue measure on X and

f ρ (x) =

−1, 1,

if if

− 1/2 ≤ x < 0, 0 < x < 1/2.

If we take the linear kernel K (x, y) = x · y, then θ = 1, γθ = 1/2. Since aslog m sumption H2 is satisfied with s = 1, the learning rate is O( m ) by taking C = m3 . Remark 2. Condition 2.14 with θ = ∞ is exactly the definition of strictly separable distribution, and γθ is the margin. The choice of f K ,C and the regularization error play essential roles in getting our error bounds. It influences the strategy of choosing the regularization parameter (model selection) and determines learning rates. For weakly separable distributions, we chose f K ,C to be multiples of a separating function in corollary 2. For the general case, it can be equation 2.3. We analyze learning rates for distributions having polynomially decaying regularization error, that is, assumption H1 with β ≤ 1. This is reasonable because of equation 2.12. Theorem 2. Suppose that R( f c ) = 0 and assumptions H1 and H2 hold with 1 2 0 < s < ∞ and 0 < β ≤ 1 , respectively. Take C = mζ with ζ := min { s+β , 1 +β }.

1172

Q. Wu and D.-X. Zhou

Then for every 0 < δ < 1 , there exists a constant c depending on s, β, δ such that with confidence 1 − δ, R( f z ) ≤ cm−α ,

α = min

2β β , . 1 +β s+β

Next we consider general distributions satisfying the Tsybakov condition (Tsybakov, 2004). Theorem 3. Assume assumptions H1, H2 , and H3 with 0 < s < ∞, 0 < β ≤ 1 , and 0 ≤ q ≤ ∞. Take C = mζ with

ζ := min

2 (q + 1 )(β + 1 ) , . β + 1 s(q + 1 ) + β(q + 2 + q s + s)

c depending on For every > 0 and every 0 < δ < 1 , there exists a constant s, q , β, δ, and such that with confidence 1 − δ, cm−α , R( f z ) − R( f c ) ≤

2β β(q + 1 ) α = min , − . β + 1 s(q + 1 ) + β(q + 2 + q s + s) Remark 3. Since R( f c ) is usually small for a meaningful classification problem, the upper bound in theorem 1 tells us that the performance of LP-SVM is similar to that of QP-SVM. However, to have convergence rates, we need to choose η = η(m) → 0 as m becomes large. This makes our rate worse than that of QP-SVM. This is the case when the capacity index s is large. When s is very small, the rate is O (m−α ) with α close to min{ qq +1 , 2β }, which co+2 β+1 incides to rate 2.10 and is better than rates 2.7 or 2.9 for QP-SVM. As any C ∞ kernel satisfies assumption H2 for an arbitrarily small s > 0 (Zhou, 2003), this is the case for polynomial or gaussian kernels, usually used in practice. Remark 4. Here we use a stepping-stone from QP-SVM to LP-SVM, so the derived learning rates for the LP-SVM are essentially no worse than those of QP-SVM. It would be interesting to introduce different tools to get learning rates for the LP-SVM that are better than those of QP-SVM. Also, the choice of the trade-off parameter C in theorem 3 depends on the indices β (approximation), s (capacity), and q (noise condition). This gives a rate that is optimal by our approach. One can take other choices ζ > 0 (for C = mζ ), independent of β, s, q , and then derive learning rates according to the proof of theorem 3. But the derived rates are worse than the one stated in theorem 3. It would be of importance to give some methods for choosing C adaptively.

SVM Soft Margin Classifiers

1173

Remark 5. When empirical covering numbers are used, the capacity index can be restricted to s ∈ [0, 2]. Similar learning rates can be derived, as done in Blanchard et al. (2004) and Wu et al. (in press). 3 Stepping-Stone Recall that in equation 1.7, the penalty term ( f ∗ ) is usually not a norm. This makes the scheme difficult to analyze. Since the solution f z of the LPSVM has a representation similar to f z in QP-SVM, we expect close relations between these schemes. Hence, the latter may play roles in the analysis for the former. To this end, we need to estimate ( f ∗z ), the l 1 -norm of ∗ the coefficients of the solution f z to equation 1.4. > 0, the function Lemma 1. For every C f z defined by equations 1.3 and 1.4 satisfies ( f ∗z ) =

m

z( αi,z ≤ CE fz) + f ∗z 2K .

i=1

Proof. The dual problem of the 1-norm soft margin SVM (Vapnik, 1998) tells us that the coefficients αi,z in the expression 1.4 of f z satisfy 0 ≤ αi,z ≤

C m

and

m

αi,z yi = 0.

(3.1)

i=1

The definition of the loss function V implies that 1 − yi f z (xi )). f z (xi ) ≤ V(yi , Then m i=1

αi,z −

m

αi,z yi f z (xi ) ≤

i=1

m

αi,z V(yi , f z (xi )).

i=1

Applying the upper bound for αi,z in equation 3.1, we can bound the right side above as m

αi,z V(yi , f z (xi )) ≤

i=1

m C z( V(yi , f z (xi )) = CE f z ). m i=1

Applying the second relation in equation 3.1 yields m i=1

αi,z yi b z = 0.

1174

Q. Wu and D.-X. Zhou

It follows that m

αi,z yi f z (xi ) =

i=1

m

m ∗ f z (xi ) + αi,z yi bz = αi,z yi f ∗z (xi ).

i=1

But f ∗z (xi ) =

m

i=1

α j,z y j K (xi , x j ). We have

j=1 m

αi,z yi f z (xi ) =

i=1

m

αi,z yi α j,z y j K (xi , x j ) = f ∗z 2K .

i, j=1

Hence, the bound for ( f ∗z ) follows.1 4 Error Decomposition In this section, we estimate R( f z ) − R( f c ). Since sgn(π( f )) = sgn( f ), we have R( f ) = R(π ( f )). Using equation 2.1 to π( f ), we obtain R( f ) − R( f c ) = R(π( f )) − R( f c ) ≤ E(π( f )) − E( f c ).

(4.1)

It is easy to see that V(y, π ( f )(x)) ≤ V(y, f (x)). Hence, E(π( f )) ≤ E( f )

and Ez (π ( f )) ≤ Ez ( f ).

(4.2)

We are in a position to prove theorem 1, which, by equation 4.1, is an easy consequence of the following result. Proposition 2. Let C > 0, 0 < η ≤ 1, and f K ,C ∈ H K . Then E(π ( f z )) − E( f c ) +

1 ( f z∗ ) ≤ 2ηR( f c ) + S(m, C, η) + 2D(ηC), C

where S(m, C, η) is defined by equation 2.11. = ηC. Proof. Take f z to be the solution of equation 1.4 with C We see from the definition of f z and equation 4.2 that

1 1 f z ) + ( f ∗z ) ≤ 0. Ez (π( f z )) + ( f z∗ ) − Ez ( C C 1 Yiming Ying pointed out to us that actually the equality holds in lemma 1. This follows from the KKT conditions. But we need only the inequality here.

SVM Soft Margin Classifiers

1175

This enables us to decompose E((π( f z )) + C1 ( f z∗ ) as E(π( f z )) +

1 1 f z ) + ( ( f z∗ ) ≤ {E(π ( f z )) − Ez (π ( f z ))} + Ez ( f ∗z ) . C C

z( = ηC. Hence, Lemma 1 gives ( f ∗z ) ≤ CE fz) + f ∗z 2K . But C E(π( f z )) +

1 ( f z∗ ) ≤ {E(π( f z )) − Ez (π ( f z ))} C 1 ∗ 2 + (1 + η)Ez ( fz) + f . C z K

Next, we use the function f K ,C to analyze the second term of the above bound and get Ez ( fz) +

1 1 ∗ 2 1 fz) + f ∗z 2K ≤ Ez ( f z K ≤ Ez ( f K ,C ) + f ∗ 2 . K ,C K (1 + η)C 2C 2C

This bound can be written as {Ez ( f K ,C ) − E( f K ,C )} + {E( f K ,C ) +

1 ∗ 2 f K ,C K }. 2C

Combining the above two steps, we find that E(π( f z )) − E( f c ) + is bounded by

1 ( f z∗ ) C

{E(π( f z )) − Ez (π ( f z ))} + (1 + η){Ez ( f K ,C ) − E( f K ,C )} 1 + (1 + η) E( f K ,C ) − E( f c ) + 2ηC f K∗ ,C 2K + ηE( f c ). By the fact E( f c ) = 2R( f c ) and the definition of D(C), we draw our conclusion. 5 Probability Inequalities In this section we give some probability inequalities. They modify the Bernstein inequality and extend our previous work in Chen et al. (2004), which was motivated by sample error estimates for the square loss (Barron, 1990; Bartlett, 1998; Cucker & Smale, 2001; Mendelson, 2002). Recall the Bernstein inequality: Let ξ be a random variable on Z with mean µ and variance σ 2 . If |ξ − µ| ≤ M. Then m 1 mε 2 . Prob µ − ξ (zi ) > ε ≤ 2 exp − m i=1 2(σ 2 + 1 Mε) 3

The one-side Bernstein inequality holds without the leading factor 2.

1176

Q. Wu and D.-X. Zhou

Proposition 3. Let ξ be a random variable on Z satisfying µ ≥ 0 , |ξ − µ| ≤ M almost everywhere, and σ 2 ≤ cµτ for some 0 ≤ τ ≤ 2. Then for every ε > 0 , there holds m ξ (zi ) µ − m1 i=1 mε2−τ 1 − τ2 . Prob >ε ≤ exp − 1 2(c + 13 Mε 1 −τ ) (µτ + ε τ ) 2 Proof. The one-side Bernstein inequality tells us that m ξ (zi ) µ − m1 i=1 1− τ2 Prob >ε 1 (µτ + ε τ ) 2 mε2−τ (µτ + ε τ ) ≤ exp − . τ 1 2 σ 2 + 13 Mε1− 2 (µτ + ε τ ) 2 Since σ 2 ≤ cµτ , we have M 1− τ τ M 1 ε 2 (µ + ε τ ) 2 ≤ cµτ + ε 1−τ (µτ + ε τ ) 3 3

1 ≤ (µτ + ε τ ) c + Mε1−τ . 3

σ2 +

This yields the desired inequality. Note that f z depends on z and thus runs over a set of functions as z changes. We need a probability inequality concerning the uniform conver gence. Denote E g := Z g(z)dρ. Lemma 2. Let 0 ≤ τ ≤ 1 , M > 0 , c ≥ 0 , and G be a set of functions on Z such that for every g ∈ G, E g ≥ 0 , |g − E g| ≤ M and E g 2 ≤ c(E g)τ . Then for ε > 0 ,

Eg −

1 m

m i=1

g(zi )

1− τ2

> 4ε 1 ((E g)τ + ε τ ) 2 −mε 2−τ ≤ N G, ε exp . 2(c + 13 Mε1−τ )

Prob sup g∈G

Proof. Let {g j }Nj = 1 ⊂ G with N = N G, ε such that for every g ∈ G, there is some j ∈ {1, . . . , N } satisfying g − g j ∞ ≤ ε. Then by proposition 3, a standard procedure (Cucker & Smale, 2001; Mukherjee, Rifkin, & Poggio, 2002; Chen et al. 2004) leads to the conclusion.

SVM Soft Margin Classifiers

1177

Remark 6. Various forms of probability inequalities using empirical covering numbers can be found in the literature. For simplicity we give the current form in lemma 2, which is enough for our purpose. Let us find the hypothesis space covering f z when z runs over all possible samples. This is implemented in the following two lemmas. By the idea of bounding the offset from Wu and Zhou (2004) and Chen et al. (2004), we can prove the following. Lemma 3. For any C > 0 , m ∈ N, and z ∈ Zm , we can find a solution f z of min | f (x )| ≤ 1. Hence, |b | ≤ 1 + f ∗ . equation 1.7 satisfying 1≤i≤m z i z z ∞ We shall always choose f z as in lemma 3. In fact, the only restriction we need to make for the minimizer f z is to choose αi = 0 and b z = y∗ , that is, f z (x) = y∗ whenever yi = y∗ for all 1 ≤ i ≤ m with some y∗ ∈ Y. Lemma 4. For every C > 0 , we have f z∗ ∈ H K and f z∗ K ≤ κ( f z∗ ) ≤ κC. Proof. It is trivial that f z∗ ∈ H K . By the reproducing property 1.5,

f z∗ K

=

m

1/2 αi,z α j,z yi y j K (xi , x j )

i, j = 1

≤κ

m

i, j = 1

1/2 αi,z α j,z

= κ( f z∗ ).

Bounding the solution to equation 1.7 by the choice f = 0 + 0, we have E( f z ) + C1 ( f z∗ ) ≤ E(0) + 0 = 1. This gives ( f z∗ ) ≤ C, and completes the proof. By lemmas 4 and 3, we know that π ( f z ) lies in F R := {π( f ) : f ∈ B R + [−(1 + κ R), 1 + κ R]},

(5.1)

with R = κC. The following lemma (Chen et al., 2004) gives the covering number estimate for F R . Lemma 5. Let F R be given by equation 5.1 with R > 0 . For any ε > 0 , there holds

2(1 + κ R) ε N (F R , ε) ≤ +1 N . ε 2R

1178

Q. Wu and D.-X. Zhou

Using the function set F R defined by equation 5.1, we set for R > 0, G R = {V(y, f (x)) − V(y, f c (x)) : f ∈ F R }.

(5.2)

By lemma 5 and the additive property of the log function, we have: Lemma 6. Let G R given by equation 5.2 with R > 0: (i) If assumption H2 holds, then there exists a constant c s > 0 such that

R s log N (G R , ε) ≤ c s log . ε (ii) If assumption H2 holds, then there exists a constant c s > 0 such that s R log N (G R , ε) ≤ c s . ε The following lemma was proved by Scovel and Steinwart (2003) for general functions f : X → R. With the projection, f here has range [−1, 1], and a simpler proof is given. Lemma 7. Assume assumption H3. For every function f : X → [−1 , 1 ] there holds E{(V(y, f (x)) − V(y, f c (x)))2 } ≤ 8

1 2c q

q /(q +1)

q

(E( f ) − E( f c )) q +1 .

Proof. Since f (x) ∈ [−1, 1], we have V(y, f (x)) − V(y, f c (x)) = y( f c (x) − f (x)). It follows that E( f ) − E( f c ) =

X

= X

( f c (x) − f (x)) f ρ (x)dρ X | f c (x) − f (x)| | f ρ (x)|dρ X

and E{(V(y, f (x)) − V(y, f c (x)))2 } =

| f c (x) − f (x)|2 dρ X . X

Let t > 0, and separate the domain X into two sets: Xt+ := {x ∈ X : | f ρ (x)| > c q t} and Xt− := {x ∈ X : | f ρ (x)| ≤ c q t}. On Xt+ , we have | f c (x) − f (x)|2 ≤ | f (x)| 2| f c (x) − f (x)| cρq t . On Xt− , we have | f c (x) − f (x)|2 ≤ 4. It follows from

SVM Soft Margin Classifiers

1179

assumption H3 that 2(E( f ) − E( f c )) | f c (x) − f (x)|2 dρ X ≤ + 4ρ X (Xt− ) cq t X ≤

2(E( f ) − E( f c )) + 4t q . cq t

Choosing t = {(E( f ) − E( f c ))/(2c q )}1/(q +1) yields the desired bound. Take the function set G in lemma 2 to be G R . Then a function g in G R takes the form g(x, y) = V(y, π ( f )(x)) − V(y, f c (x)) with m π ( f ) ∈ F R . Obviously we have g ∞ ≤ 2, E g = E(π( f )) − E( f c ) and m1 i=1 g(zi ) = Ez (π ( f )) − Ez ( f c ). τ When assumption H3 is valid, lemma 7 tells us that E g 2 ≤ c E g with q /(q +1) q and c = 8 2c1q . Applying lemma 2 and solving the equation τ = q +1 mε2−τ log N G R , ε − = log δ, 2 c + 13 · 2ε 1−τ we see the following corollary from lemmas 6 and 7. Corollary 3. Let G R be defined by equation 5.2 with R > 0 and assumption H3 hold with 0 ≤ q ≤ ∞. For every 0 < δ < 1 , with confidence at least 1 − δ, there holds q +2

q

2(q +1) {E( f ) − E( f c )} − {Ez ( f ) − Ez ( f c )} ≤ 4εm,R + 4εm,R {E( f ) − E( f c )} 2(q +1)

for all f ∈ F R , where εm,R is given by  q +1

 q /(q +1)  log 1δ + c s (log R + log m)s q +2  1 1  5 8 + ,   2cq 3 m      if assumption H2 holds,   + 1)

q +1

s q +(q2+q   s+s q /(q +1) log δ2 q +2  R c s  1 1 ,   8 8 + +   2cq 3 m m     if assumption H2 holds. 6 Rate Analysis Let us now prove the main results stated in section 2. We first prove proposition 1. Proof of Proposition 1. Since R( f c ) = 0, V(y, f c (x)) = 0 almost everywhere and E( f c ) = 0. Take η = 1 in proposition 2.

1180

Q. Wu and D.-X. Zhou

We first consider the random variable ξ = V(y, f K ,C (x)). Since 0 ≤ ξ ≤ M and Eξ = E( f K ,C ) ≤ D(C), we have σ 2 (ξ ) ≤ Eξ 2 ≤ MEξ ≤ MD(C). Applying the one-side Bernstein inequality to ξ , we see by solving the 2 quadratic equation − 2(σ 2mε = log δ/2 that with probability 1 − δ/2, +Mε/3) 2M log Ez ( f K ,C ) − E( f K ,C ) ≤ 3m 5M log ≤ 3m

!

2 δ

2 δ

+

2σ 2 (ξ ) log 2/δ m

+ D(C).

(6.1)

Next we estimate E(π ( f z )) − Ez (π ( f z )). By the definition of f z , there holds 1 1 1 f z ) + ( ( f z∗ ) ≤ Ez ( f z ) + ( f z∗ ) ≤ Ez ( f ∗z ). C C C 1 ∗ 2 f z) + f z K ). This in According to lemma 1, this is bounded by 2(Ez ( 2C connection with the definition of f z yields

1 1 ∗ 2 1 ∗ ∗ 2 ≤ 2 Ez ( f K ,C ) + ( f z ) ≤ 2 Ez ( f z ) + f f . C 2C z K 2C K ,C K Since E( f c ) = 0, D(C) = E( f K ,C ) +

1 f ∗ 2 . It follows that 2C K ,C K

1 ( f z∗ ) ≤ 2(Ez ( f K ,C ) − E( f K ,C ) + D(C)). C Together with lemma 4 and equation 6.1, this tells us that with probability 1 − δ/2,

f z∗ K

≤ κ(

f z∗ )

≤ R : = 2κC

5M log(2/δ) + 2D(C) . 3m

As we are considering a deterministic case, assumption H3 holds with q = ∞ and c ∞ = 1. Recall the definition of G R in equation 5.2. Corollary 3

SVM Soft Margin Classifiers

1181

with q = ∞ and R given as above implies that √ E(π( f z )) − Ez (π ( f z )) ≤ 4εm,C + 4 εm,C E(π ( f z )) with confidence 1 − δ where εm,C is defined in the statement. Putting the above two estimates into proposition 2, we have with confidence 1 − δ, 10M log(2/δ) √ E(π( f z )) ≤ 4εm,C + 4 εm,C E(π( f z )) + + 4D(C). 3m Solving the quadratic inequality for E(π( f z )) ≤ 32εm,C +

E(π( f z )) leads to

20M log(2/δ) + 8D(C). 3m

Then our conclusion follows from equation 4.1. Finally, we turn to the proof of theorems 2 and 3. To this end, we need a 1 ∗ bound for f ∗K ,C K . According to the definition, 2C f K ,C 2K ≤ D(C). Then we have: 1/ 2 Lemma 8. For every C > 0 , there hold f ∗K ,C K ≤ (2C D(C)) and 1/2 f K ,C ∞ ≤ 1 + 2κ(2C D(C)) .

Proof of Theorem 2. Take f K ,C = f K ,C in proposition 2. Then by lemma 8, 1/2 we may take M = 2 + 2κ(2C D(C)) . Proposition 1 with assumption H2 yields R( f z ) ≤ c s,β,δ

C (1−β)s/(s+1) 1

m 1+s

+

C (1−β)s/(s+1) 1

m 1+s

C (1+β)/2 m

s

1+s

1−β C 2 −β +C . + m 1

2

Take C = min{m s+β , m 1+β }. Then

C (1+β)/2 m

≤ 1, and the proof is complete.

Proof of Theorem 3. Denote z = E(π( f z )) − E( f c ) + C1 ( f z∗ ). Then we have ( f z∗ ) ≤ Cz . This in connection with lemma 4 yields f z∗ K ≤ κ( f z∗ ) ≤ κCz .

(6.2)

1182

Q. Wu and D.-X. Zhou

= ηC in proposition 2. It tells us that Take f K ,C = f K ,C with C z ≤ 2ηR( f c ) + S(m, C, η) + 2D(ηC). = ηC = C 1/(β+1) . By the fact R( f c ) ≤ Set η = C −β/(β+1) . Then C tion H1,

1 2

and assump-

β

z ≤ S(m, C, η) + (1 + 2c β )C − β+1 .

(6.3)

f K ,C . So we have Recall expression 2.11 for S(m, C, η). Here, f K ,C = S(m, C, η) = {(E(π( f z )) − E( f c )) − (Ez (π ( f z )) − Ez ( f c ))} f K ,C ) − Ez ( f c )) − (E( f K ,C ) − E( f c ))} + (1 + η){(Ez ( + η{Ez ( f c ) − E( f c )} =: S1 + (1 + η)S2 + ηS3 . Take t ≥ 1, C ≥ 1 to be determined later. For R ≥ 1, denote W(R) := {z ∈ Zm : f z∗ K ≤ R}.

(6.4)

For S1 , we apply corollary 3 with δ = e −t ≤ 1/e. We know that there is a (1) set VR ⊂ Zm of measure at most δ = e −t such that S1 ≤ c s,q t

Rs m

q +1

q +2+q s+s

+

Rs m

q +1 q +2

q +2+q s+s · 2(q +1)

q 2(q +1)

z

,

(1)

∀z ∈ W(R) \ VR . Here c s,q := 32(8( 2c1q )q /(q +1) + 13 )(c s + 1) ≥ 1 is a constant depending only on q and s. To estimate S2 , consider ξ = V(y, f K ,C (x)) − V(y, f c (x)) on (Z, ρ). By lemma 8, we have " 1−β C) D( ≤ 1 + 2κ 2c β C 2(β+1) . f K ,C ∞ ≤ 1 + 2κ 2C Write ξ = ξ1 + ξ2 where f K ,C (x)) − V(y, π ( f K ,C )(x)), ξ1 := V(y, ξ2 := V(y, π ( f K ,C )(x)) − V(y, f c (x)).

SVM Soft Margin Classifiers

1183

1−β 2(β+1) . Hence, σ 2 (ξ ) is bounded by It is 1 easy to1−βcheck that 0 ≤ ξ1 ≤ 2κ 2c β C 2κ 2c β C 2(β+1) Eξ1 . Then the one-side Bernstein inequality with δ = e −t tells us that there is a set V (2) ⊂ Zm of measure at most δ = e −t such that for every z ∈ Zm \ V (2) , there holds ! 1−β m 4κ 2c β C 2(β+1) t 2σ 2 (ξ1 )t 1 ξ1 (zi ) − Eξ1 ≤ + m i=1 3m m ≤

1−β 10κ 2c β C 2(β+1) t + Eξ1 . 3m

For ξ2 , by lemma 7, σ 2 (ξ2 ) ≤ 8

1 2c q

q /(q +1)

q

(Eξ2 ) q +1 .

But |ξ2 | ≤ 2. So the one-side Bernstein inequality tells us again that there is a set V (3) ⊂ Zm of measure at most δ = e −t such that for every z ∈ Zm \ V (3) , there holds m 4t 1 ξ1 (zi ) − Eξ1 ≤ + m i=1 3m

≤

! 4σ 2 (ξ1 )t m

q q +1 4t 1 q +2 t q +2 + Eξ2 . + 32 3m 2c q m q

Here we have used the following elementary inequality with b:= (Eξ2 ) 2q +2 and a := (32( 2c1q )q /(q +1) t/m)1/2 : a ·b ≤

q + 2 (2q +2)/(q +2) q + a b (2q +2)/q , 2q + 2 2q + 2

∀a , b > 0.

Combing the two estimates for ξ1 , ξ2 with the fact that Eξ = Eξ1 + C) ≤ c β C −β/(β+1) , we see that Eξ2 = E( f K ,C ) − E( f c ) ≤ D( S2 ≤ c q ,β t

1−β

C 2(β+1) + m

where c q ,β := 10κ 2c β /3 + on q , β. The last term is S3 ≤ 1.

1 m

4 3

qq +1 +2

+C

−β β+1

,

(2)

(3)

∀z ∈ Zm \ VR \ VR ,

+ 32( 2c1q )q /(q +1) + c β is a constant depending

1184

Q. Wu and D.-X. Zhou

Putting the above three estimates for S1 , S2 , S3 to equation 6.3, we find (1) that for every z ∈ W(R) \ VR \ V (2) \ V (3) , there holds z ≤ 2c s,q t

Rs m

(q +1)

q +2+q s+s

+ 8c q ,β t

1 m

qq +1 +2

+C

β − β+1

C 1/2 +1 m

. (6.5)

Here we have used another elementary inequality for α = q /(2q + 2) ∈ (0, 1) and x = z : x ≤ a x α + b,

a , b, x > 0 =⇒ x ≤ max{(2a )1/(1−α) , 2b}.

Now we can choose C to be (q +1)(β+1) C := min m2 , m s(q +1)+β(q +2+q s+s) .

(6.6)

q +1 (q +1) s(q +1)+β(q +2+q s+s) β It ensures that m1 q +2 ≤ C − β+1 and m1 q +2+q s+s ≤ C − (β+1)(q +2+q s+s) . With this (1) (2) (3) choice of C, equation 6.5 implies that with a set VR := VR ∪ VR ∪ VR of −t measure at most 3e ,

s(q +1) q +2+q β 1 s+s z ≤ C − β+1 2c s,q t C − β+1 R + 24c q ,β t ,

∀z ∈ W(R) \ VR . (6.7)

We shall finish our proof by using equations 6.2 and 6.7 iteratively. Start with the bound R = R(0) := κC. Lemma 4 verifies W(R(0) ) = Zm . At this first step, by equations 6.7 and 6.2, we have Zm = W(R(0) ) ⊆ W(R(1) ) ∪ VR(0) , where s(q +1) β 1 R(1) := κC β+1 (2c s,q t(κ + 1))C β+1 · q +2+q s+s + 24c q ,β t . Now we iterate. For n = 2, 3, . . . , we derive from equations 6.7 and 6.2 that Zm = W(R(0) ) ⊆ W(R(1) ) ∪ VR(0) ⊆ · · · ⊆ W(R(n) ) ∪ ∪n−1 j = 0 VR( j) , where each set VR( j) has measure at most 3e −t and the number R(n) is given by (n)

R

= κC

1 β+1

n β · 2c s,q t(κ + 1) C β+1

s(q +1) q +2+q s+s

n

+ 24c q ,β t(κ + 1)n .

SVM Soft Margin Classifiers

1185

Note that > 0 is fixed. We choose n0 ∈ N to be large enough such that

(n0 +1)

q +2 2s s(q + 1) + . ≤ s+ q + 2 + qs + s β q +1 In the n0 th step of our iteration, we have shown that for z ∈ W(R(n0 ) ),

f z∗ K

≤ κC

1 β+1

β n · 2c s,q t(κ + 1) 0 C β+1

s(q +1) q +2+q s+s

n0

+ 24c q ,β t(κ + 1)n0 .

This together with equation 6.5 gives β(q +1) 2β z ≤ c(s, q , β, )t n0 max m− β+1 , m− s(q +1)+β(s+2+q s+s) + . This is true for z ∈ W(R(n0 ) ) \ VR(n0 ) . Since the set ∪nj 0= 0 VR( j) has measure at most 3(n0 + 1)e −t , we know that the set W(R(n0 ) ) \ VR(n0 ) has measure at least 1 − 3(n0 + 1)e −t . Note that E(π ( f z )) − E( f c ) ≤ z . Take t = log( 3(n0δ+1) ). Then the proof is finished by equation 4.1. Acknowledgments This work is partially supported by the Research Grants Council of Hong Kong (Project No. CityU 103704) and City University of Hong Kong (Project No. 7001442). References Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68, 337–404. Barron, A. R. (1990). Complexity regularization with applications to artificial neural networks. In G. Roussa (Ed.), Nonparametric functional estimation (pp. 561–576). Dordrecht: Kluwer. Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44, 525–536. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2003). Convexity, classification, and risk bounds. Unpublished manuscript. Blanchard, G., Bousquet, O., & Massart, P. (2004). Statistical performance of support vector machines. Unpublished manuscript. Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of Computational Learning Theory (Vol. 5, pp. 144–152). Pittsburgh, PA: ACM.

1186

Q. Wu and D.-X. Zhou

Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. J. Machine Learning Research, 2, 499–526. Bradley, P. S., & Mangasarian, O. L. (2000). Massive data discrimination via linear support vector machines. Optimization Methods and Software, 13, 1–10. Chen, D. R., Wu, Q., Ying, Y., & Zhou, D. X. (2004). Support vector machine soft margin classifiers: Error analysis. J. Machine Learning Research, 5, 1143–1175. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Mach. Learning, 20, 273– 297. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. Bull. Amer. Math. Soc., 39, 1–49. Devroye, L., Gyorfi, ¨ L., & Lugosi, G. (1997). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comput. Math., 13, 1–50. Kecman, V., & Hadzic, I. (2000). Support vector selection by linear programming. Proc. IJCNN, 5, 193–198. Lugosi, G., & Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statis., 32, 30–55. Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Trans. Inform. Theory, 48, 1977–1991. Mukherjee, S., Rifkin, R., & Poggio, T. (2002). Regression and classification with regularization. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, & B. Yu (Eds.), Nonlinear estimation and classification (pp. 107–124). New York: SpringerVerlag. Niyogi, P. (1998). The informational complexity of learning. Norwell, MA: Kluwer. Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comp., 8, 819–842. Pedroso, J. P., & Murata, N. (2001). Support vector machines with different norms: Motivation, formulations and results. Pattern Recognition Letters, 22, 1263–1272. Pontil, M. (2003). A note on different covering numbers in learning theory. J. Complexity, 19, 665–671. Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Comp., 16, 1063–1076. Scovel, C., & Steinwart, I. (in press). Fast rates for support vector machines. Smale, S., & Zhou, D. X. (2003). Estimating the approximation error in learning theory. Anal. Appl., 1, 17–41. Smale, S., & Zhou, D. X. (2004). Shannon sampling and function reconstruction from point values. Bull. Amer. Math. Soc., 41, 279–305. Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity, 18, 768–791. Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statis., 32, 135–166. van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.

SVM Soft Margin Classifiers

1187

Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wahba, G. (1990). Spline models for observational data Philadelphia: SIAM. Wu, Q., Ying, Y., & Zhou, D. X. (In press). Multi-kernel regularized classifiers. Wu, Q., & Zhou, D. X. (2004). Analysis of support vector machine classification. Manuscript submitted for publication. Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. J. Machine Learning Research, 2, 527–550. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statis., 32, 56–85. Zhou, D. X. (2002). The covering number in learning theory. J. Complexity, 18, 739–767. Zhou, D. X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform. Theory, 49, 1743–1752.

Received April 30, 2004; accepted September 24, 2004.

LETTER

Communicated by Klaus-Robert Muller ¨

Leave-One-Out Bounds for Support Vector Regression Model Selection Ming-Wei Chang [email protected]

Chih-Jen Lin [email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan

Minimizing bounds of leave-one-out errors is an important and efficient approach for support vector machine (SVM) model selection. Past research focuses on their use for classification but not regression. In this letter, we derive various leave-one-out bounds for support vector regression (SVR) and discuss the difference from those for classification. Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. We also discuss the differentiability of leave-one-out bounds. 1 Introduction Support vector machines (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995) have become a promising tool for classification and regression. Their success depends on the tuning of several parameters that affect the generalization error. A popular approach is to approximate the error by a bound that is a function of parameters. Then we search for parameters so that this bound is minimized. Past efforts focus on such bounds for classification, and the aim of this article is to derive bounds for regression. We first briefly introduce support vector regression (SVR). Given training vectors xi ∈ Rn , i = 1, . . . , l, and a vector y ∈ Rl as their target values, SVR solves min ∗

w,b,ξ,ξ

subject to

l l 1 T C C ξi2 + (ξ ∗ )2 w w+ 2 2 i=1 2 i=1 i

− − ξi∗ ≤ wT φ(xi ) + b − yi ≤ + ξi ,

(1.1) i = 1, . . . , l.

Data are mapped to a higher-dimensional space by the function φ, and an -insensitive loss function is used. We refer to this form as L2-SVR because a two-norm penalty term ξi2 + (ξi∗ )2 is used. As w may be a huge vector Neural Computation 17, 1188–1222 (2005)

© 2005 Massachusetts Institute of Technology

Leave-One-Out Bounds for Support Vector Regression

1189

variable after introducing the mapping function φ, practically we solve the dual problem: 1 (α − α∗ )T K˜ (α − α∗ ) 2

min∗ α,α

+

l l (αi + αi∗ ) + yi (αi − αi∗ ) i=1

subject to

l (αi − αi∗ ) = 0,

(1.2)

i=1

0 ≤ αi , αi∗ , i = 1, . . . , l,

i=1

where K (xi , x j ) = φ(xi )T φ(x j ) is the kernel function. K˜ = K + I /C, and I is the identity matrix. For optimal w and (α, α∗ ), the primal-dual relationship shows, w =

l (αi∗ − αi )φ(xi ), i=1

so the approximate function is f (x) = wT φ(x) + b =−

l

K (x, xi )(αi − αi∗ ) + b.

(1.3)

i=1

More general information about SVR can be found in Smola and Scholkopf ¨ (2004). One difficulty over classification for parameter selection is that SVR possesses an additional parameter . Therefore, the search space of parameters is bigger than that for classification. Some work has tried to address SVR parameter selection. Momma and Bennett (2002) perform model section by pattern search, so the number of parameters checked is smaller than that by a full grid search. Kwok (2001) and Smola, Murata, Scholkopf, ¨ and Muller ¨ (1998) analyze the behavior of and conclude that the optimal scales linearly with the input noise of the training data. However, this property can be applied only when the noise is known. Gao, Gunn, Harris, and Brown (2002) derive a Bayesian framework for SVR that leads to minimizing a function of parameters. However, its performance is not very good compared to a full grid search (Lin & Weng, 2004). An improvement is in Chu, Keerthi, and Ong (2004), which modified the standard SVR formulation. This improved Bayesian SVR will be compared to our approach in this letter. This article is organized as follows. Section 2 briefly reviews leave-oneout bounds for support vector classification. We derive various leave-oneout bounds for SVR in section 3. Implementation issues are in section 4 and experiments are in section 5. Conclusions are in section 6.

1190

M.-W. Chang and C.-J. Lin

2 Leave-One-Out Bounds for Classification: A Review Given training vectors xi ∈ Rn , i = 1, . . . , l, and a vector y ∈ Rl such that yi ∈ {1, −1}, an SVM formulation for two-class classification is min

w,b,ξ

subject to

l 1 T C ξ2 w w+ 2 2 i=1 i

(2.1)

yi (wT φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , l.

Next, we briefly review two leave-one-out bounds. 2.1 Radius Margin Bound for Classification. By defining w ˜ ≡ √ , w Cξ

(2.2)

Vapnik and Chapelle (2000) have shown that the following radius margin (RM) bound holds: ˜ 2 = 4 R˜ 2 eT α, loo ≤ 4 R˜ 2 w

(2.3)

where loo is the number of leave-one-out errors and e is a vector of all ones. In equation 2.3, α is the solution of the following dual problem, I 1 T T α max e α − α Q + α 2 C subject to

0 ≤ αi , i = 1, . . . , l,

(2.4)

yT α = 0, ˜ 2 = eT α. Define where Qi j = yi y j K (xi , x j ). At optimum, w 

 φ(xi ) ˜ i ) ≡  ei  , φ(x √ C where ei is a zero vector of length l except the ith component is one. Then R˜ ˜ i ), i = in equation 2.3 is the radius of the smallest sphere containing all φ(x 1, . . . , l. The right-hand side of equation 2.3 is a function of parameters, which will then be minimized for parameter selection.

Leave-One-Out Bounds for Support Vector Regression

1191

2.2 Span Bound for Classification. Span bound, another leave-one-out bound proposed in Vapnik and Chapelle (2000), is tighter than the RM bound. Define St2 as the optimal objective value of the following problem: min λ

subject to

2

˜ ˜ i )

λi φ(x

φ(xt ) −

i∈F\{t} λi = 1,

(2.5)

i∈F\{t}

where F = {i | αi > 0} is the index set of free components of an optimal α of equation 2.4. Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, the span bound is l

αt St2 .

(2.6)

t=1

˜ the diameter of the Equation 2.5 indicates that St is smaller than 2 R, ˜ i ). Thus, equation 2.6 is tighter than smallest sphere containing all φ(x equation 2.3. Unfortunately, St2 is not a continuous function (Chapelle, Vapnik, Bousquet, & Mukherjee, 2002), so a modified span bound is proposed: l

αt S˜2t ,

(2.7)

t=1

where S˜ 2t is the optimal objective value of

2

1

˜

˜ λi φ(xi ) + η λi2 min φ(xt ) −

λ αi i∈F\{t} i∈F\{t} λi = 1. subject to

(2.8)

i∈F\{t}

η is a positive parameter that controls the smoothness of the bound. From equations 2.5 and 2.8, St2 ≤ S˜2t , so equation 2.7 is also a leave-one-out bound. Define D as an l × l diagonal matrix where Dii = η/αi and Di j = 0 for i = j. Define a new kernel matrix K˜ with ˜ i )T φ(x ˜ j ). K˜ (xi , x j ) = φ(x We let ˜ = D

DFF

0F

0FT

0

and

M=

K˜ FF

eF

eFT

0

,

(2.9)

1192

M.-W. Chang and C.-J. Lin

where K˜ FF is the submatrix of K˜ corresponding to free support vectors and eF (0F ) is a vector of |F| ones (zeros). By defining ˜ = M + D, ˜ M

(2.10)

Chapelle et al. (2002) showed that ˜ t )−1 h = S˜ 2t = K˜ (xt , xt ) − hT ( M

1 ˜ tt , −D ˜ −1 )tt (M

(2.11)

˜ with the tth column and row removed and ˜ t is the submatrix of M where M ˜ excluding M ˜ tt . h is the tth column of M Note that Chapelle et al. (2002) did not give a formal proof on the continuity of equation 2.7. We address this issue in section 4.1. 3 Leave-One-Out Bounds for Regression First, the Karash-Kunh-Tucker (KKT) optimality condition of equation 1.2 is listed here for further analysis: a vector α is optimal for equation 1.2 if and only if it satisfies constraints of equation 1.2 and there is a scalar b such that −( K˜ (α − α∗ ))i + b = yi + , if αi > 0, −( K˜ (α − α∗ ))i + b = yi − , if αi∗ > 0, yi − ≤ −( K˜ (α − α∗ ))i + b ≤ yi + , if αi = αi∗ = 0,

(3.1)

where ( K˜ (α − α∗ ))i is the ith element of K˜ (α − α∗ ). From equation 3.1, αi αi∗ = 0 when the KKT condition holds. General discussion about KKT conditions can be seen in optimization books (e.g., Bazaraa, Sherali, & Shetty 1993). 3.1 Radius Margin Bound for Regression. To study the leave-oneout error for SVR, we introduce the leave-one-out problem without the tth data: min

1 t T t C t 2 C t∗ 2 ξ + ξ (w ) (w ) + 2 2 i=t i 2 i=t i

subject to

− − ξit∗ ≤ (wt )T φ(xi ) + b t − yi ≤ + ξit ,

wt ,b t ,ξ t ,ξ t∗

(3.2)

i = 1, . . . , t − 1, t + 1, . . . , l. Though ξ t and ξ t∗ are vectors with l − 1 elements, we define ξtt = ξtt∗ = 0 to

Leave-One-Out Bounds for Support Vector Regression

1193

make them have l elements. The approximate function of equation 3.2 is f t (x) = (wt )T φ(x) + b t , so the leave-one-out error for SVR is defined as loo ≡

l f t (xt ) − yt .

(3.3)

t=1

The leave-one-out error is well defined if the approximation function is unique. Note that though w (and wt ) is unique due to the strictly convex term wT w (and (wt )T wt ), multiple b (or b t ) is possible (see, e.g., the discussion in Lin (2001a)). Therefore, we make the following assumption: Assumption 1. Equation 1.2 and the dual problem of equation 3.2 all have free support vectors. We say a dual SVR has free support vectors if (α, α∗ ) is optimal and there are some i such that αi > 0 or αi∗ > 0 (i.e., αi + αi∗ > 0 as αi αi∗ = 0). Under this assumption, equations 1.3 and 3.1 imply wT φ(xi ) + b = yi + if αi > 0 (or = yi − if αi∗ > 0). As the optimal w is unique, so is b. Similarly, b t is unique as well. We then introduce a useful lemma: Lemma 1. Under assumption 1, 1. If αt > 0,

f t (xt ) ≥ yt .

2. If αt∗ > 0,

f t (xt ) ≤ yt .

The proof is in appendix A. If αt > 0, the KKT condition 3.1 implies that f (xt ) is also larger than or equal to yt . Thus, this lemma reveals the relative position of f t (xt ) to f (xt ) and yt . The next lemma gives an error bound on each individual leave-one-out test. Lemma 2. Under assumption 1, 1. If αt = αt∗ = 0,

| f t (xt ) − yt | = | f (xt ) − yt | ≤ .

2. If αt > 0,

f t (xt ) − yt ≤ 4 R˜ 2 αt + .

3. If αt∗ > 0,

yt − f t (xt ) ≤ 4 R˜ 2 αt∗ + .

The proof is in appendix B. Then, when αt > 0, | f t (xt ) − yt | = f t (xt ) − yt from lemma 1. It follows | f t (xt ) − yt | ≤ 4 R˜ 2 αt + from lemma 2. After extending this argument to the cases of αt∗ > 0 and αt = αt∗ = 0, we have:

1194

M.-W. Chang and C.-J. Lin

Theorem 1. Under assumption 1, the leave-one-out error, equation 3.3, is bounded by 4 R˜ 2 eT (α + α∗ ) + l.

(3.4)

We discuss the difference on proving bounds for classification and regression. In classification, the RM bound 2.3 is from the following derivation: if the tth training data are wrongly classified during the leave-one-out procedure, then 1 ≤ 4αt R˜ 2 ,

(3.5)

where αt is the tth element of the optimal solution of equation 2.4. If the data are correctly classified, the leave-one-out error is zero and still smaller than 4αt R˜ 2 . Therefore, 4 R˜ 2 eT α is larger than the number of leave-oneout errors. On the other hand, since there are no “incorrectly classified” data in regression, we use lemma 2 instead of equation 3.5 and lemma 1 is required. 3.2 Span Bound for L2-SVR. Similar to the above discussion, we can have the span bound for L2-SVR: Theorem 2. Under the same assumptions of theorem 1 and the assumption that the set of support vectors remains the same during the leave-one-out procedure, the leave-one-out error of L2-SVR is bounded by l (αt + αt∗ )St2 + l,

(3.6)

t=1

where St2 is the optimal objective value of equation 2.5 with F replaced by {i | αi + αi∗ > 0}. The proof is in appendix C. The same as the case for classification, equation 3.6 may not be continuous, so we propose a similar modification to equation 2.7: l (αt + αt∗ ) S˜ 2t + l.

(3.7)

t=1

S˜ 2t is the optimal objective solution of min λ

subject to

2

φ(x ˜ ˜ t) − λi φ(xi )

λi2

+η

i∈F\{t}

i∈F\{t}

λi = 1,

i∈F\{t}

1 (αi + αi∗ )

(3.8)

Leave-One-Out Bounds for Support Vector Regression

1195

where η is a positive parameter and F = {i | αi + αi∗ > 0}. The calculation of S˜ 2t is the same as equation 2.11. 3.3 LOO Bounds for L1-SVR. L1-SVR is another commonly used form for regression. It considers the following objective function, l l 1 T ξi + C ξi∗ , w w+C 2 i=1 i=1

min ∗

w,b,ξ,ξ

(3.9)

under the constraints of equation 1.1 and nonnegative constraints on ξ, ξ ∗ : ξi ≥ 0, ξi∗ ≥ 0, i = 1, . . . , l. The name “L1” comes from the linear loss function. The dual problem is min∗ α,α

subject to

l l 1 (αi + αi∗ ) + yi (αi − αi∗ ), (α − α∗ )T K (α − α∗ ) + 2 i=1 i=1 l (3.10) (αi − αi∗ ) = 0, 0 ≤ αi , αi∗ ≤ C, i = 1, . . . , l, i=1

Two main differences between equations 1.2 and 3.10 are that K˜ is replaced by K and αi , αi∗ are upper-bounded by C. To derive leave-one-out bounds, we still require assumption 1. With some modifications in the proof (details are in section D.1), lemma 1 still holds. For lemma 2, results are different, as now αi , αt∗ ≤ C and ξt plays a role: Lemma 3 1. If αt = αt∗ = 0,

| f t (xt ) − yt | = | f (xt ) − yt | ≤ .

2. If αt > 0,

f t (xt ) − yt ≤ 4R2 αt + ξt + ξt∗ + .

3. If αt∗ > 0,

yt − f t (xt ) ≤ 4R2 αt∗ + ξt + ξt∗ + .

The proof is in section D.2. Note that R is now the radius of the smallest sphere containing all φ(xi ), i = 1, . . . , l. Using lemmas 1 and 3, the bound is 4R2 eT (α + α∗ ) + eT (ξ + ξ ∗ ) + l. Regarding the span bound, the proof for theorem 2 still holds. However, St2 is redefined as the optimal objective value of the following problem: min λ

subject to

2

φ(xt ) − λi φ(xi )

i∈F\{t}

i∈F\{t}

λi = 1,

(3.11)

1196

M.-W. Chang and C.-J. Lin

where F = {i | 0 < αi + αi∗ < C}. Then a leave-one-out bound is l l (αt + αt∗ )St2 + (ξt + ξt∗ ) + l. t=1

(3.12)

t=1

4 Implementation Issues In the rest of this article, we consider only leave-one-out bounds using L2-SVR. 4.1 Continuity and Differentiability. To use the bound, α and α∗ must be well-defined functions of parameters. That is, we need the uniqueness of the optimal dual solution. As K˜ contains the term I /C and hence is positive definite, α and α∗ are unique (Chang & Lin, 2002, lemma 4). To discuss continuity and differentiability, we make an assumption about the kernel function: Assumption 2. The kernel function is differentiable respect to parameters. For continuity, we have known that the span bound is not continuous, but others are: Theorem 3 1. (α, α∗ ) and R˜ 2 are continuous, and so is the radius margin bound. 2. The modified span bound, equation 3.7 is continuous. The proof is in appendix E. To minimize loo bounds, differentiability is important as we may have to calculate the gradient. Unfortunately, leaveone-out bounds for L2-SVR are not differentiable. An example for the radius margin bound is in appendix F. This situation is different from classification, where the radius margin bound for L2-SVM is differentiable (see more discussion in Chung, Kao, Sun, Wang, & Lin 2003). However, we may still use gradient-based methods as gradients exist almost everywhere. Theorem 4. Radius margin and modified span bounds are differentiable almost everywhere. If around a given parameter set, zero and nonzero elements of (α, α∗ ) are the same, then bounds are differentiable at this parameter set. The proof is in appendix G. The above discussion applies to bounds for classification as well. For differentiable points, we calculate gradients in section 4.2.

Leave-One-Out Bounds for Support Vector Regression

1197

4.2 Gradient Calculation. To have the gradient of leave-one-out bounds, we need the gradient of α + α∗ , R˜ 2 , and S˜ 2t . ˆ F ≡ α∗F − αF and recall the definition 4.2.1 Gradient of α + α∗ . Define α of M in equation 2.9. For free support vectors, KKT optimality conditions 3.1 imply that p α ˆF , = M 0 b where

pi =

yi −

if αˆ i > 0,

yi +

if αˆ i < 0.

We have ∂(αi + αi∗ ) ∂ αˆ i = zi , ∂θ ∂θ where

zi =

1

(4.1)

if αˆ i > 0,

−1 if αˆ i < 0.

Except , all other parameters relate to M but not p, so for any such parameter θ , 

 ∂α ˆF  ∂θ  ∂ M α ˆF + M = 0.  ∂b  b ∂θ ∂θ Thus,   ˜ ∂α ˆF ∂K  ∂θ   = −M−1  ∂θ   ∂b  0FT ∂θ 

If θ is ,  ∂α ˆ 



 ∂p  ∂  −1   = M  ∂  , ∂b 0 ∂ F

 ˜   ∂K 0F  α α ˆ ˆF F = −M−1  ∂θ . b 0 0

(4.2)

1198

M.-W. Chang and C.-J. Lin

and ∂ pi = ∂

−1

if αˆ i > 0,

1

if αˆ i < 0.

4.2.2 Gradient of S˜ 2t . Now S˜ 2t is defined as equation 2.11, but Dtt = η/(αt + αt∗ ) for t ∈ F. Using equation 2.11, ˜ −1 )tt 1 η ∂( M ∂(αt + αt∗ ) ∂ St2 = − −1 2 + . ∗ 2 ˜ )tt ∂θ ∂θ (αt + αt ) ∂θ (M Note that ˜ −1 )tt ∂( M = ∂θ

˜ −1 ∂M ∂θ

= tt

˜ −1 M

˜ −1 ∂M ˜ M ∂θ

,

(4.3)

tt

and ∂(αt + αt∗ )/∂θ can be obtained using equations 4.1 and 4.2. Furthermore, in equation 4.3,  ˜ ˜ ∂K ∂D ˜ ∂M +  = ∂θ ∂θ ∂θ 0FT

 0F 

,

0

where

˜ ∂D ∂θ

=− ii

∂(αi + αi∗ ) η , (αi + αi∗ )2 ∂θ

i ∈ F.

4.2.3 Gradient of R˜ 2 . R˜ 2 is the optimal object value of

max β

subject to

l

βi K˜ (xi , xi ) − β T K˜ β

i=1

0 ≤ βi ,

i = 1, . . . , l,

(4.4) l

βi = 1.

i=1

(see, e.g., Vapnik, 1998). From Bonnans & Shapiro (1998), it is differentiable, and the gradient is l ∂ R˜ 2 ∂ K˜ (xi , xi ) ∂ K˜ βi = − βT β. ∂θ ∂θ ∂θ i=1

Leave-One-Out Bounds for Support Vector Regression

1199

Table 1: Data Statistics. Problem

n

l

pyrim triazines mpg housing add10 cpusmall spacega abalone

27 60 7 13 10 12 6 8

74 186 392 566 1000 1000 1000 1000

Note: n is the number of features and l is the number of data instances.

5 Experiments In this section, different parameter selection methods including the proposed bounds are compared. We consider the same real data used in Lin & Weng (2004), and some statistics are in Table 1. To have a reliable comparison, for each data set, we randomly produce 30 training-testing splits. Each training set consists of four-fifths of the data, and the remaining are for testing. Parameter selection using different methods is applied on each training file. We then report the average and standard deviation of 30 mean squared errors (MSE) on predicting test sets. A method with lower MSE is better. We compare two proposed bounds with three other parameter selection methods: 1. RM (L2-SVR): the radius margin bound (equation 3.4). 2. MSP (L2-SVR): the modified span bound (equation 3.7). 3. CV (L2-SVR): a grid search of parameters using fivefold crossvalidation. 4. CV (L1-SVR): the same as the previous method but L1-SVR is considered. 5. BSVR: a Bayesian framework that improves the smoothness of the evidence function using a modified SVR (Chu et al., 2004). All methods except BSVR use the radial basis function (RBF) kernel, K (xi , x j ) = e −xi −x j

2

/(2σ 2 )

,

(5.1)

where σ 2 is the kernel parameter. BSVR implements an extension of the RBF kernel, K (xi , x j ) = κ0 e −xi −x j

2

/(2σ 2 )

+ κb ,

(5.2)

1200

M.-W. Chang and C.-J. Lin

where κ0 and κb are two additional kernel parameters. Both kernels satisfy assumption 2 on differentiability. Implementation details and experimental results are in the following subsections. 5.1 Implementations of Various Model Selection Methods. RM and MSP are differentiable almost everywhere, so we implement quasi-Newton, a gradient-based optimization method, to minimize them. The parameter η in the modified span bound, equation 2.7, is set to be 0.1. Section 5.3 will discuss the impact of using different η. Following most earlier work on minimizing leave-one-out bounds, we consider parameters in the log scale: (ln C, ln σ 2 , ln ). Thus, if f is the function of parameters, the gradient is calculated by ∂f ∂f =θ , ∂ ln θ ∂θ and formulas in section 4.2. Suppose θ is the parameter vector to be determined. The quasi-Newton method is an iterative procedure to minimize f (θ). If k is the index of the loop, the kth iteration for updating θ k to θ k+1 is: 1. Compute a search direction p = −Hk ∇ f (θ k ). 2. Find θ k+1 = θ k + λp using a line search to ensure sufficient decrease. 3. Obtain Hk+1 by  tsT ssT stT  Hk I − T + T I− T Hk+1 = t s t s t s  Hk

if tT s > 0,

(5.3)

otherwise,

where s = θ k+1 − θ k

and t = ∇ f (θ k+1 ) − ∇ f (θ k ).

Here, Hk serves as the inverse of an approximate Hessian of f and is set to be the identity matrix in the first iteration. The sufficient decrease by the line search usually means f (θ k + λp) ≤ f (θ k ) + σ1 λ∇ f (θ k )T p,

(5.4)

where 0 < σ1 < 1 is a positive constant. We find the largest value λ in a set {γ i | i = 0, 1, . . .} such that equation 5.4 holds (γ = 1/2 used in this article). We confine the search in a fixed region, so each parameter θi is associated with a lower bound li and upper bound ui . If in the quasi-Newton method, θik + λpi is not in [li , ui ], it is projected to the interval. For ln C and ln σ 2 , we set li = −8 and ui = 8. For ln , li = −8, but ui = −1. We could not use a too

Leave-One-Out Bounds for Support Vector Regression

1201

Table 2: Mean and Standard Deviation of 30 MSEs, (Using 30 Training-Testing Splits). RM Problem Mean STD

MSP

L2 CV

L1 CV

BSVR

Mean STD Mean STD Mean STD Mean STD

pyrim 0.015 0.010 0.007 triazines 0.042 0.005 0.021 mpg 8.156 1.598 7.045 housing 23.14 7.774 9.191 add10 6.491 1.675 1.945 cpusmall 34.02 13.33 14.57 spacega 0.037 0.007 0.013 abalone 10.69 1.551 5.071

0.008 0.007 0.006 0.021 1.682 7.122 2.733 9.318 0.254 1.820 4.692 14.73 0.001 0.012 0.678 5.088

0.007 0.007 0.005 0.023 1.809 7.146 2.957 11.26 0.182 1.996 3.754 15.63 0.001 0.013 0.646 5.247

0.007 0.007 0.008 0.021 1.924 6.894 5.014 10.40 0.194 2.298 5.344 16.17 0.001 0.014 0.806 5.514

0.008 0.007 1.856 3.950 0.256 5.740 0.001 0.912

large ui as a too large may cause all data to be in the -insensitive tube and hence α = α∗ = 0. Then assumption 1 does not hold, and the leave-oneout bound may not be valid. More discussion on the use of quasi-Newton method is in Chung et al. (2003). The initial point of the quasi-Newton method is (ln C, ln σ 2 , ln ) = (0, 0, −3). The minimization procedure stops when ∇ f (θ k ) < (1 + f (θ k )) × 10−5 , or

f (θ k−1 ) − f (θ k ) f (θ k−1 )

< 10−5

(5.5)

happens. Each function (gradient) evaluation involves solving an SVR and is the main computational bottleneck. We use LIBSVM (Chang & Lin, 2001) as the underlying solver. For CV, we try 2312 parameter sets with (ln C, ln γ , ln ) = [−8, −7, . . . , 8] × [−8, −7, . . . , 8] × [−8, −7, . . . , −1]. Similar to the case of using leaveone-out bounds, here we avoid considering too large . The one with the lowest fivefold CV accuracy is used to train the model for testing. For BSVR, we directly use our gradient-based implementation with the same stopping condition, equation 5.5. Note that their evidence function is not differentiable either. 5.2 Experimental Results. Table 2 presents mean and standard deviation of 30 MSEs. CV (L1- and L2-SVR) MSP, and BSVR are similar, but RM is worse. In classification, Chapelle et al. (2002) showed that radius margin and span bounds perform similarly. From our experiments on the radius margin bound, parameters are more sensitive in regression than in

1202

M.-W. Chang and C.-J. Lin

Table 3: Average (ln C, ln σ 2 , ln ) of 30 Runs. RM

MSP

σ2

σ2

Problem In C In pyrim triazines mpg housing add10 cpusmall spacega abalone

In In C In

4.0 1.0 −2.6 1.0 −1.4 −0.6 −1.8 −0.8 −0.7 0.4 −1.0 4.7 −1.5 1.1 −1.0 5.7 0.9 4.6 −1.0 8.0 −1.1 −0.5 −1.0 7.9 −6.5 −6.2 −1.7 7.4 −8.0 7.0 −1.0 3.7

2.9 3.0 1.1 1.2 3.3 1.6 3.1 0.6

L2 CV In In C In −8.0 −8.0 −7.1 −6.9 −8.0 −5.6 −8.0 −8.0

6.4 3.5 3.6 6.6 8.0 7.9 7.3 4.1

σ2

3.1 4.4 0.6 1.6 3.0 2.4 0.0 0.9

L1 CV In In C In σ 2 In −7.3 −7.2 −6.6 −6.1 −7.8 −7.0 −6.8 −6.8

2.1 0.6 5.6 7.5 7.9 8.0 6.0 6.8

3.4 4.0 0.4 1.4 2.9 1.7 1.1 1.2

−6.6 −4.7 −1.7 −1.7 −1.2 −2.1 −4.6 −2.1

Table 4: Computational Time (in Seconds) for Parameter Selection.

pyrim triazines mpg housing add10 cpusmall spacega abalone

RM

SP

L2 CV

L1 CV

0.6 2.5 10.7 23.3 160.6 204.7 53.6 68.0

0.3 5.8 63.0 159.4 957.1 931.8 771.9 1030.4

84.3 310.8 1536.7 2368.9 6940.2 10,073.7 4246.9 12,905.0

99.27 440.0 991.59 1651.5 5312.7 6087.0 15,514.42 7523.3

classification. One possible reason is that the leave-one-out error for SVR is a continuous but not a discrete measurement. The good performance of BSVR indicates that its Bayesian evidence function is accurate, but the use of a more general kernel function may also help. On the other hand, although MSP uses only the RBF kernel, it is competitive with BSVR. Note that as CV is conducted on a discrete set of parameters, sometimes its MSE is slightly worse than that of MSP or BSVR, which considers parameters in a continuous space. Table 3 presents the parameters obtained by different approaches. We do not give those by BSVR as it considers more than three parameters. Clearly these methods obtain quite distinct parameters even though they (except RM) give similar testing errors. This observation indicates that good parameters are in a quite wide region. In other words, SVM is sensitive to parameters but is not too sensitive. Moreover, different regions that RM and MSP lead to causing the two approaches to have quite different running times. More details are in Table 4 and the discussion that follows. To see the performance gain of the bound-based methods, we compare the computational time in Table 4. The experiments were done on a Pentium IV 2.8 GHz computer using the Linux operating system. We did not compare

Leave-One-Out Bounds for Support Vector Regression

1203

Table 5: Number of Function and Gradient Evaluations of the Quasi-Newton Method (Average of 30 Runs). RM

MSP

Problem

FEV

GEV

FEV

GEV

pyrim triazines mpg housing add10 cpusmall spacega abalone

34.7 29.8 38.2 40.5 44.0 43.9 28.6 19.0

13.5 10.4 5.0 5.0 4.96 5.0 9.96 3.0

32.2 30.0 21.6 33.2 27.8 35.6 30.1 19.9

13.5 13.2 11.7 13.2 5.4 6.66 9.93 10.3

the running time of BSVR as the code is available only on MS Windows. Clearly, using bounds saves a significant amount of time. For RM and MSP, the quasi-Newton implementation requires many fewer SVRs than CV. Table 5 lists the average number of function and gradient evaluations of the quasi-Newton method. Note that the number of function evaluations is the same as the number of SVRs solved. From Table 4, MSP is slower than RM though they have similar numbers of function and gradient evaluations. Because they do not land at the same parameter region, their respective SVR training time is different. In other words, the individual SVR training time here is related to parameters. Now MSP leads to a good region with smaller testing errors, but training SVRs with parameters in this region takes more time. From Table 4, the computational time of CV using L1- and L2-SVR is not close. Because they have different formulas (e.g., C in L1-SVR and C/2 in L2-SVR), we do not expect them to be very similar. 5.3 Discussion. The smoothing parameter η of the modified span bound was simply set to be 0.1 for experiments. It is important to check how η affects the performance of the bound. Figure 1 presents the relation between η and the test error. From equation 2.8, large η causes the modified bound to be away from the original one. Thus, the performance is worse, as shown in Figure 1. However, if η is reasonably small, the performance is quite stable. Therefore, the selection of η is not difficult. It is also interesting to investigate how tight the proposed bounds are in practice. Figure 2 compares different bounds and the leave-one-out value. We select the best σ 2 and from CV and show values of bounds and leaveone-out via changing C. Clearly, the span bound is a good approximation of leave-one-out, but RM is not when C is large. This situation has happened in classification (Chung et al., 2003). The reason is that St can be much smaller than 2 R˜ under some parameters. Recall that R˜ is the radius of the smallest

1204

M.-W. Chang and C.-J. Lin

Figure 1: Effect of η on testing errors: using the first training-testing split of the problems add10, mpg, housing, and cpusmall.

Figure 2: Leave-one-out, RM, and MSP. σ 2 and are fixed via using the best parameters from CV (L2-SVR). The training file is spacega.

Leave-One-Out Bounds for Support Vector Regression

1205

˜ i ), i = 1, . . . , l, so R˜ is large if there are two farsphere containing φ(x away points. However, the span bound finds a combination of xi , i ∈ / F \{t}, to be as close to xt . 6 Conclusions In this article, we derive leave-one-out bounds for SVR and discuss their properties. Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. A future study will apply the proposed bounds on feature selection. We also would like to implement nonsmooth optimization techniques as bounds here are not really differentiable. The implementation considering L1-SVR is also interesting. Experiments demonstrate that minimizing the proposed bound is more efficient than cross-validation on a discrete set of parameters. For a model with more than two parameters, a grid search is time-consuming, so a gradient-based method may be more suitable. Appendix A: Proof of Lemma 1 We consider the first case, αt > 0. Let (wt , b t , ξ t , ξ t∗ ) and (w, b, ξ, ξ ∗ ) be the optimal solutions of equations 3.2 and 1.1, respectively. Though ξ t and ξ t∗ are vectors with l − 1 elements, recall that we define ξtt = ξtt∗ = 0 to make ξ t and ξ t∗ have l elements. Note that the only difference between equations 3.2 and 1.1 is that equation 1.1 possesses the following constraint: − − ξt∗ ≤ wT φ(xt ) + b − yt ≤ + ξt .

(A.1)

We then prove the lemma by a contradiction. If the result is wrong, there are αt > 0 and f t (xt ) < yt . From the KKT condition, equation 3.1, αt > 0 implies ξt > 0 and f (xt ) = yt + + ξt > yt + > yt . Then we have f (xt ) = wT φ(xt ) + b > yt > f t (xt ) = (wt )T φ(xt ) + b t . Therefore, there is 0 < p < 1 such that (1 − p)(wT φ(xt ) + b) + p((wt )T φ(xt ) + b t ) = yt . Using the feasibility of the two points, ˆ ξˆ ∗ ) = (1 − p)(w, b, ξ, ξ ∗ ) + p(wt , b t , ξ t , ξ t∗ ) ˆ ξ, ˆ b, (w,

(A.2)

1206

M.-W. Chang and C.-J. Lin

is a new feasible solution of equation 3.2 without considering the tth element of ξˆt and ξˆt∗ . Since the objective function is convex and 0 < p < 1, we have 1 T C 2 C ∗2 ˆ + ˆ w (ξˆi ) + (ξˆ ) w 2 2 i=t 2 i=t i 1 T C 2 C ∗2 w w+ ≤ (1 − p) (ξi ) + (ξ ) 2 2 i=t 2 i=t i C t 2 C t∗ 2 1 t T t (ξ ) + (ξ ) (w ) (w ) + +p 2 2 i=t i 2 i=t i ≤

1 T C 2 C ∗2 w w+ (ξi ) + (ξ ) . 2 2 i=t 2 i=t i

(A.3)

The last inequality comes from the fact that (wt , b t , ξ t , ξ t∗ ) is optimal for equation 3.2 and if the constraint A.1 of equation 1.1 is not considered, (w, b, ξ, ξ ∗ ) is feasible for equation 3.2. In this case, αt > 0, which implies ξt > 0 and ξt∗ = 0 from KKT conditions 3.1. Since ξt = 0, ξˆt = 0 as well. As ξˆt and ξˆt∗ are not considered for derivˆ ξˆ ∗ ) satisˆ ξ, ˆ b, ing equation A.3, we can redefine ξˆt = ξˆt∗ = 0 such that (w, fies equation A.1 and thus is also a feasible solution of equation 1.1. Then, from equation A.3, ξˆt = 0 < ξt , and ξˆt∗ = ξt∗ = 0, we have l l 1 T C C ˆ w ˆ + (ξˆi )2 + (ξˆ ∗ )2 w 2 2 i=1 2 i=1 i

<

l l 1 T C C (ξi )2 + (ξ ∗ )2 . w w+ 2 2 i=1 2 i=1 i

(A.4)

ˆ ξˆ ∗ ) is a better solution than (w, b, ξ, ξ ∗ ) of equation 1.1, ˆ ξ, ˆ b, Therefore, (w, a contradiction. The proof of the other case is similar. Appendix B: Proof of Lemma 2 For easier description, in this proof we introduce a different representation of SVR dual: min α ¯

subject to

1 T ¯α ¯ + pT α ¯ α ¯ Q 2 ¯ = 0, zT α α¯ i ≥ 0,

i = 1, . . . , 2l,

(B.1)

Leave-One-Out Bounds for Support Vector Regression

1207

where α ¯ =

α

α∗

,

p=

e + y

e − y

,

z=

−e e

,

and ¯ = Q

K + I /C

−K − I /C

−K − I /C

K + I /C

.

(B.2)

We proceed with the proof by considering three cases. 1. αt = αt∗ = 0. If αt = αt∗ = 0, then α1 , . . . , αt−1 , αt+1 , . . . , αl ,

(B.3)

∗ ∗ , αt+1 , . . . , αl∗ , α1∗ , . . . , αt−1

(B.4)

and

is a feasible and optimal solution of the dual problem of equation 3.2 because it satisfies the KKT condition. Since there are free support vectors, b can be uniquely determined from equation 3.1 using some nonnegative αi or αi∗ . It follows that b = b t and f (xt ) = f t (xt ). From equation 3.1, t f (xt ) − yt = | f (xt ) − yt | ≤ . 2. αt > 0. In this case, we mainly use the formulation B.1 to represent L2-SVR with the optimal solution α. ¯ We also represent the dual of equation 3.2 by a form similar to equation B.1 and denote α ¯ t as any of its unique optimal solution. Note that equation 3.2 has l − 1 constraints, so α ¯t t t t has 2(l − 1) elements. Here, we define α ¯t =α ¯ t+l = 0 to make α ¯ a vector with 2l elements. The KKT condition of equation B.1 can be rewritten as ¯ α) (Q ¯ i + bzi + pi = 0, if α¯ i > 0, ¯ ( Qα) ¯ i + bzi + pi > 0, if α¯ i = 0.

(B.5)

Next, we follow the procedure of Vapnik and Chapelle (2000) and Joachims (2000). In equation B.1, the approximate value of the training vector xt can be written as f (xt ) = −

2l i=1

¯ it + b, α¯ i Q

1208

M.-W. Chang and C.-J. Lin

which equals f (xt ) defined in equation 1.3. Here, we intend to consider f (xt ) = −

¯ it + b α¯ i Q

(B.6)

i=t,t+l

as an approximation of f t (xt ) since f and f t do not consider the tth training vector. However equation B.6 is not applicable since equations B.3 and B.4 are not a feasible solution of the dual of equation 3.2 when αt > 0. Therefore, we construct γ from α ¯ where γ is feasible for the problem with the tth data removed from equation B.1. Define a set F t = {i | α¯ i > 0, i = t, t + l}. In order to make γ a feasible solution, let γ = α ¯ − η, where η satisfies ηi ≤ α¯ i , ηi = 0, ηi = α¯ i ,

i ∈ Ft, i ∈ F t , i = t, i = t + l,

(B.7)

i = t, i = t + l,

and zT η = 0.

(B.8)

For 2l example, we can set all η1 , . . . , ηl to zero except ηt = αt . Then, using ¯ i ≥ αt , we can find η, which satisfies equation B.7 and i=l+1 α 2l

ηi = αt .

(B.9)

i=l+1

Denote F as the objective function of equation B.1. Then, 1 ¯ α ¯ − η) + pT (α ¯ − η) (α ¯ − η)T Q( 2 1 ¯ α) ¯ − pT α − (α) ¯ ¯ T Q( 2 1 ¯ − ηT ( Q ¯α = η T Qη ¯ + p). 2

F (γ) − F (α) ¯ =

(B.10)

¯α From equations B.5 and B.7, for any i ∈ F t ∪ {t}, ( Q ¯ + p)i = −zi b and for any i ∈ F t ∪ {t}, ηi = 0. Using equation B.8, equation B.10 is reduced to F (γ) − F (α) ¯ =

1 T ¯ η Qη. 2

(B.11)

Leave-One-Out Bounds for Support Vector Regression

1209

Similarly, from α ¯ t , we construct a vector δ that is a feasible solution of ¯ equation B.1. Let F t = {i | α¯ it > 0, i = t, t + l}. In order to make δ feasible, define δ = α ¯ t − µ, where µ satisfies µi ≤ α¯ it , µi = 0,

i ∈ F¯ t , i ∈ F¯ t , i = t, i = t + l,

(B.12)

µi = −α¯ i , i = t, i = t + l, and zT µ = 0.

(B.13)

The existence of µ easily follows from assumption 1. With the condition t zT α ¯ t = 0, this assumption implies that at least one of α¯ l+1 , . . . , α¯ 2lt is positive. t Thus, while α¯ t is increased from zero to α¯ t , we can increase some positive t α¯ l+1 , . . . , α¯ 2lt so that zT α ¯ t = 0 still holds. t Next, define δ = α ¯ − µ and note that δ is a feasible solution of equation 1.2. It follows that 1 t ¯ α ¯ t − µ) + pT (α ¯ t − µ) (α ¯ − µ)T Q( 2 1 t T ¯ α ¯ t ) − pT α − (α ¯t ¯ ) Q( 2 1 ¯ − µT ( Q ¯α = µT Qµ ¯ t + p). 2

F (δ) − F (α ¯ t) =

(B.14)

From equation B.5, ¯α (Q ¯ t + p)i = −zi b t ,

∀i ∈ F¯ t ,

(B.15)

where α ¯ t and b t are optimal for equation 3.2 and its dual. By equations B.12, B.13, and B.15, ¯α ¯ t + p) = −b t µT ( Q

2l

¯α ¯α µi zi + µt ( Q ¯ t + p)t + µt+l ( Q ¯ t + p)t+l

i=t,t+l

¯α ¯α = −α¯ t ( Q ¯ t + p + b t z)t − α¯ t+l ( Q ¯ t + p + b t z)t+l = α¯ t ( f t (xt ) − yt − ) + α¯ t+l (yt − f t (xt ) − ). Thus, equation B.14 is simplified to 1 ¯ ¯ t ) − F (δ) + µT Qµ. α¯ t ( f t (xt ) − yt − ) + α¯ t+l (yt − f t (xt ) − ) = F (α 2 (B.16)

1210

M.-W. Chang and C.-J. Lin

Here we claim that α¯ t+l = 0 when α¯ t > 0 since from equation B.5, αt αt∗ = α¯ t α¯ t+l = 0.

(B.17)

Note that F (δ) ≥ F (α) ¯ as α ¯ is the optimal solution of equation B.1. Similarly, F (γ) ≥ F (α ¯ t ). Combining equations B.17, B.16, and B.11, 1 ¯ ¯ t ) − F (δ) + µT Qµ α¯ t ( f t (xt ) − yt − ) = F (α 2 1 ¯ ≤ F (γ) − F (α) ¯ + µT Qµ 2 1 1 ¯ + µT Qµ. ¯ = η T Qη 2 2

(B.18)

Let Bη be the set containing all feasible η. That is, all η satisfy equations B.7 and B.8. Similarly, Bµ is the set containing all feasible µ. Then, 1 1 ¯ + min µT Qµ, ¯ α¯ t ( f t (xt ) − yt − ) ≤ min η T Qη η∈Bη 2 µ∈Bµ 2

(B.19)

since equation B.18 is valid for all feasible η and µ. Recall K˜ = K + I /C, and define gi = ηi − ηi+l for any i = 1, . . . , l. From the definition of equations B.1 and B.2, ¯ = η T Qη

l l

gi g j K˜ i j = gt2 K˜ tt + 2gt

i=1 j=1

gi K˜ it +

i=t

gi g j K˜ i j .

i=t j=t

Moreover, we can rewrite ¯ = η Qη T

K˜ tt − 2

gt2

i=t

λi K˜ it +

λi λ j K˜ i j ,

(B.20)

i=t j=t

where λi = −

gi , gt

i = 1, . . . , l.

(B.21)

When αt > 0, αt+l = 0 from equation B.17. Therefore, gt is not zero since ηt+l = 0 and gt = α¯ t . From equation B.8, l i=1,i=t

λi = 1.

(B.22)

Leave-One-Out Bounds for Support Vector Regression

1211

Note that ηi ηi+l = 0

(B.23)

˜ i )T φ(x ˜ j ), and equafrom equations B.7 and B.17. Therefore, from K˜ i j = φ(x tion B.23, so equations B.20, B.21, and B.22 imply ¯ = gt2 d 2 (φ(x ˜ t ), t ) = α¯ t2 d 2 (φ(x ˜ t ), t ), min η T Qη

(B.24)

η∈Bη

where

l

t =

˜ i) | λi φ(x

λi = 1, λi ≥ −

i=t

i=1,i=t

α¯ i if α¯ i ≥ 0, gt

α¯ i+l if α¯ i+l ≥ 0 , λi ≤ gt

˜ t ) and the set t in the feature ˜ t ), t ) is the distance between φ(x and d(φ(x space. Define a subset of t as

+ t

=

l

˜ i) | λi φ(x

i=1,i=t

λi = 1, λi ≥ 0, λi ≥ −

i=t

α¯ i if α¯ i ≥ 0, gt

α¯ i+l if α¯ i+l ≥ 0 . λi ≤ gt

Recall that in equation B.9, one way to find feasible α ¯ − η is to decrease some free α¯ l+1 , . . . , α¯ 2l . With gt = ηt = α¯ t > 0, this is achieved by using positive + λi : α¯ i+l − λi gt . Thus, + t is nonempty. As t is a subset of the convex hull ˜ i ), i = 1, . . . , l, i = t, by φ(x

2 ˜2 ˜ t ), t ) ≤ d 2 (φ(x ˜ t ), + ˜ ˜ d 2 φ(x t ≤ max φ(xt ) − φ(xi ) ≤ 4 R . i=t

(B.25)

Equation B.24 then implies ¯ ≤ 4α¯ t2 R˜ 2 . min η T Qη η∈Bη

(B.26)

Similarly, ¯ ≤ 4α¯ t2 R˜ 2 . min µT Qµ

µ∈Bµ

(B.27)

1212

M.-W. Chang and C.-J. Lin

Combining the above inequalities and equation B.19, and canceling out α¯ t , we have f t (xt ) − yt ≤ 4 R˜ 2 α¯ t + = 4 R˜ 2 αt + .

(B.28)

3. αt∗ > 0. The result can be proved through a similar procedure for the case of αt > 0. Appendix C: Proof of Theorem 2 Define η¯ = −µ ¯ =α ¯ −α ¯ t . Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, η¯ ∈ Bη and µ ¯ ∈ Bµ . Then equation B.18 becomes an equality: ¯ η¯ = min η T Qη ¯ = α¯ t2 d 2 (φ(x ˜ t ), t ). α¯ t ( f t (xt ) − yt − ) = η¯ T Q η∈Bη

Since η¯ = α ¯ −α ¯ t and the sets of support vectors of α ¯ and α ¯ t are the same, ˜ t ), t ) = d 2 (φ(x ˜ t ), t ), d 2 (φ(x where

t =

 

i=t,α +α∗ >0 i

˜ i) | λi φ(x

i

i=t

 

λi = 1 . 

The reason is that, using the assumption, we do not need to consider the constraints associated with free support vectors in the definition of t . Therefore, it follows that loo ≤

l (αt + αt∗ )St2 + l, t=1

where St2 is the optimal objective value of equation 2.5 with F replaced by {i | i = t, αi + αi∗ > 0}. Appendix D: Leave-One-Out Bounds for L1-SVR D.1 Modifications in the Proof of Lemma 1. In lemma 1, αt > 0 implies ξt > 0, which is used for the strict inequality in equation A.4. However, for L1-SVR, ξt may be zero even if αt > 0. To prove the inequality, we consider

Leave-One-Out Bounds for Support Vector Regression

1213

equation A.3 in which the equality becomes active only if 1 1 t T t ξit + C ξit∗ = wT w + C ξi + C ξi∗ . (w ) (w ) + C 2 2 i=t i=t i=t i=t Therefore, (w, b, ξ, ξ ∗ ) is optimal for equation 3.2 as well. Using assumption 1, (w, b) = (wt , b t ), so f t (xt ) = f (xt ) ≥ yt contradicts the assumption that f t (xt ) < yt . D.2 Modifications in the Proof of Lemma 2. The proof for the case of αt = αt∗ = 0 is exactly the same, so we focus on the case of αt > 0. Similar to the L2 case, we consider a form like equation B.1 and α ¯ becomes the dual variable. Now F t is redefined as {i | 0 < α¯ i < C, i = t, t + l}. We claim that η still exists so that 0 ≤ α¯ i − ηi ≤ C, ηi = 0, ηi = α¯ i ,

i ∈ Ft, i ∈ F t , i = t, i = t + l,

(D.1)

i = t, i = t + l,

and zT η = 0

(D.2)

are satisfied. In order to decrease α¯ t to zero, one may decrease some free α¯ l+1 , . . . , α¯ 2l so that equation B.9 is satisfied. However, for L1-SVR, it is possible that after all free α¯ l+1 , . . . , α¯ 2l are decreased to 0, α¯ t is not zero yet. At this point we must increase some free α¯ 1 , . . . , α¯ l . Since we keep eT α ¯ =0 and all α¯ l+1 , . . . , α¯ 2l have been updated to zero or remain at C, α¯ t +

α¯ i = C,

(D.3)

i=1,...,l,i = t, 0<α¯ i
where ≥ 1 is an integer. Equation D.3 implies that one can reduce α¯ t to zero and increase free α¯ i , i = 1, . . . , l without exceeding the upper bound. From equation B.10 to B.11, we use the property that for any i ∈ F t ∪ {t}, ¯α (Q ¯ + p)i = −zi b i . Now this equality may not hold when i = t. If α¯ t = C, ¯α (Q ¯ + p)t = −zt b t − ξt ,

1214

M.-W. Chang and C.-J. Lin

and ξt∗ = 0, so equation B.11 becomes F (γ) − F (α) ¯ =

1 T ¯ + α¯ t (ξt + ξt∗ ). η Qη 2

(D.4)

For α ¯ t , now F t = {i | 0 < α¯ it < C, i = t, t + l}. Using assumption 1, F t = ∅. By a similar argument on the existence of η, there is µ such that 0 ≤ α¯ it − µi ≤ C, µi = 0, µi = −α¯ i ,

i ∈ F¯ t , i ∈ F¯ t , i = t, i = t + l,

(D.5)

i = t, i = t + l,

and zT µ = 0.

(D.6)

Thus, equation B.16 holds. With equation D.4, an inequality similar to equation B.19 is 1 1 ¯ + min µT Qµ ¯ + α¯ t (ξt + ξt∗ ). α¯ t ( f t (xt ) − yt − ) ≤ min η T Qη η∈Bη 2 µ∈Bµ 2 We then use the same derivation from equation B.20 to B.27, but in the definition of t and + t , λi is confined by 0 ≤ α¯ i + λi gt ≤ C

and

0 ≤ α¯ i+l − λi gt ≤ C, i = 1, . . . , l.

(D.7)

In the discussion near equation D.3, a feasible α ¯ − η can be obtained by decreasing some free α¯ l+1 , . . . , α¯ 2l , an operation that uses positive λi in α¯ i+l − λi gt . We may have to increase some free α¯ 1 , . . . , α¯ l as well. This also requires positive λi in α¯ i + λi gt . Therefore, + t = ∅, so equation B.26 (and similarly equation B.27) follows. Finally, equation B.28 becomes f t (xt ) − yt ≤ 4R2 αt + ξt + ξt∗ + . Appendix E: Proof of Theorem 3 We consider formula B.1. α ¯ is continuous if for any θ , limθ→θ α(θ) ¯ = α(θ ¯ ). If this is wrong, there is a convergent sequence {α(θ ¯ i )} such that limi→∞ {θi } = θ but limi→∞ α(θ ¯ i) = α ¯ = α(θ ¯ ). Note that the existence of the convergent sequence requires that {α(θ ¯ i )} are in a compact set. This property has been

Leave-One-Out Bounds for Support Vector Regression

1215

discussed in Lin (2001b) for L2-SVM. Since α(θ ¯ i ) and α(θ ¯ ) are both optimal solutions at θi and θ, respectively, 1 1 ¯ i )α(θ ¯ i )α(θ ¯ i ) + pT α(θ ¯ i ) ≤ α(θ ¯ ) + pT α(θ ¯ ), α(θ ¯ i )T Q(θ ¯ )T Q(θ 2 2 and 1 1 ¯ )α(θ ¯ i )α(θ ¯ ) + pT α(θ ¯ ) ≤ α(θ ¯ i ) + pT α(θ ¯ i ). α(θ ¯ )T Q(θ ¯ i )T Q(θ 2 2

(E.1)

¯ i) = With assumption 2 that all kernel elements are continuous, limi→∞ Q(θ ¯ ). Taking the limit of equation E.1, Q(θ 1 1 T ¯ )α(θ ¯ )α ¯ ) + pT α(θ ¯ ) = (α ¯ + pT α ¯ . α(θ ¯ )T Q(θ ¯ ) Q(θ 2 2 Thus, α ¯ is an optimal solution too. Since the optimal solution is unique under θ , α ¯ = α(θ ), a contradiction. Therefore, α is continuous. About R˜ 2 , it is the optimal objective value of equation 4.4. By the same procedure, β is continuous, and so is R˜ 2 . Therefore, the radius margin bound 4 R˜ 2 eT (α + α∗ ) + l is continuous. Next, we prove the continuity of the modified span bound. As we have proved that α ¯ is continuous, it is sufficient to consider S˜ 2t only. Define any sequence that converges to θ as {θi }. There are corresponding sequences {α(θ ¯ i )} and { S˜ 2t (θi )}. If for any convergent {θi }, { S˜ 2t (θi )} converges to S˜ 2t (θ ), then S˜ 2t is continuous at θ . Thus, the convergence of { S˜ 2t (θi )} to S˜ 2t (θ ) is what we are going to show next. Note that for any α, ¯ we can define two index sets: F = {i | 0 < α¯ i }

and

L = {i | α¯ i = 0}.

(E.2)

They include the indices of free and lower-bounded elements of α. ¯ Thus, for any α(θ ¯ ), there are associated Fθ and Lθ . Usually we call these sets the face of α(θ). ¯ Later, if we state that the faces of α(θ ¯ 1 ) and α(θ ¯ 2 ) are identical, it means that Fθ1 = Fθ2 and Lθ1 = Lθ2 . Because there is only a finite number of possible faces of α, ¯ we can separate {α(θ ¯ i )} into a finite number of subsequences such that all elements of each subsequence have the same face. As it suffices to prove that for any such subsequence, { S˜ 2t (θi )} converges to S˜ 2t (θ ), without loss of generality, we assume that {α(θ ¯ i )} are all at the same face. Since it is a convergent sequence and α(θ) ¯ is a continuous function, there is a fixed (maybe empty) set J ⊂ {1, . . . , l} such that α j (θ ) + α ∗j (θ ) > 0 for any θ ∈ {θi }, j ∈ J and lim α j (θi ) + α ∗j (θi ) = 0.

i→∞

(E.3)

1216

M.-W. Chang and C.-J. Lin

˜ t )−1 , which was defined in equation 2.11. Now we calculate the limit of ( M t ˜ We decompose M to four blocks: ˜t= M

A1 A2T

A2 . A3

˜ t such that Here, we rearrange M ˜ JJ = K˜ JJ + DJJ . A1 = K˜ JJ + D ˜ t correspond to indices satisfying Hence, the first |J | columns and rows of M t ˜ without the first |J | columns and equation E.3. A3 is the submatrix of M rows. We have

A1 A2T

A2 A3

−1

=

−1 A−1 I + A2 B −1 A2T A−1 −A−1 1 1 1 A2 B −B −1 A2T A−1 1

B −1

,

(E.4)

where B = A3 − A2T A−1 1 A2 . From equation E.3, it follows every diagonal element of A1 converges to infinity when {θi } approaches θ . Therefore, from lemma 2.3.3 of Golub and Van Loan (1996), lim A1 (θi )−1 = A1 (θ )−1 = O,

i→∞

(E.5)

where O is a |J | × |J | zero matrix. According to equations E.4 and E.5, ˜ t (θi )−1 = lim M

i→∞

O O1T

O1

−1 , A3 (θ )

and h J (θ ) lim h(θi ) = , h (θ ) i→∞

where O1 is a |J | × q zero matrix if q is the number of columns of A3 . h j (θ ) and h (θ ) are subvectors of h(θ ) with the first |J | and the remaining elements, respectively. Then, ˜ t (θi )−1 h(θi ) = h (θ )T A3 (θ )−1 h (θ ) = S˜ 2t (θ ). lim S˜ 2t (θi ) = lim h(θi )T M

i→∞

i→∞

Therefore, for any {θi } that converges to θ , { S˜ 2t (θi )} converges to S˜ 2t (θ ) so S˜ 2t is continuous at θ .

Leave-One-Out Bounds for Support Vector Regression

1217

Appendix F: An Example That eT (α + α∗ ) of L2-SVR Is Not Differentiable Consider an L2-SVR problem with = 0.1, 

1+

1 C

0.3

 K˜ =   0.3 0.6

1+

1 C

0.3

0.6



 0.3  , 1 1+ C

(F.1)

and T y = 0.4 −0.1 0.9 . Equation F.1 can√be the kernel matrix when using the RBF kernel. For example, if σ = 1/ 2, then x1 − x2 = x1 − x3 = x2 − x3 =

− log 0.3 ≈ 1.0973, − log 0.6 ≈ 0.7147, − log 0.3 ≈ 1.0973.

There are x1 , x2 , and x3 , which form a triangle satisfying the above. Assume δ is a small, positive number. If C = 2 + δ, then the optimal solution is α=

!

5C(C−2)

13C(2C+5)

"T 0

! and α∗ = 0 0

where = 6(4C + 5)(2C + 5). It follows that lim+

C→2

∂eT (α + α∗ ) 55 = . ∂C 351

If C = 2 − δ, then the optimal solution is α= 0

4C 7C+10

0

T

and α∗ = 0 0

T 4C 7C+10

.

31C 2 +55C

"T

,

1218

M.-W. Chang and C.-J. Lin

Then lim−

C→2

5 ∂eT (α + α∗ ) = . ∂C 36

Therefore, eT (α + α∗ ) of L2-SVR may not be a differentiable function of C. Appendix G: Proof of Theorem 4 We prove these bounds are piecewise differentiable. Such a property implies that the function is locally Lipschitz continuous and hence differentiable almost everywhere (Clarke, 1983). First we define piecewise differentiable functions: Definition 1 1. C k is a function class in which every function is differentiable k times. 2. A function f : Rn → Rm is called a PC k function (kth piecewise differentiable), 1 ≤ k ≤ ∞, if f is continuous and for every point x ∈ Rn , there exists a neighborhood W of x and a finite collection of C k -function f i : W → Rm , i = 1 , . . . , N, such that # $ f (x) ∈ f 1 (x), . . . , f N (x) , ∀x ∈ W. PC 1 functions are also called piecewise differentiable functions. There are some useful properties about PC k functions. Theorem 5 1. A function f (x) : V → Rm defined on the open set V ⊂ Rn is piecewise differentiable if and only if its component functions f i (x) : V → R, i = 1, . . . , m are also piecewise differentiable. 2. (Ulbrich, 2000, Proposition 2.20) The class of PC k -functions is closed under composition, finite summation, and multiplication. We then have a lemma to prove that α(θ ¯ ) is a piecewise differentiable function. Lemma 4. If in Assumption 2, the kernel function is k-times differentiable, α(θ ¯ ), the optimal solution of equation B.1, is a PC k function. Proof. From definition 1, a function is PC k if, at any point, there is a neighborhood such that the function in this neighborhood consists of a finite number of k-times differentiable functions. Below we construct finite ktimes differentiable functions, so α(θ ¯ ) is composed of them for any θ .

Leave-One-Out Bounds for Support Vector Regression

1219

As α ¯ is an optimal solution of equation B.1, its KKT optimality condition 3.1 can be rewritten as ¯ α) (Q ¯ i + pi + zi b ≥ 0,

if α¯ i = 0,

= 0,

if 0 < α¯ i .

(G.1)

Assume α ¯ is obtained under a parameter set θ1 and is at the face F and L defined in equation E.2. We first consider the case where F = ∅. The KKT condition of free support vectors can be written as ¯ FF α ¯ FL α Q ¯F + Q ¯ L + bzF = −pF .

(G.2)

If we combine equation G.2 and the linear constraint, zT α ¯ = 0, α ¯ F and b are the solution of the following linear system:

¯ FF Q

zF

zFT

0

α ¯F b

=

−pF

0

.

(G.3)

As F and L are the face of α, ¯ the above equation uses the fact that αi equals to 0 for every i ∈ L. From equation G.3,

α ¯F

= M−1 h,

b

(G.4)

where M=

¯ FF zF Q zFT

and h =

0

−pF 0

.

Next, we build a function γ F (θ ) = (M−1 h)F made by removing the last component of equation G.4. We claim that γ F (θ ) is a k-times differentiable function since the matrix M is invertible and both M and h are k-times differentiable functions of θ . Furthermore, we can construct a k-times differentiable function γ(θ) as follows: γ(θ) =

γ F (θ ) 0L

,

(G.5)

where 0L is the vector containing |L| zeros. For the other case where F = ∅, we can construct γ(θ ) = 0L , which is also a k-times differentiable function.

1220

M.-W. Chang and C.-J. Lin

Notice that when θ = θ1 , γ(θ1 ) = α(θ ¯ 1 ). Moreover, for all parameters whose corresponding optimal solutions are at the same face, α’s are the same as values of a k-times differentiable function. That is, for any parameter θ2 where α(θ ¯ 2 ) is at the same face as α(θ ¯ 1 ), γ(θ2 ) = α(θ ¯ 2 ). Next, we collect all possible functions like γ(θ ), which can cover α(θ ¯ ) at any value of θ. As l is the number of training data, there is a finite number of possible faces: F i , Li , i = 1, . . . , N,

(G.6)

where N ≤ 22l . For each face, we construct a function γ i (θ), which, following the explanation earlier, is a k-times differentiable function. Therefore, for any θ , we have α(θ ¯ ) ∈ {γ 1 (θ ), γ 2 (θ ), . . . , γ N (θ )}. Since α(θ ¯ ) is continuous by theorem 3, α(θ) ¯ ∈ PC k following from definition 1. Thus, using theorem 5, the radius margin bound is a PC k function. Next, we discuss S˜ 2t . Since α ¯ ∈ PC k for every θ , there exists a neighborhood W

of θ and a finite collection of C k -functions α ¯ i , i = 1, . . . , N, such that α(θ) ¯ ∈ {α ¯ 1 (θ ), . . . , α ¯ N (θ )},

∀θ ∈ W.

For any α ¯ i function, we construct a C k function S˜ i by equation 2.11. Note that the first item is a C k function from assumption 2, and only the second term involves α. ¯ Then, S˜ 2t (θ) ∈ { S˜ 1 (θ ), . . . , S˜ N (θ )},

∀θ ∈ W.

Since we have shown that S˜ 2t is continuous, S˜ 2t ∈ PC k . Furthermore, from theorem 5, lt=1 αt S˜ 2t is a PC k function. In the above analysis, if around a given parameter set, all (α, α∗ ) share the same face, then the bound is the same as a differentiable function in a neighborhood of this parameter set. Thus, the bound is differentiable there. Acknowledgments We thank Olivier Chapelle for many helpful comments. References Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming: Theory and algorithms (2nd ed.). New York: Wiley.

Leave-One-Out Bounds for Support Vector Regression

1221

Bonnans, J. F., & Shapiro, A. (1998). Optimization problems with perturbations: A guided tour. SIAM Review, 40(2), 228–264. Boser, B., Guyon, I., Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, New York: ACM Press. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for Support Vector Machines. [Computer software]. Available online at: http://www.csie.ntu.edu.tw/ ∼cjlin/libsvm. Chang, C.-C., & Lin, C.-J. (2002). Training ν-support vector regression: Theory and algorithms. Neural Computation, 14(8), 1959–1977. Chapelle, O., Vapnik, V., Bousquet, O. & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. Chu, W., Keerthi, S., & Ong, C. J. (2004). Bayesian support vector regression using a unified loss function. IEEE Transactions on Neural Networks, 15, 29–44. Chung, K.-M., Kao, W.-C., Sun, C.-L., Wang, L.-L., & Lin, C.-J. (2003). Radius margin bounds for support vector machines with the RBF kernel. Neural Computation, 15, 2643–2681. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley. Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 273– 297. Gao, J. B., Gunn, S. R., Harris, C. J., & Brown, M. (2002). A probabilistic framework for SVM regression and error bar estimation. Machine Learning, 46, 71–89. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press. Joachims, T. (2000). Estimating the generalization performance of a SVM efficiently. In Proceedings of the International Conference on Machine Learning. San Francisco: Morgan Kaufmann. Kwok, J. T. (2001). Linear dependency between epsilon and the input noise in epsilonsupport vector regression. In Proceedings of the International Conference on Artificial Neural Networks ICANN (pp. 405–410). Berlin: Springer-Verlag. Lin, C.-J. (2001a). Formulations of support vector machines: A note from an optimization point of view. Neural Computation, 13(2), 307–317. Lin, C.-J. (2001b). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12(6), 1288–1298. Lin, C.-J., & Weng, R. C. (2004). Simple probabilistic predictions for support vector regression. (Tech. rep.). Taipei: Department of Computer Science, National Taiwan University. Momma, M., & Bennett, K. P. (2002). A pattern search method for model selection of support vector regression. In Proceedings of SIAM Conference on Data Mining. Philadelphia: SIAM. Smola, A., Murata, N., Scholkopf, ¨ B., & Muller, ¨ K.-R. (1998). Asymptotically optimal choice of epsilon-loss for support vector machines. In Proceedings of the International Conference on Artificial Neural Network (pp. 105–110). Berlin: SpringerVerlag. Smola, A. J., & Scholkopf, ¨ B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.

1222

M.-W. Chang and C.-J. Lin

Ulbrich, M. (2000). Nonsmooth Newton-like methods for variational inequalities and constrained optimization problems in function Spaces. Unpublished doctoral dissertation, Technische Universit¨at Munchen. ¨ Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for support vector machines. Neural Computation, 12(9), 2013–2036.

Received July 22, 2003; accepted September 28, 2004.

REVIEW

Communicated by Christian Omlin

Rule Extraction from Recurrent Neural Networks: A Taxonomy and Review Henrik Jacobsson [email protected] School of Humanities and Informatics, University of Sk¨ovde, Sk¨ovde, Sweden, and Department of Computer Science, University of Sheffield, United Kingdom.

Rule extraction (RE) from recurrent neural networks (RNNs) refers to finding models of the underlying RNN, typically in the form of finite state machines, that mimic the network to a satisfactory degree while having the advantage of being more transparent. RE from RNNs can be argued to allow a deeper and more profound form of analysis of RNNs than other, more or less ad hoc methods. RE may give us understanding of RNNs in the intermediate levels between quite abstract theoretical knowledge of RNNs as a class of computing devices and quantitative performance evaluations of RNN instantiations. The development of techniques for extraction of rules from RNNs has been an active field since the early 1990s. This article reviews the progress of this development and analyzes it in detail. In order to structure the survey and evaluate the techniques, a taxonomy specifically designed for this purpose has been developed. Moreover, important open research issues are identified that, if addressed properly, possibly can give the field a significant push forward. 1 Introduction In this review, techniques for extracting rules (or finite state machines) from discrete-time recurrent neural networks (DTRNNs, or simply RNNs) are reviewed. We propose a new taxonomy for classifying existing techniques, present the techniques, evaluate them, and produce a list of open research issues that need to be addressed. By rule extraction from RNNs (hereafter denoted RNN-RE), we refer to the process of finding and building (preferably comprehensible) formal computational models and machines that mimic the RNN to a satisfactory degree. The connection between RNNs and formal models of computation is almost as old as the study of RNNs themselves, as the origins of these fields are largely overlapping. The study of neural networks once coincided with the study of computation in the binary recurrent network implementations of finite state automata of the theoretical work on nervous systems by McCulloch and Pitts (1943) (an interesting overview of this topic is found in Forcada, 2002). This common heritage has been flavoring the development Neural Computation 17, 1223–1263 (2005)

1224

H. Jacobsson

of the digital computer, although our current computer systems are far from being models of the nervous system. In the early 1990s, the research on recurrent neural networks was revived. When Elman (1990) introduced his quite well-known simple recurrent network (SRN), the connection between finite state machines and neural networks was again there from the start. In his paper, he compared the internal activations of the networks to the states of a finite state machine. In theory, RNNs are Turing equivalent (Siegelmann & Sontag, 1995) and can therefore compute whatever function any digital computer can compute.1 But we also know that getting the RNN to perform the desired computations is very difficult (Bengio, Simard, & Frasconi, 1994). This leaves us in a form of knowledge vacuum; we know that RNNs can be immensely powerful computational devices, and we also know that finding the instantiations of RNNs that perform these computations could very well be an insurmountable obstacle, but we do not have the means for efficiently determining the computational abilities of our current RNN instantiations. On a less theoretical level, we can simply evaluate the performance of different RNNs to see to which extent we solved the learning problem for a specific domain. Such studies are conducted in virtually all papers applying RNNs on a domain, and in some cases more systematic studies are presented (Miller & Giles, 1993; Horne & Giles, 1995; Alqu´ezar, Sanfeliu, & Sainz, 1997). But even something as simple as evaluating the performance of an RNN on a specific domain has some intrinsic problems since implicit aspects of the evaluation procedure can have a significant impact on the estimated quantitative performance (Jacobsson & Ziemke, 2003a). Actually, the analysis problems may lead to the use of models that are too simplistic, such as smaller networks and toy problem domains, just to be able to analyze (or visualize) the results. One may wonder how many published networks with only two or three state (or hidden) nodes had their specific topology chosen just to make the plotting of their internal activations possible. What we need to do is in-depth analyses of RNN instantiations to uncover the actual behavior of RNN instantiations without the need for “manually” analyzing visualizations of the RNN behavior. An efficient rule extraction technique may be the best tool for such analyses. 1.1 Topic Delimitation. Since the early 1990s, an abundance of articles has been written on recurrent neural networks, and many of them have dealt explicitly with the connection between RNNs and state machines.2 Many contributions have been theoretical, establishing the connection between (analog) RNNs (or other dynamical systems) and traditional (discrete) 1 Actually McCulloch and Pitts (1943) had already determined this equivalence in 1943 for discrete networks (Medler, 1998). 2 Many of these are summarized in Kremer (2001) and Barreto, Araujo, ´ and Kremer (2003).

Rule Extraction from Recurrent Neural Networks

1225

computational devices (Crutchfield & Young, 1990; Servan-Schreiber, Cleeremans, & McClelland, 1991; Crutchfield, 1994; Kolen, 1994; Horne & Hush, 1994; Siegelmann & Sontag, 1995; Casey, 1996; Tino, ˇ Horne, Giles, & Collingwood, 1998; Jagota, Plate, Shastri, & Sun, 1999; Omlin & Giles, 2000; Sima & Orponen, 2003; Hammer & Tino, ˇ 2003; Tino ˇ & Hammer, 2003). This work covers a wide spectrum of highly interesting and important theoretical insights, but in this article, we will not dwell on these theoretical issues because they are not the focus of this survey and because some of these articles are already much like surveys themselves, summarizing earlier findings. On the pragmatic side, we find articles describing techniques for transforming state machines into RNNs (rule insertion) or for transforming RNNs into state machines (rule extraction) (e.g., Omlin & Giles, 1992, 1996a, 1996c, 2000; Giles & Omlin, 1993; Das, Giles, & Sun, 1993; Alqu´ezar & Sanfeliu, ˜ 1994a; Omlin, Thornber, & Giles, 1998; Carrasco, Forcada, Munoz, ˜ & Neco, 2000; Carrasco & Forcada, 2001). This article, however, deals exclusively with algorithms for performing rule extraction from RNNs. Unfortunately, there will not be much room for discussion around analysis tools of RNNs other than just RE. There are a multitude of methods used to analyze RNNs, and a survey on this issue should definitely be written as well. A brief (and most probably inconclusive) list of examples of other analysis tools that have been used on RNNs is Hinton diagrams (e.g., Hinton, 1990; Niklasson & Bod´en, 1997), hierarchical cluster analysis (e.g., Cleeremans, McClelland, & Servan-Schreiber, 1989; Elman, 1990; Servan-Schreiber, Cleeremans, & McClelland, 1989; Sharkey & Jackson, 1995; Bullinaria, 1997), simple state space plots (e.g., Giles & Omlin, 1993; Zeng, Goodman, & Smyth, 1993; Gori, Maggini, & Soda, 1994; Niklasson & Bod´en, 1997; Tonkes, Blair, & Wiles, 1998; Tonkes & Wiles, 1999; Rodriguez, Wiles, & Elman, 1999; Rodriguez, 1999; Tabor & Tanenhaus, 1999), activation values plotted over time (e.g., Husbands, Harvey, & Cliff, 1995; Meeden, 1996; Ziemke & Thieme, 2002), iterated maps (e.g., Wiles & Elman, 1995), vector flow fields (e.g., Rodriguez et al., 1999; Rodriguez, 1999), external behavior analysis of RNN-controlled autonomous robotic controllers (e.g., Husbands et al., 1995; Meeden, 1996), weight space analysis (e.g., Bod´en, Wiles, Tonkes, & Blair, 1999; Tonkes & Wiles, 1999), dynamical systems theory (e.g., Tonkes et al., 1998; Rodriguez et al., 1999; Rodriguez, 1999; Bod´en, Jacobsson, & Ziemke, 2000), and ordinary quantitative evaluations of RNN performance for different domains (basically every article where an RNN is applied). Unlike previous surveys on rule extraction (Andrews, Diederich, & Tickle, 1995; Tickle, Andrews, Golea, & Diederich, 1997, 1998), this article deals exclusively with rule extraction from recurrent neural networks (resulting in quite different evaluation criteria than in previous RE surveys, as you will see in section 3). In fact, many of the RE approaches for nonrecurrent networks could potentially be used on RNNs, or at least on nonrecurrent networks in temporal domains (e.g., Craven & Shavlik, 1996;

1226

H. Jacobsson

Sun, Peterson, & Sessions, 2001). There are also other symbolic learning techniques for “training” finite automata on symbolic sequence domains directly, without taking the extra step of training a neural network, that could be mentioned (Sun & Giles, 2001; Cicchello & Kremer, 2003). These techniques are certainly interesting in themselves and should also be compared to RNN-RE techniques experimentally, but this is something that is not further examined in this review. To summarize, this review is oriented solely around RNN-RE techniques, but this field is closely related to the areas already mentioned. It may also be worth mentioning that as a review of techniques, this review is not a tutorial. Readers interested in replicating the techniques should refer to the cited articles. 1.2 Contents Overview. In section 2 we describe RNNs and finite state machines and common characteristics of RNN-RE algorithms. The evaluation criteria underlying the construction of a taxonomy for appropriately classifying and describing RNN-RE algorithms are described in section 3. The techniques are then described in section 4, and important criticism of RNN-RE in general is discussed in section 5. The set of existing RNN-RE techniques is discussed in the light of the evaluation criteria in section 6, and open research issues are summarized in section 7. In section 8, we present some conclusions. 2 Background An RNN processes sequences of data (input) and generates a responses (output) in a discrete time manner. The RNN processes information by using its internal continuous state space as an implicit, holistic memory of past input patterns (Elman, 1990). In the extraction of rules from an RNN, the continuous state space is approximated by a finite set of states, and the dynamics of the RNN is mapped to transitions among this discrete set of states. We now give a brief definition of what constitutes a recurrent neural network in the scope of this review. A brief introduction to finite state machines (FSMs) is also given, since the extracted rules are typically represented as such. A more detailed description of what RNN-RE algorithms typically constitute then follows. 2.1 Recurrent Neural Networks. A detailed review of the achievements in RNN research and the vast variety of different RNN architectures is far beyond the scope of this article. Instead, a set of identified common features of most RNN architectures will be described at an abstract enough level to incorporate most networks to which the existing RNN-RE algorithms could be applied and also abstract enough to see the striking similarities of RNN computation with the computation in

Rule Extraction from Recurrent Neural Networks

1227

st

ot

st −1

it

Figure 1: Functional dependencies of the input, state, and output of an RNN.

finite state machines (see section 2.2). Readers with no prior experience of RNNs can find more detailed description and well-developed classifications of RNNs in Kolen and Kremer (2001), Kremer (2001), or Barreto et al. (2003). Only a few of the many RNN architectures have been used in the context of rule extraction; among them are simple recurrent networks (SRNs; Elman, 1990) and, more commonly, second-order networks (e.g., sequential cascaded Networks, SCNs; Pollack, 1987). These models differ somewhat in their functionality and how they are trained. But the functional dependencies are at some level of abstraction basically the same, which is going to be exploited in the definition below. A recurrent neural network R is a six-tuple R = I, O, S, δ, γ , s0 , where I ⊆ Rni is a set of input vectors, S ⊆ Rns is a set of state vectors, O ⊆ Rno is a set of output vectors, δ : S × I → S is the state transition function, γ : S × I → O is the state interpretation function, and s0 ∈ S is the initial state vector. ni , ns , no ∈ N are the dimensionalities of the input, state, and output spaces, respectively. Often (or perhaps always) the input, state, and output are restricted to hypercubes with all elements limited to real numbers (or rational approximations of real numbers when simulated) between 0 and 1 or −1 and 1. When the networks are being trained, the two functions δ and γ are typically adjusted to produce the desired output according to some training set. For a sequence of input vectors (i1 , i2 , . . . , i ), the state is updated according to st = δ(st−1 , it ) and the output according to ot = γ (it , st−1 ). The functional dependencies are depicted in Figure 1. Note that the weights, biases, activation functions, and other concepts we typically associate with neural networks are all hidden in the state transition function δ and state interpretation function γ . The reason is that as far as RNN-RE algorithms are concerned, the fact that the networks have adaptive weights and can be trained is of less importance. An interesting consequence of the abstract nature of this RNN description,

1228

H. Jacobsson

and that this is all that is required to continue describing RNN-RE algorithms, is that it tells something about the portability of the algorithms (cf. section 3.2). There are simply not many assumptions and requirements of the underlying RNNs, which means that they are portable to more RNN types than they would be otherwise. There are a few assumptions, though—for example, that states should cluster in the state space as a result of the training (Cleeremans et al., 1989; Servan-Schreiber et al., 1989). Some, more implicit, assumptions are also the target for some of the criticism of RE from RNNs (Kolen 1993, 1994), which will be discussed in section 5 (more implicit assumptions are also discussed in section 6.5).

2.2 Finite State Machines. The rules extracted from RNNs are almost exclusively represented as FSMs. The description here will be kept brief. For a full discussion of what a regular language is and what other classes of languages there are, interested readers are referred to Hopcroft and Ullman (1979). A deterministic Mealy machine M is a six-tuple M = X, Y, Q, δ, γ , q 0 , where X is the finite input alphabet, Y is the finite output alphabet, Q is a finite set of states, δ : Q × X → Q is the transition function, γ : Q × X → Y is the output function, and q 0 ∈ Q is the initial state (note the similarities with the definition of RNNs in section 2.1). In cases where the output alphabet is binary, the machine is often referred to as a finite state automation (FSA). In an FSA, the output is interpreted as an accept or reject decision determining whether an input sequence is accepted as a grammatical string. There are actually two different models for how an FSM can be described—Mealy (as above) or Moore machines; although they are quite different from each other, they are computationally equivalent (Hopcroft & Ullman, 1979). Moore machines generate outputs based on only the current state and Mealy machines on the transitions between states; the output function, γ , is for a Moore machine γ : Q → Y and for a Mealy machine γ : Q × X → Y. In deterministic machines, an input symbol may trigger only a single transition from one state to exactly one state (as in the definition above). In a nondeterministic machine, however, a state may have zero, one, or more outgoing transitions triggered by the same input, that is, the transition function, δ, is δ : Q × X → 2 Q (a function to the power set of Q) instead of δ : Q × X → Q. That means that in a nondeterministic machine, a symbol may trigger one or more transitions from a state or even no transition at all (since ∅ ∈ 2 Q ). We will denote nondeterministic machines incomplete if there is at least one q ∈ Q and x ∈ X such that δ(q , x) = ∅. Deterministic and nondeterministic machines are computationally equivalent, although nondeterministic machines can typically be much more compact (i.e., have

Rule Extraction from Recurrent Neural Networks

1229

Figure 2: Examples of (nonequivalent) different FSM types with X = {a , b}, Y = {c, d}, Q = {1, 2}, and q o = 1. (A) Deterministic Moore machine. (B) Deterministic Mealy machine. (C) Nondeterministic Moore machine. (D) Nondeterministic Mealy machine.

fewer states) than their deterministic counterpart. Deterministic FSM, and deterministic FSAs will be abbreviated DFM and DFA, respectively. In summary, there are four types of FSMs: deterministic Moore machine, deterministic Mealy machine, nondeterministic Moore machine, and nondeterministic Mealy machine (see Figure 2 for examples). Moreover, the machines can be stochastic as well if transition probabilities are also encoded in the machine.3 For a more detailed description of deterministic and nondeterministic, Mealy and Moore machines, proof of equivalence, and a “standard” minimization algorithm, see Hopcroft and Ullman (1979). For the corresponding theory on stochastic machines, see Paz (1971).

3 Cf. Stochastic sequential machines (Paz, 1971) and probabilistic automata (Rabin, 1963).

1230

H. Jacobsson

Table 1: The Common Ingredients of RNN-RE Algorithms. 1. Quantization of the continuous state space of the RNN, resulting in a discrete set of states 2. State and output generation (and output classification, if necessary) by feeding the RNN input patterns 3. Rule construction based on the observed state transitions 4. Rule set minimization

2.3 The Basic Recipe for RNN Rule Extraction. The algorithms described in this article have many features in common (see Table 1). The continuous state space of the RNN needs to be mapped into a finite set of discrete states corresponding to the states of the resulting machine. We will refer to the states of the network as microstates and the finite set of quantized states of the network as macrostates. The macrostates are basically what the RE algorithm “sees” of the underlying RNN, whereas the actual state of the network, the microstates, is hidden. The act of transforming the microstates into macrostates is a critical part of RNN-RE algorithms (ingredient 1 in Table 1) and is called quantization. One macrostate corresponds to an uncountable set of possible microstates (just in theory; in practice, the RNN is simulated on a computer with finite precision). Therefore, deterministic sequences of events at the microstate level may appear stochastic at the macrostate level since information is lost in the quantization; for example, if two microstates a 1 , a 2 ∈ A deterministically transit to microstates b 1 ∈ B and c 1 ∈ C, respectively, then at the macrostate level, it cannot be determined from observing macrostate A whether the next macrostate will be B or C. Another common ingredient of RNN-RE algorithms is systematic testing of the RNN with different inputs (from the domain or generated specifically for the extraction) and the (macro)states and outputs are stored and used to induce the FSM (ingredient 2 in Table 1). The third ingredient is the inherent machine construction, a process often conducted concurrently with the state and output generation. Many times the generated machine is then minimized using a standard minimization algorithm (Hopcroft & Ullman, 1979); this is the fourth common ingredient of RNN-RE algorithms. FSM minimization is, however, not part of all algorithms and can also be considered as an external feature, independent of the actual extraction. 3 Evaluation Criteria and Taxonomy To simplify comparisons and structure the descriptions of the algorithms, several evaluation criteria have been chosen. These criteria are used in the

Rule Extraction from Recurrent Neural Networks

1231

tables containing summaries in section 4 and are of central importance to the discussions in section 6. The rule type, quantization method, and state generation method can be considered to constitute the main distinguishing features of RNN-RE algorithms, and they have therefore been used to structure this survey. 3.1 Main Criteria 3.1.1 Rule Type. The rules generated by RNN-RE algorithms are FSMs that are either deterministic, nondeterministic, or stochastic. They can also be in a Mealy or Moore format. In our classification of rule types, we will also distinguish whether the machine (and underlying RNN) is producing a binary accept-reject decision at the end of a string (i.e., like an FSA) or if the task is to produce an output sequence of symbols based on the input sequence (typically for prediction). 3.1.2 Quantization. One of the most varying elements of existing RNN-RE algorithms is the state space quantization method. Examples of methods used are hierarchical clustering, vector quantization, and self-organizing maps (see section 6.2 for a detailed discussion). 3.1.3 State Generation. Another important criterion is the state generation procedure, for which there are two basic methods: searching and sampling. These will be described further in the descriptions of the algorithms. 3.1.4 Network Type and Domain. Although not a feature of the extraction algorithm per se, the network types and in which domains each RNN-RE algorithm has been used will be explicitly listed for each presented technique. 3.2 Criteria from the ADT Taxonomy. Andrews et al. (1995) introduced a taxonomy, the ADT taxonomy,4 for RE algorithms, which has since been an important framework when introducing new or discussing existing RE algorithms (e.g., Schellhammer, Diederich, Towsey, & Brugman, 1998; Vahed & Omlin, 1999; Craven & Shavlik, 1999; Blanco, Delgado, & Pegalajar, 2000). The five evaluation criteria in the ADT taxonomy were expressive power, translucency, portability, rule quality, and algorithmic complexity. However, for some of their classification aspects, all RNN-RE algorithms would end up in the same class, and those aspects are therefore not very informative. The ADT taxonomy does, however, provide some very useful viewpoints taken in section 6. Some of the terminology from the

4

ADT comes from the names of the authors: Andrews, Diederich, and Tickle.

1232

H. Jacobsson

ADT taxonomy will also appear in various places in this survey; therefore, a brief description of the ADT aspects is given below. 3.2.1 Expressive Power. The expressive power is basically the type of rules generated by the RE and hence subsumed by our rule type criteria. ADT identified (when also taking Tickle et al., 1997, 1998, into account) four basic classes:

r r r r

Propositional logic (i.e., if . . . then . . . else) Nonconventional logic (e.g., fuzzy logic) First-order logic (i.e., rules with quantifiers and variables) Finite state machines

Almost all rules from RNN-RE algorithms would fall into the last category. 3.2.2 Translucency. One of the central aspects in the ADT taxonomy, translucency, that is, the degree to which the rule-extraction algorithm “looks inside” the ANN, is less relevant here since it is not a distinguishing feature of RNN-RE algorithms. ADT initially identified three types of RE algorithms: (1) decompositional algorithms, where rules are built on the level of individual neurons and then combined; (2) pedagogical approaches using a black box model of the underlying network; and (3) eclectic algorithms with aspects from both previous types. Tickle et al. (1998) also introduced a fourth intermediate category, compositional, to accommodate RNN-RE algorithms that are all (except for one pedagogical algorithm (Vahed & Omlin, 1999, 2004)) based on analyzing ensembles of neurons (i.e., the hidden state space). 3.2.3 Portability. Portability describes how well an RE technique covers the set of available ANN architectures. As for translucency, the portability is probably much the same for all RNN-RE algorithms. It is also a quite complex aspect of RE techniques (tightly bound with translucency and, in terms of feasibility, with algorithmic complexity), and we have therefore chosen not to distinguish RNN-RE algorithms by this criterion. 3.2.4 Quality. The quality of the extracted rules is a very important aspect of RE techniques, and perhaps the most interesting for evaluation of the quality of the algorithms. This aspect differs from the others in that it evaluates RE algorithms at the level of the rules rather than the level of the RE algorithms themselves. Based on previous work, such as Towell and Shavlik (1993), four subaspects of rule quality were suggested in the ADT taxonomy:

r r

Rule accuracy—the ability of the rules to generalize to unseen examples Rule fidelity—how well the rules mimic the behavior of the RNN

Rule Extraction from Recurrent Neural Networks

r r

1233

Rule consistency—the extent to which equivalent rules are extracted from different networks trained on the same task Rule comprehensibility—the readability of rules or the size of the rule set

3.2.5 Algorithmic Complexity. The algorithmic complexity of RE algorithms is unfortunately also often an open question as authors seldomly analyze this explicitly (Andrews et al., 1995). Although Golea (1996) showed that RE can be an NP-hard problem, it is unclear how existing heuristics affect the actual expected time and space requirements. For RNN-RE, the complexity issue has not received much attention, and the issue itself is quite complex, as the execution time can be affected by many factors, such as the number of state nodes, number of input symbols, granularity of the quantization, and RNN dynamics.

4 RNN-RE Techniques Although we have identified some common characteristics among the RE algorithms, dividing them into groups has been a painstaking task, as there are innumerable ways to do so. The techniques will be presented in a primarily chronological order. When a later technique is similar to an earlier one, it will be presented in connection with its predecessor (although this relation may be constituted by coincidental similarities rather than a direct continuation of prior work). First, some early work that laid the ground for RE techniques to be developed will be presented. Then the algorithms will be described in more detail in sections 4.2 to 4.7. For fuller descriptions of the algorithms, we refer to the original articles. 4.1 Pre-RE Approaches. To understand the roots of FSM extraction from recurrent networks, it is useful to recognize that in some early attempts to analyze RNNs, clustering techniques were used on the state space, and clusters corresponding to the states of the FSM generating the language were found (clustering remains one of the central issues of the research on RE from RNNs). Hierarchical cluster analysis (HCA) was used for analyzing RNNs in a few early articles on RNNs (Cleeremans et al., 1989; Servan-Schreiber et al., 1989, 1991; Elman, 1990). The authors found that for a network trained on strings generated by a small FSM, the HCA may find clusters in the state space apparently corresponding to the states of the grammar. The clusters of the HCA were labeled using the labels of the states of the underlying (known) state machine, making it easy to draw the connection between the RNN and the FSM. The fact that much of the early research on RNNs was conducted on problem sets explicitly based on FSMs may have biased subsequent research to

1234

H. Jacobsson

look for these FSMs inside the network. However, for some successful networks (e.g., Servan-Schreiber et al., 1991), no clusters corresponding directly to the states of the FSM, which generated the training set language, were found. This meant that the network had an alternative, but apparently correct, representation of the problem that differed from the one anticipated. This was probably due to the fact that the clusters of the internal state of the network did not necessarily have a straightforward one-to-one relation with the states of the corresponding minimal machine. It was later shown that nonminimal machines would typically be what is initially extracted by clustering the state space when RE was used on RNNs (Giles, Miller, Chen, Chen, & Sun, 1992). Therefore, FSM minimization is included in most RNNRE algorithms. The basic problem of using only clustering (and not recording the transitions) for analyzing RNNs is that there is no reliable way of telling how the clusters relate to each other temporally.5 If the exact same FSM is not found, the clusters may not be labeled using the original FSM as a source and the temporal ordering of the clusters is therefore lost. This problem was also observed by Elman (1990): “The temporal relationship between states is lost. One would like to know what the trajectories between states . . . look like.” The solution of this problem led to the development of FSM extraction from RNNs. 4.2 Search in Equipartitioned State Space. The algorithm of Giles et al. (Giles et al., 1991; Giles, Miller, Chen, Chen, et al., 1992; Omlin & Giles, 1996b) partitioned the state space into equally sized hypercubes (i.e., macrostates) and conducted a breadth-first search by feeding the network input patterns until no new partitions were visited. The transitions among the macrostates (induced by input patterns) were the basis for the extracted machine. The search started with a predefined initial state of the network and tested all possible input patterns on this microstate (see Figures 3 and 4). The first encountered microstate of each macrostate was then used to induce new states. This guaranteed the extraction of a deterministic machine since any state drift (Das & Mozer 1994, 1998) was avoided as the search was pruned when reentering already visited partitions. The extracted automaton was then minimized using a standard minimization algorithm for DFAs (Hopcroft & Ullman 1979). The algorithm is summarized in Table 2. The central parameter of the algorithm is the quantization degree q of the equipartition. The authors suggested starting with q = 2 and increasing it until an automata consistent with the training set is extracted, that is, the 5 There are other, more general problems of an HCA-based analysis of ANNs in general, as adjacent states (i.e., hidden unit activations) may be interpreted differently by the output layer and remote states may have the same interpretation (Sharkey & Jackson, 1995). For RNNs, this becomes even more problematic as the state is not only mapped into an output but also mapped recursively to all succeeding outputs through the state transitions.

Rule Extraction from Recurrent Neural Networks

S2 1

1

a

1235

a

b

1

2

b 2

0 S2 1

1 S1

0 1

a

a

b 2

1 3

b

b

a

2

4

a

3

b 0

0

1 S1

4

Figure 3: The two first iterations in an example of the DFA extraction algorithm of Giles et al. (1991) used on an RNN with two state nodes trained on a binary language and the quantization parameter q = 3. The state space is divided into an accept and reject region (gray and white, respectively). The algorithm expands the graph until all nodes have two outgoing arcs.

termination criteria is to have perfect accuracy of the rules. The choice of q is however usually not explicitly described as part of the RE algorithm (one exception is in the description by Omlin, 2001, where the suggested incremental procedure is also part of the algorithm). Giles, Miller, Chen, Chen, et al. (1992) found that the generalization ability of the extracted machines sometimes exceeded that of the underlying RNNs. Since the networks were trained on regular grammars, if the extraction result was a DFA equivalent with the original grammar that generated the training-test set, generalization would also be perfect. Giles, Miller, Chen, Sun, et al. (1992) showed that during successful training of an RNN, the extracted DFA will eventually belong to the same equivalence class as the

1236

H. Jacobsson

Figure 4: The two final iterations in the example of the DFA extraction algorithm of Giles et al. (1991). Note that the macrostate corresponding to node 3 could be interpreted as both an accept and reject state depending on the microstate, but the algorithm used the interpretation of the first encountered microstate as the interpretation of the macrostate.

original DFA. Existence of equivalence classes over different degrees of quantization (i.e., different values of q ) was used in, Omlin, Giles, & Miller (1992) as an indicator of the networks’ generalization ability; if the extracted DFAs for increasing values of q collapsed into a single equivalence class, it was taken as a sign of good generalization ability without the need for explicitly testing this on a separate test set. The same algorithm has been used in various other contexts: as part of rule refinement techniques (e.g., Omlin & Giles, 1992, 1996c; Giles & Omlin, 1993; Das et al., 1993), as an indicator of underlying language class (Blair & Pollack, 1997), as a method for complexity evaluation (e.g., Bakker & de Jong, 2000), as part of a quantitative comparison of different RNN architectures

Rule Extraction from Recurrent Neural Networks

1237

Table 2: Summary of Algorithm Extracting DFA Through Searching in an Equipartitioned State Space. DFA Extraction, Regular Partitioning, Breadth-First Search (Giles et al., 1991; Giles, Miller, Chen, Chen, et al., 1992; Omlin & Giles, 1996b) Moore DFA with binary (accept-reject) output Regular partitioning by q intervals in each state dimension, generating q N bins of which typically only a small subset is visited by the RNN State generation Breadth-first search Network(s) Predominantly used on second-order RNNs Domain(s) Predominantly regular languages with relatively few symbols; some applied domains, e.g., quantized financial data (Giles, Lawrence, & Tsoi 1997, 2001; Lawrence, Giles, & Tsoi, 1998)

Rule type Quantization

(Miller & Giles, 1993), as a means for FSM acquisition (e.g., Giles, Horne, & Lin 1995),6 or simply as an analysis tool of the RNN solutions (e.g., Giles & Omlin, 1994; Goudreau & Giles, 1995; Giles et al., 1997; Lawrence et al., 1998; Lawrence, Giles, & Fong, 2000; Giles et al., 2001; Bakker, 2004).7 The algorithm has also been used in the context of recursive networks (Maggini, 1998). An apparent problem with this technique is that the worst-case number of clusters grows exponentially with the number of state nodes N (q N ). The time needed for the breadth-first search will also grow exponentially with the number of possible input symbols. In practice, however, the number of visited states is much smaller than the number of possible states. This, the earliest of RNN-RE methods, is also the most widely spread algorithm. Almost all following articles where new RNN-RE techniques have been proposed cite Giles, Miller, Chen, Chen, et al. (1992). But often these articles do not contain citations to each other, giving the first impression of the field as less diverse than it actually is. Consequently, there is a surprising variety of RE approaches, some of them seemingly developed independently of each other. 4.3 Search in State Space Partitioned Through Vector Quantization. An alternative to the simple equipartition quantization was suggested by Zeng et al. (1993) where a k-means algorithm was used to cluster the microstates. The centers of the clusters, the model vectors, were used as the basis for the breadth-first search; the RNN was tested with all input symbols for each model vector state (cf. the equipartition algorithms, where the

6 Implicitly, however, more or less all articles using RE are in some way on FSM/language acquisition. 7 This is also implicitly part of many other articles as well.

1238

H. Jacobsson

Figure 5: An illustrative example of rule extraction through breadth-first search in a state space clustered by k-means. (A) The states of the RNN are sampled during training. (B) These states are clustered into a predefined number of clusters. (C) A breadth-first search (cf. Figures 3 and 4) is conducted based on the model vectors. (D) The machine is constructed.

first encountered RNN state is the basis for further search). (See Figure 5 for an illustrative example of this algorithm.) A similar approach, also using k-means, developed seemingly independently from Zeng et al. (1993), was presented in Frasconi, Gori, Maggini, and Soda (1996) and Gori, Maggini, Martinelli, and Soda, 1998, and a similar SOM-based approach in Blanco et al. (2000). A summary of these approaches is given in Table 3. To support an appropriate clustering of states, Zeng et al. (1993) and Frasconi et al. (1996) induced a bias for the RNN to form clusters during

Rule Extraction from Recurrent Neural Networks

1239

Table 3: Summary of Algorithms Extracting DFA Through Searching in a State Space Partitioned by Vector Quantization. DFA Extraction, Vector quantifier, Breadth-First Search (Zeng et al., 1993; Frasconi et al., 1996; Gori et al., 1998) Rule type Quantization State generation Network(s)

Domain(s)

Moore DFA with binary (accept-reject) output k-means Breadth-first search Second-order RNNs (Zeng et al., 1993), recurrent radial basis function network (Frasconi et al., 1996; Gori et al., 1998), RNN with an external pushdown automaton (Sun, Giles, &, Chen, 1998) Regular binary languages (Tomita, 1982), context free languages (Sun et al., 1998)

Table 4: Summary of the Search-Based DFA Extracting Algorithm Proposed by Alqu´ezar and Sanfeliu for Unbiased Grammars. DFA Extraction, Hierarchical Clustering, Sampling on Domain (Alqu´ezar & Sanfeliu, 1994a; Sanfeliu & Alqu´ezar, 1995) Unbiased Moore DFA; unbiased means the output is trinary (accept, reject, and unknown) Quantization Hierarchical clustering State generation A prefix tree is built based on the examples of the training set Networks First-order RNN (not specified in Alqu´ezar and Sanfeliu, 1994a, but in Sanfeliu & Alqu´ezar, 1995) Domains At least 15 different regular binary languages (Alqu´ezar et al., 1997) Rule type

training. Other studies have also followed this approach (Das & Das, 1991; Das & Mozer, 1994, 1998). RE-RNN algorithms developed on such specialized RNNs may not work on other networks, however. RE techniques that can be used on already existing networks (i.e., typically not designed to be easy to analyze) are described by Tickle et al. (1998) as more attractive techniques. In the presented search-based approaches, the reentering into partitions was the basis of pruning the search. A different pruning strategy was suggested by Alqu´ezar and Sanfeliu, (1994a) and Sanfeliu and Alqu´ezar (1995), who chose to use the domain to determine search depth (the algorithm is summarized in Table 4). A prefix tree (see Figure 6) was built based on the occurrences of positive and negative strings in the training set; the prefix tree contained only strings present in the training set. The states of the RNN were generated using only the strings in the prefix tree. The authors used RE

1240

H. Jacobsson

1 a

b

2 a 4 a 8

b

a

5 b

9

3

a 10

b

6 b 11

a 12

7 b 13

a 14

b 15

Figure 6: An example of a prefix tree of depth 3, created from a language that accepts only strings containing at least two b’s.

as part of their active grammatical inference (AGI) learning methodology, an iterative rule refinement technique. The states generated with the prefix tree were the basis of the initial machine. The spatially closest pair of these states was then merged iteratively until further clustering would result in an inconsistency. This RE technique was also used for a wide variety of regular grammars and two types of networks in Alqu´ezar et al. (1997). The authors reported that the extracted machines on average performed significantly better than the original RNNs. 4.4 Sampling-Based Extraction of DFA. Instead of conducting a search in the quantized state space, activity of the RNN in interaction with the data and environment can be recorded. In this way, the domain can be considered as heuristics confining the states of the RNN to relevant states. Before RE techniques for RNNs were developed, sampling of the state space using the domain was the most natural way to conduct analysis of RNNs (Cleeremans et al., 1989; Servan-Schreiber et al., 1989; Elman, 1990). The first RE technique based on sampling the RNN was proposed by Watrous and Kuhn (1992) (see Table 5). The quantization of the state space was based on splitting individual state units’ activations into intervals. They noted that these intervals could be merged and split to help the extraction of minimal and deterministic rules. The procedure of state splitting, however, is a bit vaguely described and may require intervention from the user. Manolios and Fanelli (1994) chose to use a simple vector quantifier to discretize the state space. Training from different, randomly initiated, model

Rule Extraction from Recurrent Neural Networks

1241

Table 5: Summary of the Sampling-Based DFA Extraction Algorithm Proposed by Watrous and Kuhn (1992). DFA Extraction, Dynamic Interval Clustering, Sampling on Domain (Watrous & Kuhn, 1992) Rule type Quantization State generation Networks Domains

Moore DFA with binary (accept-reject) decision Dynamically updated intervals for each state unit; states are collapsed and split through updating the intervals Sampling the RNN while processing the domain Second-order RNNs Regular binary languages (Tomita, 1982)

Table 6: Sampling-Based DFA Extractor Proposed Originally in Fanelli (1993). DFA Extraction, Vector Quantifier, Sampling on Domain (Manolios & Fanelli, 1994; originally in Fanelli, 1993) Rule type Quantization State generation Networks Domains

Moore DFA with binary (accept-reject) decision A simple vector quantifier; details unclear Sampling on a test set First-order RNNs Regular binary languages (Tomita, 1982)

vectors were repeatedly conducted until a deterministic machine was found. The termination of this procedure is however not guaranteed. The algorithm is summarized in Table 6. ˇ A similar approach was suggested in Tino ˇ and Sajda (1995), where an algorithm for removing inconsistent transitions was introduced. This algorithm could, however, fail under certain circumstances, so that the extraction of a DFA could not be guaranteed. A star topology self-organizing map ˇ (SOM; Kohonen, 1995) was used to quantize the state space. Tino ˇ and Sajda (1995) were the first to extract Mealy instead of Moore machines and also the first who did not confine the output to binary accept-reject decisions (not counting the unbiased DFA of Alqu´ezar & Sanfeliu, 1994a). This algorithm is summarized in Table 7. The breadth-first search will reliably find consistent DFMs since the search is pruned before inconsistencies leading to indeterminism are introduced. The DFM will also be complete since all symbols are tested on all states. In sampling the state space, determinism is no longer guaranteed since two microstates of the same macrostate may result in transitions to different macrostates, triggered by the same symbol. Two state vectors in the same partition may also be mapped to different classes in the output. The extracted machines may also be incomplete since not all symbols may be tested on all states. Therefore, the DFM extraction through sampling may fail, as in the above cases of Watrous and Kuhn (1992), Manolios and Fanelli ˇ (1994), and Tino ˇ and Sajda (1995). It is unclear how incomplete machines

1242

H. Jacobsson

ˇ Table 7: Summary of the Sampling-Based DFM Extractor of Tino ˇ and Sajda (1995). DFM Extraction, SOM, Sampling on Domain ˇ (Tino ˇ & Sajda, 1995) Rule type Quantization State generation Networks Domains

Mealy DFM with multiple output symbols Star topology SOM Sampling on training set Second-order RNNs Regular formal language domains with either two or three input symbols (not counting the end-of-string symbol) and two or three output symbols

Table 8: Summary of the Only Sampling-Based DFM Extractor Where Inconsistencies and Incompleteness Are Handled. DFM Extraction, Vector Quantizer, Sampling on Domain (Schellhammer et al., 1998) Rule type Quantization State generation Networks Domains

Mealy DFM with a “rescue state” used to make machine complete k-means Sampling on training set; inconsistencies solved by discarding the least frequent of inconsistent transitions SRN Natural language prediction task

were handled in the described approaches; perhaps the extracted machines were small enough and domains simple enough that such problems did not arise. An approach to solve the problem of indeterminism is to use transition frequencies to discard the least frequent of inconsistent transitions. This heuristic should in most cases solve the inconsistency without deviating much from the operation of the underlying RNN in the majority of the transitions. This simple procedure was proposed by Schellhammer et al. (1998) (summarized in Table 8). They also handled the problem of incomplete machines by creating transitions to a predefined “rescue state” to make the machine complete. These simplifications did not significantly reduce the performance of the DFM, and the rescue state made it possible for the machines to make “guesses” about inputs that otherwise would not be possible to parse. 4.5 Stochastic Machine Extraction. The extraction of deterministic FSMs (DFMs) from RNNs through sampling is hampered by the fact that the quantization of the state space may lead to inconsistencies in the macrostate transitions. These inconsistent transitions (and potentially state

Rule Extraction from Recurrent Neural Networks

1243

Table 9: Summary of Approaches of RNN-RE for Extraction of Stochastic Machines. Stochastic Machine Extraction, SOM, Sampling on Domain (Tino ˇ & Vojtek, 1998; Tino ˇ & Koteles, ¨ 1999) Rule type Quantization State generation Networks Domains

Stochastic Mealy finite state machine SOM (unspecified topology) in Tino ˇ and Vojtek (1998) and DCS in Tino ˇ and Koteles ¨ (1999) Two phases: Sampling on training set and self-driven RNN Primarily second-order RNNs. Prediction of (four) symbols generated from continuous chaotic laser data and a chaotic series of binary symbols generated with iterated logistic map function

interpretations) will, however, follow some patterns, and if all such transitions are counted, they can be transcribed into a stochastic machine—a machine with probabilities associated with the transitions. The inconsistencies that ruin a DFM extraction may, in other words, contain informative probabilities that more accurately describe the RNN. An algorithm for extraction of stochastic machines from RNNs was proposed by Tino ˇ and Vojtek (1998). These extracted machines were, however, not equivalent to the stochastic machines as defined by Paz (1971) and Rabin (1963) in that the conditional probability of the output given the state transition was not included in the model; the extracted machines did not model the output of the RNN. The algorithm quantized the state ˇ space using an SOM (as Tino ˇ & Sajda, 1995, also did). The generation of states and state transitions was divided into two phases: the pretest phase, where the RNN was domain driven, and a self-driven phase, where the output of the RNN was used as input in the next time-step (this RNN was trained to predict a long sequence of symbols). In Tino ˇ and Koteles ¨ (1999) (further described in Tino, ˇ Dorffner, & Schittenkopf, 2000), the SOM was replaced with a dynamic cell structure (DCS; Bruske & Sommer, 1995), but otherwise the algorithm was the same (see the summary in Table 9). The stochastic machines are possible to analyze in new and interesting ways. The authors (Tino ˇ & Vojtek, 1998; Tino ˇ & Koteles, ¨ 1999), for example, used entropy spectra (Young & Crutchfield, 1993) to compare the probabilities of strings generated by the RNNs with the probabilities of the strings in the original source. The results were interesting, but there were no indications in Tino ˇ and Koteles ¨ (1999) and Tino ˇ and Vojtek (1998) of how well the extracted machines corresponded to the network (i.e., rule fidelity) or how well they generalized on any test set (i.e., rule accuracy).8 The 8

Unless the entropy spectra analysis is considered a form of accuracy measurement.

1244

H. Jacobsson

Table 10: Neural Prediction Machines (NPMs) Differ from the FSM Ordinarily Extracted from RNNs in That State Transitions Are Not Incorporated into the Model. Neural Prediction Machine, Vector Quantizer, Sampling on Domain (Tino ˇ et al., 2004) Rule type Quantization State generation Networks Domains

NPM predicting the next output based on the current state of the RNN; state transitions not modeled k-means Sampling First-order RNN Continuous chaotic laser data domain transformed to four symbols and recursive natural language domains

comprehensibility of the extracted rules cannot be determined from these articles. The fact that the extracted machines did not model the output of the RNN also makes it difficult to evaluate this algorithm in the context of the other RNN-RE algorithms. Another related approach, which perhaps is not rule extraction per se but can perhaps at least be termed a partial rule extraction algorithm, is ˇ nansk the neural prediction machine (NPM) constructed in Tino, ˇ Cer ˇ y, ´ and Benuˇ ˇ skov´a (2004). The NPM predicts the next symbol given the state of the network; the state dynamics are handled by the RNN and not extracted at all (see a summary of this approach in Table 10). 4.6 A Pedagogical Approach. All previously described algorithms fall into the category compositional in ADT’s translucency classification (see section 3). There is to our knowledge only one algorithm that uses a pedagogical approach instead. Vahed and Omlin (Vahed & Omlin, 1999, 2004) used a machine learning method requiring only the input and the output to extract the machine; the internal state is ignored (see the summary in Table 11). The data used for extraction were based on all strings up to a given length. The input and output of the network were recorded and fed to the polynomial-time Trakhtenbrot-Barzdin algorithm (Trakhtenbrot & Barzdin, 1973). It was also reported that this algorithm was more successful in returning correct DFAs than clustering-based algorithms (Giles, Miller, Chen, Sun, et al., 1992). This actually seems to be the only article that at all describes an experimental comparison of different RE techniques. The machine learning algorithm they used is indeed of polynomial time complexity, given that a prefix tree (see Figure 6) is available. But the size of the prefix tree up to a string length L is of complexity O(nL ), where n is the number of symbols. As a consequence, this approach is likely to have some problems scaling up to more complex problems with more symbols.

Rule Extraction from Recurrent Neural Networks

1245

Table 11: The Only RNN-RE Algorithm Where the Internal State of the RNN Is Not Regarded During the Extraction Process. DFA Extraction, Black Box Model (Vahed & Omlin, 1999, 2004) Rule type Quantization State generation Networks Domains

Moore DFA with binary (accept-reject) output N/A All strings up to a certain length Second-order RNN One randomly generated 10-state DFA

4.7 RE-Supporting RNN Architectures. Clusters can be induced during training to support RE in later stages (Zeng et al., 1993; Frasconi et al., 1996). This was originally suggested in Das and Das (1991) and further developed in Das and Mozer (1994, 1998). Training to induce clusters results, if successfully performed, in RNNs that are trivially transformed to finite machines. Since the focus of this article is on the details of the extraction procedure, more details of these approaches will not be included here. 5 RNN-RE: Fool’s Gold? Kolen (1993) showed with some simple examples that some dynamic systems with real-valued state space (e.g., an RNN) cannot be described discretely without introducing peculiar results (cf. Kolen & Pollack, 1995). If you want to approximate the behavior of a physical system with a realvalued state space as a discrete machine, you will not only risk that the approximation might not be exact. A more profound effect of the approximation is that induced machines from the same physical system may belong to completely different classes of computational models, depending only on how the transformation from the real-valued space to a discrete approximation is conducted. This critique strikes at the very heart of RNN-RE, since the quantization of the state space is a crucial element of these algorithms and RNN-RE was actually termed “fool’s gold” by Kolen (1993). He pointed out that RNNs should be analyzed as dynamical systems or more specifically iterated function systems (IFSs) rather than state machines. There are some replies to this critique, though. One simple approach is to avoid the problem by not modeling transitions at all (Tino ˇ et al., 2004) or even not to quantize the state space at all (Vahed & Omlin, 1999, 2004). Another response to Kolen’s critique is that extraction of a state machine from an RNN has been proven to work if the underlying RNN actually robustly implements a finite state machine (Casey, 1996). However, this does not alleviate the fact that the language class for unknown RNNs cannot be

1246

H. Jacobsson

reliably recognized. But at least there is a theoretical “guarantee” that if there is an FSM at the bottom of an RNN, it can always be extracted in principle. Failure of rule extraction from an RNN could therefore be an indicator that the underlying RNN is not implementing a finite state machine. One first step in this direction has been proposed by Blair and Pollack (1997). They used unbounded growth of the macrostate set under increased resolution of the equipartition quantization method as an indicator of a nonregular underlying RNN. If we limit ourselves to real-world domains, RE will necessarily be operating on finite domains, making FSM interpretations theoretically possible at all times (although they may not be the minimal description of a domain-RNN interaction). In fact, since the focus of RNN-RE research is on FSM extraction, the question should not be whether a language class is misjudged by an RE algorithm (since extraction on the level of the class of regular languages is one of the premises), but rather how well the extracted finite machine approximates the network, as proposed by Blair and Pollack (1997). How to evaluate the fidelity of an FSM and whether this evaluation may distinguish between errors stemming from a poorly quantized state space or from a higher language class in the RNN-domain remains an open issue. In summary, although Kolen’s critique is justified, there are still reasons that further research on RE from RNNs is interesting. There is a lack of sophisticated analysis tools that can handle the complexity of RNNs, and RNN research is hampered by this fact. Although there are theoretical possibilities that RE may result in ambiguous answers about an RNN, this holds also for many other analysis techniques. For any analysis tool to be trusted, the disadvantages must be known and kept in mind when examining the results. This is precisely what makes Kolen’s observations valuable for RNNRE usage: the exposure of some of the limits of RNN-RE techniques. 6 Discussion In this section, the described techniques will be summarized and evaluated from the perspectives of the evaluation criteria: rule type, quantization method, state generation, and network type and domain. The two criteria, portability and quality, of the ADT taxonomy (described in section 3.2) will also be discussed. 6.1 Rule Types. It is quite clear that most of the research described in section 4 has been focused on extracting traditional DFA for classification of binary strings as grammatical or ungrammatical (Giles et al., 1991; Giles, Miller, Chen, Chen, et al., 1992; Watrous & Kuhn, 1992; Zeng et al., 1993; Alqu´ezar & Sanfeliu, 1994a; Manolios & Fanelli, 1994; Sanfeliu & Alqu´ezar, 1995; Omlin & Giles, 1996b; Frasconi et al., 1996; Gori et al., 1998; Vahed & Omlin, 1999, 2004). Only a few DFM extraction algorithms are

Rule Extraction from Recurrent Neural Networks

1247

ˇ 1995; used on domains with more than two output symbols (Tino ˇ & Sajda, Schellhammer et al., 1998). It is also interesting to notice that only three articles (Schellhammer et al., 1998; Tino ˇ & Vojtek, 1998; Tino ˇ & Koteles, ¨ 1999) have studied DFM RNN-RE in a prediction domain, while prediction of sequences is a quite common approach in RNN research in general (e.g., Elman 1990; Alqu´ezar & Sanfeliu, 1994b; Gers & Schmidhuber, 2001; Jacobsson & Ziemke, 2003a). The crisp DFM do not model probabilistic properties of macrostate transitions and macrostate interpretations; that kind of information is lost in the rules, independently of whether search or sampling is used to generate states. Hence, a more expressive set of rules may be represented in stochastic FSM (Tino ˇ & Vojtek, 1998; Tino ˇ & Koteles, ¨ 1999) and the fidelity (i.e., the coherence of the rules with the RNN) of stochastic rules should in principle be higher (given the same premises, e.g., quantization) than for their deterministic counterparts. The fidelity can, however, be measured in various ways, as the term is not clearly defined, leading to possibly ambiguous results. If stochastic machines are to be used for RNN analysis, however, it is important that the extracted machines also model the output of the RNN. Such stochastic machines (Paz, 1971; Rabin, 1963) have yet to be extracted; further work is needed to realize this. An approach to combine the best of both worlds may be to push further on the way chosen by Schellhammer et al. (1998), where probabilities were calculated and then used as heuristics for transforming the incomplete and nondeterministic machine into a deterministic and complete machine. That way, the information loss from going from the RNN to a deterministic machine could possibly be tracked. A last, “exotic” form of rules is the NPM. The NPM is only predicting the output of the network given the state, but is not concerned with the internal mappings of states in the RNN (Tino ˇ et al., 2004). 6.2 State Space Quantization. Clearly, there is no consensus about how to quantize the state space. Methods that have been used are (see section 4 for more complete reference lists) regular (grid) partition (Giles et al., 1991), k-means (Zeng et al., 1993; Frasconi et al., 1996; Schellhammer et al., 1998; Tino ˇ et al. 2004, Cechin, Pechmann Simon, & Stertz, 2003), SOM (Tino ˇ & ˇ Sajda, 1995; Tino ˇ & Vojtek, 1998; Blanco et al., 2000), dynamical cell structures (Tino ˇ & Koteles, ¨ 1999), “other” vector quantifiers (Manolios & Fanelli, 1994), hierarchical clustering (Alqu´ezar & Sanfeliu, 1994a), dynamically updated intervals (Watrous & Kuhn, 1992), and fuzzy clustering (Cechin et al., 2003). That makes eight different techniques, not counting small variations in implementations. Although just a fraction of existing clustering techniques have been tested (Mirkin, 1996; Jain, Murty, & Flynn 1999), it is clear that a multitude of existing clustering techniques has been used to approach the quantization problem. But what is most striking about this

1248

H. Jacobsson

multitude of techniques used is not that they are so many but that there are no studies comparing different quantization techniques to each other in the context of RNN-RE. The two main families of clustering techniques used are vector quantization (VQ; e.g., k-means and SOM) and equipartition. The main difference between these, apart from VQ partitions not being of equal sizes and shapes, is that the VQ clusters are not fixed prior to the extraction but are instead adapted to fit the actually occurring state activations in the RNN. In principle, vector quantization should be able to scale up to more state nodes than the equipartition method since the number of partitions can be arbitrarily selected independent of state space dimensionality. 6.3 State Generation. There are two basic strategies for generating the states in the RNN (see section 4 for more complete reference lists): searching (Giles et al., 1991; Zeng et al., 1993; Frasconi et al., 1996) and sampling (Watrous & Kuhn, 1992; Manolios & Fanelli, 1994; Alqu´ezar & Sanfeliu, ˇ 1994a; Tino ˇ and Sajda, 1995; Schellhammer et al., 1998; Tino ˇ & Vojtek, 1998; Tino ˇ et al., 2004). As for the clustering techniques, there are no studies comparing searching- and sampling-based RNN-RE experimentally, apart from one preliminary study (Jacobsson & Ziemke, 2003b). Unlike for clustering techniques, however, it is quite easy to see at least a few of the consequences of the choice of state generation method. First, breadth-first search will obviously have problems of scaling up to larger problems and be especially sensitive to the number of input symbols. There are also reasons to believe that for prediction networks in domains that are not completely random, many of the transitions and states generated with breadth-first search would not be relevant or ever occur in the domain (Jacobsson & Ziemke, 2003b). Machines extracted with search are, however, guaranteed to be deterministic, which may very well be desired (see the discussion in the previous section). The extraction is also guaranteed to end up with a complete machine where all possible inputs are tested on all encountered states. RE through sampling on the domain is not guaranteed to result in deterministic and complete machines. If this is required, there is no guarantee that a certain state space quantization will result in a solution since inconsistencies might occur. A heuristic solution to this problem has been proposed in only one article (Schellhammer et al., 1998). In summary, sampling-based RE techniques may be a better strategy for extraction of stochastic rather than deterministic machines. 6.4 Network Types and Domains. The networks that have been studied using RNN-RE are in most cases relatively small networks with few state nodes. This may be due to the fact that most domains used were simple enough to allow small networks to be trained.

Rule Extraction from Recurrent Neural Networks

1249

There are also significantly more second-order than first-order networks. Probably this is an effect of the focus on formal language domains where second-order RNNs are more commonly used than first-order networks (Goudreau, Giles, Chakradhar, & Cheng, 1994). The investigated domains mostly require only binary string classification. More complex domains with many symbols or deep syntactical structures have not yet been tested using RNN-RE. Therefore, the applicability of these techniques is to a large degree an open issue. The importance of rule extraction as described by Andrews et al. (1995)—for example, explanation capability and verification of ANN components—is therefore less than it would have been if the techniques had been demonstrated to work on the state-of-the-art RNNs operating on the most challenging domains. 6.5 Portability. Although most RNN-RE algorithms are compositional (i.e., under the ADT translucency criteria) and have, in principle, the same requirements on the underlying RNN, there are some implicit requirements that could be interesting to bring up, especially if the existing RNN-RE algorithms are to be applied on previously unencountered RNN architectures (or other dynamical systems) and domains. Current RE techniques are preferably used on RNNs that:

r r r r r r r

Operate in discrete time (continuous-time RNN will not work). There is, however, no known work on continuous-time RNNs in the domain of FSM-generated languages (Forcada & Carrasco, 2001). Have clearly defined input, state, and output nodes (randomly structured RNNs may be problematic, e.g., echo state networks; Jaeger, 2003). Have a fully observable state; otherwise, unobserved state nodes or noise in the observation process would disturb the extraction process since the state space would not be reliably quantized. Have state nodes that can be set explicitly (for search-based techniques). Are deterministic; otherwise, the same problem as if the state is not fully observable would occur. Are fixed during RE; no training of the RNN can be allowed during the RE process. Operate on a preferably discrete domain (or be transformed to a discrete representation prior to RE; e.g., Giles et al. 1997) since there are no means for representing continuous input data in the current types of extracted rules.

The problem of making a list of implicit requirements is that they are implicit (i.e., not readily apparent). There may therefore be other essential

1250

H. Jacobsson

requirements that we have not managed to figure out at this stage. Also, the strengths of these requirements are not clear either; some of them may be quite easily alleviated with some enhancements of current RNN-RE techniques.

6.6 Rule Quality. The quality of RNN-RE techniques is (or should be) evaluated at the level of the actual rules rather than at the level of the algorithms. Extracted rules depend not only on the algorithm but also on the underlying domain and network. Evaluation of the rule quality therefore requires extensive studies comparing different RE techniques under similar conditions. Unfortunately, such studies have not yet been conducted for most RNN-RE algorithms. There are, in the existing corpus of papers on RNN-RE, some indirect results that provide some indications for some of the rule quality subcategories of accuracy, fidelity, consistency, and comprehensibility. Some studies indicate that the extracted machines indeed have high accuracy since they may even be generalizing better than the underlying RNN (Giles, Miller, Chen, Chen, et al., 1992; Giles, Miller, Chen, Sun, et al., 1992, Giles & Omlin, 1993; Omlin & Giles, 1996b). There are, however, unfortunately no studies where the fidelity of the extracted rules has been tested separately from the accuracy. The studies tend to focus on networks that are quite successful on their domain, and under such circumstances, the difference between fidelity and accuracy is very small. For networks performing badly on their domain, high fidelity would, however, imply low accuracy since the errors of the network then would be replicated by the machine. Rule consistency has not been studied extensively, although some articles touch the subject. Rules extracted from a network during training were found to fall under a sequence of equivalence classes during training (Giles, Miller, Chen, Sun, et al., 1992). This can be seen as an example of consistency since the extracted rules after a certain time of training eventually stabilized in the same equivalence class; the set of quite similar networks at the later part of the training resulted in equivalent rules. The consistency over different parameter settings (of the quantization parameter q in the equipartitioned RNN-RE algorithm) has also been proposed as an indicator of regularity in the underlying network (Blair & Pollack, 1997). These results on consistency are, however, more or less indirect. Rule comprehensibility is an important issue to ensure further progress for RNN-RE research. After all, if the goal of extracting rules is to understand the underlying incomprehensible network, the rules should preferably be comprehensible themselves. The comprehensibility of extracted rules has not been directly evaluated. It is, however, clear that in some cases, the RNN-RE analysis has been informative in qualitative ways.

Rule Extraction from Recurrent Neural Networks

1251

7 Open Issues 7.1 Goals with RNN-RE Development. It is clear that the development of the RNN-RE algorithms has been driven by different goals in the different articles. But what are the possible goals of RNN-RE? Possible answers could be (taken partly from Andrews et al., 1995, in the context of RE in general):

r r r r

To acquire a generic model of the domain, that is, the RNN is used merely as a tool in the acquisition process (data mining). To provide an explanation of the RNN. To allow verification or validation of the RNN with respect to some requirements (cf. software testing) and thus make new, potentially safety-critical domains possible for RNNs. To improve on current RNN architectures by pinpointing errors.

The appropriate measure to evaluate the success (or rule quality) of a specific instance of an RNN-RE algorithm being applied on an RNN (and domain) depends highly on which of these (or other) goals are desired (see Figure 7). For the first goal, for example, the maximization of accuracy is the prime target. In many of the articles, it is clear that accuracy is the most important aspect of rule quality. Accuracy is a good means for evaluating rule quality as long as the goal for rule extraction is to find rules that are as good as possible on the domain. For the other goals, the maximization of fidelity is likely to be more important. After all, if the network is tested with an accuracy-maximizing RE method, the result may be rules with a performance better than the network (a result confirmed by many studies). Therefore, for purposes of RNN analysis, fidelity should be the preferred quality evaluation criterion. In some cases, however, comprehensibility may be more crucial than fidelity and accuracy. In other cases, it is imaginable that the efficiency of the algorithm (in terms of execution time or required memory storage) is the primary objective. There may also be more domainspecific measures to evaluate the degree of goal achievement. The details of the resulting RNN-RE algorithm may depend on which of these goal is aimed at. Preferably, however, the one and same algorithm should be generic enough to let the user choose among the goals. 7.2 New Challenges. What should we expect from future RNN-RE algorithms? There are some challenging applications and requirements on RNN-RE algorithms that are worth suggesting. 7.2.1 Tailor-Made Quantization algorithms. The quantization algorithm is perhaps the most critical part of extracting rules from RNNs. But what characterizes a good quantization in the context of RNN-RE? It is not necessarily spatial requirements (Sharkey & Jackson, 1995), as is usually the

1252

H. Jacobsson

Accuracy

Efficiency

Goal?

Fidelity

Comprehensibility

Figure 7: Four, possibly opposite, goals of RNN-RE that in an ideal algorithm would be simply be chosen by the setting of a few user-defined parameters.

case for evaluation of clustering techniques (Jain et al., 1999), but rather requirements based on properties of the extracted rule set. To have clusters that are spatially coherent and well separated is of less importance than the fidelity of the resulting rules. One would like a clustering technique where the clusters are functionally coherent and well separated rather than spatially. If an evaluation method for quantization techniques could be defined as functional rather than spatial requirements, it could help the research on RNN-RE to find better techniques with clustering techniques tailor-made for the purposes of extracting good rules (where the definition of “good,” of course, depends on the goal of the RE). 7.2.2 Goal-Oriented Gradually Refining Rule Extraction. Methods for controlling the comprehensibility-fidelity trade-off are identified as an important line of research by Craven and Shavlik (1999). This trade-off issue may be expanded to techniques where the user may, through the setting of a few parameters, not only have the ability to choose between fidelity and comprehensibility, but also fidelity and accuracy, fidelity and computation time, and others. In an ideal RNN-RE algorithm, the relation between execution time, fidelity, and comprehensibility may be as illustrated in Figure 8. Rules should be refined gradually over time, and the more time available, the higher the possibility is of acquiring rules of high fidelity and/or comprehensibility (“anytime rule extraction” (Craven & Shavlik, 1999)). One method for gradually refining rules may be to do reextraction of uncertain or infrequent but possibly important rules by querying the network

Rule Extraction from Recurrent Neural Networks

1253

Fidelity

Execution time

Comprehensibility Figure 8: Relation between execution time, fidelity, and comprehensibility for ideal RNN-RE algorithms with possible gradual refinement of the rules. The more time available, the more the degree of freedom in choosing between high fidelity and comprehensibility.

(Craven & Shavlik, 1994). This can, for example, be done by directly setting states in the network to be in the vicinity of the model vector (or something equivalent) of the macrostate of interest and then testing the effect of feeding the RNN various possible inputs. 7.2.3 RNN Comparisons and Evaluations. A distance metric between RNNs could be defined by comparing rules extracted by RNN-RE. RNNs are otherwise difficult to compare directly since completely different weights can give equivalent behavior, and small differences in weights may result in very different behaviors. This sort of distance metric would be favorable when using RNN ensembles (Sharkey, 1996) to ensure a heterogeneous set of RNNs in the ensemble. The idea that RNN-RE can be used as an indication of the complexity of the underlying RNN (or some other dynamical system) and domain could be further developed. Previous studies seem to show promising results (Crutchfield & Young, 1990; Blair & Pollack, 1997; Bakker & de Jong, 2000) with regard to complexity estimations that go beyond Shannon entropy and Kolmogorov complexity. A simpler version of the complexity issue would be to ask to which degree the recurrent part of the network affects the behavior of the network (i.e., whether the RNN is equivalent to any nonrecurrent network).

1254

H. Jacobsson

7.2.4 RNN Debugging. The underlying task of the RNN (e.g., prediction) can be integrated with the rules in order to identify more exactly when and how erroneous behavior occurs in the network. This can be done simply by marking which states in the extracted machines are involved in the errors. This error can perhaps then be further traced back, in the rules, to the actual erroneous behavior. This can perhaps be further integrated with the training of the network by letting the trainer update the weights only in situations identified by the rules as being part of an erroneous behaviour (cf. Schmidhuber, 1992). 7.3 Some Practical Hints. Since this article is, in part, aimed at attracting more researchers to the field, it is perhaps a good idea to not just identify open issues, but also to give some practical hints about how things should be done. These hints are basically a repetition of Craven and Shavlik (1999), but they are important enough to be repeated. First, if you are developing a new RNN-RE algorithm, strive for generality (i.e., high portability). The usefulness of the algorithm you develop will directly correlate with how easily it can be used on RNNs already developed, implemented, and tested originally without intentions of making them suitable for RE. In fact, Craven and Shavlik (1999) even suggested that the RE algorithms should be made so general that not even the assumption that the underlying system is a neural network at all is necessary. In fact, some things indicate that this is already a fact for most RNN-RE algorithms, considering the very limited assumptions of the underlying RNN (cf. section 2.1). Another good piece of advice is to seek out collaborators who already have RNNs they want to analyze (Craven & Shavlik, 1999). It is highly unlikely that you will have time to develop both state-of-the-art RNNs and state-of-the-art RNN-RE algorithms at the same time. Finding willing collaborators should not be too difficult since researchers applying novel RNNs on new domains will most likely benefit from the knowledge acquired through rule extraction. Another important ingredient for making a technique attractive is to make implementations of the techniques publicly available (Craven & Shavlik, 1999). After all, the techniques are aimed at being used by researchers that are quite busy pursuing their own line of work. 8 Conclusions Ideally, if the RNN-RE techniques developed so far had been really successful, they would be among the first analysis techniques to be used when new RNN architectures had been developed or a new domain had been concurred. No other analysis tools seem to promise a more detailed and profound understanding of RNNs. But we are not there yet.

Rule Extraction from Recurrent Neural Networks

1255

Despite numerous achievements, there seems to be no apparent common direction (or well-defined goals) in the RNN-RE research community. In most cases, developed algorithms do not seem to be built based on previous results, and there seems to be a very slow (if any) progress toward handling more complex RNNs and domains with RNN-RE algorithms. In fact, only one algorithm has been used to any wide extent in the follow-up work, and, moreover, it is the first RNN-RE algorithm developed (Giles, Miller, Chen, Chen, et al., 1992). It is surprising that it has not been replaced by anything significantly better. Actually, later algorithms may very well be better, but they are still not used as frequently as the first one, and there are almost no comparative studies. This article identifies important research issues that, if addressed, would help give this field a push forward. More important, goals against which any progress in this field should be measured have been suggested. Further clarity about what these goals constitute is needed, and they need to be constantly updated, but the main issue is that the lack of common goals is alleviated bit by bit. A natural next step would be to find standardized problems on which rule extraction techniques can be compared. Preferably, the standardized problems would not be bound to only RNN problems but also include other dynamical systems problems. RNNs, and neural networks in general, are, as they are simulated entities, very “studyable” once we have tools to study them. And the algorithms reviewed here may hold the seed for a deeper and more general notion of analysis than seen before. Better analysis tools may in turn help RNN research to progress more rapidly once we get a deeper understanding of what the networks are actually doing. In many other disciplines of science, the quantum leaps in progress often stem from more sophisticated analysis tools and measuring devices producing qualitatively new data conflicting with existing models (anomalies) that eventually may result in scientific revolutions (Kuhn, 1962). Today we have deep but partially conflicting theories of what the RNNs will be able to do in practice (i.e., the Turing machine equivalence versus the difficulty of acquiring correct behavior through learning), but we have no means for evaluating what particular instances of RNNs are actually doing in an efficient manner. With critical eyes, rule extraction from recurrent neural networks may seem an infinitesimal subfield within another infinitesimal subfield and thereby with a very limited potential to deliver interesting scientific results. But if there is a future microscope for zooming in on RNNs, I would hold that there are good reasons to believe that rule extraction mechanisms will be the operational parts, or lenses, of that microscope. And like a real-world microscope, this RNN microscope will, if general enough, be able to zoom in on other types of dynamical systems and physical computing devices and thus contribute to the scientific community in a considerably broader sense.

1256

H. Jacobsson

Acknowledgments First, I thank my librarian, Karin Lundberg, for her efficient and tireless efforts of providing me with all papers. Without her, this survey would not have been possible. I also thank Andr´e Gruning, ¨ Amanda Sharkey, Ron Sun, and, especially, Tom Ziemke for commenting on early versions of this article and also Gunnar Buason, Anders Jacobsson, and Claudina Riguetti for proofreading a later version. I also thank several of the cited authors for helping me make the list of references more complete and for engaging in some very interesting email discussions. Thanks also to the two anonymous reviewers whose suggestions helped me to streamline the article significantly. References Alqu´ezar, R., & Sanfeliu, A. (1994a). A hybrid connectionist symbolic approach to regular grammar inference based on neural learning and hierarchical clustering. In Proceedings of ICGI’94 (pp. 203–211). Berlin: Springer-Verlag. Alqu´ezar, R., & Sanfeliu, A. (1994b). Inference and recognition of regular grammars by training recurrent neural networks to learn the next-symbol prediction task. In F. Casacuberta & A. Sanfeliu (Eds.), Advances in pattern recognition and applications: Selected papers from the Vth Spanish Symposium on Pattern Recognition and Image Analysis (pp. 48–59). Singapore: World Scientific. Alqu´ezar, R., Sanfeliu, A., & Sainz, M. (1997). Experimental assessment of connectionist regular inference from positive and negative examples. In VII Simposium Nacional de Reconocimiento de Formas y An´alisis de Im´agenes (Vol. 1, pp. 49–54). Barcelona, Spain. Andrews, R., Diederich, J., & Tickle, A. (1995). Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge Based Systems, 8(6), 373–389. Bakker, B. (2004). The state of mind: Reinforcement learning with recurrent neural networks. Unpublished doctoral dissertation, Leiden University. Bakker, B., & de Jong, M. (2000). The epsilon state count. In J. A. Meyer, A. Berthoz, D. Floreano, H. Roitblat, & S. Wilson (Eds.), From animals to animats 6: Proceedings of the Sixth International Conference on Simulation of Adaptive Behavior (pp. 51–60). Cambridge, MA: MIT Press. Barreto, G. A., Araujo, ´ A. F. R., & Kremer, S. C. (2003). A taxonomy for spatiotemporal connectionist networks revisited: The unsupervised case. Neural Computation, 15(6), 1255–1320. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Blair, A., & Pollack, J. (1997). Analysis of dynamical recognizers. Neural Computation, 9(5), 1127–1142. Blanco, A., Delgado, M., & Pegalajar, M. C. (2000). Extracting rules from a (fuzzy/crisp) recurrent neural network using a self-organizing map. International Journal of Intelligent Systems, 15, 595–621.

Rule Extraction from Recurrent Neural Networks

1257

Bod´en, M., Jacobsson, H., & Ziemke, T. (2000). Evolving context-free language predictors. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 1033–1040). San Mateo, CA: Morgan Kaufmann. Bod´en, M., Wiles, J., Tonkes, B., & Blair, A. (1999). Learning to predict a contextfree language: Analysis of dynamics in recurrent hidden units. In Proceedings of ICANN 99 (pp. 359–364). Piscataway, NJ: IEEE. Bruske, J., & Sommer, G. (1995). Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7, 845–865. Bullinaria, J. A. (1997). Analyzing the internal representations of trained artificial neural networks. In A. Browne (Ed.), Neural network analysis, architectures and applications. London: IOP Publishing, (pp. 3–26). Carrasco, R. C., & Forcada, M. L. (2001). Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering, 13(2), 148–156. Carrasco, R. C., Forcada, M. L., Munoz, ˜ M. A. V., & Neco, R. P. (2000). Stable encoding of finite-state machines in discrete-time recurrent neural nets with sigmoid units. Neural Computation, 12(9), 2129–2174. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6), 1135–1178. Cechin, A. L., Pechmann Simon, D. R., & Stertz, K. (2003). State automata extraction from recurrent neural nets using k-means and fuzzy clustering. In XXIII International Conference of the Chilean Computer Science Society (pp. 73–78). Piscataway, NJ: IEEE Computer Society. Cicchello, O., & Kremer, S. C. (2003). Inducing grammars from sparse data sets: A survey of algorithms and results. Journal of Machine Learning Research, 4, 603–632. Cleeremans, A., McClelland, J. L., & Servan-Schreiber, D. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1, 372–381. Craven, M. C., & Shavlik, J. W. (1994). Using sampling and queries to extract rules from trained neural networks. In W. W. Cohen & H. Hirsh (Eds.), Machine learning: Proceedings of the Eleventh International Conference. San Francisco: Morgan Kaufmann. Craven, M. W., & Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 24–30). Cambridge, MA: MIT Press. Craven, M. W., & Shavlik, J. W. (1999). Rule extraction: Where do we go from here? (Tech. Rep. Machine Learning Research Group Working Paper 99-1). Madison: Department of Computer Sciences, University of Wisconsin. Crutchfield, J. P. (1994). The calculi of emergence: Computation, dynamics, and induction. Physica D, 75, 11–54. Crutchfield, J., & Young, K. (1990). Computation at the onset of chaos. In W. Zurek (Ed.), Complexity, entropy and the physics of information. Reading, MA: AddisonWesley. Das, S., & Das, R. (1991). Induction of discrete-state machine by stabilizing a simple recurrent network using clustering. Computer Science and Informatics, 21(2), 35–40. Das, S., Giles, C. L., & Sun, G. Z. (1993). Using prior knowledge in a NNPDA to learn context-free languages. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.),

1258

H. Jacobsson

Advances in neural information processing systems, 5 (pp. 65–72). San Mateo, CA: Morgan Kaufmann. Das, S., & Mozer, M. C. (1994), A unified gradient-descent/clustering architecture for finite state machine induction. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 19–26). San Mateo, CA: Morgan Kaufmann. Das, S., & Mozer, M. (1998). Dynamic on-line clustering and state extraction: An approach to symbolic learning. Neural Networks, 11(1), 53–64. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Fanelli, R. (1993). Grammatical inference and approximation of finite automata by Elman type recurrent neural networks trained with full forward error propagation (Tech. Rep. No: NNRG930628A). Brooklyn: Department of Physics, Brooklyn College of the City University of New York. Forcada, M. L. (2002). Neural networks: Automata and formal models of computation. An unfinished survey. Available online at: http://www.dlsi.ua.es/∼mlf/nnafmc/. Forcada, M. L., & Carrasco, R. C. (2001). Finite-state computation in analog neural networks: Steps towards biologically plausible models? In S. Wermter, J. Austin, & D. Willshaw (Eds.), Emergent computational models based on neuroscience. Berlin: Springer-Verlag. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1996). Representation of finite state automata in recurrent radial basis function networks. Machine Learning, 23(1), 5–32. Gers, F. A., & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333–1340. Giles, C. L., Chen, D., Miller, C., Chen, H., Sun, G., & Lee, Y. (1991). Second-order recurrent neural networks for grammatical inference. In Proceedings of International Joint Conference on Neural Networks (Vol. 2, pp. 273–281). Piscataway, NJ: IEEE. Giles, C. L., Horne, B. G., & Lin, T. (1995). Learning a class of large finite state machines with a recurrent neural network. Neural Networks, 8(9), 1359–1365. Giles, C. L., Lawrence, S., & Tsoi, A. (1997). Rule inference for financial prediction using recurrent neural networks. In Proceedings of IEEE/IAFE Conference on Computational Intelligence for Financial Engineering (CIFEr) (pp. 253–259). Piscataway, NJ: IEEE. Giles, C. L., Lawrence, S., & Tsoi, A. C. (2001). Noisy time series prediction using a recurrent neural network and grammatical inference. Machine Learning, 44(1/2), 161–183. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., & Sun, G. Z. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Lee, Y. C. (1992). Extracting and learning an unknown grammar with recurrent neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 317–324). San Francisco: Morgan Kaufmann. Giles, C. L., & Omlin, C. W. (1993). Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks. Connection Science, 5(3–4), 307–337.

Rule Extraction from Recurrent Neural Networks

1259

Giles, C. L., & Omlin, C. W. (1994). Pruning recurrent neural networks for improved generalization performance. IEEE Transactions on Neural Networks, 5(5), 848–851. Golea, M. (1996). On the complexity of rule extraction from neural networks and networkquerying (Tech. Rep.). Canberra: Australian National University. Gori, M., Maggini, M., Martinelli, E., & Soda, G. (1998). Inductive inference from noisy examples using the hybrid finite state filter. IEEE Transactions on Neural Networks, 9(3), 571–575. Gori, M., Maggini, M., & Soda, G. (1994). Scheduling of modular architectures for inductive inference of regular grammars. In ‘ECAI’94 Workshop on Combining Symbolic and Connectionist Processing, Amsterdam (pp. 78–87). New York: Wiley. Goudreau, M. W., & Giles, C. L. (1995). Using recurrent neural networks to learn the structure of interconnection networks. Neural Networks, 8(5), 793–804. Goudreau, M. W., Giles, C. L., Chakradhar, S. T., & Chen, D. (1994). First-order vs. second-order single layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3), 511–518. Hammer, B., & Tino, ˇ P. (2003). Recurrent neural networks with small weights implement definite memory machines. Neural Computation, 15(8), 1897–1929. Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46(1–2), 47–75. Hopcroft, J., & Ullman, J. D. (1979). Introduction to automata theory, languages, and compilation. Reading, MA: Addison-Wesley. Horne, B. G., & Giles, C. L. (1995). An experimental comparison of recurrent neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7’ (pp. 697–704). Cambridge, MA: MIT Press. Horne, B. G., & Hush, D. R. (1994). Bounds on the complexity of recurrent neural network implementations of finite state machines. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 359– 366). San Mateo, CA: Morgan Kaufmann. Husbands, P., Harvey, I., & Cliff, D. T. (1995). Circle in the round: State space attractors for evolved sighted robots. Robotics and Autonomous Systems, 15(1–2), 83–106. Jacobsson, H., & Ziemke, T. (2003a). Improving procedures for evaluation of connectionist context-free language predictors. IEEE Transactions on Neural Networks, 14(4), 963–966. Jacobsson, H., & Ziemke, T. (2003b). Reducing complexity of rule extraction from prediction RNNs through domain interaction (Tech. Rep. No. HS-IDA-TR-03-007). Skovde: ¨ Department of Computer Science, University of Skovde, ¨ Sweden. Jaeger, H. (2003). Adaptive nonlinear system identification with echo state networks. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 593–600). Cambridge, MA: MIT Press. Jagota, A., Plate, T., Shastri, L., & Sun, R. (1999). Connectionist symbol processing: Dead or alive? Neural Computing Surveys, 2, 1–40. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer. Kolen, J. F. (1993). Fool’s gold: Extracting finite state machines from recurrent network dynamics. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Neural information processing systems, 6 (pp. 501–508). San Mateo, CA: Morgan Kaufmann.

1260

H. Jacobsson

Kolen, J. F. (1994). Exploring the computational capabilities of recurrent neural networks. Unpublished doctoral dissertation, Ohio State University. Kolen, J. F., & Kremer, S. C. (Eds.). (2001). A field guide to dynamical recurrent networks. Piscataway, NJ: IEEE Press. Kolen, J., & Pollack, J. (1995). The observers’ paradox: Apparent computational complexity in physical systems. Journal of Exp. and Theoret. Artificial Intelligence, 7(3), 253–277. Kremer, S. C. (2001). Spatiotemporal connectionist networks: A taxonomy and review. Neural Computation, 13(2), 248–306. Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: University of Chicago Press. Lawrence, S., Giles, C. L., & Fong, S. (2000). Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering, 12(1), 126–140. Lawrence, S., Giles, C. L., & Tsoi, A. C. (1998). Symbolic conversion, grammatical inference and rule extraction for foreign exchange rate prediction. In Y. AbuMostafa, A. S. Weigend, and P. Refenes (Eds.), Neural networks in the capital markets NNCM96 (pp. 333–345). Singapore: World Scientific. Maggini, M. (1998). Recursive neural networks and automata. In C. L. Giles & M. Gori (Eds.), Adaptive processing of sequences and data structures (pp. 248–295). Berlin: Springer-Verlag. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6), 1155–1173. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. Medler, D. (1998). A brief history of connectionism. Neural Computing Surveys, 1(1), 61–101. Meeden, L. A. (1996). An incremental approach to developing intelligent neural network controllers for robots. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 26(3), 474–485. Miller, C. B., & Giles, C. L. (1993). Experimental comparison of the effect of order in recurrent neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 849–872. Mirkin, B. (1996). Mathematical classification and clustering. Norwood, MA: Kluwer. Niklasson, L., & Bod´en, M. (1997). Representing structure and structured representations in connectionist networks. In A. Browne (Ed.), Neural Network perspectives on cognition and adaptive robotics (pp. 20–50). London: IOP Press. Omlin, C. W. (2001). Understanding and explaining DRN behaviour. In J. F. Kolen & S. C. Kremer (Eds.), A field guide to dynamical recurrent networks (pp. 207–228). Piscataway, NJ: IEEE Press. Omlin, C. W., & Giles, C. L. (1992). Training second-order recurrent neural networks using hints. In D. Sleeman & P. Edwards (Eds.), Proceedings of the Ninth International Conference on Machine Learning (pp. 363–368). San Mateo, CA: Morgan Kaufmann. Omlin, C. W., & Giles, C. L. (1996a). Constructing deterministic finite-state automata in recurrent neural networks. Journal of the ACM, 43, 937–972. Omlin, C. W., & Giles, C. L. (1996b). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41–51.

Rule Extraction from Recurrent Neural Networks

1261

Omlin, C. W., & Giles, C. L. (1996c). Rule revision with recurrent neural networks. Knowledge and Data Engineering, 8(1), 183–188. Omlin, C. W., & Giles, C. L. (2000). Symbolic knowledge representation in recurrent neural networks: Insights from theoretical models of computation. In I. Cloete & J. M. Zurada (Eds.), Knowledge-based neurocomputing. Cambridge, MA: MIT Press. Omlin, C. W., Giles, C., & Miller, C. (1992). Heuristics for the extraction of rules from discrete-time recurrent neural networks. In Proceedings of the International Joint Conference on Neural Networks (Vol. 1, pp. 33–38). New York: International Neural Network Society, IEEE. Omlin, C. W., Thornber, K. K., & Giles, C. L. (1998). Deterministic fuzzy finite state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems, 6(1), 76–89. Paz, A. (1971). Introduction to probabilistic automata. Orlando, FL: Academic Press. Pollack, J. B. (1987). Cascaded back-propagation on dynamic connectionist networks. In Proceedings of the Ninth Annual Conference of the Cognitive Science Society (pp. 391–404). Mahwah, NJ: Erlbaum. Rabin, M. O. (1963). Probabilistic automata. Information and Control, 6, 230–245. Rodriguez, P. F. (1999). Mathematical foundations of simple recurrent neural networks in language processing. Unpublished doctoral dissertation, University of California, San Diego. Rodriguez, P., Wiles, J., & Elman, J. L. (1999). A recurrent network that learns to count. Connection Science, 11, 5–40. Sanfeliu, A., & Alqu´ezar, R. (1995). Active grammatical inference: A new learning methodology. In Shape, Structure and Pattern Recognition, 5th IAPR International Workshop on Structural and Syntactic Pattern Recognition (pp. 191–200). Singapore: World Scientific. Schellhammer, I., Diederich, J., Towsey, M., & Brugman, C. (1998). Knowledge extraction and recurrent neural networks: An analysis of an Elman network trained on a natural language learning task. In D. M. W. Powers (Ed.), Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning: NeMLaP3/CoNLL98’ (pp. 73–78). Somerset, NJ: Association for Computational Linguistics. Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2), 234–242. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. L. (1989). Learning sequential structure in simple recurrent networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 643–652). San Mateo, CA: Morgan Kaufmann. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. L. (1991). Graded state machines: The representation of temporal contingencies in simple recurrent networks. Machine Learning, 7, 161–193. Sharkey, A. J. C. (Ed.). (1996). [Special issue]. Combining artificial neural nets: Ensemble approaches Connection Science, 8 (3–4). Sharkey, N. E., & Jackson, S. A. (1995). An internal report for connectionists. In R. Sun & L. A. Bookman (Eds.), Computational architectures integrating neural and symbolic processes (pp. 223–244). Norwood, MA: Kluwer. Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150.

1262

H. Jacobsson

Sima, J., & Orponen, P. (2003). General purpose computation with neural networks: A survey of complexity theoretic results. Neural Computation, 15, 2727– 2778. Sun, G. Z., Giles, C. L., & Chen, H. H. (1998). The neural network pushdown automation: Architecture, dynamics and learning. In C. Giles & M. Gori (Eds.), Adaptive processing of sequences and data structures (pp. 296–345). Berlin: Springer. Sun, R., & Giles, C. L. (Eds.), (2001). Sequence learning: Paradigms, algorithms, and applications. Berlin: Springer. Sun, R., Peterson, T., & Sessions, C. (2001). The extraction of planning knowledge from reinforcement learning neural networks. In Proceedings of WIRN’2001. Berlin: Springer-Verlag. Tabor, W., & Tanenhaus, M. (1999). Dynamical models of sentence processing. Cognitive Science, 24(4), 491–515. Tickle, A., Andrews, R., Golea, M., & Diederich, J. (1997). Rule extraction from artificial neural networks. In A. Browne (Ed.), Neural network analysis, architectures and applications (pp. 61–99). London: IOP Publishing. Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within mined artificial neural networks. IEEE Transactions on Neural Networks, 9(6), 1057– 1068. ˇ nansk Tino, ˇ P., Cer ˇ y, ´ M., & Benuˇ ˇ skov´a, L. (2004). Markovian architectural bias of recurrent neural networks. IEEE Transactions on Neural Networks, 15(1), 6–15. Tino, ˇ P., Dorffner, G., & Schittenkopf, C. (2000). Understanding state space organization in recurrent neural networks with iterative function systems dynamics. In S. Wermter & R. Sun (Eds.), Hybrid neural symbolic integration (pp. 256–270). Berlin: Springer-Verlag. Tino, ˇ P., & Hammer, B. (2003). Architectural bias in recurrent neural networks— Fractal analysis. Neural Computation, 15(8), 1931–1957. Tino, ˇ P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1998). Finite state machines and recurrent neural networks—automata and dynamical systems approaches. In J. E. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition (pp. 171–220). Orlando, FL: Academic Press. Tino, ˇ P., & Koteles, ¨ M. (1999). Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks, 10(2), 284–302. ˇ Tino, ˇ P., & Sajda, J. (1995). Learning and extracting initial mealy automata with a modular neural network model. Neural Computation, 7(4), 822–844. Tino, ˇ P., & Vojtek, V. (1998). Extracting stochastic machines from recurrent neural networks trained on complex symbolic sequences. Neural Network World, 8(5), 517–530. Tomita, M. (1982). Dynamic construction of finite-state automata from examples using hillclimbing. In Proceedings of Fourth Annual Cognitive Science Conference (pp. 105–108). Tonkes, B., Blair, A., & Wiles, J. (1998). Inductive bias in context-free language learning. In T. Downs, M. Frean, & M. Gallagher (Eds.), Proceedings of the Ninth Australian Conference on Neural Networks. Brisbane: Department of Computer Science and Electrical Engineering, University of Queensland.

Rule Extraction from Recurrent Neural Networks

1263

Tonkes, B., & Wiles, J. (1999). Learning a context-free task with a recurrent neural network: An analysis of stability. In R. Heath, B. Hayes, A. Heathcote, & C. Hooker (Eds.), Dynamical Cognitive Science: Proceedings of the Fourth Biennial Conference of the Australasian Cognitive Science Society. University of Newcastle. Towell, G. G., & Shavlik, J. W. (1993). The extraction of refined rules from knowledgebased neural networks. Machine Learning, 13(1), 17–101. Trakhtenbrot, B. A., & Barzdin, J. M. (1973). Finite automata: Behavior and synthesis. Amsterdam: North-Holland. Vahed, A., & Omlin, C. W. (1999). Rule extraction from recurrent neural networks using a symbolic machine learning algorithm (Tech. Rep. No. US-CS-TR-4). Stellenbosch, South Africa, University of Stellenbosch. Vahed, A., & Omlin, C. W. (2004). A machine learning method for extracting symbolic knowledge from recurrent neural networks. Neural Computation, 16, 59–71. Watrous, R. L., & Kuhn, G. M. (1992). Induction of finite-state automata using secondorder recurrent networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 309–317). San Mateo, CA: Morgan Kaufmann. Wiles, J., & Elman, J. L. (1995). Learning to count without a counter: A case study of dynamics and activation landscapes in recurrent neural networks. In Proceedings of the Seventeenth Annual Conference of the Cognitive Science Society (pp. 482–487). Cambridge MA: MIT Press. Young, K., & Crutchfield, J. P. (1993). Fluctuation spectroscopy. Chaos, Solutions, and Fractals, 4, 5–39. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning finite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976–990. Ziemke, T., & Thieme, M. (2002). Neuromodulation of reactive sensorimotor mappings as a short-term memory mechanism in delayed response tasks. Adaptive Behavior, 10(3/4), 185–199.

Received April 23, 2004; accepted November 2, 2004.

NOTE

Communicated by John Platt

Learning by Kernel Polarization Yoram Baram [email protected] Department of Computer Science, Technion, Israel Institute of Technology, Haifa 3200, Israel

Kernels are key components of pattern recognition mechanisms. We propose a universal kernel optimality criterion, which is independent of the classifier to be used. Defining data polarization as a process by which points of different classes are driven to geometrically opposite locations in a confined domain, we propose selecting the kernel parameter values that polarize the data in the associated feature space. Conversely, the kernel is said to be polarized by the data. Kernel polarization gives rise to an unconstrained optimization problem. We show that complete kernel polarization yields consistent classification by kernel-sum classifiers. Tested on real-life data, polarized kernels demonstrate a clear advantage over the Euclidean distance in proximity classifiers. Embedded in a support vectors classifier, kernel polarization is found to yield about the same performance as exhaustive parameter search. 1 Introduction Pattern recognition is central to biological and artificial intelligence. A pattern classification capability is normally acquired from knowledge of the class assignment of available data. Classical methods, such as nearest neighbor, classify new data points according to their Euclidean distance from class-labeled points (e.g., Fukunaga, 1990). More recently proposed criteria, such as margin maximization (Vapnik & Lerner, 1963), employ another brand of proximity measures, known as kernels. Kernels are at the heart of such successful methods as the Parzen density estimator (Parzen, 1962) and support vector machines (SVM; Vapnik, 1995). The choice of the kernel parameter values has been often resolved by exhaustive search, maximizing the empirical performance of the particular classifier used (Fukunaga, 1990; Scholkopf & Smola, 2002). A recent paper (Chapelle, Vapnik, Bousquet, & Mukhergee, 2002) suggests selecting the kernel parameter values that minimize a bound on the performance of the SVM method in the leave-oneout procedure. Other recent papers suggest optimizing measures of class separation, such as the kernel-target alignment (Cristianini, Shawe-Taylor, Elisseeff, & Kandola, 2001) and the scatter-matrix ratio (Wang & Chan, 2002). These approaches give rise to computationaly heavy optimization Neural Computation 17, 1264–1275 (2005)

© 2005 Massachusetts Institute of Technology

Learning by Kernel Polarization

1265

problems (while the first has been limited to transduction problems, which are, in principle, considerably simpler than their more common induction versions, the second involves exhaustive search over sample matrix traces). A different approach (Platt, Burges, Swenson, Weare, & Zheng, 2002) is aimed at optimizing the linear metaparameters of an approximated kernel by minimizing the Frobenius norm. In this note, we propose a kernel optimality criterion, which represents the fundamental structure of the classification problem at hand, regardless of the particular classifier to be used. This approach will facilitate a complete separation between kernel and classifier design. We propose that kernel parameter values be selected so as to drive points with different class labels to opposite locations in the abstract feature space defined by kernel geometry. Drawing from physics, we call such a process kernel polarization. Kernel polarization gives rise to an unconstrained optimization problem. Yet employing bounded kernels, the associated feature space geometry guarantees that the optimization problem is well posed. Independently derived, kernel polarity simplifies the measure proposed in Cristianini et al. (2001) by ridding the latter of its denominator, making the optimization problem considerably easier. Polarized kernels can be used in a variety of kernel sum classifiers, ranging from the Parzen classifier, through proximity classifiers, to support vector classifiers. We show that complete kernel polarization yields consistent classification (i.e., the expected risk vanishes as the amount of labeled data increases). This will motivate kernel polarization from a theoretical viewpoint. Testing the proposed methods on benchmark real-life data, we find that polarized kernels significantly outperform the Euclidean distance in proximity classifiers. In support vector classifiers, kernel polarization allows for eliminating the soft margin parameter, producing about the same results as exhaustive parameter search. While we employ the gaussian kernel throughout, other kernel functions may be used in a similar fashion. 2 Classification by Kernel Sums We consider a set X ⊂ Rn , with each x ∈ X belonging to one of two classes, labeled y(x) ∈ Y = {−1, +1}. Given a set of labeled points (the training set), {x (i) , y(i) , i = 1, . . . , M}, where x (i) ∈ X and y(i) ∈ Y, and given a point x ∈ X, we would like to assign x with a label y(x). We consider classifiers of the general form f (x) = sign

M

αi kv x , x y + b , (i)

(i)

(2.1)

i=1

where αi , i = 1, . . . , M, and b are scalar parameters and kv (x (i) , x) is a function from X × X into R, called a kernel, with v, a vector of kernel parameters.

1266

Y. Baram

For instance, the gaussian kernel x (i) − x 2 kv x (i) , x = exp − v2

(2.2)

may be viewed as a probability density function of x centered at x (i) , or, since it is a monotone decreasing function of the Euclidean distance between its arguments, a proximity measure. But other kernels, some neither densities nor positive, have been proposed and used successfully (e.g., Scholkopf & Smola, 2002). Of course, we would like to have f (x) = y(x). Many known classifiers can be put in the form of equation 2.1. Some are discussed next for the sake of later comparison. 2.1 Parzen Classifier. Assuming the existence of a probability measure P on X × Y, it would be desirable to find a classifier f (x) that makes the probability of an error as small as possible. The minimum possible value of P[ f (x) = y(x)] with respect to all f (x) is called the Bayes risk. A classifier that achieves the Bayes risk is y(x) = 1 if P+ p+ (x) ≥ P− p− (x), and y(x) = −1 otherwise, where p+ (x) and p− (x) are the probability density functions and P+ and P− are the prior probabilities corresponding to the two classes. Substituting the densities by their Parzen estimates (Parzen, 1962), the classifier attains the form of equation 2.1 with αi = 1/(P+ m+ ) for i | y(i) = +1 and αi = 1/(P− m− ) for i | y(i) = −1, b = 0, and k(x (i) , x) is a density kernel, such as equation 2.2. We call this classifier a Parzen classifier. The Parzen classifier assumes a good approximation of the actual probability distributions of the classes. However, this cannot be guaranteed. Yet the kernel sum form of equation 2.1 is also admitted by other classifiers. Let us first consider one that does not possess this representation. The k-nearest neighbors classifier (e.g., Fukunaga, 1990) assigns a new point to the class of the majority of its k-nearest labeled neighbors, employing the Euclidean distance. While this classifier does not explicitly take into account the distribution of the data, its asymptotic error probability for any k ≥ 1 is bounded above by twice the Bayes risk (Cover & Hart, 1967). Replacing the Euclidean distance by a monotone function of it in the k-nearest neighbors classifier will clearly have no advantage, and both measures will produce the same classification results. However, such replacement can be quite advantageous when we weight the neighbors by their proximity to x. 2.2 Weighted Neighbors. Employing the Euclidean distance d, a bounded proximity measure may be defined by 1/[1 + d(x (i) , x)]. As we shall see in the next section, kernel boundedness, which implies that the associated feature space is confined, is necessary for kernel polarization. However, unlike the kernel functions under consideration, the measure 1/[1 + d(x (i) , x)] is nonparametric and consequently is not polarizable.

Learning by Kernel Polarization

1267

(Indeed, it will produce poor classification results, as will the unbounded measure 1/d(x (i) , x).) Employing, instead, a parametric kernel such as equation 2.2 as a proximity measure and weighting all training data points by their proximity to x, we obtain a classifier in the form of equation 2.1 with αi = 1, i = 1, . . . , M, and b = 0. It can be seen that the latter coincides with the Parzen classifier when the kernel is a density and when m+ = m− and P+ = P− . 2.3 Nearest Mean. The kernel version of the nearest mean classifier (Scholkopf & Smola, 2002) has the form of equation 2.1 with αi = 1/m+ for i | y(i) = +1 and αi = m− for i | y(i) = −1, and  1 b = 0.5  2 m−

kv x (i) , x

i, j|y(i) =y( j) =−1

( j)

1 − 2 m+

kv x (i) , x



( j)

.

i, j|y(i) =y( j) =+1

(2.3) 2.4 A Neural Network. A neural network (e.g., Haykin, 1998) has the form of equation 2.1, where αi , b ∈ R, are weights, to be found from the data. The training points x (i) are often replaced by a smaller set of weights w (i) , i = 1, . . . , K . These may represent a random sampling of the input space, (for example, some of the data points themselves, K being the random r-covering number (Baram, 1996)), while the αi ’s and b are obtained from some learning algorithm, such as the perceptron (Rosenblatt, 1958). The main difficulty with such designs is that they merely seek a linear separation surface in feature space, without regard to its generalization property. The following approach brings us closer to resolving this issue. Suppose that the underlying kernel satisfies Mercer’s conditions (equivalently, the kernel is said to be positive semi-definite, e.g., Scholkopf & Smola, 2002). Then there exists a mapping φ into a space where kv acts as a dot (inner) product, that is,

kv x (i) , x = φ x (i) , φ(x) .

(2.4)

A kernel sum classifier can now be written as y(x) = sign{w, φ(x) + b},

(2.5)

where w=

M i=1

αi y(i) φ x (i) .

(2.6)

1268

Y. Baram

This means that the classifier constitutes a linear separation surface (a hyperplane) w in feature space. Let us place, symmetrically about this linear surface, two linear surfaces parallel to it. The width of themargin between these two surfaces is maximized by a support vector classifier. 2.5 A Support Vector. A support vector (SV) classifier (Vapnik, 1995) has the form of equation 2.1, where the coefficients αi are found by solving the quadratic programming problem M

M 1 max αi − αi α j y(i) y( j) kv x (i) , x ( j) α 2 i=1 i=1

(2.7)

subject to 0 ≤ αi ≤

C , i = 1, . . . , M, M

(2.8)

with C, the “soft margin” (or “error cost”) parameter, to be determined, and M

αi y(i) = 0, i = 1, . . . , M.

(2.9)

i=1

The additional parameter b is found from M M (i) ( j) (i) 1 ( j) αi kv x , x y y − , b= M j=1 i=1

(2.10)

where the first sum is taken with respect to the support vectors—those data points x (i) for which the corresponding αi ’s are nonzero. Margin maximization has been justified by generalization bounds (Cristianini & Shawe-Taylor, 2000). The kernel parameters v and the soft margin parameter C are not solved by the SV machinery itself and require some external search process. The most widely used approach to this problem has been exhaustive search in conjunction with empirical error calculation. The leave-one-out error bound has been proposed for this purpose (Scholkopf & Smola, 2002). In practice, computationally less demanding cross-validation strategies have been used for error calculation in this context. A recent paper (Chapelle et al., 2002) suggests finding these parameters by minimizing a certain bound on the number of errors produced by the SV classifier in the leave-one-out procedure. In the next section we propose a method for optimizing the kernel parameters, which is independent of the particular classifier to be used.

Learning by Kernel Polarization

1269

3 Kernel Polarization The mutual proximity of two points x (i) , x ( j) in X is represented by a kernel kv (x (i) , x ( j) ), where v denotes the vector of kernel parameters. Separation of the two classes will become easier if there is a clear correspondence between the kernel proximity of points and the identity of their labels. In particular, this will be true if, by choice of the values of the kernel parameters, points of the same class come close together, while points of different classes go far apart, in a geometrically confined domain. In physics, such an action, applied to objects with opposite electric charges, has been called polarization. Given a set of labeled points (the training set), {x (i) , y(i) , i = 1, . . . , M}, where x (i) ∈ X and y(i) ∈ {−1, +1}, we define its polarity as

Kv =

M M 1 kv x (i) , x ( j) y(i) y( j) . 2 M i=1 j=1

(3.1)

Clearly, K v will increase if points of the training set having the same label come closer together and points of different classes go farther apart, in the sense of the kernel being a proximity measure. How can we be assured that when this is true for points of the training set, it will also hold true, at least in some sense of approximation, for other points of X? This will be the case if, for instance, the kernel is a continuous monotone function of the Euclidean distance between its two arguments. Furthermore, the maximization problem will be well posed if the feature space is confined. Such properties are possessed by common kernel functions. For instance, the gaussian kernel, equation 2.2, is continuous, smooth, and monotone. Also, kv (x, x) = 1 for all x ∈ X, and kv (x, x ) > 0 for all x, x ∈ X. Since, the gaussian kernel can be shown to satisfy Mercer’s condition of semi-positivity (Scholkopf & Smola, 2002), it is a dot product in feature space; hence, it is the cosine of the angle between the corresponding features. The feature space is, then, an orthant of the unity sphere; hence, it is quite confined (yet infinite dimensional; Scholkopf & Smola, 2002; Micchelli, 1986). Given a training set, our objective is to maximize K v with respect to the kernel parameters. Newton’s iteration is −1 v(k + 1) = v(k) + ∇v2 K v ∇v K v ,

(3.2)

where ∇v K v and ∇v2 K v are, respectively, the gradient and the Hessian of K v with respect to v. In the case of a gaussian kernel, we have one kernel parameter v. The first and the second derivatives of the kernel with respect to v

1270

Y. Baram

are: x (i) , x ( j) 2 2 x (i) − x ( j) 2 ∂kv (x (i) , x ( j) ) exp − = ∂v v3 v2 ∂ 2 kv (x (i) , x ( j) ) −6 x (i) − x ( j) 2 x (i) − x ( j) 2 = exp − ∂v2 v4 v2 x (i) − x ( j) 2 4 x (i) − x ( j) 4 . exp − + v6 v2

(3.3)

(3.4)

Substituting these expressions into equation 3.2 will complete the specification of the parameter optimization step. How can we evaluate the performance of kernel sum classifiers? In learning theory, performance is often measured by risk functions (Scholkopf & Smola, 2002). In particular, given a classifier f and a cost function c(x, y, f (x)) (e.g., c(x, y, f (x)) = 0.5 | y − f (x) |), the expected risk is defined as R[ f ] = E{c(x, y, f (x))} = c(x, y, f (x))dP(x, y), (3.5) X×Y

whereas the empirical risk is Remp [ f ] =

M 1 c x (i) , y(i) , f x (i) . M i=1

(3.6)

Rooted in the VC theory (Vapnik & Chervonenkis, 1968, 1971), generalization is defined as uniform (with respect to a classifier family) convergence of the empirical risk to the actual risk, as the size M of the training setgrows indefinitely. When the empirical risk is zero, generalization has a particularly strong meaning, since it guarantees, asymptotically, a faultless classifier. Since for a commonly used cost per error (c = 0.5 | y(x) − f (x) |), the expected risk coincides with the error probability, we shall call such convergence consistency (this definition, which is consistent with the one used in statistics, is different from the one used by Scholkopf & Smola, 2002, which implies convergence of the expected risk to the Bayes risk). When the kernel in the kernel sum classifier is positive definite and when the dimension of the corresponding feature space is finite, so is the VC dimension. Generalization is then implied by the VC bound. Unfortunately, the feature spaces associated with some useful kernels (e.g., the gaussian one) are infinite dimensional, making the VC bound unusable. Several alternative bounds on classifier performance have been proposed (e.g., Scholkopf & Smola, 2002).

Learning by Kernel Polarization

1271

We find the PAC-Bayesian margin bound particularly useful in the present context here. We shall assume that polarization is complete, in the sense that the polarity coefficient, equation 3.1, attains its absolute maximum value. Then we have the following result: Lemma. Under complete kernel polarization, the expected risk of a kernel sum classifier f (x) = sign{w, φ(x)} with a positive definite kernel and a feature space of dimension n ∈ N ∪ ∞ is, with probability 1 − δ, bounded as 2 2 R[ f ] ≤ −d ln 1 − 1 − ρ ( f ) + 2 ln M − ln δ + 2 , M

(3.7)

where d = min(M, n) and ρ( f ) = min

i∈[M]

y(i) w, φ(x (i) ) , φ(x (i) )

where φ(x (i) ) is the feature corresponding to x (i) and [M] = {1, 2, . . . , M}. Proof. Complete kernel polarization guarantees that the two classes in the training set reduce to a point each in feature space, and these two points are different. This means that the empirical risk of a kernel sum classifier on the training set is zero. The proof now follows directly from the PAC-Bayesian margin bound (Herbrich, 2000; Scholkopf & Smola, 2002). Consistency of kernel sum classifiers now follows for positive unity kernels, that is, kernels satisfying k(x, x ) ≥ 0 and k(x, x) = 1. Theorem. For a positive unity kernel under complete polarization, a kernel sum classifier is consistent. Proof. We first note that under a positive unity kernel, the feature space is the positive orthant of the unity sphere. Under complete polarization, the two classes are concentrated at two points in feature space, separated by an angle π/2. Consider a homogeneous separating hyperplane in feature space, yielding a zero empirical risk and the highest possible expected risk. There may be, at most, two such hyperplanes, each passing arbitrarily close to one of the two polarized classes (one or both may possess the highest expected risk). Let us denote the normal to one of these hyperplanes w ∗ and the associated classifier f ∗ . Without loss of generality, suppose that this hyperplane passes arbitrarily close to the point corresponding to the polarized class labeled y = +1. Then w ∗ forms an angle arbitrarily close to π, with the vector associated with the point corresponding to the polarized class labeled y = −1. Clearly, for any other hyperplane w with zero empirical

1272

Y. Baram

risk (i.e., a hyperplane passing between the two points) and its associated classifier f , we have R[ f ] ≤ R[ f ∗ ].

(3.8)

We now have mini∈[M] {y(i) w ∗ , φ(x (i) )} = cos π = −1 and φ(x (i) ) = kv (x (i) , x (i) ) = 1; hence, ρ 2 ( f ∗ ) = 1. It follows from equation 3.7 that R[ f ∗ ] ≤

2 (2 ln M − ln δ + 2) . M

(3.9)

Hence, R[ f ] ≤

2 (2 ln M − ln δ + 2) . M

(3.10)

It follows that R[ f ] → 0 as M → ∞, as asserted. The theorem implies that a kernel sum classifier with a polarized gaussian kernel is consistent. Of course, this is also true for other kernels with similar properties. Even if complete polarization is not achieved, this result certainly motivates kernel polarization, which is further justified in the following section by numerical examples. 4 Examples We have applied kernel polarization to the six smallest databases from UCI (2003) ranging in size from 215 to 1066 and in dimension from 5 to 20. Each database was divided into four parts, three parts serving as a training set and one part as a test set, in a fourfold cross-validation fashion. For each of the cases, we have polarized a gaussian kernel with respect to the training data. The polarized kernel was then employed by the weighted-neighbors (PKWN), the nearest-mean (PKNM), and the support vectors (PKSV) classifiers described before. We have also applied the Parzen classifier (PC) and the Euclidean versions of the nearest neighbor (NN), the nearest mean (NM), and the weighted neighbors (WN) classifiers to the same data. Performance results for the SV classifier employing exhaustive search for the kernel and the soft margin parameter values were obtained from R. El-Yaniv and E. Yom-Tov (personal communication, 2003). The results, in terms of the average percentile number of errors for the four folds and the standard error of the mean deviation (STDM) are summarized in Table 1. The average performance of each of the classifiers is given in the last row of the table, where the STDM value is calculated with respect to the results for the different databases.

Thyroid Heart Cancer Diabetes German Flare Mean

PC

22.3 ± 0.8 16.7 ± 3.0 27.4 ± 0.1 24.9 ± 1.8 26.7 ± 0.6 44.0 ± 2.2 27.0 ± 3.7

NN

3.3 ± 0.3 23.7 ± 0.3 30.7 ± 2.1 28.6 ± 0 30.3 ± 1.3 44.4 ± 0.3 26.8 ± 5.5

30.2 ± 1.9 41.0 ± 6.1 29.2 ± 0.4 34.9 ± 2.1 30.0 ± 1.4 35.3 ± 0.5 33.4 ± 1.9

WN 8.4 ± 0.41 23.0 ± 0.6 28.9 ± 0.9 25.9 ± 0.7 29.4 ± 0.3 34.2 ± 1.4 24.9 ± 3.7

PKWN 15.2 ± 0.9 18.4 ± 0.9 30.4 ± 0.5 26.9 ± 1.2 28.5 ± 0.8 37.2 ± 1.4 26.1 ± 3.3

NM

Table 1: Classification Results (Error Percentage Mean+STDM) for the Examples.

9.3 ± 0.9 15.6 ± 0.5 29.9 ± 0.7 27.3 ± 1.6 28.2 ± 0.6 33.4 ± 3.5 24.1 ± 3.9

PKNM

5.2 ± 3.2 15.3 ± 6.4 29.7 ± 2.8 22.7 ± 2.3 23.5 ± 0.8 32.4 ± 1.8 21.5 ± 4.1

SV

5.1 ± 0.1 16.3 ± 2.0 27.1 ± 0.7 22.9 ± 0.4 23.7 ± 0.8 34.0 ± 0.8 21.6 ± 4.0

PKSV

Learning by Kernel Polarization 1273

1274

Y. Baram

It can be seen that the kernel versions of the weighted neighbors and the nearest mean (PKWN and PKNM) classifiers produced better results than their Euclidean counterparts (WN and NM) in all cases. Furthermore, these results were better than those achieved by the NN and the PC classifiers in most cases. These relative advantages can also be seen from the average performance results, given in the last row of the table. The worst results were achieved by the Euclidean WN classifier. Switching to polarized kernels, the average performance of the WN classifier improved by about 25%, exceeding those of the NN and the PC classifiers by about 10% on average. The performances of the PKWN and the PKNM classifiers are nearly the same. As for the SV classifier, the use of polarized kernels proves to be quite beneficial, as the classification results are nearly the same as those obtained by exhaustive parameter search. Moreover, since polarization implies that hard classification would be nearly as effective as soft classification, polarization involves the kernel parameter (v) alone, and the soft margin parameter is eliminated from our (PKSV) design.

5 Conclusion We have introduced the concept of kernel polarization as a means for selecting kernel parameter values. We have shown that complete kernel polarization yields consistency in kernel sum classification. We have examined the use of polarized kernels in proximity classifiers on real-life data and found them to perform better than their Euclidean distance counterparts. Employed by a support vectors classifier, a polarized kernel was found to produce about the same classification results as a kernel optimized by exhaustive search. Acknowledgments I thank Ran El-Yaniv for his insightful comments and his help with the numerical examples.

References Baram, Y. (1996). Classification by balanced representation. Neurocomputing, 13, 347– 357. Chapelle, O., Vapnik, V., Bousquet, O., & Mukhergee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans. on Information Theory, IT-13, 21–27. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.

Learning by Kernel Polarization

1275

Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2001). On kernel-target alignment. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 367–373). Cambridge, MA: MIT Press. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Orlando, FL: Academic Press. Haykin, S. (1998). Neural networks, a comprehensive foundation (2nd ed.). New York: Macmillan. Herbrich, R. (2000). Learning linear classifiers. Unpublished doctoral dissertation, TU Berlin. Micchelli, C. A. (1986). Algebraic aspects of interpolation. In Proceedings of Symposia in Applied Mathematics, 36 (pp. 81–102). Providence, RI: American Mathematical Society. Parzen, E. (1962). On the estimation of probability density function and the mode. Ann. Math. Stat., 33, 1065–1076. Platt, J. C., Burges, C. J. C., Swenson, S., Weare, C., & Zheng, A. (2002). Learning a gaussian process prior for automatically generating music playlists. In S. Becker, S. Thrun, & K. Ober mayer (Eds.), Advances in neural information processing systems, 15 (pp. 1425–1432). Cambridge, MA: MIT Press. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. UCI. (2003). Machine Learning Repository. Available online: www.ics.uci.edu/ ∼mlearn/MLRepository.html. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915–918. Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of occurrence of events to their probabilities. Theory of Probability and Its Applications, 16(2), 264–280. Vapnik, V., & Lerner, A. (1963). Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 774–780. Wang, L., & Chan, K. L. (2002). Learning kernel parameters by using class separability measure. Paper presented at the NIPS kernel workshop, Whistler, Canada.

Received December 18, 2003; accepted November 29, 2004.

LETTER

Communicated by Daniel Durstewitz

A Unified Approach to Building and Controlling Spiking Attractor Networks Chris Eliasmith [email protected] Department of Philosophy and Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada

Extending work in Eliasmith and Anderson (2003), we employ a general framework to construct biologically plausible simulations of the three classes of attractor networks relevant for biological systems: static (point, line, ring, and plane) attractors, cyclic attractors, and chaotic attractors. We discuss these attractors in the context of the neural systems that they have been posited to help explain: eye control, working memory, and head direction; locomotion (specifically swimming); and olfaction, respectively. We then demonstrate how to introduce control into these models. The addition of control shows how attractor networks can be used as subsystems in larger neural systems, demonstrates how a much larger class of networks can be related to attractor networks, and makes it clear how attractor networks can be exploited for various information processing tasks in neurobiological systems. 1 Introduction Persistent activity has been thought to be important for neural computation at least since Hebb (1949), who suggested that it may underlie shortterm memory. Amit (1989), following work on attractors in artificial neural networks (e.g., that of Hopfield, 1982), suggested that persistent neural activity in biological networks is a result of dynamical attractors in the state space of recurrent networks. Since then, attractor networks have become a mainstay of computational neuroscience and have been used in a wide variety of models (see, e.g., Zhang 1996; Seung, Lee, Reis, & Tank, 2000; Touretzky & Redish, 1996; Laing & Chow, 2001; Hansel & Sompolinsky, 1998; Eliasmith, Westover, & Anderson, 2002). Despite a general agreement among theoretical neuroscientists that attractor networks form a large and biologically relevant class of networks, there is no general method for constructing and controlling the behavior of such networks. In this letter, we present such a method and explore several examples of its application, significantly extending work described in Eliasmith and Anderson (2003). We argue that this framework can both unify the current use of attractor networks and show how to extend the range of applicability of attractor models. Neural Computation 17, 1276–1314 (2005)

© 2005 Massachusetts Institute of Technology

Controlling Spiking Attractor Networks

1277

Most important, perhaps, we describe in detail how complex control can be integrated with standard attractor models. This allows us to begin to answer the kinds of pressing questions now being posed by neuroscientists, including, for example, how to account for the dynamics of working memory (see, e.g., Brody, Romo, & Kepecs, 2003; Fuster, 2001; Rainer & Miller, 2002). We briefly summarize the general framework and then demonstrate its application to the construction of spiking networks that exhibit line, plane, ring, cyclic, and chaotic attractors. Subsequently, we describe how to integrate control signals into each of these models, significantly increasing the power and range of application of these networks. 2 Framework This section briefly summarizes the methods described in Eliasmith and Anderson (2003), which we will refer to as the neural engineering framework (NEF). Subsequent sections discuss the application of these methods to the construction and control of attractor networks. The following three principles describe the approach: 1. Neural representations are defined by the combination of nonlinear encoding (exemplified by neuron tuning curves and neural spiking) and weighted linear decoding (over populations of neurons and over time). 2. Transformations of neural representations are functions of the variables represented by neural populations. Transformations are determined using an alternately weighted linear decoding. 3. Neural dynamics are characterized by considering neural representations as control-theoretic state variables. Thus, the dynamics of neurobiological systems can be analyzed using control theory. In addition to these main principles, the following addendum is taken to be important for analyzing neural systems:

r

Neural systems are subject to significant amounts of noise. Therefore, any analysis of such systems must account for the effects of noise.

Each of the next three sections describes one of the principles, in the context of the addendum, in more detail. (For detailed justifications of these principles, see Eliasmith & Anderson, 2003.) They are presented here to make clear both the terminology and assumptions in the subsequent network derivations. The successes of the subsequent models help to justify the adoption of these principles.

1278

C. Eliasmith

2.1 Representation. Consider a population of neurons whose activities a i (x) encode some vector, x. These activities can be written as a i (x) = G i [J i (x)] ,

(2.1)

where G i is the nonlinear function describing the neuron’s response function, and J i (x) is the current entering the soma. The somatic current is given by J i (x) = αi x · φ˜ i + J ibias ,

(2.2)

where αi is a gain and conversion factor, x is the vector variable to be encoded, φ˜ i determines the preferred stimulus of the neuron, and J ibias is a bias current that accounts for background activity. This equation provides a standard description of the effects of a current arriving at the soma of neuron i as a result of presenting a stimulus x. The nonlinearity G i that describes the neuron’s activity as a result of this current can be left undefined for the moment. In general, it should be determined experimentally, and thus based on the intrinsic physiological properties of the neuron(s) being modeled. The result of applying G i to the soma current J i (x) over the range of stimuli gives the neuron’s tuning curve a i (x). So a i (x) defines the encoding of the stimulus into neural activity. Given this encoding, the original stimulus vector can be estimated by decoding those activities, for example, xˆ =

a i (x)φi .

(2.3)

i

These decoding vectors, φi , can be found by a least-squares method (see the appendix; Salinas & Abbott, 1994; Eliasmith & Anderson, 2003). Together, the nonlinear encoding in equation 2.1 and the linear decoding in equation 2.3 define a neural population code for the representation of x. To incorporate a temporal code into this population code, we can draw on work that has shown that most of the information in neural spike trains can be extracted by linear decoding (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Let us first consider the temporal code in isolation by taking the neural activities a i (t) to be decoded spike trains, that is, a i (t) =

n

h i (t) ∗ δi (t − tn ) =

h i (t − tn ),

(2.4)

n

where δi (·) are the spikes at times tn for neuron i, and h i (t) are the linear decoding filters, which, for reasons of biological plausibility, we can take to be the (normalized) postsynaptic currents (PSCs) in the subsequent neuron. Elsewhere it has been shown that the information loss under this assumption is minimal and can be alleviated by increasing population size (Eliasmith

Controlling Spiking Attractor Networks

1279

& Anderson, 2003). As before, the encoding on which this linear decoding operates is defined as in equation 2.1, where G i is now taken to be a spiking nonlinearity. We can now combine this temporal code with the previously defined population code to give a general population temporal code for vectors: δ(t −tin ) = G i αi x · φ˜ i + J ibias Encoding xˆ = i,n h i (t − tn )φi Decoding. 2.2 Transformation. For such representations to be useful, they must be used to define transformations (i.e., functions of the vector variables). f (x) Fortunately, we can again find (least-squares optimal) decoders φi to perform a transformation f (x). So instead of finding the optimal decoders φi to extract the originally encoded variable x from the encoding, we can reweight the decoding to give some function f (x) other than identity (see the f (x) appendix). To distinguish the representational decoders φi from φi , we refer to the latter as transformational decoders. Given this characterization, it is a simple matter to rewrite the encoding and decoding equations for estimating some function of the vector variable: δ(t − tin ) = G i αi x · φ˜ i + J ibias Encoding f (x) Decoding. fˆ (x) = i,n h i (t − tn )φi Notably, both linear and nonlinear functions of the encoded variable can be computed in this manner (see Eliasmith & Anderson, 2003, for further discussion). 2.3 Dynamics. Dynamics of neural systems can be described using the previous characterizations of representation and transformation by employing modern control theory. Specifically, we can allow the higher-level vector variables represented by a neural population to be control-theoretic state variables. Let us first consider linear time-invariant (LTI) systems. Recall that the state equation in modern control theory describing LTI dynamics is ˙ = Ax(t) + Bu(t). x(t)

(2.5)

The input matrix B and the dynamics matrix A completely describe the dynamics of the LTI system, given the state variables x(t) and the input u(t). Taking the Laplace transform of equation 2.5 gives x(s) = h(s) [Ax(s) + Bu(s)] , where h(s) = 1s . Any LTI control system can be written in this form.

1280

C. Eliasmith

In the case of neural systems, the transfer function h(s) is not 1s but is determined by the intrinsic properties of the component cells. Because it is reasonable to assume that the dynamics of the synaptic PSC dominate the dynamics of the cellular response as a whole (Eliasmith & Anderson, 2003), it is reasonable to characterize the dynamics of neural populations based on their synaptic dynamics, that is, using h i (t) from equation 2.4. A simple model of a synaptic PSC is given by h (t) =

1 −t/τ , e τ

(2.6)

where τ is the synaptic time constant. The Laplace transform of this filter is h (s) =

1 . 1 + sτ

Given the change in filters from h(s) to h (s), we now need to determine how to change A and B in order to preserve the dynamics defined in the original system (i.e., the one using h(s)). In other words, letting the neural dynamics be defined by A and B , we need to determine the relation between matrices A and A and matrices B and B given the differences between h(s) and h (s). To do so, we can solve for x(s) in both cases and equate the resulting expressions for x(s). This gives A = τ A + I

B = τ B.

(2.7) (2.8)

This procedure assumes nothing about A or B, so we can construct a neurobiologically realistic implementation of any dynamical system defined using the techniques of modern control theory applied to LTI systems (constrained by the neurons’ intrinsic dynamics). Note also that this derivation is independent of the spiking nonlinearity, since that process is both very rapid and dedicated to encoding the resultant somatic voltage (not filtering it in any significant way). Importantly, the same approach can be used to characterize the broader class of time-varying and nonlinear control systems (examples are provided below). 2.4 Synthesis. Combining the preceding characterizations of representation, transformation, and dynamics results in the generic neural subsystem shown in Figure 1. With this formulation, the synaptic weights needed to implement some specified control system can be directly computed. Note that the formulation also provides for simulating the same model at various levels of description (e.g., at the level of neural populations or individual neurons, using either rate neurons or spiking neurons, and so on). This is useful for fine-tuning a model before introducing the extra computational overhead involved with modeling complex spiking neurons.

Controlling Spiking Attractor Networks

Σ n δ

β

(t-tjn)

Σδβ'(t-tjn)

n

1 1+sτijαβ

αβ

M

Fβ φj

1281

Fαβ[xβ(t)] α

Ji (t)

∼φ α i

1 1+sτijαβ'

αβ'

M

Fβ' φj

Gi[.]

Σn δα(t-tin)

soma αβ'

F

β'

[x (t)]

...

...

...

...

higher-level description

synaptic weights

PSCs dendrites

neural description

Figure 1: A generic neural population systems diagram. This figure is a combination of a higher-level (control-theoretic) and a neural-level system description (denoted by dotted lines). The solid boxes highlight the dendritic elements. The right-most solid box decomposes the synaptic weights into the relevant matrices. See the text for further discussion. (Adapted from Eliasmith & Anderson, 2003.)

In Figure 1, the solid lines highlight the dendritic components, which have been separated into postsynaptic filtering by the PSCs (i.e., temporal decoding) and the synaptic weights (population decoding and encoding). The weights themselves are determined by three matrices: (1) the decoding Fβ matrix, whose elements are the decoders φi for some (nonlinear) function β F of the signal x (t) that comes from a preceding population β; (2) the encoding matrix, whose elements are the encoders φ˜ iα for this population; and (3) the generalized transformation matrix Mαβ , which defines the transformations necessary for implementing the desired control system. Essentially, this diagram summarizes the three principles and their interrelations. The generality of the diagram hints at the generality of the underlying framework. In the remainder of the article, however, I focus only on its application to the construction and control of attractor networks. 3 Building Attractor Networks The most obvious feature of attractor networks is their tendency toward dynamic stability. That is, given a momentary input, they will settle on a position, or a recurring sequence of positions, in state space. This kind of stability can be usefully exploited by biological systems in a number of ways. For instance, it can help the system react to environmental changes

1282

C. Eliasmith

on multiple timescales. That is, stability permits systems to act on longer timescales than they might otherwise be able to, which is essential for numerous behaviors, including prediction, navigation, and social interaction. In addition, stability can be used as an indicator of task completion, such as in the case of stimulus categorization (Hopfield, 1982). As well, stability can make the system more robust (i.e., more resistant to undesirable perturbations). Because these networks are constructed so as to have only a certain set of stable states, random perturbations to nearby states can quickly dissipate to a stable state. As a result, attractor networks have been shown to be effective for noise reduction (Pouget, Zhang, Deneve, & Latham, 1998). Similarly, attractors over a series of states (e.g., cyclic attractors) can be used to robustly support repetitive behaviors such as walking, swimming, flying, or chewing. Given these kinds of useful computational properties and their natural analogs in biological behavior, it is unsurprising that attractor networks have become a staple of computational neuroscience. More than this, as the complexity of computational models continues to increase, attractor networks are likely to form important subnetworks in larger models. This is because the ability of attractor networks to, for example, categorize, filter noise, and integrate signals makes them good candidates for being some of the basic building blocks of complex signal processing systems. As a result, the networks described here should prove useful for a wide class of more complex models. To maintain consistency, all of the results of subsequent models were generated using networks of leaky integrate-and-fire neurons with absolute refractory periods of 1 ms, membrane time constants of 10 ms, and synaptic time constants of 5 ms. Intercepts and maximum firing rates were chosen from even distributions. The intercept intervals are normalized over [−1, 1] unless otherwise specified. (For a detailed discussion of the effects of changing these parameters, see Eliasmith & Anderson, 2003.) For each example presented below, the presentation focuses specifically on the construction of the relevant model. As a result, there is minimal discussion of the justification for mapping particular kinds of attractors onto various neural systems and behaviors, although references are provided. 3.1 Line Attractor. The line attractor, or neural integrator, has recently been implicated in decision making (Shadlen & Newsome, 2001; Wang, 2002), but is most extensively explored in the context of oculomotor control (Fukushima, Kaneko, & Fuchs, 1992; Seung, 1996; Askay, Gamkrelidze, Seung, Baker, & Tank, 2001). It is interesting to note that the terms line attractor and neural integrator actually describe different aspects of the network. In particular, the network is called an integrator because the low-dimensional variable (e.g., horizontal eye position) x(t) describing the network’s output reflects the integration of the input signal (e.g., eye movement velocity) u(t) to the system. In contrast, the network is called a line attractor because in

Controlling Spiking Attractor Networks

1283 x(t)

B

A

u(t)

bk(t)

aj(t)

Figure 2: Line attractor network architecture. The underline denotes variables that are part of the neuron-level description. The remaining variables are part of the higher-level description.

the high-dimensional activity space of the network (where the dimension is equal to the number of neurons in the network), the organization of the system collapses network activity to lie on a one-dimensional subspace (i.e., a line). As a result, only input that moves the network along this line changes the network’s output. In a sense, then, these two terms reflect a difference between what can be called higher-level and neuron-level descriptions of the system (see Figure 2). As modelers of the system, we need a method that allows us to integrate these two descriptions. Adopting the principles outlined earlier does precisely this. Notably, the resulting derivation is simple and is similar to that presented in Eliasmith and Anderson (2003). However, all of the steps needed to generate the far more complex circuits discussed later are described here, so it is useful as an introduction (and referred to for some of the subsequent derivations). We can begin by describing the higher-level behavior as integration, which has the state equation x˙ = Ax(t) + Bu(t) x(s) =

1 [Ax(s) + Bu(s)] , s

(3.1) (3.2)

where A = 0 and B = 1. Given principle 3, we can determine A and B , which are needed to implement this behavior in a system with neural dynamics defined by h (t) (see equation 2.6). The result is B = τ A = 1,

1284

C. Eliasmith

Firing Rate (Hz)

250

200

150

100

50

0 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x Figure 3: Sample tuning curves for a population of neurons used to implement the line attractor. These are the equivalent steady-state tuning curves of the spiking neurons used in this example. They are found by solving the differential equations for the leaky integrate-and-fire neuron assuming a constant input . 1 current and are described by a j (x) = ref J threshold τ j −τ RC j ln 1− a x+J bias j

where τ is the time constant of the PSC of neurons in the population representing x(t). To use this description in a neural model, we must define the representation of the state variable of the system, x(t). Given principle 1, let us define this representation using the following encoding and decoding: a j (t) = G j α j x(t)φ˜ j + J jbias

(3.3)

and xˆ (t) =

a j (t)φ xj .

(3.4)

j

Note that the encoding weight φ˜ j plays the same role as the encoding vector in equation 2.2 but is simply ±1 (for on and off neurons) in the scalar case. Figure 3 shows a population of neurons with this kind of encoding. Let us also assume an analogous representation for u(t). Working in the time domain, we can take our description of the dynamics, x(t) = h (t) ∗ A x(t) + B u(t) ,

Controlling Spiking Attractor Networks

1285

and substitute it into equation 3.3 to give a j (t) = G j α j φ˜ j h (t) ∗ [A x(t) + B u(t)] + J jbias .

(3.5)

Substituting our decoding equation 3.4 into 3.5 for both populations gives

a j (t) = G j α j

h (t) ∗ φ˜ j

A

a i (t)φix

+B

i

= G j h (t) ∗

ω ji a i (t) +

b k (t)φku

+J

bias j

k

i

ω jk b k (t) + J jbias ,

(3.6) (3.7)

k

where ω ji = α j A φix φ˜ j andω jk = α j B φku φ˜ j are the recurrent and input connection weights, respectively. Note that i is used to index population activity at the previous time step,1 and G i is a spiking nonlinearity. It is important to keep in mind that the temporal filtering is done only once, despite this notation. That is, h (t) is the same filter as that defining the decoding of both x(t) and u(t). More precisely, this equation should be written as

δ j (t − tn ) = G j

n

ω ji h i (t) ∗ δi (t − tn ) + . . .

i,n

(3.8)

ω jk h k (t)

∗ δk (t − tn ) + J

bias j

.

(3.9)

k,n

The dynamics of this system when h i (t) = h k (t) are as written in equation 3.5, which is the case of most interest as it best approximates a true integrator. Nevertheless, they do not have to be equal and model a broader class of dynamics when this is included in the higher-level analysis. For completeness, we can write the subthreshold dynamical equations for an individual LIF neuron voltage Vj (t) in this population as follows: d Vj 1 = − RC dt τj

Vj − R j

ω jk h k (t)

ω ji h i (t) ∗ δi (t − tn ) + . . .

i,n

∗ δk (t − tn ) + J

(3.10)

bias j

,

(3.11)

k,n

1

In fact, there are no discrete time steps since this is a continuous system. However, the PSC effectively acts as a time step, as it determines the length of time that previous information is available.

1286

u(t)

A)

C. Eliasmith C) 70

1

60 0

-1 0

B)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Neuron

50 40 30

1

x(t)

20 10

0

0 0

0.1

0.2

0.3

0.4

0.5

0.6

Time (s)

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time (s)

Figure 4: (A) The decoded input u(t), (B) the decoded integration x(t) of a spiking line attractor with 200 neurons under 10% noise, and (C) spike rasters of a third of the neurons in the population.

where τ jRC = R j C j , R j is the membrane resistance, and C j the membrane capacitance. As usual, the spikes are determined by choosing a threshold voltage for the LIF neuron (Vth ) and placing a spike when Vj > Vth . In our ref models, we also include a refractory time constant τ j , which captures the absolute refractory period observed in real neurons. Figure 4 shows a brief sample run for this network. To gain insight into the network’s function as both an attractor and an integrator, it is important to derive measures of the network’s behavior. This has already been done to some extent for line attractors, so I will not discuss such measures here (Seung et al., 2000; Eliasmith & Anderson, 2003). What these analyses make clear, however, is how higher-level properties, such as the effective time constant of the network, are related to neuron-level properties, such as membrane and synaptic time constants. Because the previous derivation is part of a general method for building more complex attractor networks (as I discuss next), it becomes evident how these same analyses can apply in the more complex cases. This is a significant benefit of generating models with a set of unified principles. More important from a practical standpoint, constructing this network by employing control theory makes it evident how to control some of the high-level properties, such as the effective network time constant (see section 4). It is this kind of control that begins to make clear how important such simple networks are for understanding neural signal processing. 3.2 Plane Attractor. Perhaps the most obvious generalization of a line attractor is to a plane attractor, that is, from a one-dimensional to a multidimensional attractor. In this section, I perform this generalization. However, to demonstrate the representational generality of the NEF, I consider plane

Controlling Spiking Attractor Networks

1287

attractors in the context of function representation (rather than vector representation). This has the further benefit of demonstrating how the ubiquitous bump attractor networks relate to the methods described here. Networks that have sustained gaussian-like bumps of activity have been posited in various neural systems, including frontal working memory areas (Laing & Chow, 2001; Brody et al., 2003), the head direction system (Zhang, 1996; Redish, 1999), visual feature selection areas (Hansel & Sompolinsky, 1998), arm control systems (Snyder, Batista, & Andersen, 1997), and the path integration system (Conklin & Eliasmith, 2005). The prevalence of this kind of attractor suggests that it is important to account for such networks in a general framework. However, it is not immediately obvious how the stable representation of functions relates to either vector or scalar representation as I have so far described it. Theoretically, continuous function representation demands an infinite number of degrees of freedom. Neural systems, of course, are finite. As a result, it is natural to assume that understanding function representation in neural systems can be done using a finite basis for the function space accessible by that system. Given a finite basis, the finite set of coefficients of such a basis determines the function being represented by the system at any time. As a result, function representation can be treated as the representation of a vector of coefficients over some basis, that is,

x(ν; A) =

M

xm m (ν),

(3.12)

m=1

where ν is the dimension the function is defined over (e.g., spatial position), x is the vector of M coefficients xm , and m are the set of M orthonormal basis functions needed to define the function space. Notably, this basis does not need to be accessible in any way by the neural system itself; it is merely a way for us to conveniently write the function space that can be represented by the system. The neural representation depends on an overcomplete representation of this same space, where the overcomplete basis is defined by the tuning curves of the relevant neurons. More specifically, we can define the encoding of a function space analogous to that for a vector space by writing a i (x(ν; x)) = a i (x) = G i αi x(ν; x)φ˜ i (ν)ν + J ibias .

(3.13)

Here, the encoded function x(ν; x) (e.g., a gaussian-like bump) and the encoding function φ˜ i (ν), which is inferred from the tuning curve, determine the activity of the neuron. Because of the integration over ν in this encoding, it is only changes in the coefficients x that affect neural firing, again making

1288

C. Eliasmith

it clear that we can treat neural activity as encoding a vector of coefficients. The decoding for function representation is, as expected, xˆ (ν; x) =

a i (x)φi (ν),

(3.14)

i

where the decoders φi (ν) can be determined like the vector decoders discussed earlier. Having defined the function representation in this way, we are in a position to explicitly relate it to an equivalent vector representation. This is important because it allows us to use the control-theoretic techniques discussed in section 2.3 to define the dynamics of the representation. Let us begin by writing the decoders φi (ν) using the orthonormal basis that defines the function space x(ν; x),

φi (ν) =

M

q im m (ν),

m

where q im are elements of the matrix of coefficients defining each of the i encoding functions with respect to the basis m (ν). This ensures that the representation of a given function will not lie outside the original function space. Similarly, the encoding functions φ˜ i (ν) should encode functions x(ν; x) only in such a way that they can be decoded by these decoders, so we may assume

φ˜ i (ν) =

M

q˜ im m (ν).

(3.15)

m

Together, these definitions determine the equivalent of the function representation in a vector space. In particular, the encoding is given by a i (x) = G i αi = G i αi = G i αi

xm m (ν)q˜ in n (ν)

n,m

n,m

m

xm q˜ in δnm +

J ibias

xm q˜ im +

= G i αi xq˜ i m + J ibias ,

+ ν

J ibias

J ibias

Controlling Spiking Attractor Networks A)

1289 B)

0.1

Φ1 Φ2 Φ3 Φ4

Φ1 Φ2 Φ3 Φ4 Φ5

0.1

Φ (ν)

Φ (ν)

Φ5

0

0

-0.1 -0.1

-3

-2

-1

0

ν

1

2

3

-1

-0.8 -0.6 -0.4 -0.2

0

ν

0.2

0.4

0.6

0.8

1

Figure 5: The orthonormal bases for the representational spaces for (A) the ring attractor in the head direction system and (B) LIP working memory. Note that the former is cyclic and the latter is not.

and the decoding is given by xˆ =

a i (x)qi .

i

Essentially, these equations simply convert the encoding and decoding functions into their equivalent encoding and decoding vectors in the function space whose dimensions are determined by m . Our description of a neural system in this space will have all the same properties as it did in the original function space. The advantage, as already mentioned, is that control theory is more easily applied to finite vector spaces. When we introduce control in section 4, we demonstrate this advantage in more detail. To see the utility of this formulation, let us consider two different neural systems in parallel: working memory in lateral intraparietal (LIP) cortex and the ring attractor in the head direction system.2 These can be considered in parallel because the dominant dynamics in both systems are the same. That is, just like the integrator, LIP working memory and the ring attractor maintain a constant value with no input: x˙ = 0. The difference between these systems is that LIP working memory is sometimes taken to have a bounded domain of representation (e.g., between ±60 degrees from midline), whereas the ring attractor has a cyclic domain of representation. Given equation 3.12, this difference will show up as a difference in the orthonormal basis (ν) that spans the representational space. Figure 5 shows this difference.

2 The first of these is considered in Eliasmith & Anderson (2003), although without the integration of the data as described here.

1290 A)

C. Eliasmith B)

Figure 6: Samples of the (A) encoding functions and (B) resultant tuning curves for LIP neurons based on the data provided in Platt and Glimcher (1998).

In fact, these orthonormal basis functions can be inferred from the neural data. To do so, we first need to construct the neural encoding functions φ˜ i (ν) by looking at the experimentally determined tuning curves of the neurons. That is, we can generate a population of neurons with tuning curves that have the same distribution of widths and heights as observed in the system we are modeling and then use the encoding functions necessary for generating that population as an overcomplete basis for the representational space. We can then determine, using singular value decomposition, the orthonormal basis that spans that same space and use it as our orthonormal basis, (ν). For example, using data from Platt and Glimcher (1998), we have applied this method to get the tuning curves shown in Figure 6A, the encoders shown in Figure 6B, and thus arrive at the orthonormal basis shown in Figure 5B for LIP representation. Given an orthonormal basis, we now need to determine which set of coefficients x on that basis is relevant for the neural system of interest. The standard representations in both LIP working memory and the ring attractor are generally characterized as gaussian-like bumps, at any position. However, in LIP, there is evidence that this bump can be further parameterized by nonpositional dimensions (Sereno & Maunsell, 1998). This kind of parametric working memory has also been found in frontal systems (Romo, Brody, Hern´andez, & Lemus, 1999). So the representation in LIP will be gaussian-like bumps, but of varying heights. Ideally, we need to specify some probability density ρ(x) on the coefficients that appropriately picks out just the gaussian-like functions centered at every value of ν (and those of various heights for the LIP model). This essentially specifies the range of functions that are permissible in our function space. It is clearly undesirable to have all functions that can be represented by the orthonormal basis as candidate representations. Given ρ(x), we can

Controlling Spiking Attractor Networks A)

1291

B) 0

1

0

ν

-1

-π

0

ν

π

Time (s)

0.5

1

0

1

500

Neuron

1000

1

500

Time (s)

0.5

1 1000

Neuron

Figure 7: Simulation results for (A) LIP working memory encoding two different bump heights at two locations and (B) a head direction ring attractor. The top graph shows the decoded function representation and the bottom graph the activity of the neurons in the population. (The activity plots are calculated from spiking data using a 20 ms time window with background activity removed and are smoothed by a 50-point moving average.) Both models use 1000 spiking LIF neurons with 10% noise added.

use the methods described in the appendix to find the decoders. In particular, the average over x in equation A.2 is replaced by an average over ρ(x). In practice, it can be difficult to compactly define ρ(x), so it is often convenient to use a Monte Carlo method for approximating this distribution when performing the average, which we have done for these examples. Having fully defined the representational space for the system, we can apply the methods described earlier for the line attractor to generate a fully recurrent spiking model of these systems. The resulting behaviors of these two models are shown in Figure 7. Note that although it is natural to interpret the behavior of this network as a function attractor (as demonstrated by the activity of the population of neurons), the model can also be understood as implementing a (hyper-)plane attractor in the x vector space. Being able to understand the models of two different neural systems with one approach can be very useful. This is because we can transfer analyses

1292

C. Eliasmith

of one model to the other with minimal effort. As well, it highlights possible basic principles of neural organization (in this case integration). These observations provide a first hint that the NEF stands to unify diverse descriptions of neural systems and help identify general principles underlying the dynamics and representations in those systems. 3.3 Cyclic Attractor. The kinds of attractors presented to this point are, in a sense, static because once the system has settled to a stable point, it will remain there unless perturbed. However, there is another broad class of attractors with dynamic, periodic stability. In such cases, settling into the attractor results in a cyclic progression through a closed set of points. The simplest example of this kind of attractor is the ideal oscillator. Because cyclic attractors are used to describe oscillators and many neural systems seem to include oscillatory behavior, it is natural to use cyclic attractors to describe oscillatory behavior in neural systems. Such behavior may include any repetitive motion, such as walking, swimming, flying, or chewing. The natural mapping between oscillators and repetitive behavior is at the heart of most work on central pattern generators (CPGs; Selverston, 1980; Kopell & Ermentrout, 1998). However, this work typically characterizes oscillators as interactions between only a few neighboring neurons. In contrast, the NEF can help us in understanding cyclic attractors at the network level. Comparing the results of an NEF characterization with that of the standard approach to CPGs shows that there are advantages to the higher-level NEF characterization. To effect this comparison, let us extend a previously described model of lamprey swimming (for more details on the mechanical model, see Eliasmith & Anderson 2000, 2003).3 Later, we extend this model by introducing control. When the lamprey swims, the resulting motion resembles a standing wave of one period over the lamprey’s length. The tensions T in the muscles needed to give rise to this motion can be described by T(z, t) = κ(sin(ωt − kz) − sin(ωt)),

(3.16)

A where κ = γ ηω , k = 2π , A = 1 is the wave amplitude, η = 1 is the normalk L ized viscosity coefficient, γ = 1 is the ratio of intersegmental and vertebrae length, L = 1 is the length of the lamprey, and ω is the swimming frequency. As for the LIP model, we can define an orthogonal representation of the dynamic pattern of tensions in terms of the coefficients xn (t) and the

3 Unlike the previous model, this one includes noisy spiking neurons. Parameters for these neurons and their distribution are based on data in el Manira, Tegner, & Grillner (1994). This effectively demonstrates that the bursting observed in lamprey spinal cord is observed in the model as well.

Controlling Spiking Attractor Networks

1293

harmonic functions n (z): N ˆ x2n−1 (t) sin(2π nz) + x2n (t) cos(2πnz) . T(z, t; x) = κ x0 + n=1

The appropriate x coefficients are found by setting the mean square ˆ t; x) and T(z, t) to be zero. Doing so, we find that error between T(z, x0 (t) = − sin(ωt), x1 (t) = − cos(ωt), x2 (t) = sin(ωt), and for n > 2, xn (t) = 0. This defines the representation in a higher-level function space, whose dynamics we can implement by describing the dynamics of the coefficients, x. In this case, it is evident that the coefficients x0 and x1 implement a standard oscillator. The coefficient x2 is an additional counterphase sine wave. This additional term simply tilts the two-dimensional cyclic attractor in phase space, so we essentially have just a standard oscillator. We can write the control equations as usual, x˙ = Ax, where 

 0 ω 0   A =  −ω 0 0  0 −ω 0

(3.17)

for some frequency ω. Before we embed this control description into a neural population, it makes sense to take into account the known anatomical structure of the system we are describing. In the case of the lamprey, we know that the representation of the tension T is spread down the length of the animal in a series of 100 or so segments (Grillner, Wall´en, Brodin, & Lansner, 1991). As a result, we can define a representation that is intermediate between a neural representation and the orthogonal representation that captures this structure. In particular, let us define an overcomplete representation along the length z with gaussian-like encoding functions. The encoding into this intermediate representation is thus b j (t) = φ˜ j (z)T(z, t)z , and the decoding is ˆ t) = T(z,

j

b j (t)φ j (z).

1294

C. Eliasmith

This representation is not essential, but has one very useful property: it allows us to simulate some parts of the model at the neural level and other parts at this intermediate level, resulting in significant computational savings while selectively simplifying the model. Of course, to use this representation, we need to associate the intermediate representation to both the neural and orthogonal representations. The relation to the neural representation is defined by the standard neural representation described in section 2.2, with the encoding given by δ j (t − tin ) = G i αi b j φ˜ i + J ibias and the decoding by bˆ j =

h ij (t − tn )φi .

i,n

Essentially, these equations describe how the intermediate population activity b j is related to the actual neurons, indexed by i, in the various populations along the length of the lamprey.4 To relate the intermediate and orthogonal spaces, we can use the projec˜ tion operator = [φ]. That is, we can project the control description into the intermediate space as follows: x˙ = Ax −1

b˙ = A−1 b b˙ = Ab b, where Ab = A−1 . Having provided these descriptions, we can now selectively convert segments of the intermediate representation into spiking neurons to see how single cells perform in the context of the whole, dynamic spinal cord. Figure 8A shows single cells that burst during swimming, and Figure 8B shows the average spike rate of an entire population of neurons in a segment. Given this characterization of neural representation, these graphs reflect different levels of description of the neural activity during the same simulation. It should be clear that this model does not adopt the standard CPG approach to modeling this system. As a result, the question arises as to whether

4 Because the lamprey spinal cord is effectively continuous, assignment of neurons to particular populations is somewhat arbitrary, although constrained by the part of the lamprey over which they encode muscle tension. So the resulting model is similarly continuous.

Controlling Spiking Attractor Networks A)

1295 B) 3.5

Left side Right side

Left side Right side

3

Firing Rate (Hz)

Neuron

20

15

10

5

2.5 2 1.5 1 0.5

00

0.5

1

1.5

2

Time(s)

2.5

0

0

0.5

1

1.5

2

2.5

3

Time(s)

Figure 8: Results from the lamprey modeled as a cyclic attractor. The middle segment of the lamprey was modeled at the single cell level by a population of 200 neurons under 10% noise. (A) The spikes from 20 (10 left, 10 right) single cells in the population. (B) The average rate in the two (left and right) subpopulations.

the results of this simulation match the known neuroanatomy as well as the CPG approach. While there is much that remains unknown about the lamprey, these three constraints on the anatomy are well established: 1. Connectivity is mostly local but spans several segments. 2. Connectivity is asymmetric. 3. Individual neurons code muscle tension over small regions. By introducing the intermediate level of representation, we have enforced the third constraint explicitly. Looking at the intermediate-level weight matrix for this system shown in Figure 9, we can see that constraint 2 clearly holds and that constraint 1 is approximately satisfied. The NEF can be used to embed high-level descriptions of cyclic attractors into biologically plausible networks that are consistent with the relevant anatomical constraints. Undoubtedly, different systems will impose different anatomical constraints, but the NEF methods are clearly not determined by the particular constraints found in the lamprey. Equally important, as discussed in section 4, because the NEF allows the integration of control into the high-level description, it is straightforward to characterize (and enforce) essential high-level properties like stability and controllability in the generated models. This has often proved a daunting task for the standard bottom-up CPG approach (Marder, Kopell, & Sigvardt, 1997). So, again, it is the ease with which this model can be extended to account for important but complex behaviors that demonstrates the utility of the NEF.

1296

C. Eliasmith 100

Neuron

80

60

40

20

0 0

20

40

60

80

100

Neuron Figure 9: The connectivity matrix between segments in the lamprey model. Connectivity is asymmetric and mostly local, in agreement with the known anatomy. The darker the image, the stronger the weights. Zero weights are the large gray-hatched areas.

3.4 Chaotic Attractor. The final class of attractors considered are also dynamic attractors, but they, unlike the cyclic attractors, are not periodic. Instead, any nearby trajectories in chaotic (or strange) attractors diverge exponentially over time. Nevertheless, they are attractors in that there is a bounded subspace of the state space toward which trajectories, regardless of initial conditions, tend over time. In the context of neurobiological systems, there have been some suggestions that chaos or chaotic attractors can be useful for describing certain neural systems (Matsugu, Duffin, & Poon, 1998; Kelso & Fuchs, 1995; Skarda & Freeman, 1987). For example, Skarda & Freeman (1987) suggest that the olfactory bulb, before odor recognition, rests in a chaotic state. The fact that the state is chaotic rather than merely noisy permits more rapid convergence to limit cycles that aid in the recognition of odors. These kinds of information processing effects themselves are well documented. For instance, a number of practical control problems can be more efficiently solved if a system can exploit chaotic attractors effectively (Bradley, 1995). However, the existence of chaos in neural systems is subject to much debate (Lai, Harrison, Frei, & Osorio, 2003; Biswal & Dasgupta, 2002).

Controlling Spiking Attractor Networks

1297

As a result, we consider chaotic attractors here largely for the purposes of completeness, that is, to show that this approach is general enough to capture such phenomena, should they exist. For this example, we have chosen to use the familiar Lorenz attractor, described by: 

    x˙ 1 x1 −a a 0       x˙ 2  =  b −1 x1   x2  . x˙ 3 x2 0 −c x3

(3.18)

If a = 10, b = 28, and c = 8/3 this system of equations gives the wellknown butterfly chaotic attractor. It is clear from this set of equations that the system to be considered is nonlinear. So, unlike the previous examples, we need to compute nonlinear functions of the state variables, meaning this is not an LTI system. As discussed in more detail in section 4.1, there are various possible architectures for computing the necessary transformations. Here, we compute the necessary cross-terms by extracting them directly from the population representing the vector space x. Specifically, we can find decoding vectors for x1 x3 (i.e., φ x1 x3 ) and x1 x2 (i.e., φ x1 x2 ) using the method discussed in the appendix where, for example, f (x) = x1 x3 . These decoding vectors can be used to provide an expression for the recurrent updating of the population’s firing rate, a i (x) = G i αi φ˜ i l(x) + J ibias ,

(3.19)

where the vector function l(x) is defined by the Lorenz equations in equation 3.18. Substituting the appropriate neural-level characterizations of this transformation into equation 3.19 gives a i (x) = G i

ωija x1 − ωija x2 + ωijbx1 − ωijx2

j

− ωijx1 x3

+

ωijx1 x2

−

ωijcx3 a j (x)

+

J ibias

,

where ωija x1 = αi φ˜ i,1 a φ xj 1 , ωija x2 = αi φ˜ i,1 a φ xj 1 ωijbx1 = αi φ˜ i,2 bφ xj 1 , ωijx2 = αi φ˜ i,2 φ xj 2 , ωijx1 x3 = αi φ˜ i,2 φ xj 1 x3 , ωijx1 x2 = αi φ˜ i,3 φ xj 1 x2 , and ωijcx3 = αi φ˜ i,3 cφ xj 3 . As usual, the connection weights are found by combining the encoding and decoding vectors as appropriate. Note that despite implementing a higher-level nonlinear system, there are no multiplications between neural activities in the neural-level description. This demonstrates that the neural nonlinearities alone result in the nonlinear behavior of the network. That is, no additional nonlinearities (e.g., dendritic nonlinearities) are needed to give rise to this behavior.

1298

C. Eliasmith

A) x y z

50 40 30

Amplitude

20 10 0

-1 0 -2 0 -3 0 -4 0 0

0. 5

1

1. 5

2

2. 5

3

3. 5

4

4. 5

5

Time(s) B)

C) 20 50

18

45

16

40

14

35

12

30

Neuron

z

10

25

8

20

6

15 10

4

5

2

0

0 0

0. 1 0. 2 0. 3 0. 4 0.5

0. 6 0. 7 0. 8 0. 9

Time(s)

1

-2 0 -1 5 -10

-5

0

5

10

15

20

25

x

Figure 10: The Lorenz attractor implemented in a spiking network of 2000 LIF neurons under 10% noise. (A) The decoded output from the three dimensions. (B) Spike trains from 20 sample neurons in the population for the first second of the run display irregular firing. (C) Typical Lorenz attractor-type motions (i.e., the butterfly shape) are verified by plotting the state space. For clarity, only the last 4.5 s are plotted, removing the start-up transients.

Running this simulation in a spiking network of 2000 LIF neurons under noise gives the results shown in Figure 10. Because this is a simulation of a chaotic network under noise, it is essential to demonstrate that a chaotic system is in fact being implemented, and the results are not just noisy spiking from neurons. Applying the noise titration method (Poon & Barahona, 2001) on the decoded spikes verifies the presence of chaos in the system ( p-value < 10−15 with a noise limit of 54%, where the noise limit indicates how much more noise could be added before the nonlinearity was no longer

Controlling Spiking Attractor Networks

1299

detectable). Notably, the noise titration method is much better at detecting chaos in time series than most other methods, even in highly noisy contexts. As a result, we can be confident that the simulated system is implementing a chaotic attractor as expected. This can also be qualitatively verified by plotting the state space of the decoded network activity, which clearly preserves the butterfly look of the Lorenz attractor (see Figure 10C). 4 Controlling Attractor Networks To this point we have demonstrated how three main classes of attractor networks can be embedded into neurobiologically plausible systems and have indicated, in each case, which specific systems might be well modeled by these various kinds of attractors. However, in each case, we have not demonstrated how a neural system could use this kind of structure effectively. Merely having an attractor network in a system is not itself necessarily useful unless the computational properties of the attractor can be taken advantage of. Taking advantage of an attractor can be done by moving the network either into or out of an attractor, moving between various attractor basins, or destroying and creating attractors within the network’s state space. Performing these actions means controlling the attractor network in some way. Some of these behaviors can be effected by simply changing the input to the network. But more generally, we must be able to control the parameters defining the attractor properties. In what follows, we focus on this second, more powerful kind of control. Specifically, we revisit examples from each of the three classes of attractor networks and show how control can be integrated into these models. For the neural integrator we show how it can be turned into a more general circuit that acts as a controllable filter. For the ring attractor, we demonstrate how to build a nonlinear control model that moves the current head direction estimate given a vestibular control signal and does not rely on multiplicative interactions at the neural level. In the case of the cyclic attractor, we construct a control system that permits variations in the speed of the orbit. And finally, in the case of the chaotic attractor, we demonstrate how to build a system that can be moved between chaotic, cyclic, and point attractor regimes. 4.1 The Neural Integrator as a Controllable Filter. As described in section 3.1, a line attractor is implemented in the neural integrator in virtue of the dynamics matrix A being set to 1. While the particular output value of the attractor depends on the input, the dynamics of the attractor are controlled by A . Hence, it is natural to inquire as to what happens as A varies over time. Since A is unity feedback, it is fairly obvious what the answer to this question is: as A goes over 1, the resulting positive feedback will cause the circuit to saturate; as A becomes less than one, the circuit begins to act as a low-pass filter, with the cutoff frequency determined by the precise

1300

C. Eliasmith A)

x(t)

c(t)

B) x(t)

B'x

B'x

u(t)

u(t)

A'x

B'c

A'(t)

A'(t)

Figure 11: Two possible network architectures for implementing the controllable filter. The B variables modify the inputs to the populations representing their subscripted variable. The A variables modify the relevant recurrent connections. The architecture in A is considered in Eliasmith and Anderson (2003); the more efficient architecture in B is considered here.

value of A . Thus, we can build a tunable filter by using the same circuit and allowing direct control over A . To do so, we can introduce another population of neurons dl that encode the value of A (t). Because A is no longer static, the product A x must be constantly recomputed. This means that our network must support multiplication at the higher level. The two most obvious architectures for building this computation into the network are shown in Figure 11. Both architectures are implementations of the same high-level dynamics equation, x(t) = h (t) ∗ (A (t)x(t) + τ u(t)),

(4.1)

which is no longer LTI, as it is clearly a time-varying system. Notably, while both architectures demand multiplication at the higher level, this does not mean that there needs to be multiplication between activities at the neural level. This is because, as mentioned in section 2.2 and demonstrated in section 3.4, nonlinear functions can be determined using only linear decoding weights. As described in Eliasmith and Anderson (2003), the first architecture can be implemented by constructing an intermediate representation of the vector c = [A , x] from which the product is extracted using linear decoding. The result is then used as the recurrent input to the a i population representing x. This circuit is successful, but performance is improved by adopting the second architecture. In the second architecture, the representation in a i population is taken to be a 2D representation of x in which the first element is the integrated input and the second element is A . The product is extracted directly from

Controlling Spiking Attractor Networks

1301

this representation using linear decoding and then used as feedback. This has the advantage over the first architecture of not introducing extra delays and noise. Specifically, let x = [x1 , x2 ] (where x1 = x and x2 = A in equation 4.1). A more accurate description of the higher-level dynamics equation for this system is x˙ = h (t) ∗ A x + B u x˙ 1 x2 0 τ 0 x1 u = h (t) ∗ + , x2 x2 0 0 0 τ A

(4.2)

which makes the nonlinear nature of this implementation explicit. Notably, here the desired A is provided as input from a preceding population, as is the signal to be integrated, u. To implement this system, we need to compute the transformation, p ˆ = p(t) a i (t)φi , i

where p(t) is the product of the elements of x. Substituting this transformation into equation 3.6 gives

a j = G j α j h ∗ φ˜ j =Gj

i

p a i (t)φi

+B

i

ωij a i (t) +

k

ωk j b k (t) + J

bias j

b k φku

+J

bias j

,

(4.3)

k

p where ωij = α j φ˜ j φi , ωk j = α j φ˜ j B φku and

a i (t) = h ∗ G i αi x(t)φ˜ i + J ibias . The results of simulating this nonlinear control system are shown in Figure 12. This run demonstrates a number of features of the network. In the first tenth of a second, the control signal 1 − A is nonzero, helping to eliminate any drift in the network for zero input. The control signal then goes to zero, turning the network into a standard integrator over the next two-tenths of a second, when a step input is provided to the network. The control signal is then increased to .3, rapidly forcing the integrated signal to zero. The next step input is filtered by a low-pass filter, since the control signal is again nonzero. The third step input is also integrated, as the control signal is zero. Like the first input, this input is forced to zero by increasing the control signal, but this time the decay is much slower because the control

1302

A)

C. Eliasmith

1.5

u(t) 1-A'(t)

Input

1 0.5 0 -0.5 -1 -1.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Controlled Filter

B) 1

x (t) 1 x (t)

0.8

2

0.6 0.4 0.2 0 -0.2 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time (s) Figure 12: The results of simulating the second architecture for a controllable filter in a spiking network of 2000 neurons under 10% noise. (A) Input signals to the network. (B) High-level decoded response of the spiking neural network. The network encodes both the integrated result and the control signal directly, to efficiently support the necessary nonlinearity. See the text for a description of the behavior.

signal is lower (.1). These behaviors show how the control signal can be used as a reset signal (by simply making it nonzero) or as a means of determining the properties of a tunable low-pass filter. The introduction of control into the system gave us a means of radically altering the attractive properties of the system. It is only while A = 1 that we have an approximate line attractor. For positive values less than one, the system no longer acts as a line attractor, but rather as a point attractor, whose basin properties (e.g., steepness) vary as the control signal. As can be seen in equation 4.3, there is no multiplication of neural activities. There is, of course, significant debate about whether, and to what extent, dendritic nonlinearities might be able to support multiplication of neural activity (see, e.g., Koch & Poggio, 1992; Mel 1994, 1999; Salinas & Abbott, 1996; von der Heydt, Peterhans, & Dursteler, 1991). As a result, it is useful to demonstrate that it is possible to generate circuits without

Controlling Spiking Attractor Networks

1303

multiplication of neural activities that support network-level nonlinearities. If dendritic nonlinearities are discovered in the relevant systems, these networks would become much simpler (essentially we would not need to construct the intermediate c population). 4.2 Controlling Bump Movement Around the Ring Attractor. Of the two models presented in section 3.2, the role of control is most evident for the head direction system. In order to be useful, this system must be able to update its current estimate of the heading of the animal given vestibular information about changes in the animal’s heading. In terms of the ring attractor, this means that the system must be able to rotate the bump of activity around the ring in various directions and at various speeds given simple left-right angular velocity commands. (For a generalization of this model to the control of a two-dimensional bump that very effectively characterizes behavioral and single-cell data from path integration in subiculum see Conklin & Eliasmith, 2005.) To design a system that behaves in this way, we can first describe the problem in the function space x(ν) for a velocity command δ/τ and then write this in terms of the coefficients x in the orthonormal Fourier space: x(ν; t + τ ) = x(ν + δ; t) = xm e im(ν+δ) m

=

xm e imν e imδ .

m

Rotation of the bump can be effected by applying the matrix Em = e imδ , where δ determines the speed of rotation. Written for real-valued functions, E becomes 

1 0 0  0 cos(mδ) sin(mδ)   E=  0 − sin(mδ) cos(mδ)  0

0

···

 0 0   ..  . .   .. .

To derive the dynamics matrix A for the state equation 2.5, it is important to note that E defines the new function at t + τ , not just the change, δx. As well, we would like to control the speed of rotation, so we can introduce a scalar on [−1, 1] that changes the velocity of

1304

C. Eliasmith

rotation, with δ defining the maximum velocity. Taking these into account gives A = C(t) (E − I) , where C(t) is the left-right velocity command.5 As in section 4.1, we can include the time-varying scalar variable C(t) in the state vector x and perform the necessary multiplication by extracting a nonlinear function that is the product of that element with the rest of the state vector. Doing so again means there is no need to multiply neural activities. Constructing a ring attractor without multiplication is a problem that was only recently solved by Goodridge and Touretzky (2000). That solution, however, is specific to a one-dimensional single bump attractor, does not use spiking neurons, and does not include noise. As well, the solution posits single left-right units that together project to every cell in the population, necessitates the setting of normalization factors, and demands numerical experiments to determine the appropriate value of a number of the parameters. In sum, the solution is somewhat nonbiological and very specific to the problem being addressed. In contrast, the solution we have presented here is subject to none of these concerns: it is both biologically plausible and very general. The behavior of the fully spiking version of this model is shown in Figure 13. In this case, the control parameters simply move the system through the attractor’s phase space rather than altering the phase space itself as in the previous example. However, to accomplish the same movement using the input u(t), we would have to provide the appropriate high-dimensional vector input (i.e., the new bump position minus the old bump position). Using the velocity commands in this nonlinear control system, we need only provide a scalar input to appropriately update the systems state variables. In other words, the introduction of this kind of control greatly simplifies updating the system’s current position in phase space.

5 There is more subtlety than one might think to this equation. For values of C(t) = 1, the system does not behave as one might expect. For negative values, two bumps are created: one negative bump in the direction opposite the desired direction of motion and the other at the current bump location. This results in the current bump being effectively pushed away from the negative bump. For values less than one, a proportionally scaled bump is created in the location as if C(t) = 1, and a proportionally scaled bump is subtracted from the current position, resulting in proportionally scaled movement. There are two reasons this works as expected. The first is that the movements are very small, so the resulting bumps in all cases are approximately gaussian (though subtly bimodal). The second is that the attractor dynamics built into the network clean-up any nongaussianity of the resulting states. The result is a network that displays bumps moving in either direction proportional to C(t), as desired.

Controlling Spiking Attractor Networks

A)

1305

B) 0

Time (s)

1

2

-π

Left / Right

ν

π

Figure 13: The controlled ring attractor in a spiking network of 1000 neurons with 10% noise. The left-right velocity control signal is shown on the left, and the corresponding behavior of the bump is shown on the right.

4.3 Controlling the Speed of the Cyclic Attractor. As mentioned in section 3.3, one advantage of our synthesis of top-down and bottomup data is that it permits the inclusion of strong top-down constraints on model building. In the context of lamprey locomotion, introducing control over swimming speed and guaranteeing stable oscillation to CPG-based models were problems that had to be tackled separately, took much extra work, and resulted in solutions specific to this kind of network (Marder et al., 1997). In contrast, stability, control, and other top-down constraints can be included in the cyclic attractor model directly. In this example, we consider control over swimming speed. Given the two previous examples, we know that this kind of control can be characterized as the introduction of nonlinear or time-varying parameters into our state equations. For instance, we can make the frequency term ω in equation 3.17 a function of time. For simplicity, we will consider the standard oscillator, although it is closely related to the swimming model as discussed earlier (see Kuo & Eliasmith, 2004, for an anatomically and physiologically plausible model of zebrafish swimming with speed control). To change the speed of an oscillator, we need to implement x˙ =

0 ω(t) −ω(t) 0

x1 + Bu(t). x2

1306

C. Eliasmith 1.5 x1 x2 ω

1

0.5

0

-0.5

-1

-1.5 0

0.5

1

1.5

2

2.5

3

Time (s) Figure 14: The controlled oscillator in a spiking network of 800 neurons with 10% noise. The control signal varies ω, causing the oscillator to double its speed and then slow down to the original speed.

For this example, we can construct a nonlinear model, but unlike equation 4.2, we do not have to increase the dimension of the state space. Instead, we can increase the dimension of the input space, so the third dimension carries the time-varying frequency signal, giving x˙ =

0 u3 (t) −u3 (t) 0

x1 + Bu(t). x2

(4.4)

As before, this can be translated into a neural dynamics matrix using equation 2.7 and implemented in a neural population using the methods analogous to those in section 4.1. In fact, the architecture used for the neural implementation is the same despite the alternate way of expressing the control system in equation 4.4. That is, in order to perform the necessary multiplication, the dimensionality of the space encoded by the population is increased by one. In this case, the extra dimension is assigned to the input vector rather than the state vector. It should not be surprising, then, that we can successfully implement a controlled cyclic attractor (see Figure 14). In this example, control is used to vary a property of the attractor: the period of the orbit. Because the previous two examples are static attractors,

Controlling Spiking Attractor Networks

1307

this kind of control does not apply to them. However, it should be clear that this example adds nothing new theoretically. Nevertheless, it helps to demonstrate that the methods introduced earlier apply broadly. Notably, the introduction of this kind of control into a neural model of swimming results in connectivity that matches the known anatomy (Kuo & Eliasmith, 2004) . 4.4 Moving Between Chaotic, Cyclic, and Point Attractors. Both Skarda and Freeman (1987) and Kelso and Fuchs (1995) have suggested that being in a chaotic attractor may help to improve the speed of response of a neural system to various perturbations. In other words, they suggest that if chaotic dynamics are to be useful to a neural system, it must be possible to move into and out of the chaotic regime. Conveniently, the bifurcations and attractor structures in the Lorenz equations, 3.18, are well characterized, making them ideal for introducing the kind of control needed to enter and exit the chaotic attractor. For instance, changing the b parameter over the range [1, 300] causes the system to exhibit point, chaotic, and cyclic attractors. Constructing a neural control system with an architecture analogous to that of the controlled integrator discussed earlier would allow us to move the system between these kinds of states. However, an implementational difficulty arises. As suggested by Figure 10, the mean of the x3 variable is roughly equal to b. Thus, for largely varying values of b, the neural system will have to represent a large range of values, necessitating a very wide dynamic range with a good signal-to-noise ratio. This can be achieved with enough neurons, but it is more efficient to rewrite the Lorenz equations to preserve the dynamics but eliminate the scaling problem. To do so, we can simply subtract and add b as appropriate to remove the scaling effect. This gives 

 x˙ 1 a (x2 − x1 )   x ˙ =  2  bx1 − x2 − x1 (x3 + b) x˙ 3 x1 x2 − c(x3 + b) − b       x1 0 0 0 −a a 0 b       0 0 00. =  0 −1 −x1   x2  +  x2 0 −c x3 −(c + 1) 0 0 0 Given this characterization of the Lorenz system, it is evident that, conveniently, introduction of the controlled signal b no longer requires multiplication, making the problem simpler than the previous control examples. Implementing these equations in a spiking neural population can be done as in section 3.4. The results of simulating this network under noise are shown in Figure 15. After the start-up transient, the network displays chaotic behavior, as with

1308

C. Eliasmith 100

Amplitude

x y z Control

50

0

-50

0

1

2

3

4

5

6

7

8

9

10

Time (s) Figure 15: The controlled chaotic attractor in a spiking network of 1000 neurons with 10% noise. The system clearly moves from the chaotic regime to an oscillatory one, and finally to a point attrator. Noise titration verifies this result (see the text for discussion).

no control (see Figure 10). However, in this case, it is the value of the control signal b(t) that forces the system to be in the chaotic regime. After 6 seconds, the control signal changes, moving the system to a stable limit cycle. At 8 seconds, the control signal changes again, moving the system to a fixedpoint attractor. To verify that these different regimes are as described, data from each of the regimes were titrated as before. The noise titration during the chaotic regime verified the presence of the nonlinearity (p-value < 10−15 ; noise limit 39%). During the limit cycle and the point attractor regimes, noise titration did not detect any nonlinearity, as expected. Interestingly, despite the highly nonlinear nature of the system itself, the kind of control that might be useful for information processing turns out to be quite simple. Unlike the previous examples, this one demonstrates how nonmultiplicative control can be very powerful. It serves to move a nonlinear system through very different attractor regimes, some linear and some not. However, unlike previous examples, it is unclear how useful this kind of network is for understanding neural systems better. Nevertheless, it serves to illustrate the generality of the NEF for understanding a wide variety of attractors and control systems in the context of constraints specific to neural systems.

Controlling Spiking Attractor Networks

1309

5 Conclusion We have presented several examples of biologically plausible attractor networks that cover a wide variety of attractors. These examples employ a variety of representations, including scalars, vectors, and functions. They also exemplify a variety of control systems, including linear, time varying, and nonlinear. Some of the examples on their own are not particularly novel. What is novel is demonstrating how they relate to more complex (and novel) models. In other words, it is this wide diversity of examples that shows that the NEF can help us systematize the functionality of neural systems. That is, despite differences in tuning curves, kinds of representation, neural response properties, intrinsic dynamics, and so on, it is possible to classify various networks as variations on themes of integration, filtering, and oscillation, among others, all of which can be derived from simple, general principles. Furthermore, characterizing high-level dynamics, and how lower-level properties affect those dynamics, can provide a window into the purpose of neural subsystems. Compiling this knowledge should aid in more quickly understanding the likely functions of larger, more complex, and less familiar systems. This kind of systematization can be useful in a number of ways. For one, it suggests that perhaps the kinds of networks described here can serve as functional parts of larger networks—networks that we can construct using these same methods. One example of this is presented in Eliasmith et al. (2002), where an integrator is one of nine subnetworks used to estimate the true translational velocity of an animal given the responses of semicircular canals and otoliths to a variety of acceleration profiles. So while we have used the NEF here to construct a specific class of networks, the same methods are more broadly applicable. A second benefit of systematization is that it supports the transfer of knowledge and analyses regarding well-understood neural systems to lesser understood ones. For instance, understanding a useful control structure for the ring attractor suggests a useful control structure for path integration, a lesser-studied and more complex neural system (Conklin & Eliasmith, 2005). More generally, if we can see the close relation between two high-level descriptions of different neural systems (e.g., working memory and the neural integrator, or path integration and the head direction system), what we have learned about one may often be translated into implications for the other. This can greatly speed up the development of novel models and focus our attention on the important characteristics of new systems (be they similarities to, or differences from, other known systems). Finally, by being able to provide the high-level characterizations of neural systems on which such systematization depends, we can carefully introduce new complexities into existing models. For instance, the recent surge of interest in the observed dynamics of working memory (Romo et al., 1999; Brody et al., 2003; Miller, Brody, & Romo, 2003) can be captured by simple

1310

C. Eliasmith

extensions of the models described earlier (Singh & Eliasmith, 2004). Again, this can greatly aid the construction of novel models—models that may be able to address more complicated phenomena than otherwise possible (Eliasmith, 2004, presents a neuron-level model of a well-studied deductive inference task—the Wason card selection task—with 14 subsystems.) So while this discussion has focused on characterizing a class of networks that is clearly important for understanding neural systems, the methods underlying this approach have much broader application. They can help us begin to better understand the general control and routing of information through the brain in a way responsible to neural constraints. With continuing improvement in experimental techniques for examining large-scale networks, theoretical tools for understanding networks on the same scale become essential. The examples provided here hint that the NEF may be a useful attempt at beginning to develop such tools. Appendix: Least-Squares Optimal Linea Decoders Consider determining the optimal linear decoders φi in equation 2.3 under noise (see also Salinas & Abbott, 1994 and Eliasmith & Anderson, 2003). To include noise, we introduce the noise term ηi , which is drawn from a gaussian, independent, identically distributed, zero mean distribution. The noise is added to the neuron activity a i , resulting in a decoding of xˆ =

N

(a i (x) + ηi ) φi .

(A.1)

i=1

To find the least-squares-optimal φi , we construct and minimize the mean square error, averaging over the expected noise and the vector x: 1 E= 2 1 = 2

N

x−

2

(a i (x) + ηi ) φi

i=1

x−

x,η

N

N

a i (x)φi −

i=1

i=1

2

ηi φi

,

(A.2)

x,η

where ·x indicates integration over the range of x. Because the noise is independent on each neuron, the noise averages out except when i = j. So the average of the ηi η j noise is equal to the variance σ 2 of the noise on the neurons. Thus, the error with noise becomes 1 E= 2

x−

N i=1

2

a i (x)φi x

N 1 + σ2 φ2. 2 i=1 i

(A.3)

Controlling Spiking Attractor Networks

1311

Taking the derivative of the error gives

N 1 δE =− 2 x − a j (x)φ j a i (x) + σ 2 φ j δij δφi 2 j x

N = − a i (x)xx + a i (x)a j (x)φ j + σ 2 φ j δij . j

(A.4)

x

Setting the derivative to zero gives N a i (x)xx = a i (x)a j (x) x + σ 2 δij φ j

(A.5)

j

or, in matrix form, ϒ = φ. The decoding vectors φi are given by φ = −1 ϒ, where

ij = a i (x)a j (x) x + σ 2 δij ϒi = xa i (x)x .

Notice that the matrix will be nonsingular because of the noise term on the diagonal. This same procedure can be followed to find the optimal linear decoders φ f (x) for some linear or nonlinear function f (x). The error to be minimized then becomes 1 E= 2

f (x) −

N i=1

2

(a i (x) + ηi ) φi

f (x)

. x,η

The minimization is analogous, with only a differing result for ϒ: ϒi = f (x)a i (x)x .

Acknowledgments Special thanks to Charles H. Anderson, Valentin Zhigulin, and John Conklin for helpful discussions on this project. John Conklin coded the simulations for controlling the ring attractor. The code for detecting chaos was graciously provided by Chi-Sang Poon. This work is supported by grants

1312

C. Eliasmith

from the National Science and Engineering Research Council of Canada, the Canadian Foundation for Innovation, the Ontario Innovation Trust, and the McDonnell Project in Philosophy and the Neurosciences. References Amit, D. J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Askay, E., Gamkrelidze, G., Seung, H. S., Baker, R., & Tank, D. (2001). In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nature Neuroscience, 4, 184–193. Biswal, B., & Dasgupta, C. (2002). Neural network model for apparent deterministic chaos in spontaneously bursting hippocampal slices. Physical Review Letters, 88, 088102. Bradley, E. (1995). Autonomous exploration and control of chaotic systems. Cybernetics and Systems, 26, 299–319. Brody, C. D., Romo, R., & Kepecs, A. (2003). Basic mechanisms for graded persistent activity: Discrete attractors, continuous attractors, and dynamic representations. Current Opinion in Neurobiology, 13, 204–211. Conklin, J., & Eliasmith, C. (2005). A controlled attractor network model of path integration in the rat. Journal of Computational Neuroscience, 18, 183–203. el Manira, A., Tegner, J., & Grillner, S. (1994). Calcium-dependent potassium channels play a critical role for burst termination in the locomotor network in lamprey. Journal of Neurophysiology, 72, 1852–1861. Eliasmith, C., & Anderson, C. H. (2000). Rethinking central pattern generators: A general approach. Neurocomputing, 32, 735–740. Eliasmith, C., & Anderson, C. H. (2003). Neural engineering: Computation, representation, and dynamics in neurobiological systems. Cambridge, MA: MIT Press. Eliasmith, C. (2004). Learning context sensitive logical inference in a neurobiological simulation. In S. Levy & R. Gayler (Eds.), Compositional connectionism in cognitive science (pp. 17–20), Menlo Park, CA: AAAI Press. Eliasmith, C., Westover, M. B., & Anderson, C. H. (2002). A general framework for neurobiological modeling: An application to the vestibular system. Neurocomputing, 46, 1071–1076. Fukushima, K., Kaneko, C. R. S., & Fuchs, A. F. (1992). The neuronal substrate of integration in the oculomotor system. Progress in Neurobiology, 39, 609–639. Fuster, J. M. (2001). The prefrontal cortex—an update: Time is of the essence. Neuron, 30, 319–333. Goodridge, J. P., & Touretzky, D. S. (2000). Modeling attractor deformation in the rodent head-direction system. Journal of Neurophysiology, 83, 3402–3410. Grillner, S., Wall´en, P., Brodin, I., & Lansner, A. (1991). The neuronal network generating locomotor behavior in lamprey: Circuitry, transmitters, membrane properties, and simulation. Annual Review of Neuroscience, 14, 169–199. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press.

Controlling Spiking Attractor Networks

1313

Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Kelso, J. A. S., & Fuchs, A. (1995). Self-organizing dynamics of the human brain: Critical instabilities and Silnikov chaos. Chaos, 5, 64–69. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation. Orlando, FL: Academic Press. Kopell, N., & Ermentrout, G. B. (1988). Coupled oscillators and the design of central pattern generators. Mathematical Biosciences, 90, 87–109. Kuo, D., & Eliasmith, C. (2004). Understanding interactions between networks controlling distinct behaviors: Escape and swimming in larval zebrafish. Neurocomputing, 58–60, 541–547. Lai, Y.-C., Harrison, M. A., Frei, M. G., & Osorio, I. (2003). Inability of Lyapunov exponents to predict epileptic seizures. Physical Review Letters, 91, 068012. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Computation, 13, 1473–1494. Marder, E., Kopell, N., & Sigvardt, K. (1997). How computation aids in understanding biological networks. In P. Stein, S. Grillner, A. Selverston, & D. Stuart (Eds.), Neurons, networks, and motor behavior. Cambridge, MA: MIT Press. Matsugu, M., Duffin, J., & Poon, C. (1998). Entrainment, instability, quasi-periodicity, and chaos in a compound neural oscillator. Journal of Computational Neuroscience, 5, 35–51. Mel, B. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031– 1085. Mel, B. (1999). Why have dendrites? A computational perspective. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites. New York: Oxford University Press. Miller, P., Brody, C., & Romo, R. (2003). A recurrent network model of somatosensory parametric working memory in the prefrontal cortex. Cerebral Cortex, 13, 1208– 1218. Platt, M. L., & Glimcher, G. W. (1998). Response fields of intraparietal neurons quantified with multiple saccadic targets. Experimental Brain Research, 121, 65–75. Poon, C.-S., & Barahona, M. (2001). Titration of chaos with added noise. Proceedings of the National Academy of Science, 98, 7107–7112. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 373–401. Rainer, G., & Miller, E. K. (2002). Timecourse of object-related neural activity in the primate prefrontal cortex during a short-term memory task. European Journal of Neuroscience, 15, 1244–1254. Redish, A. D. (1999). Beyond the cognitive map. Cambridge, MA: MIT Press. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–473. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1, 89–107.

1314

C. Eliasmith

Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences USA, 93, 11956– 11961. Selverston, A. I. (1980). Are central pattern generators understandable? Behavioral and Brain Sciences, 3, 535–571. Sereno, A. B., & Maunsell, J. H. R. (1998). Shape selectivity in primate lateral intraparietal cortex. Nature, 395, 500–503. Seung, H. S. (1996). How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, USA, 93, 13339–13344. Seung, H. S., Lee, D., Reis, B., & Tank, D. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Shadlen, N. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area lip) of the rhesus monkey. Journal of Neurophysiology, 86, 1916–1936. Singh, R., & Eliasmith, C. (2004, March 24–28). A dynamic model of working memory in the PFC during a somatosensory discrimination task. Paper presented at Computational and Systems Neuroscience 2004, Cold Spring Harbor, NY. Skarda, C. A., & Freeman, W. J. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Snyder, L. H., Batista, A. P., & Andersen, R. A. (1997). Coding of intention in the posterior parietal cortex. Nature, 386, 167–170. Touretzky, D. S., & Redish, A. D. (1996). Theory of rodent navigation based on interacting representations of space. Hippocampus, 6, 247–270. von der Heydt, R., Peterhans, E., & Dursteler, M. (1991). Grating cells in monkey visual cortex: Coding texture? In B. Blum (Ed.), Channels in the visual nervous system: Neurophysiology, psychophysics, and models. London: Freund. Wang, X. J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36, 955–968. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16, 2112–2126.

Received June 3, 2004; accepted November 15, 2004.

LETTER

Communicated by Bard Ermentrout

Synchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of Interactions Takashi Kanamaru [email protected]

Masatoshi Sekine [email protected] Department of Electrical and Electronic Engineering, Faculty of Technology, Tokyo University of Agriculture and Technology, Tokyo 184-8588, Japan

Synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections are investigated, and their dependences on the forms of interactions are analyzed. As the forms of interactions, we treat the double exponential coupling and the interactions derived from it: pulse coupling, exponential coupling, and alpha coupling. It is found that the bifurcation structure of the networks depends mainly on the decay time of the synaptic interaction and the effect of the rise time is smaller than that of the decay time.

1 Introduction Recently, oscillations and synchronization in neural systems have been attracting considerable attention. Particularly in the visual cortex and the hippocampus, synchronized oscillations with typical frequencies are often observed in the average behaviors of the neuronal ensemble, and it is proposed that they are related to the binding of the information in the visual cortex and the regulation of the synaptic plasticity in the hippocampus (for a review, see Gray, 1994). To understand the mechanism of such synchronized oscillations, networks of excitatory or inhibitory neurons have been investigated by numerous authors (Abbott & Vreeswijk, 1993; Hansel, Mato, & Meunier, 1995; Kuramoto, 1991; Mirollo & Strogatz, 1990; Sato & Shiino, 2002; Tsodyks, Mitkov, & Sompolinsky, 1993; van Vreeswijk, 1996; van Vreeswijk, Abbott, & Ermentrout, 1994). Typically, the perfect synchronization is observed in the network of pulse-coupled self-oscillating excitatory neurons (Kuramoto, 1991; Mirollo & Strogatz, 1990), but it is not always stable for networks with slow couplings, and the partial synchronization, the antiphase synchronization, or an asynchronous state appears depending on the parameters, such as the characteristic timescale of the synaptic interaction (Abbott & Neural Computation 17, 1315–1338 (2005)

© 2005 Massachusetts Institute of Technology

1316

T. Kanamaru and M. Sekine

van Vreeswijk, 1993; Hansel et al., 1995; Sato & Shiino, 2002; Tsodyks et al., 1993; van Vreeswijk, 1996; van Vreeswijk et al., 1994). The frequencies of these synchronized firings are determined mainly by the frequency of a single neuron, and they might be much larger than the physiologically observed ones, such as 40 Hz of the gamma oscillation. Recently, more complex dynamics than that of the excitatory network have been found in networks of excitatory and inhibitory neurons (Borgers ¨ & Kopell, 2003; Brunel, 2000; Golomb & Ermentrout, 2001; Hansel & Mato, 2003; Kanamaru & Sekine, 2003, 2004; van Vreeswijk & Sompolinsky, 1996). Similar to the excitatory network, the synchronized firings are observed in the network of self-oscillating neurons (Borgers ¨ & Kopell, 2003) or in the network of self-oscillating and excitable neurons (Hansel & Mato, 2003). Moreover, the synchronized firings are observed even in the network of only excitable neurons with excitatory and inhibitory connections in a noisy environment (Brunel, 2000; Kanamaru & Sekine, 2003, 2004), where excitable neurons in the absence of connections fire randomly with the help of noise, and when an appropriate strength of connections is introduced, the synchronized firings appear. In our previous studies (Kanamaru & Sekine, 2003, 2004), we investigated a noisy network of class 1 neurons with excitatory and inhibitory connections by means of bifurcation analyses and found various synchronized firings, including chaotic ones. We found that the frequencies of such synchronized firings depend on both noise intensity and coupling strength. In this model, the characteristic timescale of the interaction is assumed to be the same order as that of each neuron, and we could not examine the effect of the timescale of synaptic interactions systematically. In this letter, we investigate the synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections under a noisy environment and examine their dependence on the forms of interactions. As the forms of interactions, we treat the double exponential coupling and the interactions derived from it in some limiting cases: pulse coupling, exponential coupling, and alpha coupling. With these couplings, we investigate the dependence of the bifurcation structure on the rise time and the decay time of synaptic interactions. In section 2, we provide a definition of our model and introduce its Fokker-Planck equations. We also introduce four forms of interactions: double exponential coupling, pulse coupling, exponential coupling, and alpha coupling. In section 3, we analyze the network with the pulse coupling by solving the Fokker-Planck equations, and we obtain a bifurcation set numerically. We observe that the synchronized periodic firings appear mainly by going through the Hopf bifurcation or the saddle node on limit cycle bifurcation. In section 4, we analyze the network with the exponential coupling. Besides the synchronized periodic firings, we observe synchronized chaotic firings and anomalous high-frequency synchronization. We also investigate the effect of the decay time of the synaptic interaction. In section 5, we analyze the networks with alpha coupling or

Synchronized Firings in Class 1 Networks

1317

double exponential coupling. We find that the dependence of the bifurcation structure on the rise time of the synaptic interaction is weaker than that on the decay time. Conclusions and discussions are given in the final section. 2 Model Let us consider the coupled active rotators composed of excitatory neu(i) (i) rons θ E (i = 1, 2, · · · , NE ) and inhibitory neurons θ I (i = 1, 2, · · · , NI ) (Kanamaru & Sekine, 2003, 2004), written as (i) (i) (i) τ E θ˙ E = 1 − a sin θ E + ξ E (t) + IEE (t) − IEI (t),

(2.1)

(i) τ I θ˙ I

(2.2)

=1 − a

(i) sin θ I

+

(i) ξ I (t)

+ I I E (t) − III (t).

Here, a is a system parameter, τ E and τ I are the time constants of the neuron, IXY (t) (X, Y = E or I ) is the synaptic input from the ensemble Y to the (i) ensemble X, and ξ X (t) is gaussian white noise satisfying

(i) ( j) ξ X (t)ξY (t ) = Dδi j δXY δ(t − t ),

(2.3)

where D is the noise intensity and δi j is Kronecker’s delta. For a > 1, an active rotator shows typical properties of an excitable system: it has a stable equilibrium θ0 ≡ arcsin(1/a ) and − sin(θ (i) (t)) + 1/a shows a pulselike waveform when an appropriate amount of disturbance is injected (Kurrer and Schulten, 1995; Sakaguchi, Shinomoto, & Kuramoto, 1988; Shinomoto & Kuramoto, 1986; Tanabe, Shimokawa, Sato, & Pakdaman, 1999). Note that a single active rotator can be transformed into the canonical model θ˙ = (1 − cos θ ) + (1 + cos θ)r for class 1 neurons (Ermentrout, 1996; Izhikevich, 1999). Thus, our synaptically coupled active rotators might reflect the dynamics of networks of class 1 neurons such as the Connor model or Morris-Lecar model (Ermentrout, 1996). Moreover, the active rotator has a property that its Fokker-Planck equations can be numerically integrated with a smaller number of terms than that of the leaky integrate-and-fire model. Thus, we consider it an effective tool to analyze the dynamics of pulse neural networks. As the interaction IXY (t) from ensemble Y to ensemble X(X, Y = E or I ), we consider the difference of two exponential functions (Abbott & Vreeswijk, 1993; Hansel et al., 1995; Gerstner & Kistler, 2002), written as NY gXY 1 IXY (t) = NY j=1 k κ1Y − κ2Y

( j)

t − tk exp − κ1Y

( j)

t − tk − exp − κ2Y

,

(2.4)

1318

T. Kanamaru and M. Sekine ( j)

where tk is the kth firing time of the jth neuron, and κ1Y and κ2Y (κ1Y > κ2Y > 0) denote the decay time and the rise time of the synaptic interaction, ( j) respectively. Note that the second sum is taken over k satisfying t > tk , ( j) and the firing time is defined as the time when θY turns around over the value 3π/2, which is the point located at the opposite side of the stable equilibrium point θ0 = arcsin(1/a ) ∼ π/2. This interaction is called the double exponential coupling in the following. In the three limits κ1Y , κ2Y → 0, κ2Y → 0 (κ1Y ≡ κY ), and κ1Y → κ2Y ≡ κY , IXY (t) is rewritten as NY gXY ( j) δ t − tk , NY j=1 k ( j) NY t − tk gXY 1 , exp − IXY (t) = NY j=1 k κY κY ( j) ( j) NY t − tk t − tk gXY IXY (t) = exp − , NY j=1 k κY κY2

IXY (t) =

(2.5)

(2.6)

(2.7)

and we call them pulse coupling, exponential coupling, and alpha coupling, respectively. In the following, synchronization phenomena in the network with each coupling are analyzed. To reduce the number of parameters, we set gEE = gII ≡ gint , gEI = gIE ≡ gext , a = 1.05, and τ E = τ I = 1.0. In previous studies (Kanamaru & Sekine, 2003, 2004), we considered a network with the waveform coupling written as IXY (t) =

NY gXY ( j) −sin θY + 1/a , NY j=1

(2.8)

where the waveform of the pulse is injected to the next neuron directly, and we found various synchronized firings including synchronized chaotic firings and weakly synchronized periodic firings. It is notable that chaos is observed in the noisy network of active rotators, while chaos does not appear in a single active rotator by the general property of one-dimensional differential equations. In those studies, waveform coupling was used to facilitate the numerical analyses, but to compare them with the physiologically observed synchronization phenomena, the double exponential coupling and the couplings derived from it seem to be more appropriate. For the analysis, let us introduce the Fokker-Planck equations (Gerstner & Kistler, 2002; Kuramoto, 1984), ∂n E D ∂ 2 nE 1 ∂ (AE n E ) + 2 , =− ∂t τ E ∂θ E 2τ E ∂θ E 2

(2.9)

Synchronized Firings in Class 1 Networks

D ∂ 2 nI 1 ∂ ∂n I (AI n I ) + 2 , =− ∂t τ I ∂θ I 2τ I ∂θ I 2

1319

(2.10)

AE (θ E , t) = 1 − a sin θ E + IEE (t) − IEI (t),

(2.11)

AI (θ I , t) = 1 − a sin θ I + I I E (t) − III (t),

(2.12)

for the normalized number densities of the excitatory and inhibitory neurons, 1 (i) (2.13) δ θE − θE , n E (θ E , t) ≡ NE 1 (i) (2.14) n I (θ I , t) ≡ δ θI − θI , NI in the limit NE , NI → ∞. Note that asynchronous firings and synchronized firings of the network correspond to a stationary solution and a time-varying solution of the Fokker-Planck equations, respectively. The probability fluxes for the excitatory and inhibitory ensembles are defined as JE (θ E , t) = JI (θ I , t) =

1 D ∂n E AE n E − 2 , τE 2τ E ∂θ E

(2.15)

1 D ∂n I AI n I − 2 , τI 2τ I ∂θ I

(2.16)

respectively. Note that the probability flux at θ = 3π/2 can be interpreted as the instantaneous firing rate at t for each ensemble. 3 Pulse Coupling In this section, a network with the pulse coupling written by equations 2.1, 2.2, and 2.5 is considered. The coupling term IXY (t) in equation 2.5 is approximated as IXY (t) = gXY J Y (t) + σ (t),

(3.1)

where J Y (t) ≡ J Y (3π/2, t) is the firing rate and σ (t) is a fluctuation term. The probability flux J Y (t) at θ = 3π/2 is obtained by solving equations 2.15 and 2.16 for θ = 3π/2. This flux J Y (t) can be calculated when an inequality

gEE 3π 3π gII 1− nE nI 1+ τE 2 τI 2

3π 3π gEI gIE nE nI = 0 + τE τI 2 2

(3.2)

1320

T. Kanamaru and M. Sekine

is satisfied. A sufficient condition for inequality 3.2 is 1−

gEE nE τE

3π 2

> 0,

(3.3)

because the other terms in equation 3.2 are positive. Within all our numerical solutions, the condition 3.3 is proven to be satisfied. In the limit of NY → ∞, the fluctuation term σ (t) converges to zero. With this approximation, a numerically obtained bifurcation set for gint = 3.5 in the (D, gext ) plane is shown in Figure 1. Typically, there exist synchronized firings in the area between the Hopf bifurcation line and the saddle node on the limit cycle bifurcation line with moderate values of D. In Figure 1, flows in the plane of probability fluxes JE and JI are also shown (their explanations are given later in this section). The Hopf and the saddle node bifurcation lines are obtained as follows. First, equations 2.9 and 2.10 are transformed into a set of ordinary differential equations x˙ = f (x) for the spatial Fourier coefficients of n E and n I , as shown in the appendix. Next, a stationary solution x 0 is numerically obtained with the Newton method (Press, Flannery, Teukolsky, & Vetterling, 1988), and the eigenvalues of the Jacobian matrix D f (x 0 ) numerically obtained by using the QR algorithm (Press et al., 1988) are examined to find bifurcation lines. For numerical calculations, each Fourier series is truncated at the first 40 or 60 terms. The homoclinic and the double-limit cycle bifurcation lines are obtained by observing the long time behaviors of the solutions of equations 2.9 and 2.10. This bifurcation set is similar to that of the network with the waveform coupling (Kanamaru & Sekine, 2003) except that chaotic firings found in the network with the waveform coupling do not exist in this network. To understand the bifurcation set, schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set in Figure 1. Note that a stationary solution and a time-periodic solution of the Fokker-Planck equations are projected as an equilibrium point and a limit cycle onto the (JE , JI ) plane, respectively, and they correspond to the asynchronous and the synchronized firings of the network, respectively. Typically, for small D and moderate gext , there exists a stable equilibrium point with small probability fluxes. This equilibrium point corresponds to the firings where all neurons fluctuate around their resting potentials, and when this point disappears by the saddle node on limit cycle bifurcation, the synchronized firings appear. For large D, there exists a stable equilibrium point with large probability fluxes, and it corresponds to the firings where neurons fire with high frequencies without correlations. The synchronized firings also appear after the Hopf bifurcation of the equilibrium point. This equilibrium point approaches the origin of the (JE , JI ) plane with the increase of gext , and its probability fluxes become small. Moreover, in some region in the bifurcation set, the synchronized firings also appear by the double limit cycle bifurcation or the homoclinic bifurcation. (For more information about each

Synchronized Firings in Class 1 Networks 1

1321

JI JE

SN HB

0.8

JI

SN

JE 0.6

SNL

JI

Hopf

JI

gext / gint

JE

JE

0.4 SH

0.2

GH DLC

HB

JI JE

SN

0 0.005

0.01

0.1

0.2

D Figure 1: A bifurcation set in the (D, gext ) plane for the pulse-coupled network with gint = 3.5. The solid, dotted, and dash-dotted lines denote the Hopf, saddle node, and global bifurcations, respectively. Schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set. The filled and open circles in the trajectories in the (JE , JI ) plane denote the stable and unstable equilibrium points, respectively. The solid and dashed closed curves denote the stable and unstable limit cycles, respectively. The meanings of the abbreviations are as follows: SN, saddle node; SNL, saddle node on limit cycle; DLC, double limit cycle; HB, homoclinic bifurcation; SH, subcritical Hopf; GH, generalized Hopf.

bifurcation, see Guckenheimer & Holmes, 1983, and Hoppensteadt & Izhikevich, 1997.) The raster plots of the typical synchronized firings for the finite system with NE = NI = 1000 are shown in Figure 2. Each figure shows the firing times of the neurons. As shown in Figure 2A, the synchronized firings near the saddle node on limit cycle bifurcation have a long period, and their degree of synchronization is strong. This is because the system stays a long

1322

T. Kanamaru and M. Sekine

Figure 2: Raster plots of the typical synchronized firings for the finite system with NE = NI = 1000. The parameters are set at (A) gext /gint = 0.4 and D = 0.015 and (B) gext /gint = 0.4 and D = 0.08 with gint = 3.5. The neurons are aligned so that the excitatory neurons are in the range 0 ≤ i < 1000, and the inhibitory neurons are in the range 1000 ≤ i < 2000.

time in the area where the original saddle and node existed. As shown in Figure 2B, the synchronized firings near the Hopf bifurcation have a short period, and their degree of synchronization is weak. This is because the limit cycle that corresponds to this weakly synchronized firing with a high firing rate is created around the stable equilibrium point, which denotes the asynchronous firings.

Synchronized Firings in Class 1 Networks

1323

4 Exponential Coupling In this section, a network with the exponential coupling is analyzed. The parameters are fixed at κ E = κ I = 1 and gint = 3.5. For a large number of neurons, equation 2.6 is approximated by the Ornstein-Uhlenbeck process (Gardiner, 1985), written as IXY˙ (t) = −(IXY (t) − gXY J Y (t))/κY + σ (t),

(4.1)

where σ (t) is a fluctuation term and σ (t) converges to zero in the limit of NY → ∞. By integrating this differential equation with the Fokker-Planck equations, the exponentially coupled network can be numerically analyzed. A numerically obtained bifurcation set in the (D, gext ) plane is shown in Figure 3. In this bifurcation set, there exists a crisis line where a chaotic solution disappears (as explained later in this section). In Figure 3, schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set. The bifurcation structure roughly resembles that of the pulse-coupled network, but in the exponentially coupled network, there also exist period-doubling bifurcations and the chaotic solutions. The flows in the (JE , JI ) plane, the time series of JE , and the raster plots for the finite system with NE = NI = 1000 are shown in Figures 4A, 4B, and 4C, respectively, and synchronized chaotic firings are observed. Let us consider the Poincar´e section of the trajectory at a line JE = 0.15 with d JE /dt > 0 in the (JE , JI ) plane. The bifurcation diagram of the attractors at the Poincar´e section against D for gext /gint = 0.64 and gint = 3.5 is shown in Figure 5A, and the chaotic attractors are observed. To confirm that the chaotic behaviors in Figure 5A are actually chaotic, the largest Lyapunov exponent is calculated by the standard technique (Ott, 1993): calculating the expansion rate of two nearby trajectories, each of which follows a set of ordinary differential equations x˙ = f (x) for the spatial Fourier coefficients of equations 2.9 and 2.10. The corresponding Lyapunov exponent is shown in Figure 5B. It is observed that the Lyapunov exponent takes positive values when the chaotic solutions exist and takes zero when periodic solutions are stable. In the following, periodic solutions with the period n in the Poincar´e section are called periodic solutions with cycle n. The areas where the periodic solutions with cycle 2 or 4 or the chaotic solutions exist are roughly sketched in Figure 6. The periodic solutions with large cycles and the windows in the chaotic regions are neglected because their areas are very narrow. In the bifurcation set, there exist points of crisis line where the chaotic attractors disappear. When a periodic solution instead of the chaotic attractor disappears, this is the point of homoclinic bifurcation. For small gext , high-frequency synchronization where excitatory neurons continue to fire with the period about their pulse width are observed. The flows of the probability flux and the raster plots for such a synchronization

1324

T. Kanamaru and M. Sekine 1

JI

JE SN

0.8

JI

HB

JE

Hopf

JI

JI

0.6

gext / gint

crisis or HB JI

JE

JE

0.4

JE SN 0.2 SN

0 0.005

0.01

0.1

0.2

D Figure 3: A numerically obtained bifurcation set of the exponentially coupled network. The parameters are set at κ E = κ I = 1 and gint = 3.5. The solid, dotted, and dash-dotted lines denote the Hopf, saddle node, and global bifurcations, respectively. Schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set. The filled and open circles in the trajectories in the (JE , JI ) plane denote the stable and unstable equilibrium points, respectively. The solid closed curves denote the stable limit cycle. The meanings of the abbreviations are as follows: SN, saddle node; HB, homoclinic bifurcation.

are shown in Figure 7. As shown in Figure 7B, the frequencies of the excitatory neurons are very high, and their patterns of synchronization are hardly seen. It seems that this high-frequency synchronization does not correspond to the physiological observations because the periods of the physiologically observed periodic firings are much longer than the typical pulse width of a neuron. Thus, we call this high-frequency synchronization the anomalous high-frequency synchronization. The anomalous highfrequency synchronization is realized because the probability flux JE of the

Synchronized Firings in Class 1 Networks

1325

Figure 4: The chaotic dynamics observed in the exponentially coupled network for gext /gint = 0.64, gint = 3.5, and D = 0.0125. (A) A flow in the (JE , JI ) plane. (B) A time series of JE . (C) The raster plot of the firings in the finite system with NE = NI = 1000.

1326

T. Kanamaru and M. Sekine

Figure 5: The positions of the attractors on the Poincar´e section JE = 0.15 against D for gext /gint = 0.64. (B) The corresponding Lyapunov exponent.

excitatory neurons always takes large values. A condition for the existence of the anomalous high-frequency synchronization is obtained as follows. Generally, if the product J X (t) of the time-average J X (t) of the probability flux and the pulse width takes a value larger than 1, the neurons in the ensemble X continue to fire. With our parameters, the pulse width is about 5; thus, the excitatory neurons continue to fire if an inequality J X (t) > 0.2 is satisfied. In the area labeled H in Figure 6, this inequality is satisfied; thus, the anomalous high-frequency synchronization is observed.

Synchronized Firings in Class 1 Networks

1327

1 0.8

C 4

2

0.6

gext / gint

0.4 0.2 0 0.005

H

0.1

0.01

0.2

D Figure 6: The areas where the periodic solutions with cycle 2 or 4 or the chaotic solutions exist are roughly sketched, and they are labeled 2, 4, and C, respectively. In the area labeled H, there exists the anomalous high-frequency synchronization.

Before closing this section, let us consider the dependence of the exponentially coupled network on the synaptic time constants κ E and κ I . The bifurcation sets for κ E = κ I = 0.1, κ E = κ I = 0.5, and κ E = κ I = 3.0 are shown in Figure 8. The boundaries of the areas where the periodic solution with cycle 2 or the chaotic solution exists are roughly sketched. The periodic solutions with larger cycles also exist for the parameters of Figure 8C, but we neglect them because those areas are very narrow. Moreover, the area where the anomalous high-frequency synchronization exists is also shown. As shown in Figure 8A, for small κ E and κ I , the structure of the bifurcation set is almost identical with that of the pulse-coupled network. This is because equation 4.1 reduces to IXY (t) = gXY J Y (t) in the limit of κ E , κ I → 0, and it is equivalent to the interaction term of the pulse coupling in equation 3.1. As shown in Figures 8B and 8C, when κ E and κ I are increased, the two homoclinic bifurcation lines merge, and the periodic solutions with n cycles and the chaotic solutions appear. Moreover, it is also observed that the area where the synchronized firings exist becomes narrower along the D-axis by the increase of κ E and κ I . In other words, the synchronized firings are more easily obtained for short synaptic decay time κ E and κ I . Note that the change of κ E and κ I does not affect the positions of equilibrium points because the equilibrium of 4.1 is independent of κY . Thus, if gext , gint , and D are fixed, the firing rate or probability flux of the equilibrium point is kept constant with the change of κY . However, the stability of equilibrium states depends on κ E and κ I ; thus, the position of the Hopf bifurcation line changes.

1328

T. Kanamaru and M. Sekine

(A) 0.25 0.2

JE 0.15

J

JI

0.1 0.05 0 0

50

100

150

200

150

200

t (B)

index of neuron

2000

1500

1000

500

0 0

50

100

t Figure 7: The anomalous high-frequency synchronization. (A) Flows of the probability flux and (B) the raster plots for the finite system with NE = NI = 1000 for D = 0.014, gext /gint = 0.3, and gint = 3.5.

5 Alpha Coupling and Double Exponential Coupling In the limit of NY → ∞, the coupling term IXY (t) for the network with alpha coupling is approximated as (Gardiner, 1985) IXY (t) = gXY

t

−∞

dt α(t − t ; κY )J Y (t ),

t t α(t; κ) ≡ 2 exp − , κ κ

(5.1) (5.2)

Synchronized Firings in Class 1 Networks

(A) 1

1329

(B)

κ E = κ I =0.1, gint =3.5

κ E = κ I =0.5, gint =3.5

1

SN 0.8

SN 0.8

HB

0.6

SNL

gext / gint

HB

0.6

2

Hopf

Hopf 0.4

H

DLC

0.2 0 0.005

0.4

HB

0.2

SN

SN 0.01

0.1

0 0.005

0.2

0.01

0.1

D

0.2

D

(C)

κ E = κ I =3.0, gint =3.5

1 0.8

HB C

0.6

gext / gint

0 0.005

Hopf

SN

0.4 0.2

2

H

H

0.01

0.1

0.2

D

Figure 8: The bifurcation sets of the exponentially coupled network for (A) κ E = κ I = 0.1, (B) κ E = κ I = 0.5, and (C) κ E = κ I = 3.0. The internal coupling strength gint is fixed at gint = 3.5. The boundaries of the areas where the periodic solution with cycle 2 or the chaotic solution exist are rough sketches. The areas where the anomalous high-frequency synchronization exists are also shown.

and it satisfies the differential equations written as 1 (0) IXY − IXY , IXY˙ (t) = − κY 1 (0) (0)˙ IXY − gXY J Y , IXY (t) = − κY

(5.3) (5.4)

where (0)

IXY (t) = gXY

t

−∞

dt e(t − t ; κY )J Y (t ),

1 t e(t; κ) ≡ exp − . κ κ

(5.5) (5.6)

1330

T. Kanamaru and M. Sekine

By integrating the Fokker-Planck equations with equations 5.3 and 5.4, the behavior of the network with alpha coupling can be analyzed. For the network with double exponential coupling, the coupling term IXY (t) can be approximated as 1 (1) (2) κ1Y IXY − κ2Y IXY , κ1Y − κ2Y t (1) IXY (t) = gXY dt e(t − t ; κ1Y )J Y (t ),

IXY (t) =

−∞

(2) IXY (t) = gXY

t

−∞

dt e(t − t ; κ2Y )J Y (t ), (1)

(5.7) (5.8) (5.9)

(2)

in the limit of NY → ∞. And IXY (t) and IXY (t) satisfy the differential equations 1 (1) (1)˙ IXY (t) = − I − gXY J Y , κ1Y XY 1 (2) (2)˙ I − gXY J Y . IXY (t) = − κ2Y XY

(5.10) (5.11)

By integrating the Fokker-Planck equations with equations 5.7, 5.10, and 5.11, the behavior of the network with double exponential coupling can be analyzed. Following the above procedures, bifurcation sets of the network with alpha coupling or double exponential coupling are shown in Figure 9. Figure 9A shows the result for alpha coupling with κ E = κ I = 1, and Figures 9B and 9C show the results for double exponential coupling with κ1E = κ1I = 3 and κ2E = κ2I = 1, and κ1E = κ1I = 1 and κ2E = κ2I = 0.5, respectively. The internal coupling strength gint is fixed at gint = 3.5. In Figure 9D, the corresponding parameter values are plotted in the (κ1Y , κ2Y ) plane. By comparing Figures 9A and 9B, the effect of the decay time of the synaptic interaction can be summarized. Similar to the results for the exponential coupling in Figure 8, with the increase of the decay time, the area where the synchronized firings exist becomes narrower along the D-axis. By comparing Figures 9A and 9C, the effect of the rise time can be summarized. Unlike the decay time, it is observed that the change of the rise time does not have a large effect on the overall bifurcation structure of the network. We perform more detailed analyses on the dependence of the bifurcation structure on the synaptic time constants. We focus only on the large D and consider the Hopf bifurcation observed when varying the synaptic time constants κ1 ≡ κ1E = κ1I and κ2 ≡ κ2E = κ2I for fixed D, gext and gint . As stated in the previous section, the change of synaptic time constants affects the stability of the equilibrium points, but it does not affect the firing

Synchronized Firings in Class 1 Networks

(A) 1 0.8

(B)

κ 1E = κ 1I =3.0 double exp., κ 2E = κ 2I =1.0, gint =3.5

alpha, κ E = κ I =1.0, gint =3.5

HB or crisis

1331

1

0.8 HB or crisis

C 2

0.6

C

0.6

Hopf

gext / gint 0.4

2

Hopf

0.4

H

H

0.2

SN 0 0.005

0.01

0.1

0.2

H

H

0.2 0 0.005

SN 0.01

0.1

D

(C)

0.2

D

κ 1E = κ 1I =1.0 double exp., κ 2E = κ 2I =0.5, gint =3.5

(D) κ 2Y

1

3.0 0.8

gext / gint

κ 1Y = κ 2Y

HB or crisis

0.6

C

2

2.0

Hopf

0.4 0.2

A H

B

1.0

H

C

SN 0 0.005

κ 1Y

0.01

0.1

0.2

O

1.0

2.0

3.0

D

Figure 9: The bifurcation sets of the network with the alpha coupling or the double exponential coupling. (A) Alpha coupling, κ E = κ I = 1. (B) Double exponential coupling, κ1E = κ1I = 3 and κ2E = κ2I = 1. (C) κ1E = κ1I = 1 and κ2E = κ2I = 0.5. The internal coupling strength gint is fixed at gint = 3.5. (D) The chosen parameters are plotted in the (κ1Y , κ2Y ) plane. The filled circles denote the parameters investigated in this figure, and the open circles denote the parameters treated in the previous sections.

rate of each ensemble. The Hopf bifurcation lines observed when varying the synaptic time constants are shown in Figure 10. Typically, as shown in Figure 10A, the Hopf bifurcation takes place by decreasing the synaptic decay time κ1 , and the synchronized firings appear. This is because the area for the synchronized firings widens along the D-axis with the decrease of κ1 , as shown in Figure 8. On the other hand, as shown in Figure 10A, its dependence on the synaptic rise time κ2 is not uniform. The Hopf bifurcation takes place with the change of κ2 only when the synaptic decay time κ1 is appropriately chosen. It is also observed that the synchronized firings appear even with the increase of κ2 .

1332

T. Kanamaru and M. Sekine

(A) 3

g ext /g int =0.7, D=0.04 g ext /g int =0.6, D=0.05

2.5 2

κ2

S

1.5 1 0.5

P

0 0

0.5

1

1.5

2

κ1

2.5

3

2.5

3

(B) 3

g ext /g int =0.25, D=0.013

2.5 2

κ2

1.5 1

P

0.5

S

0 0

0.5

1

1.5

κ1

2

Figure 10: The Hopf bifurcation observed when varying the synaptic time constants for (A) gext /gint = 0.7 and D = 0.04, gext /gint = 0.6 and D = 0.05, and (B) gext /gint = 0.25 and D = 0.013. The internal coupling strength gint is fixed at gint = 3.5. For A, the synchronized periodic firings exist for short decay time κ1 , and for B, the synchronized periodic firings exist for long decay time κ1 . The synchronized state is stable in the area labeled P, and the asynchronous state is stable in the area labeled S.

Moreover, as shown in Figure 10B, there exist parameter values where long synaptic time constants cause the synchronized firings. This is because the area for the synchronized firings slightly widens along the gext axis with the increase of κ1 , as shown in Figure 8. However, these synchronized firings are anomalous high-frequency synchronization; thus, this phenomenon might not have a physiological correspondence.

Synchronized Firings in Class 1 Networks

1333

6 Conclusions and Discussions On the synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections, their dependence on the forms of interactions is analyzed. As the forms of interactions, we treat double exponential coupling and the interactions derived from it in some limiting cases—pulse coupling, exponential coupling, and alpha coupling—and investigate the dependence of the bifurcation structure on the rise time and the decay time of interactions. By investigating the dependence of the solutions on the external connection strength gext and the noise intensity D, various synchronized firings are observed, such as synchronized periodic firings, synchronized chaotic firings, and anomalous high-frequency synchronization. The decay time κ1 of the synaptic potential affects the bifurcation structure of the synchronized firings on the (D, gext ) plane. With the decrease of κ1 , the area showing the synchronized firings widens along the D-axis. In other words, the synchronized firings are more easily obtained for a short synaptic decay time. It is also found that a relatively large value of κ1 is required to observe the synchronized chaotic firings. The dependence of the overall bifurcation structure on the synaptic rise time κ2 is weaker than that on κ1 . In the analysis of synchronization in neural systems, the average firing rate of the ensemble is often fixed by regulating the constant input to the network (e.g., see Hansel & Mato, 2003). With such a procedure, it is possible to separate the effects of the firing rate and the other parameters on the bifurcation structure. In our model, the firing rate corresponds to the probability flux, and such a fixation of the firing rate is not performed in our analysis—namely, the value of the firing rate varies depending on parameters gext , gint , and D. Thus, our bifurcation sets reflect the effect of both the firing rate and the other parameters. However, when gext , gint , and D are fixed, the firing rate of the equilibrium point takes a constant value; thus, the effect of the firing rate is eliminated in the bifurcation set in the (κ1 , κ2 ) plane (see Figure 10). Let us consider the effect of the timescale of the synaptic interaction on the synchronized firings. In the networks of self-oscillating excitatory neurons with alpha couplings, it is known that the perfectly synchronized state is unstable (van Vreeswijk, 1996; van Vreeswijk et al., 1994). In such a network, the asynchronous state is stable for long synaptic timescales, and the partially synchronized state is stabilized for short synaptic timescales (van Vreeswijk, 1996). The synchronization observed in our noisy network would correspond to their partial synchronization, and, similar to their results, the short synaptic timescale facilitates the synchronization in our network (see Figure 10A). It is noticeable that the synchronous state is stabilized even for long synaptic times in some parameter range (see Figure 10B). This effect can be understood by considering the overall bifurcation structure. However, this synchronous state corresponds to the anomalous high-frequency

1334

T. Kanamaru and M. Sekine

synchronization; thus, this phenomenon might not have a physiological correspondence. As for the rise time of the synaptic interaction, it is known that a pair of self-oscillating excitatory leaky integrate-and-fire neurons with exponential couplings shows perfect synchronization, although the network with alpha couplings shows only the partial synchronization or the antiphase synchronization (van Vreeswijk et al., 1994). Moreover, for a pair of self-oscillating excitatory neurons with double exponential couplings, it is also known that the short rise time widens the parameter range where the partial synchronization is observed (Hansel et al., 1995). These results suggest that the short rise time facilitates the synchronization in the small network. However, our results show that the rise time of the synaptic interaction gives smaller effects on the overall bifurcation structure than that of the decay time. It might be because our network contains a very large number of neurons, and the effect of a single pulse is scaled as ∼NX−1 (X = E or I ) and negligible. In such a network, the bifurcation structure might be determined by the characteristic timescale of the synaptic input IXY (t). In our configuration, the decay time is longer than the rise time (κ1 ≥ κ2 ); thus, it is dominant in IXY (t). Let us consider the roles of inhibition. In our network, the synchronized firings are not observed without inhibitory neurons (see the bifurcation set at gext = 0), and it might be because our network is composed of excitable neurons. Although the period of the firings of networks of self-oscillating excitatory neurons is typically determined by the period of a single neuron, it can take various values depending on the parameters in the network of excitable neurons with excitatory and inhibitory connections. Typically, the period is long around the saddle node on limit cycle bifurcation and the homoclinic bifurcation, and it is short around the Hopf bifurcation. Note that the period of the firings near the Hopf bifurcation can take large values if the activities of excitatory and inhibitory ensembles are balanced and weakly synchronized periodic firings are realized (Kanamaru & Sekine, 2004). In the analysis of the pulse-coupled network, it is found that its bifurcation structure is similar to that of the network with the waveform-coupling written by equation 2.8 (Kanamaru & Sekine, 2003). The width of the pulse injected to the next neuron with the waveform coupling is as large as ∼ 5, and the width of the interaction of the pulse-coupled neuron is infinitesimal; thus, this similarity seems to be odd. This contradiction might be explained as follows. In the network with the waveform coupling or the pulse coupling, each neuron has its characteristic timescale determined by (τ E , a ) or (τ I , a ). By the coupling, the additional characteristic timescale is not introduced to the network because the interaction with the waveform coupling has the same characteristic timescale as the neuron, and the interaction with the pulse coupling does not have a characteristic timescale. In the network with the other couplings such as exponential coupling, a new characteristic timescale of the synapse is introduced to the network; thus, its dynamics becomes complex.

Synchronized Firings in Class 1 Networks

1335

Hoppensteadt and Izhikevich considered weakly connected networks of class 1 neurons that are close to the saddle node bifurcation point and derived a canonical model described by phase variables connected with the pulse coupling (Hoppensteadt and Izhikevich, 1997; Izhikevich, 1999). Because of the closeness to the bifurcation point, the characteristic timescale of the neuron is long, and the characteristic timescale of the coupling becomes relatively short; thus, the pulse coupling is justified in the canonical model. The behavior of this model is expected to be similar to that of our pulse-coupled active rotators. When the neuron is away from the bifurcation point, the approximation with the pulse coupling does not hold; thus, couplings such as double exponential coupling might be required. In such a network, synchronized firings appear mainly through the Hopf bifurcation or the homoclinic bifurcation, and synchronized periodic firings with large cycles and synchronized chaotic firings are typically observed in a wide range of parameters. This ubiquity of the chaotic firings might suggest the importance of chaos in the brain dynamics. In this article, for simplicity, we treated only the case where the time constants of the excitatory neurons and the inhibitory neurons are identical. Kanamaru and Sekine (2004) treated a network with the waveform coupling that has different time constants τ E = 1 and τ I = 2, and it was found that its dynamics is more complex than that of the network with τ E = τ I = 1. Moreover, the weakly synchronized periodic firings that are often observed in physiological experiments (Gray & Singer, 1989; Buzs´aki, Horv´ath, Urioste, Hetke, & Wise, 1992; Fisahn, Pike, Buhl, & Paulsen, 1998) are also observed in the network with τ E = 1 and τ I = 2. The physiological neurons have many characteristic timescales such as those of the various ion channels and the synaptic interactions, and it is known that the excitatory neurons and the inhibitory neurons have different values of time constants. Thus, it would be important to find dominant characteristic timescales in the physiological system and incorporate it in the theoretical model. Appendix: Numerical Integration of the Fokker-Planck Equation In this section, we give a method for the numerical integration of FokkerPlanck equations 2.9 and 2.10. Two densities given by equations 2.13 and 2.14 are 2π-periodic functions of θ E and θ I , respectively. Thus, they can be expanded as

n E (θ E , t) = n I (θ I , t) =

∞ E 1 a k (t) cos(kθ E ) + b kE (t) sin(kθ E ) , + 2π k=1

(A.1)

∞ I 1 a k (t) cos(kθ I ) + b kI (t) sin(kθ I ) , + 2π k=1

(A.2)

1336

T. Kanamaru and M. Sekine

and, by substituting them, equations 2.9 and 2.10 are transformed into a set of ordinary differential equations x˙ = f (x) where x = (a 1E , b 1E , a 1I , b 1I , a 2E , b 2E , a 2I , b 2I , · · ·)t , k2 D da k a k (X) k (X) (X) (X) a , = − (1 + I X )b k + a k−1 − a k+1 − dt τX 2τ X 2τ X2 k

(A.3)

k2 D db k a k (X) k (X) (X) (X) b , = (1 + I X )a k + b k−1 − b k+1 − dt τX 2τ X 2τ X2 k

(A.4)

(X)

(X)

I E ≡ IEE − IEI ,

(A.5)

I I ≡ IIE − III ,

(A.6)

1 , π

(A.7)

(X)

a0 ≡ (X)

b 0 ≡ 0,

(A.8)

k ≥ 1, and X = E or I . These ordinary differential equations are numerically integrated with the fourth-order Runge-Kutta algorithm.

Acknowledgments T.K. is grateful to T. Horita for his careful reading of the manuscript. This research was partially supported by a Grant-in-Aid for Encouragement of Young Scientists (B) (No. 14780260) from the Ministry of Education, Culture, Sports, Science, and Technology, Japan.

References Abbott, L. F., & van Vreeswijk, C. (1993) Asynchronous states in networks of pulsecoupled oscillators. Physical Review E, 48, 1483–1490. Borgers, ¨ C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Computation, 15, 509– 538. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183–208. Buzs´aki, G., Horv´ath, Z., Urioste, R., Hetke, J., & Wise, K. (1992). High-frequency network oscillation in the hippocampus. Science, 256, 1025–1027. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40Hz in the hippocampus in vitro. Nature, 394, 186–189.

Synchronized Firings in Class 1 Networks

1337

Gardiner, C. W. (1985). Handbook of stochastic methods. Berlin: Springer-Verlag. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Golomb, D., & Ermentrout, G. B. (2001). Bistability in pulse propagation in networks of excitatory and inhibitory populations. Physical Review Letters, 86, 4179–4182. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience, 1, 11–38. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences of USA, 86, 1698–1702. Guckenheimer, J., & Holmes, P. (1983). Nonlinear oscillations, dynamical systems, and bifurcations of vector fields. New York: Springer. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Computation, 15, 1–56. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks, 10, 499–507. Kanamaru, T., & Sekine, M. (2003). Analysis of globally connected active rotators with excitatory and inhibitory connections using the Fokker-Planck equation. Physical Review E, 67, 031916. Kanamaru, T., & Sekine, M. (2004). An analysis of globally connected active rotators with excitatory and inhibitory connections having different time constants using the nonlinear Fokker-Planck equations. IEEE Transactions on Neural Networks, 15, 1009–1017. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: Springer. Kuramoto, Y. (1991). Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Kurrer, C., & Schulten, K. (1995). Noise-induced synchronous neuronal oscillations. Physical Review E, 51, 6213–6218. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM Journal of Applied Mathematics, 50, 1645–1662. Ott, E., (1993). Chaos in dynamical systems. Cambridge: Cambridge University Press. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C. Cambridge: Cambridge University Press. Sakaguchi, H., Shinomoto, S., & Kuramoto, Y. (1988). Phase transitions and their bifurcation analysis in a large population of active rotators with mean-field coupling. Progress of Theoretical Physics, 79, 600–607. Sato, Y. D., & Shiino, M. (2002). Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Physical Review E, 66, 041903. Shinomoto, S., & Kuramoto, Y. (1986). Phase transitions in active rotator systems. Progress of Theoretical Physics, 75, 1105–1110.

1338

T. Kanamaru and M. Sekine

Tanabe, S., Shimokawa, T., Sato, S., & Pakdaman, K. (1999). Response of coupled noisy excitable systems to weak stimulation. Physical Review E, 60, 2182–2185. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Physical Review Letters, 71, 1280–1283. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Physical Review E, 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. Journal of Computational Neuroscience, 1, 313– 321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726.

Received March 11, 2004; accepted November 3, 2004.

LETTER

Communicated by Suzanna Becker

A Hierarchy of Associations in Hippocampo-Cortical Systems: Cognitive Maps and Navigation Strategies J. P. Banquet [email protected] INSERM U483 Neuroscience and Modelization, Universit´e Pierre et Marie Curie, 75252 Paris, France

Ph. Gaussier [email protected]

M. Quoy [email protected]

A. Revel [email protected] CNRS U2235 ETIS-Neurocybern´etique, Universit´e de Cergy-Pontoise-ENSEA, 95014 Cergy-Pontoise, France

Y. Burnod [email protected] INSERM U483 Neuroscience and Modelization, Universit´e Pierre et Marie Curie, 75252 Paris, France

In this letter we describe a hippocampo-cortical model of spatial processing and navigation based on a cascade of increasingly complex associative processes that are also relevant for other hippocampal functions such as episodic memory. Associative learning of different types and the related pattern encoding-recognition take place at three successive levels: (1) an object location level, which computes the landmarks from merged multimodal sensory inputs in the parahippocampal cortices; (2) a subject location level, which computes place fields by combination of local views and movement-related information in the entorhinal cortex; and (3) a spatiotemporal level, which computes place transitions from contiguous place fields in the CA3-CA1 region, which form building blocks for learning temporospatial sequences. At the cell population level, superficial entorhinal place cells encode spatial, context-independent maps as landscapes of activity; populations of transition cells in the CA3-CA1 region encode context-dependent maps as sequences of transitions, which form graphs in prefrontal-parietal cortices. The model was tested on a robot moving in a real environment; these tests produced results that could help to interpret biological data. Neural Computation 17, 1339–1384 (2005)

© 2005 Massachusetts Institute of Technology

1340

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Two different goal-oriented navigation strategies were displayed depending on the type of map used by the system. Thanks to its multilevel, multimodal integration and behavioral implementation, the model suggests functional interpretations for largely unaccounted structural differences between hippocampo-cortical systems. Further, spatiotemporal information, a common denominator shared by several brain structures, could serve as a cognitive processing frame and a functional link, for example, during spatial navigation and episodic memory, as suggested by the applications of the model to other domains, temporal sequence learning and imitation in particular.

1 Introduction In recent years, the understanding of the hippocampo-cortical connectivity (Witter et al., 2000; Lavenex & Amaral, 2000; Amaral & Witter, 1989) and evidence from a variety of experimental approaches indicate that each of the component fields of the hippocampal system (parahippocampal region, entorhinal cortex, hippocampus proper) may serve different yet complementary functions. Both anatomical and experimental results suggest the existence of at least three main processing levels of complex temporospatial information: a first level in the perirhinal and postrhinal cortex for pattern location association, a second level in the entorhinal cortex for the integration of visuospatial and self-motion information into a coarse spatial code, and a third level for temporospatial and contextual integration in the trisynaptic loop, which forms a major input to the subiculum. Moreover, two parallel streams (conveying respectively “what” and “where” information) have been delineated by tracers all the way through parahippocampal and entorhinal systems. Local connections within and between these streams potentially lead to increased associativity and integration of the information that reaches the different rostrocaudal or mediolateral regions of the hippocampal system (Lavenex & Amaral, 2000). Yet contrasting with these latter partial connections, a layered projection of the “What” and “Where” streams leads to a considerable convergence and a loss of anatomical topology at the level of the dentate gyrus (DG) and the CA3 fields (Witter et al., 2000; Lavenex & Amaral, 2000; Amaral & Witter, 1989). These two structures are also in receipt of important modulating signals from the septum of the basal forebrain cholinergic system and the dopaminergic system. The functional meaning of these structural characteristics is poorly understood. A biologically realistic and functionally integrated model should help to clarify the properties of the different subsystems and their contribution to global functions attributed to the hippocampus, such as spatial processing and navigation, or episodic memory. The hippocampus of the rat has been hypothesized to host a spatial representation of the animal’s environment (O’Keefe & Nadel, 1978). The main

A Hierarchy of Associations in Hippocampo-Cortical Systems

1341

evidence in support of this theory is the existence of hippocampal place cells (PC, pyramidal neurons whose firing is strongly correlated with the location of a freely moving rat in its environment) (O’Keefe & Dostrovsky, 1971). The activity of each cell is selective of the current location of the animal. This cell-specific region of intense discharge is named the firing field, by analogy to the receptive field of cortical neurons. The firing fields of PCs can be seen in all parts of the environment accessible to the rat, so that collectively, active PCs and their specific firing profile provide a potent signature of the environment and, plausibly, the components of a map. If the shape of the apparatus (Muller & Kubie, 1987; Lever, Willis, Cacucci, Burgess, & O’Keefe, 2002), the color of the objects within the apparatus (Bostock, Muller, & Kubie, 1991; Kentros, Hargreaves, Kandel, Shapiro, & Muller, 1998), or the orientation of the apparatus relative to background (Cressant, Muller, & Poucet, 2002; Skaggs, Knierim, Kudrimoti, & McNaughton, 1995; Tanila, Sipila, Shapiro, & Eichenbaum, 1997) are changed, “remapping” takes place: some cells active in one apparatus become silent, and inversely. The fields of the cells active in both apparatuses are unrelated. This phenomenon suggests that the hippocampus learns and holds distinct maps for distinct environments. The hallmark of our hippocampal model was a dynamical spatiotemporal (and not just spatial) representation of the space and task environment through the computation and encoding of transitions in the CA field (transition cells), the inclusion of this hippocampal structure into a larger corticalsubcortical network, and the storage of the maps at cortical level. These characteristics provided for a straightforward solution to the theoretical difficulty to switch from a spatial cognitive map to its motor implementation during goal-oriented navigation (Banquet, Gaussier, Quoy, Revel, & Burnod, 2004; Gaussier, Revel, Banquet, & Babeau, 2002; Banquet, Gaussier, Revel, Moga, & Burnod, 2001). Even though different implementations of neural fields (Amari, 1977) and chaotic attractors (Tsuda, 2001) were used (Quoy, Banquet, & Dauce, 2001; Dauce, Quoy, & Doyon, 2002), the model presented here differs from a classical attractor model in that no recurrent connections were implemented in CA3. Our main goal in this letter is to delineate within a coherent frame the distinct complementary contributions, in spatial processing and navigation, of the parahippocampal region (perirhinal, PR; parahippocampal, PH; and entorhinal, EC cortices) and of the hippocampus (HS) proper, based on anatomical and experimental observations, and to make testable predictions. Our model comprises successive levels of association of different types and of increasing complexity. These associative neural nets, functionally paired to pattern-encoding and recognition networks, provided increasingly multimodal and abstract representations of the inputs. The differences between the associations performed by the local recurrent cortical circuits of pyramidal cells and the extensive, global CA3 associations (Cohen & Eichenbaum, 1993) were attributed to the distinct structures of hippocampal and cortical networks. Object-location associations (here

1342

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

landmarks), encountered in the parahippocampal cortex (Rolls & Treves, 1998), combined to form local views; in conjunction with idiothetic inputs (here, idiothetic information means all direct self-motion information, including optic flow, vestibular signals, corollary discharge, and somatosensory feedback), these views created position-dependent activity in the medial EC (Quirk, Muller, Kubie, & Ranck, 1992; Sharp, 1999) and HS. The wealth of experiments on spatial and navigation tasks in rodents and primates provided a test bench for model development and analysis. The recording of at least two types of PCs (hippocampal and entorhinalsubicular) suggests the encoding by distinct neural populations of two types of maps for the same environment. Classically, PCs with well-delimited place fields have been recorded in CA3-CA1 pyramidal cells and DG granule cells (Jung & McNaughton, 1993). More recently, place cell–like activity has been recorded in the superficial (Quirk et al., 1992; Sharp, 1999) and deep layers (Frank, Brown, & Wilson, 2000; Mizumori, Ward, & Lavoie, 1992) of medial entorhinal cortex (MEC), as well as in the subiculum (SUB) (Sharp & Green, 1994). The firing fields of these pyramidal neurons have no clear-cut boundaries but a graded decay starting from spatially stable maxima. No remapping of these fields takes place when the geometry of the environment is changed; rather, there is a topological adaptation to the shape of the environment (Quirk et al., 1992; Sharp, 1999). The relatively weak and coarse place codes found in superficial EC are refined in the hippocampus proper to create a finely grained representation of position in DG, transformed into larger, overlapping place fields in CA3-CA1, and further embedded in the context of a trajectory in deep EC (Frank et al., 2000). Our model reproduced two types of place fields, entorhinal and hippocampal, starting from real views taken from the environment by the camera of a moving robot and also provided mechanistic and functional interpretations and predictions. The hypothesis of a hierarchy of associativity allowed us to consider the spatial information precoded in EC as the source for both a DG refined spatial code and a CA3-CA1 temporospatial code. Recent results confirmed the main assumption of our model (Banquet et al., 1997; Revel, Gaussier, Leprˆetre, & Banquet, 1998) of two distinct functions of EC-DG and CA3-CA1 for the processing of spatial and temporal order information. After selective DG or CA1 lesions, a double dissociation in the separation of respectively small-grain spatial patterns and temporospatial patterns (Gilbert, Kesner, & Lee, 2001) supported this view. Collectively, the corresponding two types of place cells encoded two types of coexisting hippocampo-cortical maps, associated with distinct navigation strategies during robotic experiments. The first was a “universal” context-independent map (Sharp, 1999) computed by superficial entorhinal neural populations with weak position-dependent activity, based on “landscapes” of PC potentials proper to each location. The classical concept of spatial map was extended here to the acquisition of a coarse yet specific location-action mapping, close to the concept of cognitive map.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1343

The second type was a context-dependent map computed by the CA3-CA1 association networks based on place field transitions encoded by transition cells. A transition cell or set (representing a population in the model) was a minimal representation of changing subsets of active CA3 neurons during navigation. The transition cells formed the building blocks of the neural representations of temporospatial sequences, graphs, and contextual maps putatively stored in parietal or prefrontal cortex. While the first type of map could be characterized as spatial and stable, this second type could be characterized as temporospatial and dynamic. Universal and contextual maps were both modulated by the “head direction system,” thus achieving external coherence aligned to the external world, as well as internal coherence by the alignment of views from multiple directions. Our model built on previous models of place cells (O’Keefe, 1991; Sharp, Blair, & Brown, 1996; Burgess, Recce, & O’Keefe, 1994; McNaughton, Knierim, & Wilson, 1994; Touretzky & Redish, 1996), and yet made several original contributions. First, the use of transitions to guide actions provided a straightforward transition between spatial representation and navigation. Second, a theoretical analysis of the process of place field and map learning resulted in a single analytical equation (equation 2.10) that summarized the spatial properties of the network and was useful to understand the relations between the landmarks and the geometrical properties of the place fields. Third, visual information was automatically extracted from the environment by a biologically inspired vision system combined with path integration to provide a mechanistic and functional integration and interpretation of the two types of place fields. Stable invariant but coarse spatial codes were combined with context- and task-dependent dynamic codes (transitions) to produce robust and flexible temporospatial representations. At the population level, the map concept was extended to a mathematical mapping between the spaces of representations and actions, which could be shared by both spatial and cognitive maps. Finally, the most significant subsystems of the parahippocampal region and the hippocampus proper were functionally integrated in order to implement, beyond the simple simulation of a model, a robot control system during navigation experiments that were more recently conducted in parallel in rat and robot (Paz-Villagran, Save, & Poucet, 2003, 2004).1 This letter emphasizes the anatomical and physiological support and detailed mathematical formulation of the different model subsystems and the corresponding experimental predictions; it also proposes a functional significance for the different types of PCs and corresponding maps by establishing a link between map types and navigation strategies. Nevertheless, the letter focuses on the input stages (PR, PH) and 1 Koala robot built by K-team, equipped with a CCD camera mounted on a servomotor to take panoramic views of the environment; the visual field varied from 60 to 300 degrees with a maximal resolution of (256 x 1200) pixels. A magnetic compass simulated the vestibular system.

1344

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

the early stages (EC, DG) of hippocampal processing, which form a sound basis for the development of the whole system. The functions of CA3, CA1, and subiculum are only sketched here. In spite of its apparently limited scope, the further developments of the model proved its general relevance for hippocampal and brain processing, since the same architecture receiving different input modalities was successfully used for learning purely temporal or spatiotemporal sequences (Banquet, Gaussier, Revel, et al., 2001; Banquet, Gaussier, Quoy, Revel, & Burnod, 2002), as well as learning by imitation (Gaussier, Moga, Banquet, & Quoy, 1998; Banquet, Gaussier, Revel, et al., 2001; Andry, Gaussier, Moga, Banquet, & Nadel, 2001) and could be adapted to any type of information in different formal spaces (e.g., word list learning). This result is in agreement with the detection of spatiotemporal information in a large variety of brain structures more or less directly related to the hippocampal system. This information could help to monitor the specific processing performed by these structures and provide a functional link between them. We first outline the anatomical and physiological bases, the architecture, and the functioning of the model in the methodological section, before presenting the results and a discussion. 2 Methods 2.1 Anatomical and Physiological Basis of the Model. The parahippocampal region, first level in the hierarchy of associativity of the hippocampo-cortical loop, receives convergent inputs from neocortex unimodal and polymodal association areas, and yet preserves some modal segregation (Lavenex & Amaral, 2000; Suzuki, Zola-Morgan, Squire, & Amaral, 1993; Witter et al., 2000). Selective lesions of PR and PH induced mild navigation deficits (Wiig & Bilkey, 1994, 1995; Liu & Bilkey, 1998), qualitatively different from hippocampal deficits, or no deficit at all (Kolb, Buhrmann, McDonald, & Sutherland, 1994; Glenn & Mumby, 1998; Bussey, Muir, & Aggleton, 1999). Conversely, PR and PH removal disrupted the animal’s ability to detect the changed position of a specific object in a familiar environment (Aggleton, Vann, Oswald, & Good, 2000). Accordingly, these lesions enduringly impaired DMS/DNMS (delay match/nonmatch to sample) tasks based on object-location associations in monkey (Suzuki et al., 1993; Zola-Morgan, Squire, Amaral, & Suzuki, 1989; Zola-Morgan, Squire, & Ramus, 1994) and equivalent navigation tasks in rats (Eichenbaum, Otto, & Cohen, 1994; Wiig & Bilkey, 1994). These tasks can be considered to depend on a simple stimulus-response strategy. The PR lesions induced more severe visual recognition deficits than EC lesions, and their effect was doubly dissociated from that of HS (Aggleton et al., 2000). The PH and posterior EC lesions produced a more severe spatial deficit than lesions of the rostral PR and EC (Parkinson, Murray, & Miskin, 1988). PR-PH areas remain cortically oriented because stimulus responsive

A Hierarchy of Associations in Hippocampo-Cortical Systems

1345

cells are more frequent there than in EC. In the model, these two structures were represented by two one-dimensional layers representing pattern and direction that combined in a landmark-encoding two-dimensional array. A second wave of association and pattern encoding was hypothesized to take place in EC superficial layers that receive inputs from PR, PH, and other polysensory areas. EC deep layer V receives, via subiculum, hippocampal backprojections that close the major hippocampal loop (see Figure 1) through a unidirectional internal projection to superficial EC layers (Kohler, Eriksson, Davies, & Chan-Palay, 1986; Jones, 1993; Witter et al., 2000). EC deep layers also send external projections to the cortex, thus closing the hippocampo-cortical loop. Preferentially, layer II projects to DG and CA3 and layer III to the CA3-CA1 region. The direct EC projections on the CA3CA1 region are at least as strong as the projections relayed through DG (Yeckel & Berger, 1990). An inhibitory barrier on EC layer II prevents any important traffic in the trisynaptic loop except for high-frequency (7 Hz) firing (Jones, 1993). Like PR or PH lesions, selective EC lesions induce more severe deficits in DNMS than selective HS lesions (Eichenbaum et al., 1994). More important, extensive EC lesions reduce the fraction of hippocampal cells presenting location-specific firing, and the stability of the place fields after maze rotation (Miller & Best, 1980) causes spatial deficits comparable to hippocampal deficits (Miller & Best, 1980; Olton, Walker, & Wolf, 1982; Goodlett, Nichols, Halloran, & West, 1989; Schenk & Morris, 1985), thus confirming the importance of EC spatial information in hippocampal spatial processing. In an attempt to overcome the limitations of lesion studies, Vann (Vann, Brown, Erichsen, & Aggleton, 2000) found a highly significant increase in C-fos expression in all HS and SUB subfields, in proportion to the (radial maze) task demands on spatial capacities for self-location and navigation. The parahippocampal region showed a lower yet highly significant increase in the C-fos label, with the exception of PR, which reacted only to novel stimuli. Simple spatial rearrangement of familiar icons increased C-fos expression in PH and parts of HS. Finally, place cell–like activity has been recorded in the superficial (Quirk et al., 1992; Sharp, 1999) and deep layers (Frank et al., 2000; Mizumori et al., 1992) of MEC, and in SUB (Sharp & Green, 1994). Furthermore, prospective and retrospective coding and path equivalence (tendency to fire at same relative locations along different paths) in deep EC suggest a coding by these neurons of the similarities between different trajectories at the same relative location with respect to a starting point (rather than precisely coding locations per se), thus relating location and behavior (Frank et al., 2000), and suggesting a dominance of path integration–related information in deep EC layers. In the model, EC (superficial) cells generated place-specific activity by implementing an unsupervised pattern learning on PR-PH inputs. The role of the dentate gyrus (DG) in spatial processing is ambiguous. DG is essential for subtle (but not coarse) spatial pattern discrimination, and

1346

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

a double dissociation exists between DG lesions associated with deficits in fine spatial discrimination and CA1 lesions associated with deficits in temporospatial sequence learning (Gilbert et al., 2001). Selective destruction of the DG granule cells preserves the spatial selectivity of CA3 cells but induces a spatial learning deficit (McNaughton, Barnes, Meltzer, & Sutherland, 1989). Some coherence emerges from these results if two facts are emphasized: the presence of a weak spatial code in EC and the direct and indirect connections of EC to downstream structures CA3, CA1, and SUB, susceptible to functioning independently (Yeckel & Berger, 1990). Accordingly, our model assumed that EC weak spatial code is used for a refined spatial localization by DG (orthogonalization) and also for spatiotemporal sequence learning by CA3-CA1. This hypothesis predicts that selective bilateral EC lesions should impair both a fine spatial discrimination by DG and a temporal spatial sequence learning by CA1. At present, it is known that deficits in maze performance follow bilateral EC lesions but not bilateral DG lesions in rats (Jarrard, Okaichi, Steward, & Goldschmidt, 1984). Other relevant spatiotemporal characteristics of DG processing are implemented in the model: 1. The anatomical topography reflected by the LEC-MEC subdivision is lost at the DG-CA3 stage due to the laminated projection (Amaral, l993) of superficial EC neurons on the distal DG-CA3 dendrites. The highly convergent EC projections on DG granules and their divergent widespread distribution on the DG field were believed to further foster intermodal integration. 2. The dominance of feedforward DG activation, in the absence of any significant direct recurrent connectivity between granule cells, was thought to be responsible for the sharp delimitation of DG place fields (Jung & McNaughton, 1993) and was implemented in the model by a full feedforward convergent EC-DG connectivity and a winner-takeall (WTA) long-range competition between active neurons (orthogonalization). 3. Excitatory interneurons (mossy cells), modeled by a local recurrent activation of granule cell assemblies, implemented a delay in DG cell activity that created a sliding window of activation, including past and present events, encoded as an event transition by CA3. 4. The convergence onto CA3 of the direct distal inputs from the perforant pathway and the indirect spatially restricted proximal DG projections onto CA3 (each granule cell contacts at most 15 CA3 pyramidal cells) enforced a pattern of activation on CA3. Temporal processing and delay activity believed to take place in DG are also a part of HS function:

A Hierarchy of Associations in Hippocampo-Cortical Systems

1347

1. During single stimulus response, an initial monosynaptic activation of the pyramidal cells in CA3-CA1 through direct EC projections was followed by a weaker activation of the same cells through the DG-CA3 trisynaptic route (Yeckel & Berger, 1990). Thus, with spatially close place fields corresponding to temporally overlapping subsets of active PCs, coding for sequentially visited locations could also support the coding of place transitions at the level of neural populations (Banquet, Gaussier, Revel, et al., 2001). 2. A remarkably long time constant of the CA3 NMDA receptors (150 msec) and their capacity for short-term potentiation endow CA3CA1 with a memory range adapted for learning transitions or short event sequences. 3. A familiarity-dependent, increasing place field overlap in the CA3CA1 region (Mehta, Barnes, & McNaughton, 1997) could correspond to an earlier anticipation of upcoming fields, when the rat is at the border of the current field. 4. Some hippocampal cells discharge according to the stage of a task, independent of the animal’s location (Eichenbaum, Kuperstein, Fagan, & Nagode, 1987; Wiener, Paul, & Eichenbaum, 1989; Wiener & Korshunov, 1995). 5. Recent developments (Gilbert et al., 2001) in pattern separation paradigms (Chiba, Kesner, & Gibson, 1994; Gilbert, Kesner, & DeCoteau, 1998) confirm a double dissociation between a DG finely grained spatial pattern separation and a CA3-CA1 (spatial) temporal order pattern separation. 2.2 Network Model. The network architecture includes two onedimensional input layers. A PR “What” layer, receiving pattern codes from temporal areas TE, and a PH “Where” layer, receiving object direction and location codes from posterior parietal cortex (plus V4 in primates), are dedicated, respectively, to the recognition of novel items and their spatial arrangement. These input layers converge on a merging module PR-PH, coding landmark constellations. Pattern selection-recognition in an EC module results in a weak place code that combines visual and movement-related information. A DG module performs a feedforward self-organizing, competitive separation of patterns (orthogonalization) and their transient storage in working memory. Current direct and delayed indirect inputs to CA3 allow the computation of transitions. These transitions are associated with their corresponding movement vector by convergence of place information and path integration on SUB. An analytical formulation of place coding and recognition, based on a comparison (match-mismatch) between current and memorized views of an environment, summarizes the performances of the different networks.

1348

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

2.2.1 Network Input. This letter does not aim at a detailed presentation of the process of visual pattern learning (Gaussier, Joulain, Banquet, Leprˆetre, & Revel, 2000). In the first visual processing stages, the identification of focal features at the center of subareas partitioning a scene resulted from gradient and curvature extraction, end stop, and corner detection, among others. The gradient extraction was followed by a convolution with filters (e.g., difference of gaussians) for the detection of corners. A serial search resulted from the emergence of a new winner feature-coding neuron after the inhibition of the previous winner. Typically, the pattern and location of 20 to 30 areas were extracted from a panoramic scene. In mammals and more so in primates, ocular saccade and pop-out attention play an important role during scene exploration. In our model, sequential snapshots of a scene identified separately “what” (a significant feature and its context) and “where” (azimuth) information, which was then recombined into landmarks. A localization-navigation paradigm (visually based in particular) involves a similarity measure between learned and current views. Such a match mechanism at the level of features allowed a more robust scene recognition than a global correlation (without feature extraction) because the recognition level depended only on the correct recognition of the selected features in their context and on their relative displacement compared to the learned image (see the analytical equation of the model, equation 2.10). A one-shot learning of the patterns took place within the connections between input pathways and “What” layer, where the pattern was recognized or a new code recruited. The absence of identification of symbolic objects avoided the binding problem related to this process. A given configuration of landmarks (constellation) allowed the recognition of a place. The whole process simulated a spotlight mechanism, whatever its nature (attention, saccade, head direction), performed by the rotation of the camera. 2.2.2 Model of Perirhinal-Parahippocampal Cortices: “What” and “Where” Input Association. In the model (see Figure 2), for a given landmark l, the effect of lateral diffusion on activity j of neuron j on the “Where” PH layer was expressed as a nonnormalized gaussian activity profile: 2 ((θ l − 2π j )mod2π ) j = exp − k N 2σ 2 where θkl represents the azimuth of the lth land2π mark and N j the preferred direction of neuron j. N represents the number of neurons (120) on the PH “Where” network. The influence on j of the activity related to lth landmark decays exponentially as a function of the angular distance between neuron j preferred direction and the azimuth of the lth landmark. If this difference is nil (the direction of lth landmark corresponds to the preferred direction of neuron j), j = 1. The activity level of each “Where” neuron represented an internal measure of the angular distance between the azimuth of the current head gaze direction and the preferred direction of this neuron.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1349

The lateral diffusion of activation to neighbor neurons implied that a neuron did not need to be precisely tuned to the direction of a given landmark in order to become active. Neurons Njk belonging to the jth neighborhood and projecting to the PR-PH cells of the k column are defined by Njk =

jmax j : k. − j < d Nθ . kmax

− j| < d Nθ determined the neighborhood of the jth “Where” neuron |k. kjmax max that projected to neuron lk in the PR-PH network; kjmax was the ratio between max the number of neurons in the “Where” layer and the number of columns in the PR-PH network; d Nθ determined the size of the neighborhood of “Where” cells that project to a single PR-PH cell. This encoding of object direction is consistent with a polar coordinate system. Ultimately, object direction was referred to the body axis orientation, which itself referred to an external reference. This external reference allowed that landmark information be aligned with the environment and also independent of the orientation of the agent. In vivo, the head direction system, scattered in different brain structures and integrated into the hippocampal system in the subiculum (Sharp, Blair, Etkin, & Tzanetos, 1995) or in a HS-SUB-EC loop (Redish & Touretzky, 1997), is believed to perform this function. The activity of pattern-encoding PR and direction-encoding PH converged on the PR-PH two-dimensional array that merged “What” and “Where” streams to code landmarks by a product (pi, AND operator). PRPH is a “necessary” zone of convergence for “What” and “Where” information. This convergence has been proven by the recording of neurons in different structures (PH, EC, CA3) that respond specifically for one object in a given location (Rolls & Treves, 1998). Therefore, several possible structures or neuron populations could correspond to the PR-PH network. It could be PH since strong connections exist between PR and PH or even a subpopulation of neurons in EC superficial layers that include both stellate and pyramidal cells. AND operations in biological networks can be performed by the staged merging of excitatory synapses on dendritic trees (Shepherd, 1993). All the cells of a column of the PR-PH matrix received inputs from the same neighborhood in the “Where” layer. These neighborhoods partially overlapped. In summary, four characteristics of the network deserve to be emphasized: 1. Although full feedforward connectivity between “Where” and PRPH networks led to accurate performance, PR-PH units received only a fraction of “Where” units in order to increase the capacity of the network. 2. Only maximally active inputs were learned by the PR-PH neurons.

1350

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

3. Due to input codes, the level of activation of product neurons reflected the angular distance of the corresponding landmark to the current head gaze direction. 4. Assuming that the visual system cannot recognize several patterns in parallel, we use an automatic spotlight system to explore sequentially the visual scene according to a saliency map. This sequential exploration makes “What” and “Where” information temporally correlated and bound. The time-sliced sensory sweep performed by the visual system is corrected by the PR-PH working memory, which bridges the temporal gap introduced by the sequential exploration (EC delay neurons). A similar mechanism has been demonstrated for visual saccades in posterior parietal cortex. pr ph

The discrete equation of the PR-PH neurons activity Xkl pr ph Xkl (t

+ dt) =

pr ph Xkl (t) + Ikl

−

pr ph Xkl

.

is +

I n− pr ph I nm .Wm,kl

(2.1)

m

[x]+ =

x 0

if x > 0 otherwise

The excitatory component of equation 2.1 includes Ikl , a global input to pr ph neuron kl detailed below, and Xkl (t), a memory term allowing the buildup of a landmark constellation and fluctuating between 0 and 1. The inhibitory term in equation 2.1 induces a reset of the representation of a learned landmark constellation. I nm represents the activity of mth inhibitory interneuron triggered by a sensorimotor reset signal at T, 2T, 3T, . . . , nT, where T is a constant period for a visual panoramic exI n− pr ph ploration of the scenery; Wm,kl represents fixed weights between the inhibitory interneuron m and a PRPH pyramidal cell kl. Ikl , the global input to neuron kl of the PR-PH matrix, is computed as a product: pr − pr ph ph− pr ph Ikl = max L i .Wi,kl . max j .Wj,kl . i∈Nli

pr − pr ph

j∈Nl j

(2.2)

ph− pr ph

Wi,kl (Wj,kl ) are the connection weights between any ith landmark ( jth azimuth) input to the kl PR-PH neuron; L i and j represent the “What” and “Where” network inputs, respectively. The synaptic weights between input unit j and PR-PH neurons learn in one trial, in the absence of inhibitory reset and only for maximal input lines: ph− pr ph

Wj,ki

= (L i ) . ( j ) . f (I − In ).

(2.3)

A Hierarchy of Associations in Hippocampo-Cortical Systems

1351

i = arg(max p∈Nli L p ), j = arg(maxq ∈Nk j q ); In is an inhibitory reset activity that prevents learning in case of reset; I is: I =

(maxi∈Nli L i ) + (max j∈Nk j j ) . 2

(2.4)

f (x) = 1 if x > 0.99 and 0 otherwise; this thresholded Heaviside function corresponds to a learning modulation common to all active neurons. The Max operator in equations 2.2 through 2.4 expressed a competition between “Where” neurons belonging to the same neighborhood of inputs to PR-PH neurons. Thus, the optimally tuned “Where” neuron could get control of PR-PH neuron activation and learn the corresponding patternazimuth conjunction. In summary, the PR-PH network has two functions: to bind the “What” and “Where” information in order to create a landmark and to bridge the temporal gap between successive landmarks (working memory) in order to create a landmark constellation or view that is directly learned or recognized as a place by EC. 2.2.3 Entorhinal Cortex and Place Coding. In the second wave of integration and association—between sensory (visual) inputs and path integration— the emergence of place cell–like activity in EC is accounted for in the model by a summation (OR operator) that complements the AND operator of the PR-PH network to globally perform a sigma pi. The activity Xec j of an EC pyramidal neuron j coding for places is given by Xec j

= f Dj

pr ph−ec pr ph Wkl, j .Xkl

,

(2.5)

kl∈Nkl

where f D (x) represents an output function that performs a learningdependent tuning of EC neuron response such that the response, which is weak and mildly specific before learning, becomes larger for specific inputs after learning: f D j (x) = D j .r.e

(−Vig+0.01).(1−x/r )2 σ j .(D j −1.01)2

.

(2.6)

The three parameters (D, r , Vig) modulated height, width, and slope of the gaussian function: (1) D, a neuron tuning factor increased with learning; (2) Vig, a vigilance parameter was the inverse of the activity level resulting from the comparison between memorized and new input patterns; and (3) r , a scaling factor, allowed the output integration on EC to work at constant energy in spite of the fluctuations in input levels.

1352

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

A local competition was implemented: Xec j =

Xec j 0

ec if Xec j = maxi:|i− j|
d1 max is a parameter determining the distance on which neurons compete on EC. The learning equation at the synapses between PR-PH and EC neurons implemented a pseudo-Hebbian rule: pr ph−ec

d Wkl, j

dt

pr ph−ec

= −λ1 .Wkl, j

+ λ2 1 −

pr ph . f Xkl

pr ph−ec Wkl, j

pr ph ec . f Xkl .X j .

(2.7)

kl∈Nkl pr ph

f (Xkl

).Xec j is the product of pre- and postsynaptic terms of the Hebbs pr ph−ec

rule; weight normalization is Wkl, j

=

pr ph−ec

Wkl, j

kl∈Nkl

pr ph−ec

Wkl, j

; λi is the gains of

depression or potentiation; and f is the PR-PH output function, as defined in equation 2.1. 2.2.4 DG Delay Cells versus CA3-CA1 Transition Cells. DG has already been simulated as a neural network performing spectral timing and pattern learning (Banquet et al., 1997), which can function independent of the global architecture.2 The solutions of the dynamic equations of the model provided a family of bell-shaped curves. In order to spare computing resources, this part of the architecture modeled the temporal function of the DG network by a basis of radial functions (gaussian), dg

X j,l (t) =

((t − τl ) − m j )2 m0 exp − , mj 2σ j

(2.8)

where m j denotes the particular time constant of neuron j, with standard deviation σ j , mo being the faster referential time constant; l the label of the recognized input pattern that triggers its associated neural cluster including neurons with different time constants; and τl the instant of activation of the lth cluster. In navigation tasks, precise timing is usually not necessary. Then m j , τl , σ j were the same for all cells, and σ j was large. DG neurons were combined with excitatory mossy cell interneurons to form delay loops. The

2 Spectral timing is a weighted decomposition of time by a neural population endowed with a whole spectrum of neural time constants (Grossberg & Merrill, 1992)

A Hierarchy of Associations in Hippocampo-Cortical Systems

1353

pattern separation function (orthogonalization) of DG granule cells uses the same computations as EC neurons except that a winner-takes-all (WTA) competition encompassed the whole DG field. In this model, the CA3-CA1 region computed a third type of association besides the association of object and location in PR-PH and of visual-allothetic (landmark) and movement-related-idiothetic (path integration) information in EC-SUB. This association learned transitions between events (in this case, visited places). This temporal-sequential aspect was implemented thanks to the hetero-association capacities of the CA3CA1 network between direct and indirect pathways, combined with a DG delay memory due to local excitatory loops between mossy cells (excitatory interneurons) and granule cells (see Figure 4A) that created a sliding memory window (Banquet, Gaussier, Revel, et al., 2001; Gaussier et al., 2002). Encoding transitions does not preclude the simultaneous coding of places in the CA3-CA1 region (a place can be viewed as a transition from a place to itself). In further developments of the model, these transitions formed the building blocks of place field chunks, akin to Worden’s (1992) fragment fitting embedded in different trajectories as recorded in deep EC neurons (Frank et al., 2000), and of spatiotemporal sequences (Banquet, Gaussier, Quoy, Revel, & Burnod, 2004; Banquet et al., 2002; Banquet, Gaussier, Revel, et al., 2001; Banquet, Gaussier, Quoy, & Revel, 2001). These transitions simplified the Hebbian learning of the temporospatial sequences and the implementation of the corresponding cognitive graph maps in the sensorimotor system, thanks to an appropriate and unambiguous association between a transition and a movement vector. Whatever the trajectory between two places, path integration computed a single displacement vector. The neural implementation (see Figure 4A) featured a group of CA3 neurons combining information on the current place recognition from EC (distal dendrite inputs) with information on the previously stored input (proximal dendrite inputs). The activity of transition-prediction neurons, Xi,C jA3 , resulted from a summation of the activity of the two inputs, separately insufficient to trigger the activity. The transition prediction was achieved by reinforcing (Hebbian learning) the link between the DG delayed input and the CA3 assembly coding the transition, such that the delayed activity became sufficient by itself to activate CA3 neurons. Correct predictions reinforced learning; otherwise weights were depressed, and another (learned or new) node became active. Xi,C jA3 is the activity of CA3 TCs: Xi,C jA3

=

dg−C A3 dg .Xk Wk,i j

k

[x]+ is as defined in equation 2.1.

+ +

ec−C A3 .Xec Wj,i j j

(2.9)

1354

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Table 1: Model Parameters. Description

Symbol Value

Number of neurons in “Where” layer “Where” layer diffusion range Saturation point of sigma pi neuron activity Number of inhibitory interneurons Decay factor of sigma pi activity Decay factor of sigma pi activity Sigma pi to EC weights decay factor Sigma pi to EC weights learning gain Number of input pathways to a sigma pi neuron EC neuron tuning factor Novelty detection tuning factor

N = 12 (simulation); N = 180 (robot) 15 σ 2 = 2. log(0.1) BX = 1

PR-PH neuron population size PR-PH neurons competition range EC neurons competition range DG neurons competition range to PR-PH neuron neighborhood PR-PH to EC neuron neighborhood

m = 1/20 principal neurons 0 < λ0 << 1 λ1 = 0.99 λ2 = 0.001 λ3 = 1 mmax = 10 D j = 0.1 before learning, D j = 0.9 after Vig = 1 in novel situation, Vig = 0.1 otherwise Nkl = 90 (simulations) d1 max = 10 (simulations) d1 max = 10 (simul), d1 max = 120 (robot) d2 max = 40 (DG group size) d Nθ = 2 (simulation), d Nθ = 30 (robot) d Nkl = 20 (simulation)

dg−C A3

Wk,i j is the strength of the link between the k DG neuron and the (i j) dg CA3 neuron; Xk is the activity of the k DG cell; Xec j is the activity of the ec−C A3 EC neuron j connected to an (i j) CA3 neuron; and Wj,i is the strength j of the link between them. The weight modification rule between an i DG granule cell and an i j CA3 pyramidal cell is: dg−C A3 Wi,i j

=

 

dg

Xi

dg 2

i ( Xi

)  unmodified

ifXec j = 0 otherwise.

Direct EC-CA3 inputs are unconditional and not learned in the model. The parameter values used during the simulation and robotic experiments are listed in Table 1. 2.3 Functioning of the Network and Experimental Paradigms. Our model combines an abstract mathematical analysis of the overall network properties, a simulation of these principles by a realistic neural network architecture (see Figure 1), and an implementation of the model as a control system for robot navigation. We now describe the functioning of the network during navigation and then establish a correspondance between the analytical and simulation levels. The neural network implemented four steps.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1355

Step 1. Visual scenes of an environment were automatically extracted by the single CCD mobile camera of the robot visual system. This part of the model is not detailed in this article (Gaussier et al., 2000). The pan of the camera varied in different experiments from 180 to 300 degrees. For each landmark extracted, a compass measured the azimuth of the head gaze direction relative to an absolute reference as a substitute for the robot’s vestibular system. In a landmark-bounded area, as in navigation experiments, the angle between these landmarks and a stable and salient distant landmark could be used as an absolute reference. Thus, the angular measure of the different landmarks refering to the body axis and to the absolute reference remained stable regardless of the agent’s orientation. At any position, a given pattern in the current head gaze direction (fixation point) of the agent was encoded and learned by a neuron i with activity li in the “What” layer. Simultaneously, a neuron j in the “Where” layer optimally tuned to this direction presented directional activity θ j . The same pattern also activated (to a lesser degree) related “Where” cells tuned for neighboring directions because of their bell-shaped tuning curve. Step 2. The activity of the “What” and “Where” layers was then combined in an Nl × Nθ multimodal layer PR-PH (see Figure 2) by a product operation. In our model, the temporary memory buffer capacity of PH and EC neurons (Suzuki et al., 1993; Zola-Morgan, Squire, Clower, & Rempel, 1993; Egorov, Hamam, Fransen, Hasselmo, & Alonso, 2002; Fransen, Alonso, & Hasselmo, 2002) served to transiently store sequential snapshots that built up constellations of landmarks forming local views. Two indices identified each PR-PH unit. Unit ik received input from pattern unit li and preferred direction unit θk (including its neighborhood) (see Figure 2). Step 3. A population of EC neurons summed up the neural activities of a landmark constellation, thus computing a sigma pi operation (sum of products) in collaboration with the PR-PH multimodal layer. This integration of a set of landmarks was sufficient to generate place-dependent activity in superficial EC. The diffuse overlapping EC place fields let several EC nodes encode simultaneously, albeit at different degrees, for a single place. In the model, a movement induces a reset of the previous landmark constellation in PR-PH nodes. However, a new place was encoded only if the mismatch between the previous and new view was sufficient. In a more recent version, path integration combined continuously with PC information at the level of the subiculum and fed back this combined information to deep EC layers (Banquet et al., 2004). Step 4. Distribution of information to the whole DG field and strong convergence on DG granule cells allowed a WTA competition over the entire DG field, providing a single winner for a given location. Excitatory coupling between mossy cells and granule cells implemented a working memory on DG used to compute transitions on CA3. When a temporal derivative on the EC output (or novelty detection, more recently) detected a novel pattern, the temporal conjunction of the delayed input in the indirect pathway and of

1356

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

the current input through the direct pathway encoded a transition pattern in CA3 pyramidal cells that was learned in the proximal connections between mossy fibers and CA3 pyramidal cells. The convergence-divergence of connections between the different processing stages was parametrically determined by the size of the input neighborhoods of the principal neurons. This convergence was a factor 4 between the “Where” map and PR-PH network, a factor 5 between PR-PH and EC, and from all-to-one between EC and DG, so that both the high degree of convergence of EC inputs on granule cells (a factor of 400 in animal) and the extensive receptive field of DG neurons could be accounted for. Training and performance were simultaneous. In rats, the incentive to explore their environment prevails over the immediate satisfaction of basic drives. During an exploration phase, independent of any reward, Hebbian learning of places (by conjunction of inputs and EC place cell activation), and also of maps (by spatiotemporal contiguity between successive locations) made an environment familiar and also eventually located the goal objects. Learning modified the synaptic strength of the connections and also induced an increased selectivity of the output function of the PCs. Spatial tuning of PCs is suggested by the increased response of PCs within a single session in the same environment (Mehta et al., 1997), or an increase in reliability across sessions. Goal-oriented navigation in maze or open environments depended on the cortical map of this environment learned during an exploration phase. Goal-reaching paradigms confirmed the learning of reward location (place-reward association) and reinforced the trajectories leading to the goal. Robotic experiments were preceded by simulations in artificial setups. During the exploration or recognition of an environment, the shift between successive PCs was monitored by a novelty-dependent septal modulation (Hasselmo, Schnell, Berke, & Barkai, 1995). The system was confronted with continuously drifting visual inputs. Septal modulation favored either a network reconfiguration in the presence of novel inputs or associative learning and pattern storage when no novelty was detected. The decrease in EC and DG activity triggered by a mismatch between stored and novel inputs raised a vigilance level. This increase in vigilance sharpened the EC-DG output functions, favoring the emergence of a new activation pattern, thus avoiding the system’s being trapped in the same PC attractor (see Figure 6). A progressive degradation of the PC firing occurred when the robot moved from the center of the field to the periphery, and then a sudden change when a low-activity threshold was crossed. This PC switch is comparable to the map switch hypothesized by other authors (Redish, Rosenzweig, Bohanick, McNaughton, & Barnes, 2000; Samsonovich & McNaughton, 1997). This drift-and-shift process (degradation of the views and takeover of a new PC population) provided for a coherent partitioning of space into distinct place fields during exploration. In a familiar environment, the same process implemented a successive activation of PCs during the transition from one place field to the next. NMDA receptor blockade

A Hierarchy of Associations in Hippocampo-Cortical Systems

1357

or selective knock-out mice have recently permitted a dissociation between short-term (NMDA-independent) and long-term (NMDA-dependent) place field stability (Kentros et al., 1998; McHugh, Blum, Tsien, Tonegawa, & Wilson, 1996; Rotenberg, Mayflower, Hawkins, Kandel, & Muller, 1996). As a theoretical formulation of the model, analytical equation 2.10 is not used in the simulations, unlike other network equations. It summarizes the spatial properties of the PR-PH-EC-DG network processing, with a loss of dynamical and online learning properties. Nevertheless, this equation is useful to understand the relationships between the landmarks and the geometrical properties of the place fields. Each term of the equation corresponds to a fundamental property of one of the network equations that support the simulations. In this equation, the activity of the place cell i when the agent is at the location (x, y) is analytically given by

Ni Pi (x, y) = 1 −

k=1

Vi,k . f (|| , vk (x, y)) . π Ni

(2.10)

PC activity is expressed as the complement to one of a mismatch factor between stored views and the current view. || is computed as the minimum between |i,k − θk (x, y)| and 2π − |i,k − θk (x, y)| and is always smaller than or equal to π; π Ni is a mismatch normalization factor, π being the maximal angular mismatch for a given landmark and Ni the number of visible landmarks when the agent is at the place field i, which corresponds to place cell activity Pi . All the angles are measured in radians from an absolute direction (the north for instance). i,k represents the azimuth of the landmark k from the learned place i; θk (x, y) is the azimuth of the same landmark k for the current robot location (x, y); Vi,k (respectively, vk ), an all-or-none weighting factor depending on the landmark visibility and/or recognition from learned i (respectively current (x, y)) location, is set to 1 when the landmark k is seen from the learned (respectively, current) location i, and 0 otherwise. When learned landmarks are not recognized, we can have Vi,k = 1 and vk = 0. f is a nonlinear function that accounts for landmark recognition: f (θ, vk ) =

θ if vk = 1 π if vk = 0.

Three of the four possible combinations of the pair (Vi,k , vk ) are relevant for mismatch computation: 1. (1, 0): Landmark k was seen from learned place i but is not seen or recognized from the current place; f (θ, 0) = π . The mismatch π associated with this landmark is maximal. 2. (1, 1): Landmark k was seen from learned place i and is recognized from the current location. The contribution to mismatch is ||.

1358

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

3. (0, 1): Landmark k is seen from the current place, but was not seen from the learned place i. It does not contribute to the mismatch term (Vi,k = 0). Equation. 2.10 gives a function Pi that tends asymptotically to 1 when the azimuths θk (x, y) associated with the current location are close to the stored i,k . The network implementation complies with equation 2.10. Landmarks are learned in the connectivity patterns between visual inputs and the “What” layer. Therefore, match-mismatch is reflected in the activity level of “What” nodes that results from the dot product between the input and the learned weight pattern. The landmark node with maximal activity in the “What” layer identifies the input. This network operation implements the (Vi,k , vk ) pair in equation 2.10. If no stored landmark corresponds to the input, a new node is dedicated to this novel landmark, and its azimuth is also learned. Only the connection weights between the maximally active “Where” nodes and PR-PH nodes are learned, for the sake of simplicity. Thus, after learning, the dot product between a bell-shaped activation pattern in the “Where” layer and the weight pattern on an active PRPH neuron (determining the activation level of this neuron) reflects the azimuth difference between learned and recognized landmark. This network step implements the || term in equation 2.10. In this way, the level of activity of a PR-PH landmark node reflects the level of similarity between the learned and the current view for this landmark. A landmark constellation is encoded in a few EC nodes and learned in the synaptic weights of the PR-PH to EC pathways. This network step implements the summation in equation 2.10. 3 Results Some results derive from simulations with artificial inputs (see Figures 2 and 6), yet most of them come from robotic experiments conducted with real visual inputs in natural indoor environments (see Figures 3, 4, 5, 7, and 8). They express the functions of the different hippocampal subsystems (see Figure 1), which are presented incrementally, going from landmarks to local views, place fields, and transitions. Two paradigms were used. In self-localization paradigms, the robot was passively moved in different locations of an open environment, where different place fields could arise from panoramic views of the environment. In goal-oriented navigation paradigms, within open or maze environments, or a combination of both, random exploration allowed the buildup of cortical maps and the discovery of the goal(s) locations on these maps. 3.1 Landmark Constellations and Local Views. Visual patterns L i automatically extracted from the environment by the single CCD mobile camera of the robot visual system during exploration or artificially provided at

A Hierarchy of Associations in Hippocampo-Cortical Systems

1359

Figure 1: Schematic representation of the hippocampal circuits. Superficial layers of lateral (L) and medial (M) entorhinal cortex (EC) receive information from the perirhinal (PR) and parahippocamapl (PH) cortices, respectively. EC layer II transmits information to the dentate gyrus (DG) granule cells, and CA3 pyramidal cell distal dendrites, through the perforant pathway. The CA3 proximal dendrites receive mossy fibers from the DG granule cells. CA3 pyramidal neurons connect to other CA3 neurons by recurrent collaterals and to CA1 by Schaffer collaterals. Distal CA1 dendrites also receive direct connections from EC layer III. CA1 connects to subiculum (SUB) and directly to deep EC layers. Subicular connections to EC layer V close both the intrahippocampal loop through a one-way connection from layer V to EC superficial layers, and the hippocampocortical loops through indirect connections to the same associative cortical areas that send inputs to EC layer II. Direct connections link subiculum to prefrontal cortex (PF) and accumbens (AC). Septal modulatory inputs (not represented on this figure) target mostly DG, CA3, and CA1. Semicircular connections represent modifiable synapses.

time T0 to the system during preliminary simulations (see Figure 2) were learned by a neuron i with activity li in the “What” layer PR representing the perirhinal cortex. For each landmark extracted, a compass (a substitute for the robot’s vestibular system) measured the azimuth j of the landmark relative to the head-body axis of the robot referred to a stable external reference, thus making the measure independent of the robot orientation. This azimuth was encoded by a neuron j optimally tuned to this pattern direction with directional activity θ j in the “Where” PH layer, representing parahippocampal (posthippocampal in rat) cortex (see Figure 1). The same pattern also activated (to a lesser degree) “Where” cells with a bell-shaped tuning for neighboring directions. The activity of the “What” and “Where” layers was then combined in an Nl × Nθ multimodal layer PR-PH with units kl (see Figure 2) by a product operation.

1360

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Figure 2: (A) Simplified model architecture representing “What” (PR) and “Where” (PH) inputs merging (AND operator) in a PR-PH layer with a shortterm memory capacity (recurrent loops). Only active links have been displayed. Several landmark-encoding cells simultaneously active in the PR-PH layer correspond to a constellation of landmarks collected by visual exploration, and form a view of the environment from a given place. (B) Simulation results of these first three networks of the architecture. Pattern L i and azimuth j merge to form a landmark in the PR-PH network. During the exploration of a scene from a particular place at times T0, T01, T02, T03, and T04, PR-PH working memory allows the buildup of a landmark constellation correponding to a view.

At times T0-T04 (see Figure 2), a landmark constellation forming a local view representing the visual configuration of the environment from the current location was incrementally stored in the PR-PH working memory, which is a property of PH and EC neurons (Suzuki et al., 1993; Zola-Morgan et al., 1993; Egorov et al., 2002; Fransen et al., 2002). The population code of these views was next decoded in EC.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1361

3.2 Entorhinal Location-Dependent Activity. A specific location is defined by the configuration and orientation of the landmarks corresponding to this location. This information from the PR-PH network was integrated in the EC network (see Figure 3A) and learned in the synaptic connections between PR-PH and EC. A remarkable property of this system resides in its built-in generalization capacity. A neuron coding for a location A responds when the robot is precisely in A, but also to a lesser degree for a neighborhood of A. In this way, a field is created around each learned location, providing continuity and also overlap in the space code. When the robot visually explored the experimental room from different locations (similar to a rat passively moved from place to place), place fields corresponding to each location covered the space of the room. With a short-range, soft competition among pyramidal cells, large, overlapping place fields represented a weak place code, similar to that of EC neurons. Nevertheless, as for hippocampal place fields, the firing maxima were stable and evenly distributed. The learning of several locations led to a paving of the space, where neurons reactive to different locations coded for different areas of the environment (see Figure 3B). The shape of the place fields adapted to the geometry of the environment (Gaussier et al., 2002). 3.3 Hippocampal Dentate and CA3 Transition Cells. The coarse spatial information computed in several EC cells converged on DG granule cells, where it underwent a sharp competition. Then the elaborated spatial information was transmitted to proximal dendrites of CA3 pyramidal cells; EC information also reached CA3 neurons through direct connections to distal dendrites (see Figure 4A). CA3 connected to CA1 pyramidal cells (Schaffer collaterals) and CA1 to subiculum (SUB). In DG, place fields became clearly delimited (Jung & McNaughton, 1993), evenly distributed in space, and comparable to place fields recorded in HS proper (see Figure 4B). In CA3, the temporal conjunction of a delayed DG place cell activity (maintained through recurrent activation of granule cells by the mossy cells) and of the current PC activity provided by the direct connections to CA3 gave rise to transition cells, which encoded the transition between two place fields in relation to transition-dependent direction and self-motion information (path integration). Once learned, these transitions helped to prime and predict accessible locations from the current one and to detect novelty as a violation of expectations. The following key features of PC firing were reproduced:

r

Typical place cell firing patterns were elicited in the first moments (Hill, 1978) of the exposure of the robot to a new environment. In agreement with Redish and Touretzky (1997, 1998), we did not need any preconfigured reference frame as in the CA3 charts hypothesized by McNaughton (McNaughton et al., 1996; Samsonovich & McNaughton,

1362

r r

r

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

1997). A weak and randomly distributed initial set of weights in the network synapses was sufficient to get this result (see Figure 4B). The place field pattern stabilized during the first instants of exploration and was reproducible from session to session provided that the environment was not changed. Little directional specificity of place fields was found in an open field (Muller, Kubie, & Saypoff, 1991) when the system was configured with a large visual angle (300 degrees). Yet, with a smaller visual angle as in frontal vision (180 degrees), view cells were obtained, as in monkey. In these natural indoor experiments, about 15 to 30 landmarks were identified and used by the system in a typical panoramic image, although not all of them were necessary. In theory, either two correctly recognized landmarks referring to an absolute direction or three

Figure 3: (A) The neurons coding for a landmark constellation in PR-PH converge on EC superficial layers, where they activate position-dependent cells. In a first approximation, EC superficial layers sending inputs to DG and CA3 and EC deep layers receiving subicular inputs have not been dissociated. Their functional integration is granted by the existence of a one-way link from deep to superficial EC pyramidal cells. (B) Examples of firing rate maps of typical EC cells. The upper maps represent an overhead view of square and cylindrical recording chambers in which the rat’s position is correlated with the firing rate of MEC cells (darker spots represent locations of higher firing rate) averaged over the recording session. When the shape of the recording chamber was changed from a square to a cylinder, MEC firing patterns topologically transformed (as by compression). Reproduced from Quirk et al. (1992, Figure 13) with permission. The lower maps result from experiments where robots were passively moved to different locations of a room. They represent activation patterns of simulated MEC pyramidal neurons with positional firing, induced by real visual inputs sampled by the vision system of a robot at different locations in an experimental room. A diffuse but stable place cell–like tuning of the 25 EC cells covers the whole space. Each rectangle represents an overhead view of the experimental room. The numbers on the x- and y-axes represent the distances in meters. The cues available were composed of the usual furniture of a laboratory room (for example, desks, chairs, shelves). As in experimental data, each PC presents focal stationary maxima corresponding to a particular location in the experimental room but no clear-cut boundaries and a progressive graded decay. Figure 4: (A) The complete, simplified architecture of the hippocampal model is represented. DG forms a pattern-encoding network with a working memory that maintains a delayed representation of the previous input. The CA3 heteroassociative net associates previous and current input transmitted respectively through indirect and direct perforant path connections to CA3 in order to compute event transitions. CA1-SUB performs a pattern-encoding and

A Hierarchy of Associations in Hippocampo-Cortical Systems

1363

recognition of these transitions and closes the intrahippocampal loop through its connections to deep EC layers. This loop closure is not simulated here, but see Banquet et al., (2004). (B) The experimental firing-rate maps illustrate that the firing pattern of hippocampal place cells changed when the room shape changed. (Reproduced from Quirk et al., 1992, Figure 13 with permission.) The lower maps represent activation patterns of simulated hippocampal neurons with positional firing, induced by the same conditions as in Figure 3B. When a sharp competition between PCs was implemented by a WTA network, the activity of the PCs did not present a large overlap anymore. Clearly separated place fields still covered the entire space.

1364

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

landmarks referring to a relative direction (given by a landmark) are sufficient for unambiguous place recognition. Environment perturbations such as suppression or occlusion of some landmarks did not affect the spatial distribution of the place fields (Knierim, Kudrimoti, & McNaughton, 1995; McNaughton et al., 1996; O’Keefe & Speakman, 1987). 3.4 Primates’ View Cells. Cells with view and/or object rather than place-dependent activity are most often recorded in monkey (Rolls & O’Mara, 1995). Figure 5 compares recordings of monkey view cells (A) with results (B) obtained by reducing the angle of the robot vision from 300 to 180 degrees. The previously isotropic, orientation-independent response of place cells became dependent on which particular part of space (and/or object) the robot was looking at. Simultaneously, cell activity became relatively independent of the location of the animal, as found for view cells. Thus, a frontal vision in primates, compared to the panoramic vision of rats, could partially account for the preeminence of view cells in monkey and the difficulty in recording PCs. This does not imply that a one-eyed rat, with a reduced field of vision, will present view cells rather than PCs, due to the compensation by idiothetic information. This result was obtained from robotic experiments implementing a biologically realistic vision system separating frontal from panoramic vision (Gaussier, Joulain, Banquet, & Revel, 1998; Gaussier et al., 1999, 2000, 2002). On the basis of the same principles, Rolls’s group developed a similar model using simulated visual inputs that also accounted for the spatial field specificities of primate view cells and rodent place cells (de Araujo, Rolls, & Stringer, 2001; Stringer, Rolls, & Trappenberg, 2004). 3.5 Maps and Navigation Strategies 3.5.1 Shift Between Places During Simulated Random Navigation. Autonomous learning and performance were not dissociated. A vigilance parameter (active during both stages) depending on the match-related level of PC activity featured ACh septal modulation (Hasselmo et al., 1995). This parameter decided whether a new panorama was different enough from the stored ones to be learned as a new place, and favored either novel event encoding or expression of previously learned patterns in the absence of novelty detection. It also induced the shift between place fields during navigation by tuning a PC output function. This spatiotemporal contiguity between successive place fields was at the basis of Hebbian latent learning of a contextual map, during exploration. In Figure 6, the same local view (landmark constellation) and the associated place field (PF1) were learned or recognized through times T1-T2, in spite of the decreased amplitude of neural activity at T2 due to a drift of landmark azimuths, or even a drop of some landmarks. Conversely, a shift

A Hierarchy of Associations in Hippocampo-Cortical Systems

1365

Figure 5: Place versus view cells. (A) View cell recordings in monkey’s hippocampus. (Reproduced from Rolls & O’Mara, 1995, Figures 3–4, with permission.) (B) Results from an experiment where the pan of the camera of the robot was reduced from 300 to 180 degrees in order to replicate the frontal view of the primates. Activity was recorded from two model view cells responding to oriented views of a scene. Circles represent positions where the views were taken from. An arrow inside a circle indicates the learned position and direction. The length of the bars represents cell response amplitude as a function of the head (camera) orientation. The responses of the model view cells, as in the monkey experiment, were relatively independent of the location where the view was taken from, but depended on the view captured by a particular direction of gaze. Bars represent not spike frequencies but average activities of the cells.

from PF1 to PF2 occurred at T3. In fact, at the end of T2, the activation (SUM) of the PC corresponding to PF1 decreased below a threshold, triggering a burst of activity in the vigilance (VIG) module, a reset of the current EC-DG pattern, and the selection of a new code corresponding to PF2, at T3-T4-T5.

1366

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Figure 6: Results from a simulation with artificial inputs. Identification by EC and DG of two place fields (PF1 at T1,T2,T6 and PF2 at T3,T4,T5). After PF1 learning or recognition at T1, the landmarks recognition decreased (weaker output at T2 in EC-DG and in the activation-integrator, SUM). As a consequence of an increasing mismatch between the current constellation and the memorized constellation corresponding to PF1, the activation of the neural population coding for the PF1 constellation crossed a threshold at the end of T2. A phasic vigilance burst (VIG) induced an output function modulation in EC-DG and a different PC was learned or recognized at T3. When the simulated animal crossed back PF1 border at T6, the landmark constellation corresponding to PF1 was reactivated due to its prior learning during exploration.

Later, at T6, long-term learning ensured the activation of the PC coding for PF1 and the recognition of the place when the animal moved back to the former place field (PF1). Thus, a progressive degradation of the PC firing occurred when the robot moved from the center of the field to the periphery,

A Hierarchy of Associations in Hippocampo-Cortical Systems

1367

and then a sudden change when a low-activity threshold was crossed. This PC switch is comparable to the map switch hypothesized by other authors (Redish et al., 2000; Samsonovich & McNaughton, 1997). At variance with random navigation, which is a simulation result, the navigations with maps, presented next, are behavioral results obtained from experiments with the Koala robot. 3.5.2 Universal Maps and Route Navigation. A collection of coarse and weak EC place codes (see Figure 7A) formed a landscape of overlapping PC potentials resulting from the variable coactivation for each location of a specific subset of neurons. We consider this landscape of potentials as a generalization of the classical concept of map, tested in the following unsupervised learning paradigm. When a goal was discovered during random exploration, the robot associated a few places around the goal with specific actions leading to the goal. Thus, a local behavioral attractor was learned around the goal, supporting a gradient-descent strategy toward the goal (see Figure 7B). A capacity to generalize permitted reaching the goal from a novel (unlearned) place in the open environment. The large size of PCs in EC verified in a robotic experiment, combined with a graded decrement of their bump activity, provided for a true generalization of learning to any place in an open environment. Indeed, each novel place was associated with a weighted combination of the actions associated with the most active PCs corresponding to neighboring learned places in order to generate smooth trajectories. At any location, the direction to the goal could also be provided by the single most similar place where a view-action association had been previously learned (see Figure 7B). Further, multiple goals could be simultaneously active as multiple attractor basins that were size-modulated by the strength of the corresponding drives (Gaussier et al., 2000). 3.5.3 Contextual Maps and Planning. A set of such places and transitions could be learned during random exploration (see Figure 8A) of an environment by strengthening the connections between the corresponding transition cells (Hebbian latent learning at the cortical level). Some of these connections could also be reinforced after leading to a goal-reward (see Figure 8B). A combination of several such paths formed a task-contextdependent map similar to a graph (see Figure 8). Muller first developed a conceptual graph model based on PCs (Muller et al., 1991) that was also simulated (Redish & Touretzy, 1998). In contrast, in our model, transitions computed in CA3-CA1 formed the building blocks of the global graph-map learned and stored at the cortical level. In conjunction with goal diffusion through the graph-map in prefrontal cortex, these transitions provided for a straightforward solution to the shift from space representation to navigation, in particular path desambiguation, monitoring several goals, and the invention of shortcuts. The navigation strategy based on this type of map implemented sequences of transitions leading to a goal

1368

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

in a proactive, planning mode of navigation combining previous latent learning of a global map in the prefrontal or parietal cortices and a guidance by the diffusion of activation from the goal location throughout the map (Gaussier et al., 2002; Banquet, Gaussier, Revel, et al., 2001; Banquet et al., 2002, 2004) 4 Discussion Before discussing these results, the scope of the model must be clearly delineated: neural computations at cell and population levels, in parahippocampal cortices and the hippocampus, were related to spatial processing and navigation. The model aimed to establish a link between mechanisms, functions, and behavior and to separate the specific contributions of these cortical and subcortical structures that are not yet clearly understood. The head direction system was only implicitly modeled by referring directional information of landmarks to the head-body axis and to an external reference. The simulations and robotic experiments dissociated the contribution of the different modalities or structures and the static-dynamic aspects in spatial processing. In spite of this limited scope, the model is relevant for hippocampal and brain processing in general. Indeed, the same architecture receiving different input modalities was successfully used for

Figure 7: (A) Results of an experiment where the robot learned different locations in a room. The x,y plane represents the space of the experimental room. The z-axis represents the normalized level of activity of the place cells. The manifolds in different gray levels represent the overlapping activities of four different place cells, with maxima in different locations, recorded during a robotic experiment. The combination of these overlapping activities builds a potential landscape, with a potential vector, proper to every location. (B) After learning four place-action associations around the goal (filled circles with arrows), the robot could navigate to the goal from an unlearned place (empty squares) along a path materialized by a sequence of arrows (learning generalization). As in the previous figure, the x,y plane represents the space of the experimental room, in arbitrary coordinates, partitioned by four place fields. The z-axis represents the levels of the behavioral attractor. This figure illustrates the navigation strategy associated with the potential landscape of part A of the figure. Any unlearned location in the environment was associated with the action(s) related to the closest learned place(s). Therefore the movement (or the weighted combination of movements for a population code with smooth trajectory) associated with these places was performed. This strategy amounted to implementing in the behavior of the agent a gradient descent toward the goal. The exploration phase was not represented on this figure, just the navigation to the goal. The associated behavioral attractor basin illustrated in this figure resulted from sensorimotor learning.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1369

1370

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

learning purely temporal or spatiotemporal sequences, as well as learning by imitation (Banquet, Gaussier, Revel, et al., 2001; Gaussier, Joulain, & Banquet, 1998). Thus, the model contributes to bridging the gap between the two main hippocampal theories—cognitive map versus episodic memory. Experiments with rats and robots were conducted in parallel (Paz-Villagran et al., 2003, 2004). The observed properties are emergent characteristics of the model and its inputs, and not direct consequences of a specific network architecture. The three levels of association jointly provided the foundations of higher cortical functions such as sequence learning and planning. Multimodal PR-PH units performed a product to give rise to landmarks by object-location association (Rolls & Treves, 1998). The activity of PR-PH units reflected, by its location and activation, the characteristics of “What” and “Where” input layers. Landmark constellations formed local views, transformed into entorhinal place codes (Frank et al., 2000; Mizumori et al., 1992; Quirk et al., 1992). The monotonical decrease of place recognition as a function of distance to a learned place in superficial EC reflected the continuity of space. As a population, these diffuse place cells in EC supported a context-independent universal map, in agreement with PC patterns in EC and SUB relatively independent of context (Sharp, 1999). Conversely, spatiotemporal learning in CA3-CA1 (Gilbert et al., 2001) computed transitions between places, which were associated with corresponding movement vectors. The model solutions showed several interesting properties.

Figure 8: (A) Cognitive map built by exploration of a simulated environment, with artificial landmarks. The landmarks are the crosses on the border. Each dot is a subgoal. The links indicate that the corresponding two subgoals have been successively activated. The subgoals and the learned connections formed the cognitive map. Each learned place represented by a dot is surrounded by an area of graded decay of activity (not represented), such that the robot entering this zone performed the learned movement associated with the transition. Thus, the trajectory of the robot toward the goal (dotted line) took advantage of the map but was not constrained by the learned trajectories. Further, simulation of the inertia of the robot provided for smooth trajectories. (B) Results of an experiment where the latent learning of an environment by the robot provided a graph-map that was subsequently used during goal-oriented navigation. The curved lines delineate the real robot trajectory. The straight lines represent the ideal trajectories derived from path integration between two adjacent locations and associated with the transitions between these locations. In spite of the absence of direct connections between nodes coding transition BC and CE (not experienced during exploration), the combination of the priming or prediction of the possible transitions by CA1 and of the diffusion of activation from the goal in prefrontal cortex network (maximal for the shortest trajectory) allowed selecting this shortcut.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1371

1372

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

4.1 Generalization. Two interactive levels of generalization were demonstrated. Locally, the PC coding for a location A fired not only when the robot was in A, but to a lesser degree for a significant area around each location, in particular for EC place cells. The generalization related to the diffuse activity of the EC place cells depended on the large size of their place fields (different from hippocampal place fields), as well as on the open environment. In the frame of a universal map, the robot learned but a few places around the goal and was able to generalize to the whole room the local learning of the capture of the goal, without the need to learn every location-action association, as in other similar models. Indeed, the movement to perform in order to reach the goal from a new place was computed as a weighted sum of the movement vectors associated with the learned neighboring places. The local generalization was also found to be essential for a simple, straightforward use of the contextual maps during navigation. The same graded decay around peak location-specific activity made the trajectories during performance of graph-map navigation, not necessarily superposed on the learned ones. This topological interpretation of the ability of an animal to generalize an incomplete learning to novel parts of an environment is an alternative to the hypothesis of a Euclidean metric map encountered in some vectorial models of space representations (O’Keefe, 1991; Burgess et al., 1994; Arleo & Gerstner, 2000), which, however, provides a better account of novel path finding by animals. Nevertheless, the large EC place fields of our model provide a generalization capacity that allows choosing a correct place-action association in order to solve the Morris water maze from a novel starting location (cf. Figure 7). 4.2 Parahippocampal Cortical Networks and Spatial Maps. Properties of single neurons and populations closely resembling those of neurons in the parahippocampal cortices and the hippocampus in both rat and monkey were given mechanistic interpretations in terms of network operations; functional and behavioral significance in terms of spatial versus contextual maps could be associated with distinct locale navigation strategies (O’Keefe, 1991). The associations performed by the modules of the parahippocampal cortices, through proximal recurrent collaterals, without hippocampal mediation, were essentially local and induced a fusion of complementary features such as pattern and location into complex monolithic entities such as landmarks. Distinct object-location associations were encoded by different neural populations in a “What”-to-“Where” network. Yet these structures were more than simple relay stations forwarding cortical inputs to HS. They were responsible for the computation of landmarks in PR-PH and a weak spatial code in EC. Functional models have already attributed complementary roles to the spatial representation in EC-SUB versus HS (Sharp, 1999); yet our model gives a precise mathematical

A Hierarchy of Associations in Hippocampo-Cortical Systems

1373

formulation and a mechanistic interpretation of the two types of spatial codes in the frame of hippocampo-cortical interactions. In particular, the weak EC spatial information supports a context-independent universal map; it is also at the origin of a refined spatial code in DG and of spatiotemporal sequence learning in CA1 and downstream structures. The navigation strategy associated with the EC spatial map implied more than a simple stimulus-response mechanism in which hippocampal spatial signals and discrete locomotor responses served, respectively, as the stimulus and the response (Sharp et al., 1996). Indeed, no direct perception of the goal or of a cue pointing to the goal was necessary. Therefore, the strategy was better characterized as a route navigation. But no links were learned between successive places (at variance with graphs), only associations between a few places and the corresponding motions to the goal, as in Burgess (Burgess et al., 1994). Yet in the graph-map, we associated transitions to sensory representations of the corresponding actions; that provided several nice properties, in particular path desambiguation, without dramatically taxing the neural memory set (about four times more transitions than simple locations). Obvious limitation, such a universal map was inappropriate for planning a trajectory to the goal (at variance with the graph-map), but was sufficient to account for simple reactive strategies according to Sharp (Sharp et al., 1996). 4.3 Hippocampal Networks, Temporal Processing, and ContextDependent Maps. Several functional or mathematical models used the associative capacities of the CA3 network for spatial computations. McNaughton (1989) postulated linked sets of local views associated with movements in order to carry the rat from one place field to the next, without a global map–like representation of the environment. One aspect of our model comes close to a mechanistic implementation of this idea. CA3 association of two successively visited places created a transition that, through selfmotion information and path integration, was naturally and uniquely associated with the movement vector involved in the transition. Learned links between transitions during exploration formed sequence building blocks that could correspond to the elongated place fields recorded in deep EC (Frank et al., 2000). These chains generalized into graphs that supported, at the population level, contextual maps. Yet recent experimental results performed in parallel with robots and rats (Paz-Villagran et al., 2003, 2004) suggest that the hippocampal representations are only fragmentary and under the grip of current sensory information. The global, stable maps more independent of current sensory information could be stored in cortical structures such as posterior parietal (PP) or prefrontal (PF) cortices. The buildup of a map (see Figure 8A) and the existence of a graded decay of activity around each learned place avoided the combinatorial problems faced by stimulus-response and list-learning strategies.

1374

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

A topological theory of spatial representation (Muller et al., 1991) and the related goal-oriented cognitive graph models (Trullier & Meyer, 2000) proposed that an interconnected CA3 population globally reflected the topological connectivity of the environment in a context-dependent manner. For both theories (McNaughton, 1989; Muller et al., 1991), some form of cognitive mapping can be viewed as a sequence learning and prediction problem. We further showed that dynamic sequence codes and stable place codes could coexist in different structures and collaborate for navigation. However, both theories (McNaughton, 1989; Muller et al., 1991) faced the difficulty of covering a continuous space by discontinuous trajectories. Attractor networks avoided this difficulty (Redish & Touretzky, 1997, 1998; Redish, 1999; Samsonovich & McNaughton, 1997). Yet it must be noticed that a sparse connectivity in a CA3 asymmetric recurrent network (Rolls & Treves, 1998) is more appropriate for learning temporal sequences (Levy, 1996), than for the generation of attractor states. This type of sequence learning by CA3 is compatible with the learning of transitions by this structure in our model, which form the buiding blocks of our sensorimotor sequences. In different implementations of our model, we also used neural fields and chaotic attractors (Tsuda, 2001) in order to learn and control temporospatial sequences in robotic experiments (Quoy, Banquet, & Dauce, 2001; Dauce et al., 2002). Yet the model presented here is not strictly an attractor model; in particular, just CA3 heteroassociations and not recurrent connections were implemented. Some of these attractor models (Redish, 1999) emphasized the existence of multiple maps and reference frames (reference point, orientation, and distance metric) of an environment (according to reward location in particular) as a way of coding both spatial and nonspatial context. In our model, the transitions occurred between the place fields of a same map, not between reference frames or maps as in Redish. Finally, we acknowledged the possibility of multiple maps within a single environment. Beyond that, we hypothesized that two or several complementary maps of a different nature could independently coexist in the hippocampal system and, if needed, support vicarious navigation strategies, even though there should be some stage of fusion between the two maps for the sake of spatial coherence. For the fusion of spatial view and self-motion information, Stringer (Stringer, Rolls, Trappenberg, & de Araujo, 2002) proposed appropriate connections for performing the idiothetic (self-motion) update of a continuous attractor. In our model, the transition views are associated with their corresponding movement vector. And navigation implies the whole hippocampo-cortical network. Therefore, we make a distinction between multimodal (e.g., visual and idiothetic) representations of transition-action associations hypothesized to take place in CA3 and/or SUB, and the context-dependent selection and implementation of actions in relation to these transitions that have been shown to reside in the striatum (including the nucleus accumbens) (Banquet et al., 2004). The inclusion of recurrent connections in our CA3 network would be straightforward if

A Hierarchy of Associations in Hippocampo-Cortical Systems

1375

they were used for pattern completion (Hopfield network) rather than for temporal diffusion of the activity (Levy, 1996). There is complementarity rather than incompatibility between the processing and learning performed by these connections and the processing currently implemented essentially based on heteroassociative connections. CA3-recurrent NMDA-knockout mice or NMDA receptor blockade (Kentros et al., 1998; McHugh et al., 1996, Rotenberg et al., 1996) essentially prevented the long-term stabilization of newly established firing fields. Once learned, our contextual maps were used in a reactive or a planning mode that could manage several simultaneously active drives or goals (Banquet, Gaussier, Revel, et al., 2001; Gaussier et al., 2002; Banquet, Gaussier, Quoy, et al., 2001). Remarkably, navigation based on this type of map made of transitions and involving planning presented several behavioral advantages and overcame typical limitations of map-based navigation:

r r r r r

Path disambiguation occurred at choice points in the graph, when several trajectories were possible. Due to the existence of a graded decay of activity around learned place fields, the system was not compelled to follow precisely the learned trajectories. The diffusion of activation from the goal location made possible creating shortcuts that had not been previously learned. Deadlock situations encountered during planning with place cell graphs could be solved. Chaining of transitions into temporospatial sequences became straightforward.

The process of remapping confirms that PCs do not just encode locations; rather, collectively, they provide a signature of the environment and context. A cue rotation during a constant task induces an angular rotation of the place fields within a stable PC population (O’Keefe & Speakman, 1987; McNaughton, 1989; Quirk, Muller, & Kubie, 1990; Markus et al., 1995; Save, Nerad, & Poucet, 2000); only a change in the geometry of the environment or in the behavioral task (Markus et al., 1995) induces a remapping (substantial shift in angular and radial position of the place fields of a PC population), implying at least a partial renewal of the CA3-CA1 population coding the space. The geometrical remapping takes place only for environments that have been previously learned (Lever et al., 2002); if not, CA3-CA1 cells undergo a topological adaptation to the shape of the new environment comparable to that of the EC place cells. This result could be accounted for by our model if two different learning dynamics were introduced: a rapid learning of path integration and/or landmark-based spatial features, and its diffusion through the hippocampal system; a slower tuning of HS

1376

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

neurons to the more subtle geometrical and contextual characteristics of the environment. Movement-related information served in the model for the path integration between two place fields that was associated with transitions and also to reset views after a robot motion. This model was expanded to incorporate a detailed model of path integration (Banquet et al., 2004). In this version, the model has similarities with others (Guazzelli, Bota, & Arbib, 2001) but differs in that visual and proprioceptive information are not artificially provided to perceptual and feature detector layers; rather, it comes from real data extracted from the environment or robot movement. Both independent and merged place codes exist in distinct structures. The computation of transition fields and their association with their corresponding direction-motion vector through path integration in the SUB implemented in the model a basic transition-action association. This essential feature of the model, beyond requiring the computation of path integration, extended the associative capacities of the CA3 network to the temporal domain (Levy, 1996) and supported the learning of temporospatial sequences in downstream structures such as the nucleus accumbens and the prefrontal cortex (Banquet et al., 2004; Poucet et al., 2004). The prediction of a temporospatial dimension in hippocampal processing was recently confirmed by the double dissociation between DG and CA1 functions in spatial and spatiotemporal (sequence) pattern discrimination (Gilbert et al., 2001). The second prediction of a basic association of a transition with the representation of the corresponding movement vector, within a purely sensory modality (and not at an interface between sensory and motor systems), was implemented by the association of a transition with its corresponding path-integration vector, coding direction, and displacement; it was based on the convergence of place information and path integration on SU and/or EC (Redish & Touretzky, 1997; Sharp, 1999). Further, whole body motion cells have been recorded in monkey CA3 field (O’Mara, Rolls, Berthoz, & Kesner, 1994). This combination of transition and action representation is also supported by path equivalence found in deep EC pyramidal cells (Frank et al., 2000), which reflects behavioral similarities between spatially close but distinct trajectories. Motion-related information and path integration could be responsible for this path equivalence. Exteroceptive, mostly visual, information would serve not only to periodically recalibrate the path-integration system, but also to transform a trajectory-dependent map in deep EC into a purely spatial map in superficial EC that could be recycled into and refined by the hippocampal system. The longer place fields recorded in deep EC neurons (Frank et al., 2000) could relate positions over longer distances and capture behavioral regularities that may support the animal’s ability to generalize across experiences. The eventuality that basic transition action-representation associations do not take place at an interface between sensory and motor systems, but in a multimodal system like the hippocampus, thanks to the proprioceptive and idiothetic component of

A Hierarchy of Associations in Hippocampo-Cortical Systems

1377

the movement related modality, is one of the unexpected predictions of the model. 5 Conclusion The model makes several original contributions. First, it takes into account and gives functional significance to the existence of at least two types of PCs: diffuse EC place cells that adapt to task and geometrical context rather than change their code, and DG-CA3-CA1 well-delimited PCs that depend on task context and therefore change their code if the context is changed. Both types of PCs seem to be under the control of head direction cells, since a rotation of the landmarks induces a commensurate rotation of the place fields. The EC PCs would encode the spatial layout of the environment independent of task and context constraints on the basis of purely spatial, dominant, movement-related information used for path integration. Conversely, CA3-CA1 transition cells would encode temporospatial sequences dependent on the task context in particular. In this contextual encoding, the temporal, or at least sequential, aspect of learning during task performance would prevail on the purely spatial aspect. Both types of maps are complementary. Distinct navigation strategies in order to capture a goal have been associated with each of them. Second, the encoding of transitions in CA3-CA1 (instead of simple locations as in other models) was inspired by the memory properties of these structures. They allowed an unambiguous and straightforward link between spatial representation and implementation into temporospatial sequences during navigation and planning. Path integration computed the ideal trajectory between two locations, whatever the exploration path in between. This was not the case when simple locations were associated with many possible displacements. Third, these high-level functional properties derived from basic distinctions between the local associations performed by the different cortices, with a limited (even if increasing with the hierarchies of associations) scope and the global all-inclusive associations performed by the CA3 system. Fourth, the submission of the model to the test of robotic paradigms of navigation in environments different in shape or complexity provided a functional-behavioral validation of the model and made possible the straightforward integration of two originally distinct models of hippocampal function: PC computation and timing-sequence learning. In further developments, animal and robotic experiments are conducted in parallel, during identical tasks; complex paradigms like navigation are considered to implicate, beyond the hippocampus, a network of systems, including in particular prefrontal cortex and the ventral basal ganglia. In this network, spatial as well as temporal dimensions serve as a common framework and a functional link between anatomically distant or functionally distinct systems.

1378

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Acknowledgments We thank Bruno Poucet, Sidney Wiener, and Jos´e Luis Contreras Vidal for comments; Jean Pierre Souteyrand for the figure design; Christine Lienhart for corrections of the English; and two anonymous reviewers. This work was supported by INSERM, French Research Ministery (Computational Neurosciences ACI N 39) and CNRS (Cognition and Information Processing).

References Aggleton, J., Vann, S., Oswald, C., & Good, M. (2000). Identifying cortical inputs to the rat hippocampus that subserve allocentric spatial processes: A simple problem with a complex answer. Hippocampus, 10, 466–474. Amaral, D. (1993). Emerging principles of intrinsic hippocampal organization. Current Opinion in Neurobiology, 3, 225–229. Amaral, D., & Witter, M. (1989). The three-dimensional organization of the hippocampal formation: A review of anatomical data. Neuroscience, 31(3), 571–591. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87. Andry, P., Gaussier, P., Moga, S., Banquet, J., & Nadel, J. (2001). Learning and communication in imitation: An autonomous robot perspective. IEEE Transactions on Systems, Man and Cybernetics, Part A, 31(5), 431–444. Arleo, A., & Gerstner, W. (2000). Spatial cognition and neuro-mimetic navigation: A model of hippocampal place cell activity. Biol. Cybern., 83(3), 287–299. Banquet, J., Gaussier, P., Dreher, J. C., Joulain, C., Revel, A., & Gunther, ¨ W. (1997). Space-time, order, and hierarchy in fronto-hippocampal system: A neural basis of personality. In G. Matthews (Ed.), Cognitive science perspectives on personality and emotion (pp. 123–179). Amsterdam: Elsevier Science. Banquet, J., Gaussier, P., Quoy, M., & Revel, A. (2001). From reflex to planning: Multimodal, versatile, complex systems in biorobotics. Behavioral and Brain Science, 24(6), 1051–1053. Banquet, J., Gaussier, P., Quoy, M., Revel, A., & Burnod, Y. (2002). Corticohippocampal maps and navigation strategies in robots and rodents. In B. Hallan, D. Floreano, J. Hallan, G. Hayes, & J. A. Meyer (Eds.), From animals to animats 7 (pp. 141–150). Cambridge, MA: MIT Press. Banquet, J., Gaussier, P., Quoy, M., Revel, A., & Burnod, Y. (2004). Spatial representation versus navigation through hippocampal, prefrontal and ganglio-basal loops. In P. Szolgay & J. Vandewalle (Eds.), International Joint Conference on Neural Networks (pp. 1499–1505). Budapest, Hungary. Banquet, J., Gaussier, P., Revel, A., Moga, S., & Burnod, Y. (2001). Sequence learning and timing in hippocampus, prefrontal cortex and accumbens. In K. Marco & P. Werbos (Eds.), Proceedings of the International Joint Conference on Neural Networks (pp. 1053–1058), Washington, DC. Bostock, E., Muller, R. U., & Kubie, J. L. (1991). Experience-dependent modifications of hippocampal place cell firing. Hippocampus, 1, 193–206.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1379

Burgess, N., Recce, M., & O’Keefe, J. (1994). A model of hippocampal function. Neural Networks, 7 (6/7), 1065–1081. Bussey, T., Muir, J., & Aggleton, J. (1999). Functionally dissociating aspects of event memory: The effects of combined perirhinal and postrhinal cortex lesions on object and place memory in the rat. J. Neurosci., 19(1), 495–502. Chiba, A., Kesner, R., & Gibson, C. (1994). Memory for spatial location as a function of temporal lag in rats: Role of hippocampus and medial prefrontal cortex. Behav. Neural Biol., 61, 123–131. Cohen, N., & Eichenbaum, H. (1993). Memory, amnesia, and the hippocampal system. Cambridge, MA: MIT Press. Cressant, A., Muller, R., & Poucet, B. (2002). Remapping of place cell firing patterns after maze rotations. Exp. Brain. Res., 143, 470–479. Dauce, E., Quoy, M., & Doyon, B. (2002). Resonant spatio-temporal learning in large random neural networks. Biological Cybernetics, 87, 185–198. de Araujo, I., Rolls, E., & Stringer, S. (2001). A view model which accounts for the spatial fields of hippocampal primate spatial view cells and rat place cells. Hippocampus, 11, 699–706. Egorov, A., Hamam, B., Fransen, E., Hasselmo, M., & Alonso, A. (2002). Graded persistent activity in entorhinal cortex neurons. Nature, 420, 173–178. Eichenbaum, H., Kuperstein, M., Fagan, A., & Nagode, J. (1987). Cue-sampling and goal-approach correlates of hippocampal unit activity in rats performing an odor-discrimination task. Journal of Neurosciences, 7(3), 716–732. Eichenbaum, H., Otto, T., & Cohen, N. (1994). Two functional components of the hippocampal memory system. Behav. Brain Sci., 17, 449–472. Frank, L., Brown, E., & Wilson, M. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27, 169–178. Fransen, E., Alonso, A., & Hasselmo, M. (2002). Simulations of the role of the muscarinic-activated calcium-sensitive nonspecific cathion current (INCM) in entorhinal neuronal activity during delayed matching tasks. Journal of Neuroscience, 22(3), 1081–1097. Gaussier, P., Joulain, C., & Banquet, J. (1998). Motivated animat navigation-A visually guided approach. In R. Pfeiffer, B. Blumberg, J. A. Meyer, & S. W. Wilson (Eds.), From animals to animats 5. Cambridge, MA: MIT Press. Gaussier, P., Joulain, C., Banquet, J., Leprˆetre, S., & Revel, A. (2000). The visual homing problem: An example of robotics/biology cross fertilization. Robotics and Autonomous Systems, 30, 155–180. Gaussier, P., Joulain, C., Banquet, J., & Revel, A. (1998). L’apprentissage de sc`enes visuelles complexes. Informatik, 1, 30–34. Gaussier, P., Joulain, C., Banquet, J., Revel, A., Leprˆetre, S., & Moga, S. (1999). A neural architecture for autonomous learning. Industrial Robot, 26(1), 33– 38. Gaussier, P., Moga, S., Banquet, J., & Quoy, M. (1998). From perception-action loops to imitation processes: A bottom-up approach of learning by imitation. Applied Artificial Intelligence, 12(7–8), 701–727. Gaussier, P., Revel, A., Banquet, J., & Babeau, V. (2002). From view cells and place cells to cognitive maps: Processing stages of the hippocampal system. Biol. Cybern., 86, 15–28.

1380

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Gilbert, P., Kesner, R., & DeCoteau, W. (1998). The role of the hippocampus in mediating spatial pattern separation. J. Neuroscience, 18, 804–810. Gilbert, P., Kesner, R., & Lee, I. (2001). Dissociating hippocampal subregions: A double dissociation between dentate gyrus and CA1. Hippocampus, 11, 626– 636. Glenn, M., & Mumby, D. (1998). Place memory is intact in rats with perirhinal lesions. Behav. Neurosci., 112, 1353–1365. Goodlett, C., Nichols, J., Halloran, R., & West, J. (1989). Long-term deficits in water maze spatial conditional alternation performance following retrohippocampal lesions in rats. Behav. Brain Res., 32, 63–67. Grossberg, S., & Merrill, W. (1992). A neural network model of adaptively timed reinforcement learning and hippocampal dynamics. Cognitive Brain Research, 1, 3–38. Guazzelli, A., Bota, M., & Arbib, M. A. (2001). Competitive Hebbian learning and the hippocampal place cell system: Modelling the interactions of visual and path integration cues. Hippocampus, 11, 216–239. Hasselmo, M., Schnell, E., Berke, J., & Barkai, E. (1995). A model of the hippocampus combining self-organization and associative memory function. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 77–84). Cambridge, MA: MIT Press. Hill, A. (1978). First occurrence of hippocampal spatial firing in a new environment. Exp. Neurol., 62, 282–297. Jarrard, L., Okaichi, H., Steward, O., & Goldschmidt, R. (1984). On the role of hippocampal connections in the performance of place and cue tasks: Comparisons with damage to hippocampus. Behav. Neurosci., 98(6), 946–954. Jones, R. S. (1993). Entorhinal-hippocampal connections: A speculative view of their function. TINS, 16(2), 58–64. Jung, M., & McNaughton, B. (1993). Spatial selectivity of unit activity in the hippocampal granular layer. Hippocampus, 3(2), 165–182. Kentros, C., Hargreaves, E., Kandel, E. R., Shapiro, M., & Muller, R. V. (1998). Abolition of long-term stability of new hippocampal place cell maps by NMDA receptor blockade. Science, 280, 2121–2126. Knierim, J., Kudrimoti, H., & McNaughton, B. (1995). Place cells, head direction cells, and the learning of landmark stability. Journal of Neuroscience, 15(3), 1648–1659. Kohler, C., Eriksson, L., Davies, S., & Chan-Palay, V. (1986). Neuropeptide and innervation of the hippocampal region in the rat and monkey brain. J. Comp. Neurol., 244, 384–400. Kolb, B., Buhrmann, K., McDonald, R., & Sutherland, R. (1994). Dissociation of the medial prefrontal, posterior parietal, and temporal cortex for spatial navigation and recognition memory in the rat. Cereb. Cortex, 6, 644–800. Lavenex, P., & Amaral, D. (2000). Hippocampal-neocortical interaction: A hierarchy of associativity. Hippocampus, 10, 420–430. Lever, C., Willis, T., Cacucci, F., Burgess, N., & O’Keefe, J. (2002). Long-term plasticity in hippocampal place-cell representation by environmental geometry. Nature, 416, 90–94. Levy, W. B. (1996). A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6(6), 579–591.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1381

Liu, P., & Bilkey, D. (1998). Is there a direct projection from perirhinal cortex to the hippocampus? Hippocampus, 8, 424–425. Markus, E., Qin, Y., Leonard, B., Skaggs, W., McNaughton, B., & Barnes, C. (1995). Interactions between location and task affect the spatial and directional firing of hippocampal neurons. J. Neurosci., 15(11), 7079–7094. McHugh, T., Blum, K., Tsien, J., Tonegawa, S., & Wilson, M. (1996). Impaired hippocampal representation of space in CA1-specific NMDAR1 knockout mice. Cell, 87, 1339–1349. McNaughton, B. (1989). Neuronal mechanisms for spatial computation and information storage. In L. Nadel, L. Cooper, P. Culicover, & R. Harnish (Eds.), Neural connections and mental computations (pp. 285–349). Cambridge, MA: MIT Press. McNaughton, B., Barnes, C., Gerrard, J., Gothard, K., Jung, M., Knierim, J., Kudrimoti, H., Qin, Y., Skagges, W., Suster, M., & Weaver, K. (1996). Deciphering the hippocampal polyglot: The hippocampus as a path integration system. Journal of Experimental Biology, 199, 173–185. McNaughton, B., Barnes, C., Meltzer, J., & Sutherland, R. (1989). Hippocampal granule cells are necessary for normal spatial learning but not for spatially selective pyramidal cell discharge. Experimental Brain Research, 76, 485–496. McNaughton, B., Knierim, J., & Wilson, M. (1994). Vector encoding and the vestibular foundations of spatial cognition: Neurophysiological and computational mechanisms. In M. Gazzaniga, (Ed.), The cognitive neuroscience (pp. 585–595). Cambridge, MA: MIT Press. Mehta, M., Barnes, C., & McNaughton, B. (1997). Experience-dependent, asymmetric expansion of hippocampal place fields. Proc. Natl. Acad. Sci. U.S.A., 94(16), 8918– 8921. Miller, V., & Best, P. (1980). Spatial correlates of hippocampal unit activity are altered by lesions of the fornix and entorhinal cortex. Brain Res., 194(2), 311–323. Mizumori, S., Ward, K., & Lavoie, A. (1992). Medial septal modulation of entorhinal single unit activity in anesthetized and freely moving rats. Brain Research, 570, 188–197. Muller, R., & Kubie, J. (1987). The effect of change in the environment on the spatial firing of hippocampal complex-spike cells. Journal of Neuroscience, 7, 1951–1958. Muller, R., Kubie, J., & Saypoff, R. (1991). The hippocampus as a cognitive graph. Hippocampus, 1(3), 243–246. O’Keefe, J. (1991). The hippocampal cognitive map and navigational strategies. In J. Paillard (Ed.), Brain and space (pp. 273–295). New York: Oxford University Press. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Res., 34, 171–175. O’Keefe, J., & Nadel, N. (1978). The hippocampus as a cognitive map. Oxford: Clarendon Press. O’Keefe, J., & Speakman, A. (1987). Single unit activity in the rat hippocampus during a spatial memory task. Exp. Brain. Res., 68, 1–27. Olton, D., Walker, J., & Wolf, W. (1982). A disconnection analysis of hippocampal function. Brain Res., 233, 241–253. O’Mara, S., Rolls, E., Berthoz, A., & Kesner, R. P. (1994). Neurons responding to whole-body motion in the primate hippocampus. J. Neurosci., 14, 6511–6523.

1382

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Parkinson, J., Murray, E., & Miskin, M. (1988). A selective mnemonic role for the hippocampus in monkey: Memory for the location of objects. Journal of Neuroscience, 8, 4159–4167. Paz-Villagran, V., Save, E., & Poucet, B. (2003). Independent coding of connected environments by place cells. Society for Neuroscience Abstracts, 29, 91.19. Paz-Villagran, V., Save, E., & Poucet, B. (2004). Independent coding of connected environments by place cells. European Journal of Neuroscience, 20, 1379–1390. Poucet, B., Lenck-Santini, P., Hok, V., Save, E., Banquet, J., Gaussier, P., & Muller, R. (2004). Spatial navigation and hippocampal place cell firing: The problem of goal encoding. Review of the Neurosciences, 15, 89–107. Quirk, G., Muller, R., & Kubie, J. (1990). The firing of hippocampal place cells in the dark depends on the rat’s recent exprience. J. Neurosci., 10, 2008–2017. Quirk, G., Muller, R., Kubie, J., & Ranck, J. (1992). The positional firing properties of medial entorhinal neurons: Description and comparison with hippocampal place cells. Journal of Neuroscience, 10, 1945–1963. Quoy, M., Banquet, J., & Dauce, E. (2001). Sequence learning and control with chaos: From biology to robotics. Behavioral and Brain Science, 24(5), 824–825. Redish, A. (1999). Beyond the cognitive map: From place cells to episodic memory. Cambridge, MA: MIT Press. Redish, A., Rosenzweig, E., Bohanick, J., McNaughton, B., & Barnes, C. (2000). Dynamics of hippocampal ensemble activity realignment: Time versus space. Journal of Neuroscience, 20(24), 9298–9309. Redish, A., & Touretzky, D. (1997). Cognitive maps beyond the hippocampus. Hippocampus, 7, 15–35. Redish, A., & Touretzky, D. (1998). The role of the hipppocampus in solving the Morris water maze. Neural Computation, 10, 73–111. Revel, A., Gaussier, P., Leprˆetre, S., & Banquet, J. (1998). Planification versus sensorymotor conditioning: What are the issues? In R. Pfeiffer, B. Blumberg, J. A. Meyer, & S. W. Wilson (Eds.), From animals to animats 5. Cambridge, MA: MIT Press. Rolls, E., & O’Mara, S. (1995). View-responsive neurons in the primate hippocampal complex. Hippocampus, 5, 409–424. Rolls, E., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Rotenberg, A., Mayflower, M., Hawkins, R., Kandel, E., & Muller, R. (1996). Mice expressing activated CAMKII lack low frequency LTP and do not form stable place cells in the CA1 region of the hippocampus. Cell, 87, 1351–1361. Samsonovich, A., & McNaughton, B. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15), 5900–5920. Save, E., Nerad, L., & Poucet, B. (2000). Contribution of multiple sensory information to place field stability in hippocampal place cells. Hippocampus, 10(1), 64– 76. Schenk, F., & Morris, R. (1985). Dissociation between components of spatial memory in rats after recovery from the effects of retrohippocampal lesions. Exp. Brain Res., 58, 11–28. Sharp, P. (1999). Complementary roles for hippocampal versus subicular/entorhinal place cells in coding place, context, and events. Hippocampus, 9(4), 432–443.

A Hierarchy of Associations in Hippocampo-Cortical Systems

1383

Sharp, P., Blair, H., & Brown, M. (1996). Neural network modelling of the hippocampal formation spatial signals and their possible role in navigation: A modular approach. Hippocampus, 6(6), 735–748. Sharp, P., Blair, H., Etkin, D., & Tzanetos, D. (1995). Influence of vestibular and visual motion information on the spatial firing pattern of hippocampal place cells. Journal of Neuroscience, 15(1), 173–189. Sharp, P., & Green, C. (1994). Spatial correlates of firing patterns of single cells in the subiculum of the freely moving rat. J. Neurosci., 14(4), 2339–2356. Shepherd, G. (1993). The synaptic organization of the brain. New York: Oxford University Press. Skaggs, W., Knierim, J., Kudrimoti, H., & McNaughton, B. (1995). A model of the neural basis of rat sense of direction. In G. Tesavro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing, 7 (pp. 173–180). Cambridge, MA: MIT Press, Boston. Stringer, S., Rolls, E., & Trappenberg, T. (2004). Self-organizing continuous attractor networks with multiple activity packets, and the representation of space. Neural Networks, 17, 5–27. Stringer, S., Rolls, E., Trappenberg, T., & de Araujo, I. (2002). Self-organizing continuous attractor networks and path integration: Two-dimensional models of place cells. Network: Computation in Neural Systems, 13, 429–446. Suzuki, W., Zola-Morgan, S., Squire, L., & Amaral, D. G. (1993). Lesions of the perirhinal and parahippocampal cortices in the monkey produce long-lasting memory impairment in the visual and tactual modalities. Journal of Neuroscience, 13(3), 5418–5432. Tanila, H., Sipila, P., Shapiro, M., & Eichenbaum, H. (1997). Brain aging: Impaired coding of novel environmental cues. J. Neuroscience, 17(13), 5167–5174. Touretzky, D., & Redish, A. (1996). A theory of rodent navigation based on interacting representations of space. Hippocampus, 6(3), 247–270. Trullier, O., & Meyer, J. (2000). Animat navigation using a cognitive graph. Biol. Cybern., 83, 271–285. Tsuda, I. (2001). Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Science, 24, 793–847. Vann, S., Brown, M., Erichsen, J., & Aggleton, J. (2000). Using C-fos imaging in the rat to reveal the anatomical extent of the disruptive effects of fornix lesions. Journal of Neuroscience, 20, 8144–8152. Wiener, S., & Korshunov, V. (1995). Place-independent behavioural correlates of hippocampal neurones in rats. Neuroreport, 7(1), 183–188. Wiener, S., Paul, C., & Eichenbaum, H. (1989). Spatial and behavioral correlates of hippocampal neuronal activity. J. Neurosci., 9(8), 2737–2763. Wiig, K., & Bilkey, D. (1994). Perihinal cortex lesions in rat disrupt performance in a spatial DNMS task. Neuroreport, 5(6), 1405–1408. Wiig, K., & Bilkey, D. (1995). Lesions of rat perirhinal cortex exacerbate the memory deficit observed following damage to the fimbria-fornix. Behav. Neurosci., 109(4), 620–630. Witter, M., Naber, P., van Haeften, T., Machielsen, W., Rombouts, S., Barkhof, F., & da Silva, P. S. F. L. (2000). Cortico-hippocampal communication by way of parallel parahippocampal-subicular pathways. Hippocampus, 10, 398–410.

1384

J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod

Worden, R. (1992). Navigation by fragment fitting: A theory of hippocampal function. Hippocampus, 2, 165–187. Yeckel, M., & Berger, T. (1990). Feedforward excitation of the hippocampus by afferents from the entorhinal cortex: Redefinition of the role of the trisynaptic pathway. Proc. Natl. Acad. Sci. U.S.A., 87(15), 5832–5836. Zola-Morgan, S., Squire, L., Amaral, D., & Suzuki, W. (1989). Lesions of perirhinal and parahippocampal cortex that spare the amygdala and hippocampal formation produce severe memory impairment. Journal of Neuroscience, 9(6), 4355–4370. Zola-Morgan, S., Squire, L., Clower, R., & Rempel, N. (1993). Damage to the perirhinal cortex exacerbates memory impairment following lesions to the hippocampal formation. Journal of Neuroscience, 13(1), 251–265. Zola-Morgan, S., Squire, L., & Ramus, S. (1994). Severity of memory impairment in monkeys as a function of locus and extent of damage within the medial temporal lobe memory system. Hippocampus, 4(4), 483–495.

Received April 16, 2004; accepted November 3, 2004.

LETTER

Communicated by Zoubin Ghahramani

Evidence Evaluation for Bayesian Neural Networks Using Contour Monte Carlo Faming Liang [email protected] Department of Statistics, Texas A&M University, College Station, TX 77843, U.S.A.

Bayesian neural networks play an increasingly important role in modeling and predicting nonlinear phenomena in scientific computing. In this article, we propose to use the contour Monte Carlo algorithm to evaluate evidence for Bayesian neural networks. In the new method, the evidence is dynamically learned for each of the models. Our numerical results show that the new method works well for both the regression and classification multilayer perceptrons. It often leads to an improved estimate, in terms of overall accuracy, for the evidence of multiple MLPs in comparison with the reversible-jump Markov chain Monte Carlo method and the gaussian approximation method. For the simulated data, it can identify the true models, and for the real data, it can produce results consistent with those published in the literature. 1 Introduction The Bayesian evidence framework proposed in MacKay (1992a) provides a unified theoretical treatment of training, model selection, and prediction for Bayesian neural networks. It can be described as follows. Let H1 , . . . , Hm denote m models under consideration, where the term H denotes all the hypotheses and assumptions that are made in defining the model (e.g., network structure, specific noise model). In this article, the Hi ’s are restricted to multiple layer perceptrons (MLPs) with different numbers of hidden units. Let ω i be a vector that is the collection of parameters of the model Hi , including the connection weights and the parameters of the noise and prior distributions; f (ω i |Hi ) denote the prior distribution of ω i ; D denote the training data; and f (D|ω i , Hi ) denote the likelihood for the model and parameters. According to Bayes’ theorem, the posterior density of ω i is then f (ω i |D, Hi ) =

f (D|ω i , Hi ) f (ω i |Hi ) , f (D|Hi )

where the denominator is f (D|ω i , Hi ) f (ω i |Hi )dω i . f (D|Hi ) = Neural Computation 17, 1385–1410 (2005)

(1.1)

(1.2)

© 2005 Massachusetts Institute of Technology

1386

F. Liang

When there are multiple models, f (D|Hi ) is the likelihood of the model Hi . Hence, it is called the evidence of the model. The evidence can be used for model selection. The model of the largest evidence is selected as the most probable model. An advantage of this method is that it can make use of the full data for model training. In cross-validation, only part of the data can be used for model training. The Bayesian evidence framework also provides a natural interpretation for the regularization method, of which the regularization term is an analog of the prior of a Bayesian model. The evidence can also be used for prediction with a committee of networks (Perrone & Cooper, 1993). Let P(Hi ) denote the prior probability of the model Hi . The posterior probability of Hi is then P(Hi |D) =

f (D|Hi )P(Hi ) . f (D)

The point Bayesian prediction is given by y¯ =

P(Hi |D)

i

m =

i=1

y(x, ω i ) f (ω i |Hi , D)dω i

f (D|Hi )P(Hi ) y(x, ω i ) f (ω i |Hi , D)dω i m . i=1 f (D|Hi )P(Hi )

Refer to Bishop (1995) for further discussion on the Bayesian evidence framework. Several methods have been proposed or used in the literature to evaluate the evidence for multiple models. The first method is the gaussian approximation method (MacKay, 1992a). In this method, the evidence is estimated by 1 fˆ(D|Hi ) = f Dω iMP , Hi f ω iMP Hi (2π)d/2 det− 2 A,

(1.3)

where ω iMP is the MAP (maximum a posteriori) estimate of ω i , d is the dimension of ω i , and A = −∇∇ log f (ω i |D, Hi )|ω iMP is the Hessian matrix evaluated at ω iMP . This estimate is often inaccurate due to the difficulties involved in the gaussian approximation and the determinant evaluation for the Hessian matrix. Walker (1969) showed that in the limit of an infinite training set, the posterior distribution is a mixture of gaussians, but with a finite training set, the approximation breaks down. The Hessian matrix is possibly singular or nearly singular for a network with redundant weights. In this case, the calculation of the determinant is unreliable. (See Thodberg, 1995, and Bishop, 1995, for the sophisticated methods of Hessian approximation.)

Evidence Evaluation for Bayesian Neural Networks

1387

The second method is reversible-jump MCMC (RJMCMC; Green, 1995). Muller ¨ and Insua (1998) applied it to evaluate evidence for Bayesian neural networks. RJMCMC works by sampling the network structure and connection weights simultaneously from their joint posterior, f (ω, H|D) ∝ f (D|ω, H) f (ω|H)P(H),

H ∈ {H1 , . . . , Hm },

(1.4)

where the set {H1 , . . . , Hm } contains all candidate models under consideration. The ratio of the evidence of the models Hi and H j , RE i, j = P(D|Hi )/P(D|H j ) can be estimated by the ratio of their (relative) sampling frequencies, that is, R fi RE ij = , R fj

1 ≤ i,

j ≤ m,

where R f i and R f j denote the relative sampling frequencies of the models Hi and H j , respectively. RJMCMC avoids the gaussian approximation, but it may converge very slowly for large networks. Muller ¨ and Insua (1998) note that even for one-hidden-layer regression networks, the weights corresponding to the connections from the hidden layer to the output layer should be integrated out explicitly from the joint posterior; otherwise, it would lead to a very slowly mixing Markov chain, rendering the scheme of little practical value. The slow convergence of RJMCMC is due to the inability of the Metropolis-Hastings algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970) in transition between different models. In RJMCMC, the sampler often suffers from some difficulty in transition between different models. Low-probability models are rarely visited. Unbalanced sample sizes of models often result in a poor estimate, in terms of overall accuracy, for the evidence of multiple models. The third method is thermodynamic integration (Neal, 1993; Gelman & Meng, 1998). It calculates the ratio of the evidence of two models based on the overlap of the posterior distributions of the two models. Hence, it cannot be used directly to evaluate the ratio of evidence for two models with different dimensions. This method is not comparable to RJMCMC and the algorithm proposed in this article. The latter two methods can be used to simultaneously evaluate the evidence of multiple models with different dimensions. In this article, we propose using the contour Monte Carlo algorithm (Liang, 2004) to evaluate evidence for Bayesian neural networks. As RJMCMC, this method avoids the evaluation of Hessian matrices. This method also overcomes the difficulty of RJMCMC in transition between different models. In CMC, the evidence is dynamically learned for each of the models. Weighted by the evidence, each of the models can be sampled from equally. CMC often results in an improved estimate, in terms of overall

1388

F. Liang

accuracy, for the evidence of multiple models in comparison with RJMCMC and the gaussian approximation method. Numerical results show that the new method works well for both the regression and classification networks. For the simulated data, it can identify the true model, and for the real data, the results are consistent with those published in the literature. This letter is organized as follows. In section 2, we describe briefly the contour Monte Carlo algorithm and illustrate it through a simple 1D example. In section 3, we show how the method can be used for evidence evaluation for regression MLPs. In section 4, we show how the method can be used for evidence evaluation for classification MLPs. Section 5 concludes with a brief discussion. 2 Evidence Evaluation Using Contour Monte Carlo The contour Monte Carlo (CMC) algorithm, proposed recently by Liang (2004), can be regarded as a nontrivial generalization of the multicanonical algorithm (Berg & Neuhaus, 1991) and the Wang-Landau algorithm (Wang & Landau, 2001). In this section, we first review the algorithm in general, and then give the details of the implementation of the algorithm for evaluating evidence for Bayesian MLPs. The general review will facilitate the extension of the algorithm to the simultaneous use for network training and structure selection in section 5. 2.1 Contour Monte Carlo Algorithm. Suppose that we want to make an inference for the following Boltzmann distribution, f (x) =

1 exp{−U(x)/τ }, Z

x ∈ X,

where τ is the temperature, X is the sample space, U(x) is the energy function that corresponds to the negative log posterior in Bayesian neural networks, and Z = X exp{−U(x)/τ }d x is the normalizing constant. Suppose that the sample space is partitioned into m disjoint subregions, E 1 , . . . , E m , according to some criterion chosen by the user, say, the energy function or some variable of interest. For example, the joint posterior distribution f (ω, H|D) given in equation 1.4 has the sample space X =

m

Xi ,

i=1

where Xi = {x = (ω i , Hi ) : f (ωi , Hi |D) > 0} is the support of the marginal posterior f (ωi |D, Hi ). If we partition X according to the index of models, then we have E i = Xi . The order of the subregions will not affect the performance of the algorithm.

Evidence Evaluation for Bayesian Neural Networks

1389

In the following, we let ψ(x) = exp{−U(x)/τ }, I (x ∈ E i ) be the indicator function, indicating the subregion where the sample x belongs, and g(E i ) be a weight associated with the subregion E i . Suppose that g(E i ) is known, and it happens to be the quantity g(E i ) = Ei ψ(x)d x. If the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) is employed to simulate from the distribution, π(x) =

m ψ(x) 1 I (x ∈ E i ), m i=1 g(E i )

(2.1)

then each of the subregions will have an equal sampling frequency. This implies that if the sample space is partitioned according to the index of models as discussed above, each of the models will be sampled from equally in a simulation from π (x), and thus, enough samples can be easily collected for each of the models for an effective inference for f (x). However, g(E i ) is unknown and needs to be estimated prior to the simulation. CMC provides a method to estimate the g(E i )’s. A run of CMC proceeds in several stages, as follows. Let gˆ (s,k) (E i ) denote the estimate of g(E i ) at the kth iteration of the sth stage of the simulation, x (s,k) denote the sample obtained at the kth iteration of the sth stage of the simulation, and δs denote a modification factor for gˆ (s,k) (E i )’s at stage s. In the first stage (s = 1), the simulation starts with the initial estimates gˆ (0,0) (E 1 ) = · · · = gˆ (0,0) (E m ) = 1 and a random sample x (0,0) (k = 0), and then iterates between the following two steps. 1. Sampling: Propose a new configuration x ∗ in the neighborhood of x (s,k) according to the proposal distribution T(x (s,k) → x ∗ ), where the proposal distribution T(· → ·) can be chosen as in the MetropolisHastings algorithm. Accept x ∗ with the probability

(s,k) E I (s,k) ψ(x ∗ ) T(x ∗ → x (s,k) ) gˆ x ,1 , (2.2) min gˆ (s,k) E I x ∗ ψ(x (s,k) ) T(x (s,k) → x ∗ ) where Iz is the index of the subregion where z belongs. If it is accepted, set x (s,k+1) = x ∗ . Otherwise, set x (s,k+1) = x (s,k) . 2. Weight updating: Set gˆ (s,k+1) E I (s,k+1) = (1 + δs )gˆ (s,k) E I (s,k+1) . x x The algorithm will iterate until the relative sampling frequency of each of the subregions is stable; that is, the relative sampling frequencies fluctuate only within a small range. For example, we can define the following statistic, Sk =

m 1 R f i,(k+1)b − 1 , m i=1 R f i,kb

1390

F. Liang

to measure the stability of the estimates, where b is the batch size and R f i,kb is the relative sampling frequency of the subregion E i calculated at the (kb)th iteration of the stage. Here we define 00 = 1 in Sk to accommodate the empty subregions of which the sampling frequencies are 0. Once the relative sampling frequencies are stable, we will resume the sample counter for each of the subregions, reduce δs to a smaller value, and proceed to the next stage of simulation. The estimates gˆ (s,K s ) (E i )’s will be passed on as initial values to the next stage; that is, we set gˆ (s+1,0) (E i ) = gˆ (s,K s ) (E i ) for i = 1, . . . , m, where K s denotes the total number of iterations performed during the sth stage. At each iteration, the sampling step can be repeated a number of times. The repetition will usually improve the accuracy of the estimates gˆ (E i )’s but should not affect their convergence. For simplicity, the sampling step is performed only once at each iteration in all simulations of this article. The convergence of the estimates is stated in the following theorem (the proof is presented in Liang, 2004). Theorem 1. As δs → 0 and K s → ∞, for any positive function ψ(x) (x ∈ X ) with X ψ(x)d x < ∞, we have gˆ (s,k) (E i ) −→ g(E i ),

in probability

(2.3)

for i = 1, . . . , m, where g(E i ) = c Ei ψ(x)d x and c is a constant that can be dem g(E i ) or a termined with an additional constraint on g(E i )’s, for example, i=1 particular g(E i ) is equal to a known number. This theorem implies that gˆ (s,K s ) (E i ) is a consistent estimate of g(E i ) as δs → 0 and k → ∞. If δs has converged to 0 and gˆ (E i )’s have converged to their true values, the relative sampling frequency of each subregion must be proportional to Ei

ψ(x)d x g(E i )

∝ 1,

for i = 1, . . . , m,

(2.4)

because of the acceptance rule, equation 2.2. It is inclusion of the factor gˆ (s,k) (E I (s,k) )/gˆ (s,k) (E I x ∗ ) in equation 2.2 that forces each of the subregions to x be sampled from equally. At each step of CMC, gˆ (s,k) (E i )’s are adjusted according to the status of the sampler. The adjustment penalizes the future visits to the overvisited subregions and awards those to the undervisited subregions. The δs works as the scale or step size of the adjustment. It should decrease monotonically with the simulation. The decreasing rate of δs may affect the convergence of the algorithm. As we know, the preceding stage aims at providing a good initial value for the following stage, or in other words, the following stage aims at fine-tuning the estimate obtained in the preceding stage.

Evidence Evaluation for Bayesian Neural Networks

1391

Table 1: Unnormalized Mass Function of the 10-State Distribution. x

1

2

3

4

5

6

7

8

9

10

P(x)

1

100

2

1

3

1

2

2000

10

1

Hence, the δ’s should be chosen such that the error of the estimate obtained in the preceding stage should be able to be corrected in the following stage with a reasonable √ number of iterations. In this article, δs decreases in the scheme δs+1 = 1 + δs − 1. This scheme works well for all examples. √ Note that δs+1 = 1 + δs − 1 ≈ 0.5δs for small δs , but it decreases faster than δs+1 = 0.5δs for large δs . The algorithm will be run until δs < δend , where δend is usually a very small number, say, a number less than 10−6 . CMC is a stochastic approximation process. Dividing the simulation into stages allows the effect of iterations and the modification factor to be considered separately. Liang (2003) calculated the mean and mean squared error of gˆ (s,k) (E i ) at each stage. As an estimator of g(E i ), the bias of gˆ (s,k) (E i ) is on the order of O(δs ) as k → ∞, that is, E gˆ (s,k+1) (E i ) = g(E i ) + O(δs ),

as k → ∞.

The mean squared error of gˆ (s,k) (E i ) is also on the order of O(δs ), that is, 2 E gˆ (s,k) (E i ) − g(E i ) ∼ O(δs ),

as k → ∞.

This will help us to choose an appropriate decreasing scheme for δs and the number of iterations. (See Liang, 2003, for further discussion on the issue.) 2.2 An Illustrative Example. We use this example to illustrate the CMC algorithm. The example consists of 10 states with the unnormalized mass function P(x) as specified in Table 1. The transition matrix employed is a stochastic matrix of which each row is generated independently from Dirichlet(1, . . . , 1). To apply CMC to this example, we partitioned the sample space into three subregions: E 1 = {1, 2, 3}, E 2 = {4, 5, 6}, and E 3 = {7, 8, 9, 10}. The mass values contained in the subregions E 1 , E 2 , and E 3 are 103, 5, and 2013, respectively. This example mimics model selection problems, in which there is often a single model that dominates the others. CMC was run 50 times independently with ψ(x) = P(x), δ1 = 0.1, δend = 10−6 , K 1 = 1000 and K s+1 = 1.5K s . The CPU time of each run is about 0.55 second on a 2.8 GHZ computer (all computations reported in this article were done on the same computer). Figure 1a shows the transition path of the subregions at the last stage of a run. It is easy to see that CMC results in a free random walk in the space of subregions. Figure 2 shows the relative errors of the

1392

F. Liang (b) Metropolis

3

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

2

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

1

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 0

100 200 300 iteration (X 100)

400

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

3

sub–region

sub–region

(a) CMC

2

•• • ••••• •••••••• • • • • ••• •

1

500

0

100 200 300 iteration (X 100)

400

500

5

Figure 1: Transition path of the subregions at the last stage of a CMC run for the 10-state example. The samples are collected every 100 iterations in the simulation.

0

•

−1

1

relative error 2 3

4

•

•

••

• • •

• • •

• ••

5

••

•

••

••

stage

•

10

•

•

•

•

•

•

•

5

Figure 2: Plot of the relative errors of the estimates of g(E i )’s versus the stages, where the solid line, dotted line, and dashed line correspond to the estimates of g(E 1 ), g(E 2 ), and g(E 3 ), respectively.

estimates g (s,K s ) (E i )’s versus the stages where the relative error is defined as gˆ (E i )(s,K s ) /g(E i ) − 1. As the simulation evolves, the estimates are more and more accurate. The other computational results are summarized in Table 2. It shows that in CMC, each of the subregions is sampled equally likely,

Evidence Evaluation for Bayesian Neural Networks

1393

Table 2: Computational Results for the 10-State Distribution. CMC Subregion State g(E i )

gˆ (E i )

Frequency (%)

Metropolis gˆ (E i )

Frequency (%)

E1

1 2 3

103

0.320 (0.001) 103.108 (0.083) 32.313 (0.022) 0.642 (0.002)

102.882 (0.117)

0.047 (0.23e-3) 4.710 (5.42e-3) 0.094 (0.35e-3)

E2

4 5 6

5

6.660 (0.010) 4.988 (0.005) 20.042 (0.023) 6.697 (0.012)

4.962 (0.019)

0.047 (0.37e-3) 0.140 (0.72e-3) 0.046 (0.30e-3)

E3

7 8 9 10

0.031 (0.000) 0.093 (0.36e-3) 33.111 (0.025) 94.306 (6.03e-3) 2013 2012.904 (0.087) 0.167 (0.001) 2013.166 (0.123) 0.470 (1.12e-3) 0.018 (0.001) 0.047 (0.33e-3)

3 3 Notes: The estimates gˆ (E i ) are subject to the constraint i=1 gˆ (E i ) = i=1 g(E i ) = 2021. The numbers in the parentheses are the standard deviations of the preceding estimates.

regardless of the mass values contained in each subregion. The total mass contained in each subregion is estimated rather accurately by CMC. If each subregion is regarded as the sample space of a model, then the total mass contained in each subregion is the evidence of the model. Potentially, CMC serves as a good alternative of evidence evaluation for Bayesian MLPs. For comparison, the Metropolis-Hastings algorithm was also applied to simulate from the distribution with the same transition matrix. Each run consists of 3.6e+6 iterations and costs about the same CPU time as that of CMC. The algorithm was also run for 50 times independently. Figure 1b shows the transition path of the subregions in the first 50,000 iterations. The subregion E 2 was never visited in the plotted 500 iterations. Table 2 shows the estimates of the mass values and the sampling frequencies for each of the subregions. For this example, more than 94 percent of samples are drawn from E 3 in the Metropolis-Hastings runs, and the resulting estimates of g(E i )’s are less accurate than those produced by CMC. Unbalanced sample sizes often result in an estimate with poor overall accuracy, although some components of the estimate corresponding to the oversampled subregions may be accurate. 2.3 Evidence Evaluation for Bayesian Neural Networks. Suppose that the training data D are modeled by a Bayesian MLP model. With slight abuse of notations, we denote a Bayesian MLP model with H hidden units by H. Our task here is to determine the “best” model by evaluating the evidence of m candidate models H1 , . . . , Hm . Furthermore, suppose

1394

F. Liang

that the sample space of the posterior distribution f (ω, H|D) has been partitioned into m subregions with E 1 = {(ω1 , H1 ) : f (ω1 , H1 |D) > 0}, . . . , E m = {(ωm , Hm ) : f (ωm , Hm |D) > 0}. For convenience in programming, the models are ordered in the number of hidden units in this article. The order of the models has no any effect on the convergence of the algorithm. If CMC is run for this problem with the choice ψ(·) = P(D|ω, H)P(ω|H), theorem 1 implies that gˆ (s,k) (E i ) will converge to the evidence of the model Hi (up to a multiplication factor) in probability as δs → 0 and K s → ∞, that is, gˆ (s,k) (E i ) → g(E i ) = c

Xi

P(D|ω, H)P(ω|H)dω,

for i = 1, . . . , m.

Thus, if g(E j ) > 0, gˆ (s,k) (E i )/gˆ (s,k) (E j ) will converge in probability to g(E i )/g(E j ), the ratio of the evidence of the models Hi and H j . Following is an implementation of CMC for evidence evaluation for Bayesian MLPs. Let Q = (q i j ) be a stochastic matrix, where q i j denotes the proposal probability of the transition from the model Hi to H j . In this article, Q is a triangular matrix with q i,i−1 = q i,i = q i,i+1 = 13 for 1 < i < m, q 1,1 = q m,m = 23 , and q 1,2 = q m,m−1 = 13 . Partition the sample space according to the index of models, initialize gˆ (0,0) (E 1 ) = · · · = gˆ (0,0) (E m ) = 1, and iterate between the following steps: 1. Draw a random number u from the uniform distribution Unif(0,1). 2. If u < q i,i−1 , try a “death” move, which is to delete a hidden unit from the current model, that is, try to jump from model Hi to model Hi−1 . If the move is accepted, set i ← i − 1. 3. If u < q i,i−1 + q i,i , try a “within-subregion” move, which is to update the parameters of the current model, that is, try to draw a new sample from the posterior distribution f (ωi |Hi , D) of model Hi . 4. Otherwise, try a “birth” move, which is to add a new hidden unit to the current model, that is, try to jump from model Hi to model Hi+1 . If the move is accepted, set i ← i + 1. The acceptance of the “birth,” “death,” and “within-subregion” moves are all guided by equation 2.2. For the within-subregion move, equation 2.6 is reduced to the conventional Metropolis-Hastings rule; the only difference is that the gˆ (s,k) ’s need to be updated here. For the sample space partition used in this article, the within-subregion move is also a within-model move, that is, trying to update ω (s,k) , the parameter vector of the current model to a new vector in the parameter space of the current model. In this article, ω (s,k) is updated in blocks as follows. The weights on the connections fed to the same hidden or output unit are grouped into one block.

Evidence Evaluation for Bayesian Neural Networks

1395

This grouping method usually works well for networks with a small or moderate number of input and hidden units. For networks with a large number of input and hidden units, the group may need to break down further. Suppose that the connection weights have been grouped into blocks. One block is then selected at random, and the weights in the selected block are updated with a spherical proposal distribution. A direction is first generated uniformly, and then the radius is drawn from N(0, ςm2 ), where ςm2 is calibrated such that the acceptance rate of the CMC moves is about 0.2 to 0.4. This proposal is symmetric in the sense that the transition probability ratio T(ω ∗ → ω (s,k) )/T(ω (s,k) → ω ∗ ) = 1, where ω ∗ denotes the proposed parameter vector. It is easy to see that the within-subregion move is just a Metropolis-Hastings move in the above implementation. We note that the within-subregion move can be any MCMC move performed in a fixed dimensional space, say, the Gibbs sampler (Geman & Geman, 1984) or hybrid Monte Carlo (Neal, 1996). In the birth move, a new hidden unit is proposed to add to the current model, and the weights on the new connections are drawn from N(µb , ςb2 ), where µb and ςb2 are the sample mean and sample variance of the weights of the current model, respectively. The resulting transition probability ratio is T(ω ∗ → ω (s,k) ) q i+1,i 1 1 , = T(ω (s,k) → ω ∗ ) q i,i+1 i + 1 ξj=1 φ((ω j − µb )/ςb ) where ξ is the number of newly added connections, ω1 , . . . , ωξ denote the weights drawn for the new connections, and φ(z) is the density function of the standard normal random variable z. In the death move, a hidden unit is selected at random, with the proposal to delete it from the model. Let µd and ςd2 denote the sample mean and sample variance of the connection weights, except for those on the connections fed to or exiting from the selected hidden unit of the current model. The resulting transition probability ratio is T(ω ∗ → ω (s,k) ) q i−1,i i = T(ω (s,k) → ω ∗ ) q i,i−1 1

ξ j=1

φ((ω j − µd )/ςd ) 1

,

where ω1 , . . . , ωξ denote the connection weights on the connections fed to or exiting from the selected hidden unit. In the above implementation of CMC, each of the candidate models is essentially trained by the Metropolis-Hastings algorithm, which is used in the within-subregion moves to draw samples from the posterior f (ωi |Hi , D). Hence, the performance of CMC will depend on the effectiveness of the Metropolis-Hastings algorithm for the problems under consideration. In the current implementation, CMC may not work well for a set of large

1396

F. Liang

MLPs due to the inability of the Metropolis-Hastings algorithm in sampling from the posterior distributions of large MLPs. This difficulty can be alleviated by the following techniques. The first is to replace the MetropolisHastings algorithm by an advanced MCMC algorithm that performs in a fixed dimensional space, say, hybrid Monte Carlo (Neal, 1996) or the Langevin algorithm (Grenander & Miller, 1994; Phillips & Smith, 1996). The second one is to integrate out as many parameters as possible for the posterior f (ω, H|D). The theoretical basis is Rao-Blackwell’s theorem (Casella & Berger, 2002; Liu, 2001), which suggests that an analytical integration is helpful for reducing the variance of Monte Carlo computation. In the context of Bayesian neural networks, this is also suggested experimentally by Denison, Holmes, Mallick, and Smith (2002). In this letter, for regression MLPs, we integrate out some of the parameters, including the variance of the random errors and all the weights on the connections from the hidden units to the output unit; for classification MLPs, we cannot do so. Hence, sampling from the posterior of a classification MLP usually means a more difficult task than sampling from the posterior of a regression MLP, provided that other conditions, say, the network structure and the data amount, are all similar. The third one is to repartition the sample space according to the index of models and the energy function jointly. This is to reduce the dependence of CMC on the algorithm used in the within-subregion moves. This technique will be discussed at the end of this article.

3 Evidence Evaluation for Regression MLPs We first consider the evidence evaluation for regression MLPs. Let D = {(xt1 , . . . , xt P ; yt ) : t = 1, . . . , n} denote the training data, where x1 , . . . , x P are explanatory variables and y is the response variable. Let the data be modeled by a one-hidden-layer MLP with P input units and H hidden units. The model can be written as yt = β0 +

H i=1

βi ϕ γi0 +

P

xtj γi j

+ t ,

t = 1, . . . , n,

(3.1)

j=1

where t ∼ N(0, σ 2 ), βi ’s and γi j ’s denote the connection weights from the hidden units to the output unit and from the input units to the hidden units, respectively. Here, the bias unit is treated as a special input unit with a constant input, say, 1. The ϕ(·) is the activation function. It is set to the hyperbolic tangent function in this section. The hyperbolic tangent function is equivalent to the sigmoid function in modeling, but it often leads to a faster convergence in training than the sigmoid function.

Evidence Evaluation for Bayesian Neural Networks

1397

To work with the evidence framework, we specify the following priors for the model. First, we assume that H ∼ Unif{H1 , . . . , Hm }, that is, P(Hi ) = 1/m for i = 1, . . . , m. Independent of the prior P(H), we assume

σ 2 ∼ Inv-Gamma(ν, η), βi |σ 2 ∼ N(0, τβ σ 2 ),

i = 0, . . . , K

γi j|σ ∼ N(0, τγ σ ),

i = 1, . . . , K ,

2

2

(3.2) j = 0, . . . , P,

where ν, η, τβ , and τγ are hyperparameters to be specified by the user. We note that a more sophisticated prior specification for the connection weights of MLPs is the automatic relevance determination (ARD) prior (MacKay, 1995; Neal, 1996), which performs a soft feature selection for MLPs. However, the prior, equation 3.2, has its own advantages. First, it leads to a closed expression for the marginal posterior of γ = {γi j : i = 1, . . . , K , j = 0, . . . , P} by integrating out β = (β0 , . . . , β K ) and σ 2 explicitly. Second, the prior, equation 3.2, is essentially noninformative if we restrict ν ≤ 1. If ν ≤ 1, it is easy to calculate that E(σ 2 ) = ∞, Var(σ 2 ) = ∞, and Var(βi ) = E(Var(βi |σ 2 )) + Var(E(βi |σ 2 )) = τβ E(σ 2 ) = ∞, and Var(γi j ) = ∞. A vague and proper prior is usually preferred in Bayesian model selection problems as the evidence of the models can be sensitive to the prior. (See Kass & Raftery, 1995, for more discussion on the choice of priors.) With the above prior specification, we have the following posterior distribution, f (β, γ , σ 2 , H|D) ∝

1 2πσ 2

n2

2 

n H P  1 yt − β0 − × exp − 2 βi ϕ γi0 + xtj γi j   2σ t=1 i=1 j=1  

H+1 (P+1)H H 2 2 1 1 1 2 exp − β × i 2 2 2 2πτβ σ 2τβ σ i=0 2π τγ σ

2 P H 1 ην e −η/σ 1 2 exp − γ . 2τγ σ 2 i=1 j=0 i j (ν) (σ 2 )ν+1 m

(3.3)

1398

F. Liang

Integrating out β and σ 2 and taking the logarithm, we have H+1 H(P + 1) log(2π ) + log(τβ ) 2 2 1 H(P + 1) log(τγ ) + log |B| 2 2 n H(P + 1) log + +ν 2 2

n H(P + 1) 1 + + ν log η + y y 2 2 2 H P 1 1 −1 2 γ , y ZB Z y + 2 2τγ i=1 j=0 i j

−log f (γ , H|D) = Const + + − +

−

(3.4)

where “Const” is the constant term independent of γ and H, y = (y1 , . . . , yn ) , Z = (z1 , . . . , zn ) is a matrix with the tth row z t = [1, ϕ(γ10 + P P 1 j=1 xtj γ1 j ), . . ., ϕ(γ H0 + j=1 xtj γ H j )], and B = τβ I H+1 + Z Z. The I H+1 denotes the identity matrix of order H + 1. We note that the posterior distribution is invariably a nonlinear and multimodal function. For example, the posterior is invariant with respect to arbitrary relabeling of hidden units and simultaneous changes of the sign of βi and γi j ’s with j = 0, . . . , P. Muller ¨ and Insua (1998) suggesting avoid the multimodality by imposing a relabeling constraint on the parameter space, say, 0 < γ10 < . . . < γ H0 . In this article, we impose no constraint on the parameter space. In theory, this is straightforward and complete for computing the model evidence. Here we mention one point: the model evidence is invariant with respect to the relabeling constraint. This can be seen from the equation 1.2, which depends on only the values of the integrand, which is invariant with respect to the relabeling constraint. 3.1 Example 1: Simulated Data. We simulated y1 , . . . , yn from equation 3.1 with P = 2, H = 2, γ1 = (γ10 , γ11 , γ12 ) = (−0.5, 1, 3), γ2 = (γ20 , γ21 , γ22 ) = (0.5, 2, 2), β = (β0 , β1 , β2 ) = (−1, 3, −1.5), n = 30, σ = 0.2, and xi, j ∼ Unif (−2, 2) for i = 1, . . . , n and j = 1, 2. For this example, we consider the models with one to four hidden units, which are denoted by H1 , . . . , H4 , respectively. In simulation, we set the hyperparameters ν = 1, η = √ 1, and τβ = τγ = 100 and set the CMC parameters ρ = 0, δ1 = 0.01, δs+1 = δs + 1 − 1, K 1 = 105 , and K s+1 = 1.2K s . The CPU time cost by a single run is about 43 seconds. Figure 3 shows the path of the model transition at the last stage of the simulation with δ = 1.56 × 10−4 , where one point was plotted every 100 iterations. CMC leads to a free

1399

1.0

1.5

2.0

Models 2.5 3.0

3.5

4.0

Evidence Evaluation for Bayesian Neural Networks

0

1000

2000 Iteration (X 100)

3000

Figure 3: Transition path of the models in a CMC run for example 1. The samples are collected every 100 iterations in the simulation. Table 3: Sensitivity Analysis for Hyperparameters for Example 1. Model Frequency (%) Setting (1,100) H1 H2 H3 H4

24.61 (0.26) 25.18 (0.13) 25.13 (0.08) 25.07 (0.15)

Evidence (1,10)

Evidence (1,100)

Evidence (1,1000)

Evidence (0.1,100)

0.15 (0.01) 70.31 (0.43) 24.82 (0.34) 4.72 (0.10)

0.08 (0.01) 84.99 (0.39) 13.77 (0.34) 1.16 (0.05)

4.14 (0.99) 81.03 (1.87) 13.90 (1.01) 0.92 (0.09)

0.00 (0.00) 82.11 (0.36) 16.14 (0.32) 1.75 (0.05)

Notes: The choices of hyperparameters are shown in the second row, where (a , b) represents ν = η = a and τβ = τγ = b. The Frequency (%) and Evidence columns show the relative sampling frequencies and the estimates of the evidence of the respective models. The numbers in parentheses are the standard deviations of the preceding estimates.

random walk in the model space. The model transition paths in the other stages are similar. m Later we repeated the run 10 times. In these runs, the constraint, i=1 gˆ (E i ) = 100, was imposed on gˆ (E i )’s to eliminate the unknown constant c of equation 2.3. Note that the same constraint was imposed on gˆ (E i )’s in all simulations of this article. The estimates of the evidence (up to a multiplication factor) and the relative sampling frequencies of the models were reported in the columns “evidence (1,100)” and “Frequency (1,100)” of Table 3, respectively. It shows that the true model H2 has been identified by CMC. To assess the sensitivity of the evidence to hyperparameters, we tried the choices ν = η = 1 and τβ = τγ = 10, ν = η = 1 and τβ = τγ = 100, and ν = η = 0.1 and τβ = τγ = 100. For each choice, CMC was run 10 times. The

1400

F. Liang

Table 4: Sensitivity Analysis for the Sample Size for Example 1. Sample Size Error Rate H1 H2 H3 H4

10 8

20 1

25 0

30 0

40 0

50 0

56.49 (7.78) 14.35 (7.60) 0.57 (0.39) 0.05 (0.05) 0.00 (0.00) 0.00 (0.00) 36.20 (6.18) 71.35 (6.19) 83.94 (1.34) 84.82 (0.85) 86.40 (0.74) 87.99 (0.42) 6.57 (1.45) 13.02 (1.52) 14.21 (1.12) 13.95 (0.75) 12.66 (0.66) 11.25 (0.38) 0.73 (0.18) 1.28 (0.20) 1.27 (0.20) 1.17 (0.11) 0.95 (0.08) 0.76 (0.04)

Notes: The second row shows the error rate (out of 10) that the true model could not be identified by CMC. The other rows show the estimates of the evidence of the respective models and the standard deviations (in parentheses) of the estimates.

computational results are also summarized in Table 3. Since CMC samples from each of the models equally, we omitted the relative sampling frequencies and report only the estimates of the evidence of the models. The results show that the evidence framework is rather robust to the variation of hyperparameters. The true model can be identified with all choices of hyperparameters shown in Table 3. Table 3 shows that the evidence framework tends to select a complex model as τβ and τγ decrease. This is consistent with our intuition: a more complex model is needed to compensate the limitation for the effect of each connection weight. Relatively, the evidence framework is more robust to the variation of ν and η. Next, we evaluated the sensitivity of the evidence framework to data amounts. We varied the sample size n: 10, 20, 25, 30, 40, and 50. For each value of n, 10 independent data sets were generated. For each of the data sets, CMC was run once with ν = η = 1 and τβ = τγ = 100. Table 4 shows the estimates of the evidence of the models, which are calculated by averaging over the 10 data sets. It shows that the evidence framework has a very good response to the data amount. Note that the total number of parameters of the true model is 10. With as few as 25 observations, the method has already been able to identify the true model for all data sets. As expected, the method tends to select a simpler model if the data information is not enough and tends to select the true model as the data information increases. For comparison, the RJMCMC and gaussian approximation methods were also applied to the data sets of n = 50. RJMCMC was implemented as CMC except that the birth, death, and Metropolis moves are guided by the Metropolis-Hastings rule instead of equation 2.2. For each of the 10 data sets, RJMCMC was run with 1.7e+6 iterations, which costs about the same CPU time as a run of CMC. These runs produce the following estimates for the evidence of the respective models: H1 : 0.00(0.00); H2 : 86.53(0.62); H3 : 12.51(0.53); and H4 : 0.96(0.09), where the numbers in the parentheses are the standard deviations of the preceding estimates. These estimates are quite consistent with those obtained by CMC, although the accuracy is slightly worse.

Evidence Evaluation for Bayesian Neural Networks

1401

In the gaussian approximation method, the evidence, equation 1.3, is calculated MLP model as follows (Bishop, K 1995). for an K βi ϕ(γi0 + Pj=1 xtj γi j )]2 ; E w = 12 [ i=0 βi2 + Let E D = 12 nt=1 [yt − β0 − i=1 K P 2 MP MP , γ ) be the minimizer of ζE D + λE w , that is, i=1 j=1 γi j ]; (β (β MP , γ MP ) = arg minβ,γ ζ E D + λE w ; B = ζ ∇∇ E D be the Hessian matrix of the training error function E D , d be the order of B, ν1 , . . . , νd be the eigenvalues of B, and A = B + λI. The log evidence of an MLP model with H hidden units is approximately 1 d log det(A) + log λ MP 2 2 n 1 2 + log ζ MP + log H! + 2 log H + log 2 2 ν 1 2 + log , (3.5) 2 n−ν

MP log f (D|H) = −λ MP E wMP − ζ MP E D −

MP where E wMP and E D are respective values of E w and E D evaluated at MP MP MP (β , γ ), and λ and ζ MP are respective choices of λ and ζ that maximize the log evidence. The estimates produced by the equation 3.5 for the evidence of the respective models are H1 : 0.00(0.00); H2 : 86.25(2.57); H3 : 10.23(1.50); and H4 : 3.52(1.98), where the numbers in the parentheses are the standard deviations of the preceding estimates. These estimates are not very accurate in comparison with those produced by CMC and RJMCMC. This is consistent with the findings of other authors (Bishop, 1995); the estimates produced by the gaussian approximation method are often not very accurate.

3.2 Example 2: Ischemic Heart Disease. The ischemic heart disease data (Kutner, Nachtsheim, & Neter, 2004) were collected by a health insurance plan and provide information concerning 788 subscribers who made claims resulting from coronary heart disease. The response variable is the natural logarithm of the total cost of services provided, and the predictors include the number of interventions or procedures carried out, the number of tracked drugs prescribed, the number of other diseases that the subscriber had during period, and the number of other complications that arose during heart disease treatment. Kutner et al. (2004) used the data set to illustrate MLPs as a nonlinear regression model. In this article, we use it to compare CMC, RJMCMC, and the gaussian approximation method for evidence evaluation for Bayesian MLPs. For this example, we consider the models H1 , . . . , H7 , which have one to seven hidden units, respectively. The first 600 observations were used for model building. CMC was run five times√with ν = η = 1, τβ = τγ = 100, K 1 = 60000, K s+1 = 1.2K s , δ1 = 0.01, δs+1 = δs + 1 − 1, and δend = 5 ×

1402

F. Liang

Table 5: Model Selection for Example 2. Models

CMC

RJMCMC

Gaussian Approximation

H1 H2 H3 H4 H5 H6 H7

0.0047 (0.0044) 92.5505 (0.7151) 7.1970 (0.6787) 0.2425 (0.0495) 0.0053 (0.0017) 0.0001 (0.0000) 0.0000 (0.0000)

0.0008 (0.0008) 93.2279 (1.6667) 6.5628 (1.5878) 0.2038 (0.0769) 0.0041 (0.0024) 0.0004 (0.0002) 0.0001 (0.0000)

0.0061 96.2062 3.5778 0.2042 0.0046 0.0006 0.0006

Notes: Hi denotes the MLP model with i hidden units. The numbers in columns 2 and 3 are, respectively, the estimates of the evidence and the standard deviations of the estimates.

Table 6: Training and Prediction Errors for Example 2. Models

Training Error

Prediction Error

H1 H2 H3 H4 H5 H6 H7

0.3975 (0.0008) 0.3771 (0.0032) 0.3650 (0.0011) 0.3609 (0.0013) 0.3595 (0.0015) 0.3581 (0.0017) 0.3570 (0.0006)

0.2794 (0.0014) 0.2744 (0.0033) 0.2890 (0.0032) 0.2904 (0.0022) 0.2866 (0.0049) 0.2778 (0.0038) 0.2800 (0.0023)

Note: The numbers in columns 2 and 3 are, respectively, the training/prediction errors and their standard deviations.

10−5 . The CPU time cost by each run is about 850 seconds. The computational results are summarized in Table 5. For comparison, RJMCMC and the gaussian approximation method were also applied to this example. RJMCMC was run five times independently. Each run consists of 2e+6 iterations and costs about the same CPU time as that of CMC. The gaussian approximation method evaluates the evidence of the models with equation 3.5. The results are shown in Table 5. Table 5 shows that the estimates produced by CMC, RJMCMC, and the gaussian approximation method are consistent: all identify model H2 as the most probable model. We also compared the generalization ability of models H1 , . . . , H7 . Each of the models was simulated with the Metropolis moves for 1.1e + 5 iterations, where the first 10,000 iterations were discarded for the burn-in process and the remaining iterations were used for prediction. Table 6 shows the training and prediction errors produced by the models in five runs. The model H2 produces the smallest prediction error. The larger evidence models tend to produce smaller prediction errors, although the

Evidence Evaluation for Bayesian Neural Networks

1403

Table 7: Model Selection for Example 3. Model

H1

H2

H3

H4

H5

Evidence

27.53 (1.91)

60.76 (1.59)

10.64 (0.30)

1.00 (0.03)

0.07 (0.00)

Notes

Best BIC

Best AICc

Notes: The Hi denotes the model with i hidden units. The estimates of the evidence and their standard deviations (in parentheses) are computed by averaging over 10 independent runs.

anticorrelation is weak. This example provides an experimental judgment for the importance of evidence evaluation for Bayesian MLPs. 3.3 Example 3: Airline Data. The data set is Series G of Box and Jenkins (1970), the monthly totals (in 100,000s) of international airline passengers from January 1949 through December 1960, for a total of 144 observations. Faraway and Chatfield (1998) analyzed the data using MLPs with various hidden units and various input patterns. The “best” model that they identified with the BIC and AICc criteria (Sugiura, 1978; Hurvich & Tsai, 1989) are shown in Table 7. For comparison, CMC was applied to this example. As in Faraway and Chatfield (1998), the first 11 years of observations (132 observations) were used for model building, and (yt−13 , yt−2 , yt−1 ) was used as the input pattern. In modeling time series data, the input pattern of a MLP can generally be chosen according to the partial autocorrelation graph of the data, as suggested by Chatfield (2001). CMC was run 10 times with ν = √ η = 1 and τβ = τγ = 100, K 1 = 2.4 × 105 , K s+1 = 1.1K s , δ1 = 10−4 , δs+1 = δs + 1 − 1, and δend = 2 × 10−5 . Each run costs about 115 seconds CPU time. The computational results in Table 7 coincide with those obtained with the information criteria (Faraway & Chatfield, 1998). The two largest evidence models are the best AICc model and the best BIC model, respectively. 4 Evidence Evaluation for Classification MLPs Let D = {(xt1 , . . . , xt P ; yt ) : t = 1, . . . , n} denote the training data, where y is the response variable that takes values in a finite set {0, 1, . . . , q − 1} only. Here, q ≥ 2 denotes the number of classes of the data. Let the data be modeled by a one-hidden-layer MLP with P input units, H hidden units, and Q output units, where Q = 1 if q = 2 and Q = q otherwise. If q = 2, the model can be written as yt =

1 with probability pt , 0 with probability 1 − pt ,

(4.1)

1404

F. Liang

where pt = ϕ

β0 +

H

βi ϕ γi0 +

P

i=1

xtj γi j

,

t = 1, . . . , n,

(4.2)

j=1

where β = (β0 , . . . , β H ) and γ = {γi j : i = 1, . . . , H, j = 0, . . . , P} denote the connection weights as for the regression MLPs, and ϕ(·) and ϕ (·) denote the activation functions for the hidden units and the output unit, respectively. In this section, ϕ(·) and ϕ (·) are both the sigmoid function for all examples. If q > 2, we have yt = l − 1 with the probability ptl for 1 ≤ l ≤ q , where exp(ztl ) , ptl = q j=1 exp(ztj )

(4.3)

and ztl = βl0 +

H

βli ϕ γi0 +

P

i=1

xtj γi j ,

j=1

where βli denotes the weight on the connection from the ith hidden unit to the lth output unit, and γi j denotes the weight on the connection from the jth input unit to the ith hidden unit as before. For the classification MLPs, we have the following mutually independent priors, H ∼ Unif{H1 , . . . , Hm }, βi ∼ N 0, σβ2 , i = 0, . . . , H, γi j ∼ N 0, σγ2 , i = 1, . . . , H,

(4.4) j = 0, . . . , P,

where σβ2 and σγ2 are hyperparameters to be specified by the user. With prior 4.4, we have the negative log posterior (up to an additive constant) for the case q = 2, −log f (β, γ , H|D) = Const +

1 + H(P + 2) H+1 log(2π ) + log σβ2 2 2

+

H H(P + 1) 1 β2 log σγ2 + 2 2 2σβ i=0 i

+

P H n 1 γi2j − yt log pt 2 2σγ i=1 j=1 t=1

−

n (1 − yt ) log(1 − pt ). t=1

(4.5)

2

Evidence Evaluation for Bayesian Neural Networks

1 Y 0 −1

1

0 1 1 11 1 1 0 0 1 0 00 1 1 0 0 0 0 1 1 1 1 1 0 1 00 0 1 1 0 1 0 0 00 01 00 1 1 0 11 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1 0 11 1 1 1 0 1 1 0 0 00 1 0 0 1 0 0 0 1 1 1 1 1 0 1 −2 −1 0 1 2 X 1

−2

1405

Figure 4: One training data set of example 4, where 0 and 1 denote the respective observations of the two classes.

For the case q > 2, we have −log f (β, γ , H|D) = Const +

H+1 1 + H(P + 2) log(2π) + log σβ2 2 2

+

H H(P + 1) 1 log σγ2 + β2 2 2σβ2 i=0 i

+

q H P n 1 γi2j − ytl log ptl , 2 2σγ i=1 j=1 t=1 l=1

(4.6)

where ytl = 1 if yt = l − 1 and 0 otherwise. The following examples illustrate the use of CMC for classification MLPs. 4.1 Example 4: Simulated Data. We simulated y1 , . . . , yn from equations 4.1 and 4.2 with P = 2, H = 2, γ1 = (γ10 , γ11 , γ12 ) = ( 22.12, −17.59, 17.15 ), γ2 = ( γ20 , γ21 , γ22 ) = (6.00, 4.49, −3.75), β = (β0 , β1 , β2 ) = (18.38, −10.66, −10.50), n = 100, and xi j ∼ Unif(−2, 2) for i = 1, . . . , n and j = 1, 2. Figure 4 shows one data set generated with the above setting. To eliminate the effect of randomness of the data generation process, we generated 10 data sets independently. For each data √ set, CMC was run once with ν = η = 1, σβ2 = σγ2 = 100, δ1 = 0.1, δs+1 = δs + 1 − 1, δend = 10−5 , K 1 = 104 , and K s+1 = 1.5K s . Each run costs about 13.5 minutes of CPU time. Here we took more iterations in each run to get accurate estimates for the evidence of the models. In fact, with only 10 s CPU time (about 1% of the CPU time reported above), the correct order of evidence of the

1406

F. Liang

Table 8: Sensitivity Analysis for Hyperparameters for Example 4. Model Frequency (%) Setting (1,100) H1 H2 H3 H4

25.77 (1.05) 25.00 (0.31) 24.74 (0.38) 24.49 (0.44)

Evidence (1,20)

Evidence (1,100)

Evidence (1,500)

Evidence (0.1,100)

0.00 (0.00) 54.39 (1.20) 34.03 (0.60) 11.58 (0.63)

0.00 (0.00) 62.66 (2.05) 29.53 (1.33) 7.81 (0.75)

0.00 (0.00) 62.37 (3.08) 29.46 (2.24) 8.17 (1.31)

0.00 (0.00) 63.12 (2.36) 29.07 (1.48) 7.81 (0.92)

Notes: The choices of hyperparameters are shown in the second row, where (a , b) represents ν = η = a and τβ = τγ = b. The Frequency (%) and Evidence columns show the relative sampling frequencies and the estimates of the evidence of the respective models. The numbers in parentheses are the standard deviations of the preceding estimates.

models has been identified by CMC. The computational results in Table 8 show that the true model is identified by CMC. The equal relative sampling frequency of each model implies that CMC works work well for classification MLPs. To assess the sensitivity of the evidence framework to hyperparameters, we tried the choices ν = η = 1 and σβ2 = σγ2 = 20, ν = η = 1 and σβ2 = σγ2 = 500, and ν = η = 0.1 and σβ2 = σγ2 = 100. For each choice, CMC was run once for each of the data sets. The computational results are summarized in Table 8. The true model can be identified by CMC with each of the above parameter settings. From the data generation process, we know the setting σβ2 = σγ2 = 20 is inappropriate for this example. Even with this setting, the true model can still be identified by CMC. Of course, here the data contain enough information to correct the inappropriate prior. 4.2 Example 5: Ripley Data. This is a synthetic problem from Ripley (1994). It consists of two input features, two classes, and 250 training patterns. For this data set, we considered the models with two to six hidden units, which are denoted by H2 , . . . , H6 , respectively. In simulation, we set ν = 1 = η = 1, σβ2 = σγ2 = 100; K 1 = 104 , K i+1 = 1.5K i , δ1 = 0.1, √ δi+1 = δi + 1 − 1, and δend = 10−5 . The simulation was repeated 10 times independently. The computational results (see Table 9) show that the highest evidence model is H3 . This is different from the result of Penny and Roberts (1999), where H4 was the largest evidence model. We note that Penny and Roberts adopted the ARD prior for the connection weights and estimated the model evidence with the gaussian approximation method. We also note that the evidence values reported in Penny and Roberts have large variance, and the evidence of H3 and H3 is not significantly different. Later, CMC was run with σβ2 = σγ2 = 20. With this setting, the largest evidence model is identified as H4 . The results are also reported in Table 9.

Evidence Evaluation for Bayesian Neural Networks

1407

Table 9: Model Selection for the Ripley Data and the Irises Data. Model Evidence Data

H1

H2

H3

H4

H5

H6

Ripleya

— 6.32 (1.43) 40.74 (1.79) 34.66 (1.48) 14.44 (1.06) 3.84 (0.38) Ripleyb — 3.95 (0.88) 28.14 (0.98) 37.15 (0.63) 22.28 (0.66) 8.49 (0.33) Irises 0.02 (0.01) 61.41 (0.48) 29.11 (0.31) 7.92 (0.15) 1.54 (0.04) — Notes: The Hi denotes the model with i hidden units. The estimates of the evidence and their standard deviations (in parentheses) are computed by averaging over 10 independent runs. a Results with σ 2 = σ 2 = 100. γ β b Results with σ 2 = σ 2 = 20. γ β

4.3 Example 6. Irises Data. The irises data (Fisher, 1936) are classified into three categories: setosa, versicolor, and virginica. Each category has 50 samples. Each sample possesses four attributes: sepal length, sepal width, petal length, and petal width. A subset of 90 samples was chosen at random for training. For this data set, we considered the models with one to five hidden units. In simulation, we set ν = 1, η = 1, σβ2 = σγ2 = 100; K 1 = 5 × 104 , √ K s+1 = 1.5K s , δ1 = 0.01, δi+1 = δi + 1 − 1, and δend = 10−6 . The simulation was repeated 10 times independently. The computational results summarized in Table 9 show that the highest evidence model is H2 , of which the total number of connection weights (including biases) is 13. This coincides with that obtained with the method of structure learning with forgetting (Ishikawa, 2000).

5 Discussion In summary, we have provided a new method for evidence evaluation for Bayesian neural networks. The simulated examples show that the true model can be identified by the method with an appropriate prior setting and enough training data. We also assessed the sensitivity of the evidence framework to hyperparameters. The numerical results show that the evidence framework is rather robust to choices of hyperparameters. Since many authors, including MacKay (1992b), Thodberg (1995), and Penny and Roberts (1999), have reported the empirical correlation on the model evidence and generalization error, we skip this part in the article. Now it is widely believed that there is a weak (anti) correlation between evidence and generalization error (MacKay, 1992b). (For more discussion on the issue, refer to Bishop, 1996.) In simulation from equation 1.4, if there are too many models under consideration, we may evaluate them part by part. For example, we first work on the models H1 , . . . , Hm1 and then work on the models Hm1 , Hm1 +1 , . . . , Hm .

1408

F. Liang

The ratios of the evidence of all the models can be adjusted according to the overlap model Hm1 . This method is extremely useful when some models are significantly different from others in structures. In this case, we can group the models with similar structures together to accelerate the simulation process. Note that CMC can be used only to simultaneously estimate the evidence of the multiple models for which the parameter have reasonable overlaps, for example, a sequence of nested models. It cannot compare the evidence for a sequence of models that are completely different. In this article, we considered only the problem of selecting the number of hidden units for MLPs. Extensions to other structure selection problems are immediate, say, the selection of input patterns, the selection of noise models, or outlier detection. In this article, the model is trained (withing model moves) with the Metropolis-Hastings algorithm, and the model transition (between model moves) is guided by the CMC rule. In the following we present one extension of CMC, where both the within- and between-model moves are guided by the CMC rule. Suppose that the sample space X is partitioned according to the model and the energy function jointly, X =

vi m

Ei j ,

i=1 j=1

i where vj=1 E i j = Xi is a partition of Xi made according to the energy function. Without loss of generality, we assume that E i,1 , . . . , E i,vi are arranged in ascending order of energy, that is, for any w ∈ E i, j1 and w ∈ E i, j2 , if j1 < j2 then Ui (w) < Ui (w ), where Ui (·) denotes the energy function of the model Hi . As δs → 0 and K s → ∞, Theorem 1 implies that vi1 ˆ (E i1 , j ) j=1 g , RE i1 ,i2 = vi2 ˆ (E i2 , j ) j=1 g

(5.1)

forms a consistent estimator of the ratio of the evidence of the models Hi1 and Hi2 . The CMC algorithm can easily overcome high energy barriers of the energy landscape; the above algorithm has taken advantage of this advance in both network training and structure selection. It will be useful for hard or big networks, for which the conventional Metropolis-Hastings algorithm cannot sample efficiently from their posterior distributions. Acknowledgments I thank the editor, the associate editor, and two anonymous referees for their constructive comments and suggestions that led to significant improvement of this article.

Evidence Evaluation for Bayesian Neural Networks

1409

References Berg, B. A., & Neuhaus, T. (1991). Multicanonical algorithms for 1st order phasetransitions. Physics Letters B, 267, 249–253. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis, forecast and control. San Francisco: Holden Day. Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury, MA: Thomson Learning. Chatfield, C. (2001). Time-series forecasting. London: Chapman & Hall. Denison, D., Holmes, C., Mallick, B., & Smith, A. F. M. (2002). Bayesian methods for nonlinear classification and regression. New York: Wiley. Faraway, J., & Chatfield, C. (1998). Time series forecasting with neural networks: A comparative study using the airline data. Appl. Statist., 47, 231–250. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problem. Annals of Eugenics, 7, 179–188. Gelman, A., & Meng, X. L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163– 185. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 57, 97–109. Grenander, U., & Miller, M. (1994). Representations of knowledge in complex systems (with discussion). J. Roy. Statist. Soc. B, 56, 549–603. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307. Ishikawa, M. (2000). Structural learning and rule discovery. In I. Cloeth & J. M. Zurada (Eds.), Knowledge-based neurocomputing (pp. 153–206). Cambridge, MA: MIT Press. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc., 90, 773– 795. Kutner, M. H., Nachtsheim, C. J., & Neter, J. (2004). Applied regression models (4th ed.). New York: McGraw-Hill. Liang, F. (2003). Contour Monte Carlo: Theoretical results, practical considerations, and applications (Tech. Rep.). College Station: Department of Statistics, Texas A&M University. Liang, F. (2004). Annealing contour Monte Carlo for structure optimization in an off-lattice protein model. Journal of Chemical Physics, 120, 6756–6763. Liu, J. S. (2001). Monte Carlo strategies in scientific computing. New York: Springer. MacKay, D. J. C. (1992a). The evidence framework applied to classification problems. Neural Computation, 4, 720–736. MacKay, D. J. C. (1992b). A practical Bayesian framework for back-propagation networks. Neural Computation, 4, 448–472.

1410

F. Liang

MacKay, D. J. C. (1995). Bayesian non-linear modeling for the 1993 energy prediction competition. In G. Heidbreder (Ed.), Maximum entropy and Bayesian methods, Santa Barbara 1993. Dordrecht: Kluwer. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Muller, ¨ P., & Insua, D. R. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 749–770. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer. Penny, W. D., & Roberts, S. J. (1999). Bayesian neural networks for classification: How useful is the evidence framework? Neural Networks, 12, 877–892. Perrone, M. P., & Cooper, L. N. (1993). When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone (Ed.), Artificial neural networks for speech and vision (pp. 126–142). London: Chapman & Hall. Phillips, D. B., & Smith, A. F. M. (1996). Bayesian model comparison via jump diffusions. In W. R. Gilks, S. T. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 215–240). London: Chapman & Hall. Ripley, B. D. (1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society, B, 409–456. Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics—Theory and Methods, 7, 13–26. Thodberg, H. H. (1995). A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7, 56–72. Walker, A. M. (1969). On the asymptotic behavior of posterior distributions. Journal of the Royal Statistical Society B, 31, 80–88. Wang, F., & Landau, D. P. (2001). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters, 86, 2050–2053.

Received January 7, 2004; accepted November 1, 2004.

LETTER

Communicated by Klaus Obermayer

An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks Kian Hsiang Low [email protected] Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890, U.S.A.

Wee Kheng Leow [email protected] Department of Computer Science, National University of Singapore, Singapore 117543, Singapore

Marcelo H. Ang, Jr. [email protected] Department of Mechanical Engineering, National University of Singapore, Singapore 119260, Singapore

Self-organizing feature maps such as extended Kohonen maps (EKMs) have been very successful at learning sensorimotor control for mobile robot tasks. This letter presents a new ensemble approach, cooperative EKMs with indirect mapping, to achieve complex robot motion. An indirect-mapping EKM self-organizes to map from the sensory input space to the motor control space indirectly via a control parameter space. Quantitative evaluation reveals that indirect mapping can provide finer, smoother, and more efficient motion control than does direct mapping by operating in a continuous, rather than discrete, motor control space. It is also shown to outperform basis function neural networks. Furthermore, training its control parameters with recursive least squares enables faster convergence and better performance compared to gradient descent. The cooperation and competition of multiple self-organized EKMs allow a nonholonomic mobile robot to negotiate unforeseen, concave, closely spaced, and dynamic obstacles. Qualitative and quantitative comparisons with neural network ensembles employing weighted sum reveal that our method can achieve more sophisticated motion tasks even though the weighted-sum ensemble approach also operates in continuous motor control space. 1 Introduction Goal-directed, collision-free motion in a complex, dynamic, and unpredictable environment is an important task for an autonomous mobile robot Neural Computation 17, 1411–1445 (2005)

© 2005 Massachusetts Institute of Technology

1412

K. Low, W. Leow, and M. Ang, Jr.

operating alone or in a team. In particular, this task is widely employed in service and field robotics (Shastri, 1999), which includes sewer inspection (Hertzberg, Christaller, Kirchner, Licht, & Rome, 1998), cleaning and housekeeping (Fiorini, Kawamura, & Prassler, 2000), surveillance (Rybski, Stoeter, Gini, Hougen, & Papanikolopoulos, 2002), sensor network coverage (Howard, Matari´c, & Sukhatme, 2002), search and rescue (Davids, 2002), and tour guides (Burgard et al., 1999). All these applications require the mobile robot to perform target- or goal-reaching movements while avoiding undesirable and potentially dangerous impact with obstacles or other robots on its team. The robot motion control problem can be stated succinctly as follows: Given an initial state described by the sensory input vector u(0) in the sensory input space U, determine a collision-free sequence of motor control vectors c(t), t = 0, . . . , T − 1, in the motor control space C that moves the robot toward a desired goal state described by u(T) ∈ U. Three general classes of algorithms have been investigated for learning sensorimotor control, which is required for this task: multivariate regression, reinforcement learning, and feature mapping. The first approach formulates the problem as a nonlinear multivariate regression problem and trains a multilayer perceptron (MLP) to perform continuous mapping from U to C (Pomerleau, 1991; Sharkey, 1998; Tani & Fukumura, 1994). It offers good generalization capability. However, prior to training the network, training samples have to be collected for every time step t = 0, . . . , T − 1 to define the quantitative error signals. This sample collection process can be very difficult and tedious, if not impossible, for a mobile robot. The reinforcement learning approach (Kaelbling, Littman, & Moore, 1996; Sutton, 1998) circumvents the above difficulty by providing a qualitative success or failure feedback only at the end of executing the motor control sequence. It estimates how well each previously executed motor control vector c(t) contributes to the overall success or failure of achieving the desired goal and modifies the algorithm accordingly. The training process tends to converge slowly due to sparse reinforcements and imprecise estimate of each motor control vector’s contribution. The third approach uses a self-organizing feature Map (SOFM) (Kohonen, 2000) such as the extended Kohonen map (EKM) (Ritter & Schulten, 1986) that self-organizes to partition the continuous input (or output) space into localized regions. The generalization capability of the feature map arises from its self-organization during training such that each neuron is trained to map a localized sensory region to a desired motor control output. As compared to predefined, uniform partitioning of the feature space (Kuperstein, 1991; Schaal & Atkeson, 1998; Zalama, Gaudiano, & Coronado, 1995), self-organization may lead to better performance and learning efficiency because more neural resources are automatically allocated to frequently encountered sensory regions during learning (Martinetz, Ritter, & Schulten, 1990; Santam´aria, Sutton, & Ram, 1998). This approach increases the resolution of the sensory representation in the frequently

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1413

encountered regions. Such a behavior is reminiscent of biological sensorimotor systems where frequently practiced movements become more fluid and accurate. This article describes a new feature map approach to learning sensorimotor control: cooperative EKMs with indirect mapping. An indirect-mapping EKM (Low, Leow, & Ang, 2002) differs from existing direct-mapping methods (Cameron, Grossberg, & Guenther, 1998; Heikkonen & Koikkalainen, 1997; Rao & Fuentes, 1998; Ritter & Schulten, 1986; Smith, 2002; Touzet, 1997; Versino & Gambardella, 1995) in two ways: 1. Direct-mapping methods map a sensory input directly to a motor control command. In contrast, our indirect-mapping approach maps a sensory input indirectly to a motor control command through control parameters. 2. As a consequence, the indirect-mapping approach maps continuous sensory input space to continuous motor control space (see section 3.1 for detailed discussion). On the other hand, direct-mapping methods map continuous sensory input space to discrete motor control commands. The motor control space is often discretized into a set of commands to be used by reinforcement learning algorithms (Mill´an, Posenato, & Dedieu, 2002; Santam´aria et al., 1998; Smith, 2002; Touzet, 1997), committee machines with voting schemes (Battiti & Colla, 1994; Hansen & Salamon, 1990; Kittler, Hatef, Duin, & Matas, 1998; Sharkey & Sharkey, 1997), and robot action selection mechanisms (Decugis & Ferber, 1998; Huntsberger & Rose, 1998; Maes, 1995; Rosenblatt, 1997). However, recent autonomous agent research in dynamical systems theory (Beer, 1995; Port & van Gelder, 1995) and reinforcement learning (Mill´an et al., 2002; Smart & Kaelbling, 2000) advocates operating in continuous motor control space, which enables our indirect-mapping method to provide finer, smoother, and more efficient motion control than does direct mapping (see section 4.1). Such a high degree of smoothness, flexibility, and precision in motion control is essential for efficiently executing complex tasks and interacting with humans. It is well understood how a SOFM or EKM is used for learning sensorimotor control (Littmann & Ritter, 1996; Walter & Schulten, 1993). However, the nontrivial problem of combining multiple SOFMs or EKMs for sophisticated control (e.g., negotiation of unforeseen complex obstacles and cooperative multirobot tracking of moving targets; Low, Leow, & Ang, 2003) is not well studied. If solved poorly, the control outputs produced while performing a complex motion task may be unexpected or undesirable. For example, a widely used ensemble technique for motion control that combines neural network outputs via weighted sum (e.g., ensemble averaging and mixture of experts; Hashem, 1997; Haykin, 1999; Jacobs, 1995) causes the robot to be trapped easily by unforeseen, complex obstacles. This navigation issue is

1414

K. Low, W. Leow, and M. Ang, Jr.

central to the robotics community as it is often encountered during robot motion in a real-world environment (Kim & Khosla, 1992; Koren & Borenstein, 1991; Rimon & Koditschek, 1992). Note that such a problem will arise even when SOFMs or EKMs are utilized in the weighted-sum ensemble (see section 4.2). To solve this robot motion problem, we propose a new ensemble approach called cooperative EKMs. The cooperation and competition of multiple EKMs (Low, Leow, & Ang, 2003) that self-organize in the same manner can enable a nonholonomic mobile robot to negotiate unforeseen, concave, and closely spaced obstacles (see section 4.2). In contrast, a robot controlled by weighted-sum ensemble (Low, Leow, & Ang, 2002) may fail in such tasks even though the networks also use continuous motor control space (see section 4.2). Before proceeding to the details of cooperative EKMs, we will discuss some related work. 2 Related Work In an MLP, all the training data are used to fit a single global model or representation. Therefore, during learning, all the network weights are susceptible to negative interference that may arise due to dynamically changing data distributions (Schaal & Atkeson, 1998). On the other hand, an EKM fits localized regions of data rather than the entire region of interest into local models or representations, thus localizing the effects of interference. Consequently, learning of a single new training datum affects fewer network weights in an EKM than in a MLP (Atkeson, Moore, & Schaal, 1997; Martinetz et al., 1990). The cost of training in an EKM is kept small by imposing a topology among the neurons such that each learning step involves a subset of neighboring neurons. Initially, the subsets are chosen large, resulting in rapid learning of the coarse sensorimotor mapping. As learning progresses, the size of the subsets is gradually reduced to refine the mapping more and more locally. This strategy allows computationally efficient and accurate training of many neurons and facilitates scaling up the EKM to a larger number of neurons for improved accuracy. An EKM also uses a smaller proportion of network weights for motor control prediction and has been reported to achieve more precise robot positioning than an MLP (Gorinevsky & Connolly, 1994; Jansen, van der Smagt, & Groen, 1995; van der Smagt, Groen, & van het Groenewoud, 1994). However, like other local model networks, an EKM suffers from the curse of dimensionality. That is, the proportion of training data lying within a fixed-radius neighborhood of a point decreases exponentially with an increasing number of dimensions of the input space. Basis function network (BFN) is another type of local model network like EKM. However, it is architecturally different from EKM in that each incoming sensory input is reduced to activation strengths by basis functions, which are linearly mapped to a corresponding control output. To do so, the output weights of all BFN neurons are required in predicting the

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1415

target-reaching motion. In contrast, the EKM uses only the winning neuron’s output weights to map each sensory input to a control output (see section 3.1). During network learning, BFN updates all its output weights with each training datum. On the other hand, only the output weights of the winning neuron and its neighbors in the EKM are updated (see section 3.5). As such, BFN may experience much more interference during online learning than an EKM would. Our indirect-mapping EKM resembles the EKM models of (Littmann and Ritter (1996) and Walter and Schulten (1993), which utilize locally linear mappings. In their models, each neuron stores both the motor control vector and the matrix of motor control parameters as output weights (see equation 3.1). On the other hand, each neuron in the indirect-mapping EKM stores only the matrix of motor control parameters. In the context of our article, their EKM models and indirect-mapping EKM, respectively, use 1800 and 1350 parameters in a network of 15 × 15 neurons (see Table 1). The extra parameters employed by their models are not necessary in achieving good target-reaching performance, as demonstrated by the indirectmapping EKM in this letter (see section 4.1). Furthermore, we have shown that training the control parameters with recursive least squares enables faster convergence and better performance compared to gradient descent. Their EKM models (Littmann & Ritter, 1996; Walter & Schulten, 1993) have used gradient descent only to learn the control parameters. It is typical for a robot to require a very large number of training data for accurate sensorimotor learning. The collection of these data for off-line training is a very difficult and tedious, if not impossible, task. Therefore, online learning is preferred over off-line to eliminate the need to store these data and avoid running complex batch training algorithms such as support vector machines (Schaal, Atkeson, & Vijayakumar, 2002). Hence, our focus in this article is on online learning. Walter and Ritter (1996) have proposed an ensemble of multiple SOFMs called hierarchical PSOM (parameterized self-organizing maps) for learning sensorimotor control of the robot arm under different system contexts. PSOM is a variant of SOFM that can learn with a small set of training data and still achieve good accuracy. To be able to do so, PSOM performs interpolation on a continuous mapping manifold using a small set of predefined, data-independent basis functions and weight vectors constructed directly from the data samples. To ensure good interpolation, the data samples have to be topologically ordered to form the weight vectors. Furthermore, the minimization of the distance function in SOFM algorithm (see equation 3.2) turns into a continuous search problem for PSOM due to its continuous manifold. For hierarchical PSOM, each PSOM is trained separately, resulting in different weight values for different PSOMs. In contrast, for our cooperative EKMs, all EKMs are trained simultaneously to obtain the same input weight values (see section 3.5). Moreover, training of our EKMs is performed online rather than off-line, as is the case for PSOM.

1416

K. Low, W. Leow, and M. Ang, Jr.

In the absence of precise quantitative error signals for training, reinforcement learning algorithms can be used if qualitative feedback signals are available. Nevertheless, they suffer from problems of generalization and continuity. Many reinforcement learning methods encode discrete sensory states and motor commands (Dietterich, 2000; Mahadevan & Connell, 1992; Rohanimanesh & Mahadevan, 2003), which cannot apply directly to the continuous sensorimotor domains of the real-world control tasks. A priori discretization of the continuous space may introduce hidden states and weak generalization, if done poorly. By combining with function approximators (e.g., MLP or feature map) that are capable of generalizing across continuous sensory input and motor control output spaces (Baird, 1995; Gross, Stephan, & Krabbes, 1998; Mill´an et al., 2002; Santam´aria et al., 1998; Smart & Kaelbling, 2000; Smith, 2002; Touzet, 1997), this limitation can be overcome. However, generalizing with function approximators does not guarantee that the algorithms will learn to produce continuous motor commands that vary smoothly and accurately in response to continuous changes in sensory state. In effect, some reinforcement learning algorithms (Mill´an et al., 2002; Santam´aria et al., 1998; Smith, 2002; Touzet, 1997) that are combined with function approximators map from continuous sensory input space to discrete motor control commands, which is exactly what the direct-mapping EKM does (see section 3.1). The drawbacks of such a continuity problem will be demonstrated in section 4.1. To resolve this problem, some methods (Baird, 1995; Gross et al., 1998; Smart & Kaelbling, 2000) map to continuous motor control space but are burdened by very slow iterative search for the optimal action.

3 Ensemble of Cooperative EKMs 3.1 Overview. An EKM is a neural network that extends Kohonen’s (2000) SOFM. Its self-organization of the input space is similar to Voronoi tessellation such that each tessellated region is encoded by the input weights of an EKM neuron. In addition to encoding a set of input weights that selforganize the sensory input space, the EKM neurons also produce outputs that vary with the incoming sensed inputs. The EKMs described in this article adopt an egocentric representation of the sensory input vector: u(t) = {α, d}T , where α and d are the direction and the distance of a target location relative to the robot’s current location and heading. At the goal state at time T, u(T) = (α, 0)T for any α. In many proposed EKMs (Cameron et al., 1998; Heikkonen & Koikkalainen, 1997; Rao & Fuentes, 1998; Ritter & Schulten, 1986; Smith, 2002; Touzet, 1997; Versino & Gambardella, 1995), each sensory input u is mapped directly to a motor control command c. In such a directmapping EKM (see Figure 1A), each neuron i has a sensory weight vector wi = (αi , di )T that encodes a tessellated region in U centered at wi . It also

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

ws u U

s

cs C

A

ws u U

Ms

s

M

1417

c C

B

Figure 1: EKM architectures. (A) Neurons of a direct-mapping EKM map the sensory input space U directly to discretized points in the motor control space C. (B) Neurons of an indirect-mapping EKM map the sensory input space U indirectly to the continuous motor control space C through the control parameter space M. It resembles the EKM model of Walter and Schulten (1993), which stores both motor control vectors and matrices of control parameters as output weights. The indirect-mapping EKM stores only matrices of control parameters as output weights.

has a weight vector ci that encodes the motor control outputs produced by the neuron. With an incoming sensory input u, the winning neuron s is determined such that its sensory weight vector ws is nearest to u (see equation 3.2). This winning neuron s outputs its motor control vector cs to move the robot (see Figure 1A). Note that any incoming sensory input u that lies within the tessellated region encoded by ws will produce the same motor control vector cs . If sensorimotor control is a linear problem, then the motor control vector c would be related to the sensory input vector u by the linear equation c = Mu,

(3.1)

where M is a matrix of motor control parameters. The control problem would be reduced to one of determining M from the training samples. In practice, however, sensorimotor coordination is typically a nonlinear problem because a real motor takes a finite but nonzero amount of time to accelerate or decelerate in order to change speed. This problem is exacerbated in nonholonomic robots. A nonholonomic robot has restrictions in the way it can move due to kinematic or dynamic constraints such as limited turning abilities or momentum at high velocities (e.g., a car) (Arkin, 1998). Hence, a nonholonomic robot is much harder to control and to achieve smooth motion than a holonomic robot (Russell & Norvig, 1995). To solve the nonlinear problem, our indirect-mapping EKM (Low, Leow, & Ang, 2002) is trained to partition the sensory input space U into locally linear regions. Each neuron i in the EKM has a sensory weight vector wi similar to that of a neuron in the direct-mapping EKM. However, unlike the directmapping approach, the output weights of neuron i represent control parameters Mi in the parameter space M (see Figure 1B) instead of the motor

1418

K. Low, W. Leow, and M. Ang, Jr.

time scale

target reaching module

slower

target localization EKM

target

a

obstacle avoidance module local obstacles

faster

. . .

obstacle localizatio n EKM

...

obstacle localization EKM

neural integration module

b1

. . .

bh

motor control

c

actuators

EKM

Figure 2: Framework of cooperative EKMs.

control vector ci . The control parameter matrix Mi is mapped to the actual motor control vector c by the linear model of equation 3.1. To elaborate, the direct-mapping approach maps all the sensory inputs u in a tessellated region in the sensory input space U, represented by a neuron s, to the same discrete point cs in the motor output space C, that is, c = cs . Thus, only a small number of points in C are represented by the neurons’ outputs (i.e., C is very sparsely sampled). In contrast, our indirectmapping approach maps each u in a local region in U to a different point c in C through equation 3.1. Since this mapping is linear and continuous, the indirect-mapping approach maps a region in U to a region in C (see Figure 1B). This method permits finer, smoother, and more efficient sensorimotor control of the robot’s target-reaching motion compared to the direct-mapping approach (see section 4.1). Cooperative EKMs (Low, Leow, & Ang, 2003) are implemented by connecting an ensemble of EKMs into three modules: target reaching, obstacle avoidance, and neural integration (see Figure 2). The target localization EKM in the target-reaching module is activated by the presence of a target within the robot’s target-sensing range. The EKM receives a sensed target location and outputs corresponding excitatory signals to the motor control EKM in the neural integration module at and around the locations of the sensed target. The obstacle localization EKMs in the obstacle avoidance module are activated by the presence of obstacles within the robot’s obstacle-sensing range. Each EKM receives a sensed obstacle location and outputs corresponding inhibitory signals to the motor control EKM in the neural integration module at and around the locations of the sensed obstacles. The motor control EKM in the neural integration module serves as the sensorimotor interface, which integrates the activity signals from the EKMs

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1419

for cooperation and competition to produce an appropriate motor signal to the actuators. This motor signal allows a robot to approach a target and negotiate obstacles. The cooperative EKM’s framework allows the modules to operate asynchronously at different rates, which is the key to preserving reactive capabilities. For example, the target-reaching module operates at about 256 ms between servo ticks while the obstacle avoidance module can typically operate faster at intervals of 128 ms. The neural integration module is activated as and when neural activities are received. 3.2 Target Reaching. The target-reaching module uses the target localization EKM to self-organize the sensory input space U. Each neuron i in the EKM has a sensory weight vector wi = (αi , di )T that encodes a region in U centered at wi . Based on each incoming sensory input u of the target location, the target localization EKM outputs excitatory signals to the motor control EKM in the neural integration module (see section 3.4). 3.2.1 Target Localization. The target localization EKM is activated as follows. Given a sensory input u of a target location: 1. Determine the winning neuron s in the target localization EKM. The winning neuron s is the one whose sensory weight vector ws = (αs , ds )T is nearest to the input u = (α, d)T : D(u, ws ) = min D(u, wi ). i∈A(α)

(3.2)

The difference D(u, wi ) is a weighted difference between u and wi , D(u, wi ) = βα (α − αi )2 + βd (d − di )2 ,

(3.3)

where βα and βd are constant parameters. The minimum in equation 3.2 is taken over the set A(α) of neurons encoding very similar angles as α: |α − αi | ≤ |α − α j |,

for each pair i ∈ A(α), j ∈ / A(α).

(3.4)

In other words, direction has priority over distance in the competition between EKM neurons. This method allows the robot to quickly orientate itself to face the target while moving toward it. An EKM contains a limited set of neurons, each of which has a sensory weight vector wi that encodes a point in the sensory input space U. The region in U that encloses all the sensory weight vectors of these neurons is called the local workspace U . Even if the target falls outside U , the nearest neuron can still be activated (see Figure 3A). 2. Compute output activity a i of neuron i in the target localization EKM: a i = G a (ws , wi ).

(3.5)

1420

K. Low, W. Leow, and M. Ang, Jr.

+

+

+

+

X

U’

A

X

X

U’

U’

B

X X

X

C

U’

D

Figure 3: Conceptual description of cooperative EKMs. (A) In response to the target ⊕, the nearest neuron (black dot) in the target localization EKM (ellipse) of the robot (gray circle) is activated. (B) The activated neuron produces a target field (dotted region) in the motor control EKM. (C) Three of the robot’s sensors detect obstacles and activate three neurons (crosses) in the obstacle localization EKMs, which produce the obstacle fields (dashed ellipses). (D) Subtraction of the obstacle fields from the target field results in the neuron at to become the winner in the motor control EKM, which moves the robot away from the obstacle.

The function G a is an elongated gaussian: (αs − αi )2 (ds − di )2 . − G a (ws , wi ) = exp − 2σa2α 2σa2d

(3.6)

Parameter σa d is much smaller than σa α , making the gaussian distance sensitive and angle insensitive. These parameter values elongate the gaussian along the direction perpendicular to the target direction αs (see Figure 3B). This elongated gaussian is the target field, which plays an important role in overcoming concave obstacles. The effects of these parameters on the robot’s target-reaching capabilities will be examined in section 4.2. The output activities of the neurons in the target localization EKM are aggregated in the motor control EKM to produce a motion that moves the robot toward the target. This will be explained in section 3.4. In the next section, we present the obstacle localization EKMs, which are activated in a similar manner as the target localization EKM. 3.3 Obstacle Avoidance. The obstacle avoidance module uses obstacle localization EKMs. The robot has h directed distance sensors around its body for detecting obstacles. Hence, each activated sensor encodes a fixed direction α j and a variable distance d j of the obstacle relative to the robot’s

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1421

heading and location. Each sensor’s input u j = (α j , d j )T induces an obstacle localization EKM. Note that each distance sensor (e.g., laser) can only reflect the nearest obstacle in its sensing direction. Hence, the number of obstacle localization EKMs that are activated does not depend on the number of obstacles but, rather, on the number of distance sensors. The obstacle localization EKMs have the same number of neurons and input weight values as the target localization EKMs; that is, each neuron i in the obstacle localization EKM has the same input weight vector wi as the neuron i in the target localization EKM. The EKM’s output inhibitory signals to the motor control EKM in the neural integration module (see section 3.4). 3.3.1 Obstacle Localization. The obstacle localization EKMs are activated as follows: For each sensory input u j , j = 1, . . . , h (i.e., h distance sensors): 1. Determine the winning neuron s in the jth obstacle localization EKM. The obstacle localization EKM is activated in the same manner as step 1 of target localization (see section 3.2). 2. Compute output activity b i of neuron i in the jth obstacle localization EKM: b i = G b (ws , wi ),

(3.7)

where

(αs − αi )2 (ds − di )2 − G b (ws , wi ) = exp − 2 2 2σbα 2σbd (ds , di ) 2.475 if di ≥ ds σbd (ds , di ) = 0.02475 otherwise.

(3.8)

The function G b is a gaussian stretched along the obstacle direction αs so that motor control EKM neurons beyond the obstacle locations are also inhibited to indicate inaccessibility (see Figure 3C). If no obstacle is detected, G b = 0. In the presence of an obstacle, the neurons in the obstacle localization EKMs at and near the obstacle locations will be activated to produce obstacle fields. The neurons nearest to the obstacle locations have the strongest activities. The effects of the parameters σbd and σbα on the robot’s obstacle avoidance capabilities will be investigated in section 4.2. 3.4 Neural Integration and Motor Control. The neural integration module uses a motor control EKM to integrate the activities from the neurons in the target and obstacle localization EKMs. The motor control EKM has the same number of neurons and input weight values as the target and robot localization EKMs.

1422

K. Low, W. Leow, and M. Ang, Jr.

3.4.1 Neural Integration. The neural integration is performed as follows: 1. Compute activity e i of neuron i in the motor control EKM, ei = ai −

h

b ji ,

(3.9)

j=1

where a i is the excitatory input from neuron i of the target localization EKM (see section 3.2) and b ji is the inhibitory input from neuron i of the jth obstacle localization EKM (see section 3.3). 2. Determine the winning neuron k in the motor control EKM. Neuron k is the one with the largest activity: e k = max e i . i

(3.10)

3.4.2 Motor Control. The motor control EKM also has a set of output weights, which encode the outputs produced by the neuron. However, unlike existing direct-mapping methods (Cameron et al., 1998; Heikkonen & Koikkalainen, 1997; Rao & Fuentes, 1998; Ritter & Schulten; 1986; Smith, 2002; Touzet, 1997; Versino & Gambardella, 1995), the output weights of neuron i of the motor control EKM represent control parameters Mi in the parameter space M instead of the actual motor control vector (see Figure 1). The control parameter matrix Mi is mapped to the actual motor control vector c by a linear model (see equation 3.11). With indirect-mapping EKM, motor control is performed as follows: Compute motor control vector c, c=

Mk u

if |Mk u| ≤ c∗ and k = s

Mk wk

otherwise,

(3.11)

where s is the winning neuron in the target localization EKM, and Mk and wk are, respectively, the control parameter matrix and sensory weight vector of the winning neuron k in the motor control EKM (step 2 of Neural Integration). The constant vector c∗ denotes the upper limit of physically realizable motor control signal. For instance, for the Khepera robots, c consists of the motor speeds vl and vr of the robot’s left and right wheels. In this case, we define c ≤ c∗ if vl ≤ vl∗ and vr ≤ vr∗ . Note that if c is beyond c∗ , simply saturating the wheel speeds does not work. For example, if the target is far away and not aligned with the robot’s heading, then saturating both wheel speeds only moves the robot forward. Without correcting the robot’s heading, the robot will not be able to reach the target. Hence, the winning neuron’s input weights wk are used to generate the physically realizable motor control output. This motor control would be the best substitution for the sensory input u because wk is closest to u compared to other weights wi , i = k.

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks d (m) 0.15

d (m) 0.15

d (m) 0.15

0.1

0.1

0.1

0.05

0.05

0.05

0

0 -4 -3 -2 -1 0 1 2 3 4

A

-4 -3 -2 -1 0 1 2 3 4

B

C

d (m) 0.15

d (m) 0.15

0.1

0.1

0.1

0.05

0.05

0.05

0

α(rad)

0 -4 -3 -2 -1 0 1 2 3 4

d (m) 0.15

0

1423

α(rad)

0

-4 -3 -2 -1 0 1 2 3 4

-4 -3 -2 -1 0 1 2 3 4

-4 -3 -2 -1 0 1 2 3 4

D

E

F

Figure 4: Output activities of neurons in (A) target localization EKM, (B) obstacle localization EKM activated by distance sensor at −π/6 radian, (C) obstacle localization EKM activated by distance sensor at 0 radian, (D) obstacle localization EKM activated by distance sensor at π/6 radian, (E) obstacle localization EKMs combined, and (F) localization EKMs combined during neural integration. Each dot denotes the sensory weights wi = (αi , di )T of a neuron. A darker dot implies that the neuron has a stronger output activity. A lighter dot implies the opposite.

In activating the motor control EKM (see Figure 3D), the obstacle fields are subtracted from the target field (see equation 3.9). If the target lies within the obstacle fields, the activation of the motor control EKM neurons close to the target location will be suppressed. Consequently, another neuron at a location that is not inhibited by the obstacle fields becomes most highly activated (see Figure 3D). This neuron produces a control parameter that moves the robot away from the obstacle. While the robot moves around the obstacle, the target and obstacle localization EKMs are continuously updated with the current locations and directions of the target and obstacles. Their interactions with the motor control EKM produce fine, smooth, and accurate motion control of the robot to negotiate the obstacle and move toward the target until it reaches the goal state u(T) at time step T. Figure 4 shows the output activities of the neurons in different EKMs produced in response to the environment setup depicted in Figure 3. In Figure 4A, the output activities of the neurons in the target localization EKM form the target field (see Figure 3B). Since the neuron at d = 0.16 m and

1424

K. Low, W. Leow, and M. Ang, Jr.

α = 0.1 radian (darkest dot in Figure 4A) is closest to the target location, it is most strongly activated and thus produces the highest output activity. This neuron corresponds to the black dot in Figure 3B. Its neighboring neurons also produce relatively strong output activities to form the target field used in overcoming the concave obstacle. The obstacle localization EKMs shown in Figures 4B, 4C, and 4D are activated by distance sensors positioned at −π/6, 0, and π/6 radian, respectively. For each EKM induced by a sensor, the neuron that is closest to its sensed obstacle becomes most strongly activated. These activated neurons are at d = 0.132 m and α = −0.4 radian (darkest dot in Figure 4B), d = 0.165 m and α = 0 radian (darkest dot in Figure 4C), and d = 0.139 m and α = 0.4 radian (darkest dot in Figure 4D). They correspond to the three crosses in Figure 3C. Figure 4E shows the combined output activities of the neurons in the obstacle localization EKMs, which form the obstacle fields (see Figure 3C). Since the target lies within the obstacle fields, the strong excitatory activities from the target localization EKM neurons that are close to the target location will be suppressed. As a result, another neuron at d = 0.098 m and α = 0.9 radian (darkest dot in Figure 4F) that is not inhibited by the obstacle fields becomes most strongly activated in the motor control EKM. This neuron corresponds to in Figure 3D. It produces a control parameter that enables the robot to negotiate the concave obstacle. Recall that the various modules run asynchronously at different rates (see section 3.1). In particular, the obstacle avoidance module runs at a faster rate than the target-reaching module. During neural integration, the localization EKMs remain activated until they are updated asynchronously at the next sensing cycle. So, the motor control EKM can receive continuous inputs from the localization EKMs and is always able to produce a motor signal as and when new inputs are sensed.

3.5 Self-Organization of EKMs. In contrast to most existing off-line learning methods (Bruske & Sommer, 1995; Gorinevsky & Connolly, 1994; Karayiannis & Mi, 1997; Moody & Darken, 1989), online learning is adopted for the EKMs. Initially, the EKMs have not been trained, and the motor control vectors c generated are inaccurate. Nevertheless, the EKMs selforganize, using these control vectors c and the corresponding robot displacements v produced by c, to map v to c indirectly. Note that v is used as the training input rather than sensory input u. Since the untrained EKMs produce inaccurate motor control vectors c in response to u (i.e., c does not move the robot to the target location specified by u), the robot will learn the wrong sensorimotor mapping if u is used as the corresponding training input. On the other hand, v is the actual displacement that corresponds to c. Using v as the training input will enable the robot to learn the correct mapping as it moves around. Hence, its sensorimotor control becomes

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1425

more accurate. At this stage, the online learning just fine-tunes the indirect mapping. The self-organized learning algorithm (in an obstacle-free environment) is as follows: Self-Organized Learning Repeat 1. Get sensory input u. 2. Execute target-reaching procedure, and move robot. 3. Get new sensory input u and compute actual displacement v as a difference between u and u. 4. Use v as the training input to determine the winning neuron k (same as step 1 of Target Localization except that u is replaced by v). 5. Adjust the weights wi of neurons i in the neighborhood Nk of the winning neuron k toward v, wi = η G(k, i)(v − wi ),

(3.12)

where G(k, i) is a gaussian function of the distance between the positions of neurons k and i in the EKM and η is a constant learning rate. This step is similar to the self-organization of Kohonen’s selforganizing map. 6. Update the weights Mi of neurons i in the neighborhood Nk to minimize the error e: e=

1 G(k, i) c − Mi v 2 . 2

(3.13)

That is, apply a recursive stochastic approximation algorithm, which can be cast into this general form, Mi = −η

∂e Hi , ∂Mi

(3.14)

where Hi is a weighting matrix. If Hi = I, a first-order learning method, gradient descent, is obtained from equation 3.14: Mi = −η

∂e I = η G(k, i)(c − Mi v)vT . ∂Mi

(3.15)

In the case of the quadratic error function e (see equation 3.13), learning can be accelerated by a second-order learning method (Battiti, 1992). This can be achieved by setting Hi to Ri−1 where Ri is a Gauss-Newton approximation of the Hessian ∂ 2 e/∂Mi2 . A secondorder learning method, recursive least squares (Glentis, Berberidis,

1426

K. Low, W. Leow, and M. Ang, Jr.

& Theodoridis, 1999), is thus derived from equation 3.14 with η = 1 (optimum step size),   Ri−1 =

1 (1 − λ)R−1 − i λ

Mi = −

 Ri−1 vvT Ri−1   λ T −1 + v Ri v G(k, i)

∂e −1 R = G(k, i)(c − Mi v)vT Ri−1 , ∂Mi i

(3.16)

(3.17)

where λ is a constant forgetting rate and Ri−1 is initialized to I. Note that the recursive online update of Ri−1 (see equation 3.16) is obtained using matrix inversion lemma to avoid the costly matrix inversion operation (Haykin, 2002). Each update of Mi requires O(n2 ) computations and O(n2 ) additional memory to store Ri−1 where n is the number of dimensions in v. In contrast, gradient descent requires O(n) computations and no additional memory. The performance of these two learning methods is compared in section 4.1. The target and obstacle localization EKMs self-organize in the same manner as the motor control EKM except that step 6 is omitted. At each training cycle, the weights of the winning neuron k and its neighboring neurons i are modified. The amount of modification is proportional to the distance G(k, i) between the neurons in the EKM. The input weights wi are updated toward the actual displacement v, and the control parameters Mi are updated so that they map the displacement v to the corresponding motor control c. After self-organization has converged, the neurons will stabilize in a state such that v = wi and c = Mi v = Mi wi . For any winning neuron k, given that u = wk , the neuron will produce a motor control output c = Mk wk , which yields a desired displacement of v = wk . If u = wk but close to wk , the motor output c = Mk u produced by neuron k will still yield the correct displacement if linearity holds within the input region that activates neuron k. Thus, given enough neurons to produce an approximate linearization of the sensory input space U, indirect-mapping EKM can produce finer and smoother motion control than direct-mapping EKM, as shown in section 4.1.

4 Experiments and Discussion 4.1 Online Learning of Target-Reaching Motion. This section presents a quantitative evaluation of the indirect-mapping EKM in online sensorimotor learning of the robot’s target-reaching motion. For the purpose of evaluating performance, the following network architectures were compared:

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1427

1. B15: BFN with 15×15 neurons 2. D15: direct-mapping EKM with 15 × 15 neurons 3. G9: indirect-mapping EKM with 9 × 9 neurons trained by gradient descent 4. G12: indirect-mapping EKM with 12 × 12 neurons trained by gradient descent 5. G15: indirect-mapping EKM with 15 × 15 neurons trained by gradient descent 6. R15: indirect-mapping EKM with 15 × 15 neurons trained by recursive least squares Our implementation of BFN was similar to those proposed by Bruske and Sommer (1995), Hartman and Keeler (1991), Karayiannis and Mi (1997), and Moody and Darken (1989), except that it was trained online rather than offline. The basis function centers were trained in a similar manner as the input weights of indirect-mapping EKM (see equation 3.12). Each basis function width was updated to approach the Euclidean distance between itself and its nearest neighbor (Hartman & Keeler, 1991; Moody & Darken, 1989; Platt, 1991). The output weights were trained by gradient descent. We also attempted to train the basis function centers with gradient descent (Ghosh & Nag, 2001; Karayiannis, 1999; Platt, 1991; Poggio & Girosi, 1990; Wettschereck & Dietterich, 1992), but learning was unsuccessful despite extensive tuning of parameters. Although the robot learned to move toward the target locations successfully, it was not able to come to a stop at these locations, even after prolonged training. One possible explanation, as detailed by Moody and Darken (1989), is that gradient descent training of the basis function centers may lead to unpredictable target-reaching motions because the centers are sometimes squeezed out of the region of input space that contain data. Furthermore, learning converges slowly due to nonlinear optimization. In contrast, the self-organization of the input space in our implemented BFN is datacentric. More neurons are committed to input regions with dense sampling of data during online learning, which improves the resolution in these regions (see section 1). Faster convergence in learning has also been reported in this case (Moody & Darken, 1989). The tests were performed using Webots (http://www.cyberbotics.com), a 3D, kinematic, sensor-based simulator for Khepera mobile robots, which incorporates 10% white noise in its sensors and actuators. The simulator computes the trajectories and sensory inputs of a robot situated in an environment corresponding to a given physical setup. The resulting simulation allows the controller to be transferred to a real robot without changes (Michel, 2004). The simulated behaviors are very close to those of a real robot, as demonstrated in these works (Hayes, Martinoli, & Goodman,

1428

K. Low, W. Leow, and M. Ang, Jr.

2002; Ijspeert, Martinoli, Billard, & Gambardella, 2001; Martinoli, Ijspeert, & Mondada, 1999). In the experiments, the neural networks were trained in a 5 m by 5 m obstacle-free environment. Each training-testing trial took 100,000 time steps, and each time step for target-reaching motion lasted 1.024 sec. During training, the input weights were initialized to correspond to regularly spaced locations in the sensory input space U. The robot began its network training at the center of the environment, and a randomly selected sequence of targets was presented. The robot’s task was to move to the targets, one at a time, and weight modification was performed at each time step after the robot had made a move. At each time interval of 10,000 steps during training, a fixed testing procedure was conducted. In each test, the robot began at the center of the environment and was presented with 50 random target locations in sequence. The robot’s task was to move to each of the target locations. No training was performed during this testing phase. The training-testing trial was repeated five times and, testing performance was averaged over the five trials. Three testing performance indices are measured in the training-testing trials. The first index is the mean positioning error E, which measures the average distance εi between the center of the robot and the ith target location after it has come to a stop (i.e., motor control c = 0): E=

1 εi , RN i

(4.1)

where R is the number of trials and N is the number of testing target locations. The second index normalized time-to-target T measures how long it takes the robot to reach the target locations: T=

1 ˜ ˜ ti ti , ti = , RN i li

(4.2)

where ti is the time it takes the robot to reach the ith target, li is the straightline distance between targets i − 1 and i, and t˜i is the normalized time taken to reach target i. That is, normalized time to target measures the average amount of time the robot takes to travel a distance of 1 m toward a target. The third index, mean deviation from straight-line trajectory D, measures how straight or wavy the robot’s trajectory is, D=

1 |di − li | δ˜i , δ˜i = , RN i li

(4.3)

where di is the distance traveled to reach the target location i and δ˜i is the deviation from straight-line trajectory for target i.

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks T (m-1) 1000

E (mm) 100

1429

B15 D15 G9

10

G12

100

G15 R15

1

time steps

10

0

1

2

3

4

5

A

6

7

8

9 10

0

1

2

3

4

5

6

7

8

9 10 (x10,000)

B

Figure 5: Performance comparison between various network architectures in (A) mean positioning error and (B) normalized time to target.

Figures 5A and 5B show, respectively, how the mean positioning error and normalized time to target decreased during the self-organized learning of various network architectures. In Figure 5A, both B15 and D15 stabilized as early as 10,000 time steps but achieved much poorer E performance compared to the other networks. G9, G12, G15, and R15 stabilized more gradually at about 70,000, 60,000, 50,000, and 15,000 time steps, respectively, but they could all achieve lower E. Notice that the larger indirect-mapping EKMs stabilized faster. To explain this counter intuitive result, note that the standard deviations of the gaussian functions in equations 3.12 and 3.15 (see section 3.5) were the same for all EKMs. That is, the proportion of neurons requiring weight updates at each time step was greater in smaller EKMs than in larger EKMs. As such, the neurons in smaller EKMs updated their weights more frequently, thus stabilizing more slowly. While the E performance shows the quality of the robot positioning at the target location, the T and D performance demonstrate the quality of the robot’s trajectory. We will illustrate only T in Figure 5B; the convergence of D is similar. The self-organization of D15 stabilized at about 50,000 time steps. G9, G12, and G15 stabilized at about 90,000, 70,000, and 50,000 time steps, respectively, which supported the observation that larger EKMs stabilized more quickly. R15 stabilized at 40,000 time steps, which was faster than that trained by gradient descent. Therefore, its training-testing process was stopped at 50,000 time steps, which was sufficient for its self-organized learning to stabilize. Although B15 stabilized as early as 10,000 time steps, its poorer performance, relative to the other networks, became obvious with increasing training time. Table 1 shows the test results after training. All indirect-mapping EKMs achieved lower mean positioning errors, normalized time to target, and mean deviation from straight-line trajectory than D15 and B15. R15 achieved much lower E and D than G9, G12, and G15. Among the indirect-mapping EKMs trained by gradient descent, G12 enabled the robot

1430

K. Low, W. Leow, and M. Ang, Jr.

Table 1: Performance Comparison Between Networks After Training.

Network B15 D15 G9 G12 G15 R15

Performance Indices

Total Parameters

E (mm)

T (m−1 )

D

1350 900 486 864 1350 1350

10.02 ± 3.81 8.37 ± 2.25 3.40 ± 1.83 3.89 ± 1.61 3.23 ± 0.66 1.29 ± 0.14

166.40 ± 28.49 36.78 ± 11.11 15.31 ± 4.64 19.31 ± 8.47 18.96 ± 4.37 17.00 ± 2.48

0.11 ± 0.06 0.18 ± 0.15 0.10 ± 0.04 0.06 ± 0.02 0.07 ± 0.02 0.04 ± 0.01

to travel the straightest path to stop at the target location. Reducing the number of neurons to 9×9 caused the path to be more convoluted. Increasing the number of neurons to 15×15 increased, instead of decreased, D slightly. This phenomenon could be explained by how the neurons selforganized in the sensory input space U, which is elaborated in the next paragraph. Neurons in G15 were self-organized into four clusters: d = 0 m and α = −3, 0, +3 radian (see Figure 6C). Neurons in G9 and G12 were selforganized into two clusters only: d = 0 m and α = 0 radian (see Figures. 6A and 6B). With more neurons, G15 gained the flexibility of backward motion (α = −3, +3 radian). However, these two regions of input space were less well sampled by the neurons than the region at α = 0 radian. As such, if a distant target appeared behind the robot with G15, its backward motion would produce a wavier path. The robot with G12 would instead turn around to face the target via the cluster at d = 0 m before moving forward in a much straighter path. As for G9 (see Figure 6A), since its neurons sampled the input space at α = 0 radian more sparsely than those in G15 at α = 0, +3, −3 radian (i.e., both forward and backward motion), it would inevitably produce a more convoluted path than G15 regardless of whether the target is in front or behind. Table 1 also shows that smaller mean deviation did not necessarily imply shorter normalized time to target. During learning, direction had priority over distance in the competition between EKM neurons (see equation 3.4). So a larger EKM had more neurons allocated for adjusting orientation without moving long distances (i.e., d = 0 m cluster in Figure 6). Consequently, the robot might move short motion steps to adjust its orientation first before moving straight to the target. Therefore, its trajectory deviated less from the straight-line path. The advantages of indirect-mapping EKMs over D15 and B15 can also be assessed from the self-organization results (see Figure 6). The neurons in the indirect-mapping EKMs cover larger areas in the sensory input space than those in D15 or B15. Moreover, they sample distances up to 0.16 m, whereas D15 and B15 neurons sample distances only up to 0.12 m. Note

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks d (m) 0.15

d (m) 0.15

d (m) 0.15

0.1

0.1

0.1

0.05

0.05

0.05

0

α(rad)

0

0

-4 -3 -2 -1 0 1 2 3 4

-4 -3 -2 -1 0 1 2 3 4

-4 -3 -2 -1 0 1 2 3 4

A

B

C

d (m) 0.15

d (m) 0.15

d (m) 0.15

0.1

0.1

0.1

0.05

0.05

0.05

0

0

1431

α(rad)

0

-4 -3 -2 -1 0 1 2 3 4

-4 -3 -2 -1 0 1 2 3 4

-4 -3 -2 -1 0 1 2 3 4

D

E

F

Figure 6: Self-organization results of (A) G9, (B) G12, (C) G15, (D) R15, (E) D15, and (F) B15 taken after one of the training trials. Each dot denotes the weights wi = (αi , di )T of a neuron.

that 0.16 m is the farthest that a Khepera robot can move in a single time step of 1 second. That is, indirect-mapping EKMs sample the sensory input space more completely than do D15 and B15 and thus produce finer, smoother, and more efficient motor control. To determine whether there is statistically significant difference between the test results of different networks, t-tests were performed. In Table 2, a large value indicates that the test results between two networks are similar, that is, not significantly different (Mendenhall & Sincich, 1994). The E and T of indirect-mapping EKMs are significantly different from those of D15 and B15 because the t-test values are less than 0.1. However, the differences in E and T of G9, G12, and G15 are not significant. This means that G9 is sufficient for the robot to stop very close to the targets at a rate that is as fast as G12 or G15. While the difference in E and D between R15 and indirectmapping EKMs trained by gradient descent is significant, the difference in T is not. The low D of G15 is not significantly different from that of B15. The difference in D of G9 from D15 and B15 is also not significant. This means that G9 achieves similar D performance as D15 and B15 even though it uses fewer network weights. Often a robot is required to move through several checkpoints in a complex environment before stopping at the goal. Given that the radius of the Khepera robot is 30 mm, it is reasonable to regard the robot to have reached

1432

K. Low, W. Leow, and M. Ang, Jr.

Table 2: Significance Levels from t-Tests on Similarity in Performance Between Networks After Training. B15

D15

G15

G12

G9

E (mm) R15 G9 G12 G15 D15

0.00 0.00 0.01 0.00 0.22

0.00 0.00 0.00 0.00

0.00 0.42 0.21

0.00 0.33

0.02

T (m−1 ) R15 G9 G12 G15 D15

0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.01 0.01

0.20 0.12 0.47

0.29 0.19

0.25

0.01 0.43 0.04 0.12 0.19

0.04 0.16 0.06 0.09

0.01 0.09 0.09

0.04 0.02

0.00

D R15 G9 G12 G15 D15

(and touched) a target checkpoint if the distance-to-target ε is less than 30 mm. Figure 7 illustrates the performance comparison that evaluates this target-reaching criterion after the robot has been trained. The target-reaching probability P(ε) measures the probability of the robot’s reaching closer than a distance of ε (with or without stopping) from the target locations. The normalized time-to-target T(ε) measures how long it takes the robot to reach closer than a distance of ε (with or without stopping) from the target locations. The mean deviation from straight-line trajectory D(ε) measures how straight or wavy the robot’s trajectory is. Test results show that with indirect-mapping EKMs, the robot could get much closer to the targets with higher probability (see Figure 7A) and reach the targets much faster (see Figure 7B) than with D15 or B15. Moreover, it could travel in straighter paths (see Figure 7C) than with D15. Table 3 shows the test results for ε = 5 mm. R15 outperformed G12 and G15 in P(5) and G9 and G15 in D(5). Among the indirect-mapping EKMs trained by gradient descent, G9 enabled a robot to reach closer than 5 mm from target locations with higher probability than G15. This could be because there were more neurons in G9 than in G15 at very small, nonzero d and |α| < 1.57 radian in the input space. This set of neurons was responsible for moving the robot, at less than 10 mm away from the target location, forward to closer than 5 mm. As a result, a higher P(5) could be achieved. It was also observed that R15, which achieved the highest P(5), had more

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1433

T (ε)(m-1) 1000

P (ε) 1 0.9

100

0.8 0.7

10

0.6 ε (mm)

0.5 0

5

10

15

20

ε (mm)

1

25

0

5

10

15

20

25

B

A D ( ε) 1

B15 D15 G9

0.1

G12 G15 R15 ε (mm)

0.01 0

5

10

15

20

25

C Figure 7: Performance comparison between various network architectures in (A) target-reaching probability, (B) normalized time to target, and (C) mean deviation from straight-line trajectory after training. Table 3: Performance Comparison Between Networks.

Network B15 D15 G9 G12 G15 R15

Performance Indices

Total Parameters

P(5)

T(5) (m−1 )

D(5)

1350 900 486 864 1350 1350

0.84 ± 0.05 0.54 ± 0.07 0.95 ± 0.08 0.92 ± 0.09 0.86 ± 0.04 1.00 ± 0.01

144.70 ± 20.16 18.61 ± 1.66 8.83 ± 0.96 8.48 ± 0.74 8.96 ± 0.47 8.85 ± 0.35

0.04 ± 0.03 0.18 ± 0.08 0.07 ± 0.02 0.03 ± 0.02 0.06 ± 0.01 0.03 ± 0.01

neurons than G9 in this region of the input space. G15 had many more neurons at approximately zero d than at very small, nonzero d. G12 achieved lower D(5) than G9 and G15. This outcome could be justified, in a similar manner, by the explanation provided for the previous test results on D. By comparing the differences in normalized time-to-target and mean deviation between Tables 1 and 3, we could notice a greater amount of time and distance required for the robot to come to a stop.

1434

K. Low, W. Leow, and M. Ang, Jr.

Table 4: Significance Levels from t-Tests on Similarity in Performance at ε = 5 mm Between Networks. B15

D15

G15

G12

G9

P(5) R15 G9 G12 G15 D15

0.00 0.02 0.06 0.26 0.00

0.00 0.00 0.00 0.00

0.00 0.03 0.11

0.04 0.29

0.12

D(5) R15 G9 G12 G15 D15

0.12 0.09 0.31 0.20 0.00

0.00 0.01 0.00 0.01

0.00 0.14 0.03

0.19 0.02

0.00

T(5) (m−1 ) R15 G9 G12 G15 D15

0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

0.35 0.40 0.13

0.17 0.27

0.48

One other interesting comparison is the total number of parameters or weights utilized by the various networks (see Tables 1 and 3). G12 uses fewer weights than both D15 and B15 but still performs better comparatively. Table 4 shows the t-test values at ε = 5 mm. While the difference in P(5) and D(5) between R15 and indirect-mapping EKMs trained by gradient descent is significant, the difference in T(5) is not. Among the indirectmapping EKMs trained by gradient descent, the t-tests for P(5) show no significant difference between G9 and G12 and between G12 and G15. The differences in T(5) between G9, G12, and G15 are also not significant. To summarize, R15 achieved the best overall performance among the various networks, in particular, its performance in E, D, P(5), and D(5). Its T and T(5) performance were not significantly different from those of the other indirect-mapping EKMs. Among the indirect-mapping EKMs, it stabilized most quickly. Although B15 stabilized much faster than the other networks, it produced the poorest performance in E, T, and T(5). D15 offered the poorest performance in D, D(5), and P(5). 4.2 Neural Network Ensemble for Target-Reaching Motion with Obstacle Avoidance. This section evaluates qualitatively and quantitatively the performance of cooperative EKMs in goal-directed, collision-free robot motion in complex, unpredictable environments. The experiments

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks 0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

1435

Width

-0.4 -0.4

-0.2

0

A

0.2

0.4

-0.4 -0.4

Depth

-0.2

0

0.2

0.4

B

Figure 8: Negotiating unforeseen concave obstacle that was 34 cm wide and 12 cm deep. (A) The robot using command fusion was trapped, but (B) the one adopting cooperative EKMs successfully moved around the obstacle.

were also performed using Webots. Twelve directed long-range sensors were modeled around its body of radius 3 cm. Each sensor had a range of 17 cm, enabling the detection of obstacles at 20 cm or nearer from the robot’s center and a resolution of 0.5 cm to simulate noise. Two tests were performed to compare cooperative EKMs with another ensemble method (Low, Leow, & Ang, 2002). The latter approach, termed command fusion, linearly combines the motion control outputs, using weighted sum, of different neural networks implementing different behaviors. This is a widely used technique to integrate the motion control outputs produced by different neural networks (Hashem, 1997; Haykin, 1999; Jacobs, 1995). For our case, the target-reaching motion is produced by an indirect-mapping EKM while obstacle avoidance is performed using the method of Braitenberg’s type-3C vehicle (Braitenberg, 1984). To elaborate, when the robot senses the presence of an obstacle, say, in front and on the left, the right motor will rotate backward faster than the left motor’s rotation forward, thus turning the robot away from the obstacle. For both ensemble methods, the target-reaching and obstacle avoidance modules ran at intervals of 256 ms and 128 ms, respectively. The robot’s performance was assessed in an environment under two unforeseen conditions: (1) concave obstacle and (2) narrow doorway between closely spaced obstacles. In the first test (see Figure 8), the robot fitted with command fusion got trapped by the concave obstacle (see Figure 8A). The target-reaching behavior tried to move the robot forward to reach the target while the obstacle avoidance behavior moved it backward to avoid the obstacle. The combined output cancelled each other, causing the robot to be trapped by the obstacle. The robot with cooperative EKMs could overcome the obstacle to reach the goal successfully (see Figure 8B).

1436

K. Low, W. Leow, and M. Ang, Jr.

700 600 500 400 300 200 100 0 8

700 600 500 400 Width (cm) 300 200 100 0 27 22

12

Depth (cm)

17 Sensing Range (cm)

16 20 12

Figure 9: Maximum width of concave obstacle (see Figure 8B) that a robot with cooperative EKMs can overcome with different combinations of obstacle depths and robot-sensing ranges.

It is noted that a robot with cooperative EKMs can still get trapped if the obstacle is so concave that the obstacle fields cannot completely inhibit the neurons at or near the target location. Figure 9 shows the maximum obstacle width that a robot with cooperative EKMs can overcome with varying obstacle depths and robot-sensing ranges. Given a fixed sensing range, the maximum negotiable obstacle depth decreases with increasing width. When the sensing range increases, the robot with cooperative EKMs can negotiate an extremely wide concave obstacle if it is not too deep. Conversely, to be able to overcome a fairly deep obstacle, its width cannot be too large. This limitation, however, does not diminish the significance of our method as it is simpler than many existing reactive robot motion methods for overcoming unforeseen concave obstacles (Lagoudakis & Maida 1999; Liu, Ang, Krishnan, & Lim, 2000; Zelek & Levine, 1996). In particular, it utilizes only local information of the target location and the unforeseen obstacles, as opposed to motion planners (Latombe, 1999) that require global knowledge of the environment to operate. In the second test (see Figure 10), the robot endowed with command fusion could not pass through the narrow doorway between closely spaced obstacles (see Figure 10A) because its obstacle avoidance behavior counteracted the target-reaching behavior. In contrast, the robot with cooperative EKMs could always traverse through the narrow doorway to the goal successfully (see Figure 10B). These two simple tests show that for command fusion, though each neural network proposes an action that is optimal by itself, the weighted sum of these action commands produces a combined action that may not satisfy the overall task. Cooperative EKMs, however, consider the activity signals

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks 0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4 -0.4

-0.2

0

A

0.2

0.4

-0.4 -0.4

-0.2

0

0.2

1437

0.4

B

Figure 10: Passing through an unforeseen narrow doorway between closely spaced obstacles that was 86 mm wide. (A) The robot using command fusion was trapped, but (B) the one adopting cooperative EKMs successfully passed through the narrow doorway to the goal.

of each localization EKM and integrate them to determine an action that can satisfy each localization EKM to a certain degree. Such tightly coupled interaction between the localization EKMs and the motor control EKM in the cooperative EKMs framework enables the robot to achieve more complex tasks. Recall that the standard deviations σ of the gaussian functions for the target and obstacle fields play an important role in the robot motion capabilities of cooperative EKMs (see sections 3.2 and 3.3). For the target field (see Figure 3A), σa α and σa d control the elongation of the target field perpendicular to and along the target direction, respectively. Parameters σbα and σbd achieve a similar effect for the obstacle field. For the negotiation of concave obstacles (e.g., see Figure 8B), the target field has to be considerably elongated perpendicular to the target direction. This requires a large enough σa α parameter value. However, as this value increases, the tendency of the robot moving along the shorter path to the goal via the narrow doorway (see Figure 10B) decreases and the longer detour featured in Figure 8B is increasingly preferred in the situation of Figure 10B. When σa α is large, subtraction of the obstacle field from the highly elongated target field (see Figure 3D) in the motor control EKM, rather than the neuron at the narrow doorway, may result in the neuron at the edge of the concave obstacle to be more highly activated. Nonetheless, the above two tests and the subsequent ones can be achieved by a single σa α value of 2.475. If σa d is too small, the target field may be totally suppressed by the obstacle field depending on σa α . This may or may not cause the robot to be trapped in the concave obstacle since any neuron not inhibited by the obstacle field can be potentially activated. If σa d is too large, the robot may get trapped in the concave obstacle. Superposition of the fields may cause the neuron in the cavity of

1438

K. Low, W. Leow, and M. Ang, Jr. 0.25

0.25

0

0

-0.25 -0.5

-0.25

0

0.25

0.5

-0.25 -0.5

0.25

0.25

0

0

-0.25 -0.5

-0.25

0

0.25

0.5

-0.25 -0.5

-0.25

0

0.25

0.5

-0.25

0

0.25

0.5

Figure 11: Motion of robot (gray) in an environment with two unforeseen obstacles (black) moving in anticlockwise circular paths. The robot could successfully negotiate past the extended walls and the dynamic obstacles to reach the goal (small black dot).

the concave obstacle, rather than the neuron at the edge of the obstacle, to be more highly activated. In all the tests, σa d is set to 0.0495. The parameter values of σbα and σbd have to be large enough for the robot to avoid collision with obstacles as well as discriminate whether a doorway is wide enough to pass through. However, if these values are too large, the robot cannot move to target locations near the obstacles or detect the presence of narrow but traversable doorways. In all the tests, σbα is set to 0.495, while σbd uses the values given in equation 3.8. The above evaluation of the target and obstacle field parameters highlights their significance to the robot motion capabilities of cooperative EKMs. In our future work, we will consider using reinforcement learning to train the appropriate parameter values for negotiating different obstacles when the robot encounters them during motion. The next two tests aim to demonstrate the capabilities of cooperative EKMs in performing more complex motion tasks. The environment for the first test consisted of three rooms connected by two doorways (see Figure 11). The middle room contained two obstacles moving in anticlockwise circular paths. The robot began in the left-most room and was tasked to move to the right-most room. Test results show that the robot was able to negotiate past the extended walls and the dynamic obstacles to reach the goal. Note that this target-reaching motion was completely determined by the cooperation and competition between the EKMs, and no global planning was used. The environment for the second test consisted of three rooms connected by two doorways and some unforeseen static obstacles (see Figure 12). The robot began in the top corner of the left-most room and was tasked to move

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1439

0.25

0

-0.25 -0.5

-0.25

0

0.25

0.5

Figure 12: Motion of robot (dark gray) in a complex environment. The checkpoints (small black dots) were located at the doorways and the goal position. The robot could successfully navigate through the checkpoints to the goal by traversing between unforeseen narrowly spaced convex obstacles (light gray) in the first and the last room and overcoming an unforeseen concave obstacle (light gray) in the middle room.

into the narrow corner of the right-most room via checkpoints plotted by a planner (Low, Leow, & Ang, 2002). The robot was able to move through the checkpoints to the goal by traversing between narrowly spaced convex obstacles in the first and the last room, and overcoming an unforeseen concave obstacle in the middle room. The results of these last two tests further confirm the effectiveness of cooperative EKMs in handling complex tasks in complex, unpredictable environments. 5 Conclusion This article presents a new approach of learning sensorimotor control for complex robot motion tasks using cooperative EKMs. Quantitative evaluation reveals that indirect-mapping EKM can produce finer, smoother, and more efficient robot motion control than other local learning methods such as direct-mapping EKM and BFN. Furthermore, training the control parameters of the indirect-mapping EKM with recursive least squares allows faster convergence and better performance than with gradient descent. The cooperation and competition of multiple EKMs enable the nonholonomic mobile robot to negotiate unforeseen concave, closely spaced, and dynamic obstacles. These tasks can easily trap robots that are controlled by neural network ensembles employing command fusion techniques. Cooperative EKMs can thus augment the reactive capabilities of an autonomous mobile robot significantly. Recently, we have enhanced cooperative EKMs further to achieve multirobot motion tasks such that multiple robots fitted with cooperative EKMs can coordinate their tracking of moving targets (see Figure 13). Qualitative and quantitative test results of the improved cooperative EKMs for multirobot tasks are presented in Low, Leow, and Ang

1440

K. Low, W. Leow, and M. Ang, Jr.

0.35 0.25 0.15 0.05 -0.05 -0.15 -0.35 -0.175

0

0.175

0.35

0

0.175

0.35

0

0.175

0.35

0.35 0.25 0.15 0.05 -0.05 -0.15 -0.35 -0.175 0.35 0.25 0.15 0.05 -0.05 -0.15 -0.35

-0.175

Figure 13: Cooperative tracking of moving targets. When the targets were moving out of the robots’ sensory range, the two robots moved in opposite directions to track the targets. In this way, all targets could still be observed by the robots.

(2003) and Low, Leow, and Ang (2004). Our continuing research goal is to generalize this approach to other sensorimotor control problems such as those of static and mobile robot manipulators.

References Arkin, R. C. (1998). Behavior-based robotics. Cambridge, MA: MIT Press. Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1–5), 11–73.

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1441

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proc. 12th International Conference on Machine Learning (ICML-95) (pp. 30–37). San Francisco: Morgan Kaufmann. Battiti, R. (1992). First- and second-order methods for learning: Between steepest descent and Newton’s method. Neural Comput., 4(2), 141–166. Battiti, R., & Colla, A. (1994). Democracy in neural nets: Voting schemes for classification. Neural Networks, 7(4), 691–707. Beer, R. D. (1995). A dynamical systems perspective on agent-environment interaction. Artificial Intelligence, 72(1–2), 173–215. Braitenberg, V. (1984). Vehicles: Experiments in synthetic psychology. Cambridge, MA: MIT Press. Bruske, J., & Sommer, G. (1995). Dynamic cell structure learns perfectly topology preserving map. Neural Comput., 7(4), 845–865. Burgard, W., Cremers, A. B., Fox, D., H¨ahnel, D., Lakemeyer, G., Schulz, D., Steiner, W., & Thrun, S. (1999). Experiences with an interactive museum tour-guide robot. Artificial Intelligence, 114(1–2), 3–55. Cameron, S., Grossberg, S., & Guenther, F. H. (1998). A self-organizing neural network architecture for navigation using optic flow. Neural Comput., 10(2), 313– 352. Davids, A. (2002). Urban search and rescue robots: From tragedy to technology. IEEE Intelligent Systems, 17(2), 81–83. Decugis, V., & Ferber, J. (1998). Action selection in an autonomous agent with a hierarchical distributed reactive planning architecture. In Proc. 2nd International Conference on Autonomous Agents (pp. 354–361). New York: ACM Press. Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artificial Intelligence Res., 13, 227–303. Fiorini, P., Kawamura, K., & Prassler, E. (Eds.). (2000). Autonomous Robots, Special Issue on Cleaning and Housekeeping Robots, 9(3). Ghosh, J., & Nag, A. (2001). An overview of radial basis function networks. In R. J. Howlett & L. C. Jain (Eds.), Radial basis function networks 2: New advances in design (pp. 1–36). New York: Physica-Verlag. Glentis, G.-O., Berberidis, K., & Theodoridis, S. (1999). Efficient least squares adaptive algorithms for FIR transversal filtering. IEEE Signal Processing Mag., 16(4), 13–41. Gorinevsky, D., & Connolly, T. H. (1994). Comparison of some neural networks and scattered data approximations: The inverse manipulator kinematics example. Neural Comput., 6(3), 521–542. Gross, H.-M., Stephan, V., & Krabbes, M. (1998). A neural field approach to topological reinforcement learning in continuous action spaces. In Proc. International Joint Conference on Neural Networks (Vol. 3, pp. 1992–1997). New York: IEEE Press. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Trans. Pattern Anal. Machine Intell., 12(10), 993–1001. Hartman, E., & Keeler, D. (1991). Predicting the future: Advantages of semilocal units. Neural Comput., 3(4), 566–578. Hashem, S. (1997). Optimal linear combinations of neural networks. Neural Networks, 10(4), 599–614. Hayes, A. T., Martinoli, A., & Goodman, R. M. (2002). Distributed odor source localization. IEEE Sensors, 2(3), 260–271.

1442

K. Low, W. Leow, and M. Ang, Jr.

Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Haykin, S. (2002). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Heikkonen, J., & Koikkalainen, P. (1997). Self-organization and autonomous robots. In O. Omidvar & P. van der Smagt (Eds.), Neural systems for robotics (pp. 297–337). Orlando, FL: Academic Press. Hertzberg, J., Christaller, T., Kirchner, F., Licht, U., & Rome, E. (1998). Sewer robotics. In R. Pfeifer, B. Blumberg, J.-A. Meyer, & S. W. Wilson (Eds.), From animals to animats 5: Proc. International Conference on Simulation of Adaptive Behavior (pp. 427– 436). Cambridge, MA: MIT Press. Howard, A., Matari´c, M. J., & Sukhatme, G. S. (2002). Network deployment using potential fields: A distributed, scalable solution to the area coverage problem. In H. Asama, T. Arai, T. Fukuda, & T. Hasegawa (Eds.), Distributed autonomous robotic systems 5: Proc. 6th International Symposium on Distributed Autonomous Robotic Systems (pp. 299–308). New York: Springer. Huntsberger, T., & Rose, J. (1998). BISMARC: A biologically inspired system for map-based autonomous rover control. Neural Networks, 11(7–8), 1497–1510. Ijspeert, A. J., Martinoli, A., Billard, A., & Gambardella, L. M. (2001). Collaboration through the exploitation of local interactions in autonomous collective robotics: The stick pulling experiment. Autonomous Robots, 11(2), 149–171. Jacobs, R. A. (1995). Methods for combining experts’ probability assessments. Neural Comput., 7(5), 867–888. Jansen, A., van der Smagt, P., & Groen, F. C. A. (1995). Nested networks for robot control. In A. F. Murray (Ed.), Applications of neural networks (pp. 221–239). Dordrecht: Kluwer. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. J. Artificial Intelligence Res., 4, 237–285. Karayiannis, N. B. (1999). Reformulated radial basis neural networks trained by gradient descent. IEEE Trans. Neural Networks, 10(3), 657–671. Karayiannis, N. B., & Mi, G. W. (1997). Growing radial basis neural networks: Merging supervised and unsupervised learning with network growth techniques. IEEE Trans. Neural Networks, 8(6), 1492–1506. Kim, J.-O., & Khosla, P. (1992). Real-time obstacle avoidance using harmonic potential functions. IEEE Trans. Robot. Automat., 8(3), 338–349. Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Trans. Pattern Anal. Machine Intell., 20(3), 226–239. Kohonen, T. (2000). Self-organizing maps (3rd ed.). New York: Springer. Koren, Y., & Borenstein, J. (1991). Potential field methods and their inherent limitations for mobile robot navigation. In Proc. IEEE International Conference on Robotics and Automation (ICRA’91) (pp. 1394–1404). New York: IEEE Computer Society Press. Kuperstein, M. (1991). INFANT neural controller for adaptive sensory-motor coordination. Neural Networks, 4(2), 131–146. Lagoudakis, M. G., & Maida, A. S. (1999). Robot navigation with a polar neural map: Student abstract. In Proc. 16th National Conference on Artificial Intelligence (AAAI-99) (p. 965). Menlo Park, CA: AAAI Press.

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1443

Latombe, J.-C. (1999). Motion planning: A journey of robots, molecules, digital actors, and other artifacts. International Journal of Robotics Research, 18(11), 1119– 1128. Littmann, E., & Ritter, H. (1996). Learning and generalization in cascade network architectures. Neural Comput., 8(7), 1521–1539. Liu, C., Ang Jr., M. H., Krishnan, H., & Lim, S. Y. (2000). Virtual obstacle concept for local-minimum-recovery in potential-field based navigation. In Proc. IEEE International Conference on Robotics and Automation (ICRA’00) (Vol. 2, pp. 983–988). New York: ACM Press. Low, K. H., Leow, W. K., & Ang, Jr., M. H. (2002). A hybrid mobile robot architecture with integrated planning and control. In Proc. 1st International Joint Conference on Autonomous Agents and MultiAgent Systems (AAMAS-02) (pp. 219–226). New York: ACM Press. Low, K. H., Leow, W. K., & Ang, Jr., M. H. (2003). Action selection for single- and multirobot tasks using cooperative extended Kohonen maps. In Proc. 18th International Joint Conference on Artificial Intelligence (IJCAI-03) (pp. 1505–1506). New York: IEEE Press. Low, K. H., Leow, W. K., & Ang, Jr., M. H. (2004). Task allocation via self-organizing swarm coalitions in distributed mobile sensor network. In Proc. 19th National Conference on Artificial Intelligence (AAAI-04) (pp. 28–33). Menlo Park, CA: AAAI Press. Maes, P. (1995). Modeling adaptive autonomous agents. In C. G. Langton (Ed.), Artificial life:An overview (pp. 135–162). Cambridge, MA: MIT Press. Mahadevan, S., & Connell, J. (1992). Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence, 55(2–3), 311–365. Martinetz, T. M., Ritter, H. J., & Schulten, K. J. (1990). Three-dimensional neural net for learning visuomotor coordination of a robot arm. IEEE Trans. Neural Networks, 1(1), 131–136. Martinoli, A., Ijspeert, A. J., & Mondada, F. (1999). Understanding collective aggregation mechanisms: From probabilistic modelling to experiments with real robots. Robotics and Autonomous Systems, 29(1), 51–63. Mendenhall, W., & Sincich, T. (1994). Statistics for engineering and the sciences (4th ed.). Upper Saddle River, NJ: Prentice Hall. Michel, O. (2004). Cyberbotics Ltd. WebotsTM: Professional mobile robot simulation. International Journal of Advanced Robotic Systems, 1(1), 39–42. Mill´an, J. del R., Posenato, D., & Dedieu, E. (2002). Continuous-action Q-learning. Machine Learning, 49(2–3), 249–265. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Comput., 1(2), 281–294. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Comput., 3(2), 213–225. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proc. IEEE, 78(9), 1481–1497. Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Comput., 3(1), 88–97. Port, R. F., & van Gelder, T. (1995). Mind as motion: Explorations in the dynamics of cognition. Cambridge, MA: MIT Press.

1444

K. Low, W. Leow, and M. Ang, Jr.

Rao, R. P. H., & Fuentes, O. (1998). Hierarchical learning of navigational behaviors in an autonomous robot using a predictive sparse distributed memory. Machine Learning, 31(1–3), 87–113. Rimon, E., & Koditschek, D. E. (1992). Exact robot navigation using artificial potential functions. IEEE Trans. Robot. Automat., 8(5), 501–518. Ritter, H., & Schulten, K. (1986). Topology conserving mappings for learning motor tasks. In J. S. Denker (Ed.), Neural networks for computing (pp. 376–380). Snowbird, UT: American Institute of Physics. Rohanimanesh, K., & Mahadevan, S. (2003). Learning to take concurrent actions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1619–1626). Cambridge, MA: MIT Press. Rosenblatt, J. K. (1997). DAMN: A distributed architecture for mobile navigation. J. Expt. Theor. Artif. Intell., 9(2–3), 339–360. Russell, S., & Norvig, P. (1995). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice Hall, Rybski, P. E., Stoeter, S. A., Gini, M., Hougen, D. F., & Papanikolopoulos, N. P. (2002). Performance of a distributed robotic system using shared communications channels. IEEE Trans. Robot. Automat., 18(5), 713–727. Santam´aria, J. C., Sutton, R., & Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2), 163–218. Schaal, S., & Atkeson, C. G. (1998). Constructive incremental learning from only local information. Neural Comput., 10(8), 2047–2084. Schaal, S., Atkeson, C. G., & Vijayakumar, S. (2002). Scalable techniques from nonparametric statistics for real time robot learning. Applied Intelligence, 17(1), 49–60. Sharkey, A. J. C., & Sharkey, N. E. (1997). Combining diverse neural nets. Knowledge Engineering Review, 12(3), 231–247. Sharkey, N. E. (1998). Learning from innate behaviors: A quantitative evaluation of neural network controllers. Machine Learning, 31(1–3), 115–139. Shastri, S. V. (Ed.). (1999). Field and service robotics [Special issue]. International Journal of Robotics Research, 18(7). Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proc. 17th International Conference on Machine Learning (ICML-00) (pp. 903–910). San Francisco: Morgan Kaufmann. Smith, A. J. (2002). Applications of the self-organising map to reinforcement learning. Neural Networks, 15(8–9), 1107–1124. Sutton, R. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tani, J., & Fukumura, N. (1994). Learning goal-directed sensory-based navigation of a mobile robot. Neural Networks, 7(3), 553–563. Touzet, C. (1997). Neural reinforcement learning for behavior synthesis. Robotics and Autonomous Systems, 22(3–4), 251–281. van der Smagt, P., Groen, F. C. A., & van het Groenewoud, F. (1994). The locally linear nested network for robot manipulation. In Proc. International Conference on Neural Networks (Vol. 5, pp. 2787–2792). New York: IEEE Press. Versino, C., & Gambardella, L. M. (1995). Learning the visuomotor coordination of a mobile robot by using the invertible Kohonen map. In J. Mira & F. Sandoval

Ensemble of Cooperative EKMs for Complex Robot Motion Tasks

1445

(Eds.), Proc. International Workshop on Artificial Neural Networks (pp. 1084–1091). Berlin: Springer. Walter, J., & Ritter, H. (1996). Investment learning with hierarchical PSOM. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 570–576). Cambridge, MA: MIT Press. Walter, J. A., & Schulten, K. J. (1993). Implementation of self-organizing neural networks for visuo-motor control of an industrial robot. IEEE Trans. Neural Networks, 4(1), 86–95. Wettschereck, D., & Dietterich, T. (1992). Improving the performance of radial basis function networks by learning center locations. In J. E. Moody, S. J. Hanson, & R. P. Lipmann (Eds.), Advances in neural information processing systems, 4 (pp. 1133– 1140). San Mateo, CA: Morgan Kaufmann. Zalama, E., Gaudiano, P., & Coronado, J. L. (1995). A real-time, unsupervised neural network for the low-level control of a mobile robot in a non-stationary environment. Neural Networks, 8(1), 103–123. Zelek, J. S., & Levine, M. D. (1996). SPOTT: A mobile robot control architecture for unknown or partially known environments. In I. Nourbakhsh (Ed.), Planning with incomplete information for robot problems: Papers from the AAAI Spring Symposium (pp. 129–140). Toronto: AAAI Press.

Received March 9, 2004; accepted November 19, 2004.

NOTE

Communicated by Boris Gutkin

Categorization of Neural Excitability Using Threshold Models A. Tonnelier [email protected] Cortex Project, INRIA Lorraine, Campus Scientifique, Vandoeuvre-l`es-Nancy, France

A classification of spiking neurons according to the transition from quiescence to periodic firing of action potentials is commonly used. Nonbursting neurons are classified into two types, type I and type II excitability. We use simple phenomenological spiking neuron models to derive a criterion for the determination of the neural excitability based on the afterpotential following a spike. The crucial characteristic is the existence for type II model of a positive overshoot, that is, a delayed afterdepolarization, during the recovery process of the membrane potential. Our prediction is numerically tested using well-known type I and type II models including the Connor, Walter, & McKown (1977) model and the Hodgkin-Huxley (1952) model. 1 Introduction Despite the large number of ionic mechanisms underlying the initiation of action potentials, a broad class of nonbursting neurons presents two types of excitability (Hodgkin, 1948; Rinzel & Ermentrout, 1998; Izhikevich, 2000). The properties of membrane excitability are determined according to the emerging frequency of repetitive firing. Type I is obtained when repetitive action potentials are generated with an arbitrarily low frequency, whereas in type II, spike trains emerge at a nonzero frequency. The frequency response of a single cell is crucial since it models the input-output relation, that is, the gain function, commonly used in firing rate description of neural networks. The dynamics of the membrane excitability determines the spike train statistics (Gutkin & Ermentrout, 1998) and is fundamental to understanding how nontrivial dynamics emerge when neurons are coupled in networks (Hansel, Mato, & Meunier, 1995). Previous work on the classification of excitability used bifurcation theory (Ermentrout, 1996; Rinzel & Ermentrout, 1998; Izhikevich, 2000). The bifurcation resulting in the apparition of a stable limit cycle determines the type of excitability. Typically, type I and II are related to a saddle node bifurcation on an invariant circle and an Andronov-Hopf bifurcation, respectively. However this classification is not perfect, and one has to distinguish between the bifurcation of the resting state and the bifurcation of the limit cycle leading to a complex classification (Izhikevich, 2000). Neural Computation 17, 1447–1455 (2005)

© 2005 Massachusetts Institute of Technology

1448

A. Tonnelier

The purpose of this note is to derive a simple criterion for the classification of neural excitability. Tonnelier and Gerstner (2003) showed that type I and type II neurons can be obtained as a generalization of integrate-and-fire neurons. However, the question of the classification based on the firing rate was not addressed, and the neural mechanisms that distinguish between the two types of excitability were not established. We present an easy and intuitive way to characterize the neural excitability of spiking neurons using the analytical framework of the spike response model. The result is surprisingly simple: type I is obtained when the afterpotential following a spike has a monotonic recovery process, whereas type II membranes present a small depolarization during the recovery. We check the validity of this classification on more complex models and derive some qualitative and quantitative predictions. 2 Type I vs. Type II Excitability of the Spike Response Model The spike response model allows a phenomenological description of spiking neurons (Gerstner & Kistler, 2002). This model approximates the dynamics of biophysical detailed models with great accuracy (Kistler, Gerstner, & van Hemmen, 1997; Jolivet, Lewis, & Gerstner, 2004) and yields to a transparent discussion of various neural dynamics (Gerstner, van Hemmen, & Cowan, 1996). In this model, the membrane potential v(t) in response to a constant stimulation is given by v(t) =

η(t − t f ) + ustat (I ),

t f ∈F

where η(t − t f ) describes the form of a spike and the afterpotential following it. The second term ustat (I ) models the response of the membrane potential to a constant input current I , that is, the steady-state I − V relationship of the model (to simplify the notations, we will drop the dependence on I ). It is convenient (Tonnelier & Gerstner, 2003) to split the kernel η into two parts, η(t − t f ) = η f (t − t f ) − ηr (t − tr ), where η f and ηr are two pulse-shaped kernels. The first term, η f , describes the spike, that is, the abrupt depolarization of the membrane potential, and −ηr models the recovery period that follows the spike, that is, the spike afterpotential. The action potential is triggered at time t = t f ∈ F, where the set F gives the spike events that are to be taken into account. A spike event occurs if the membrane potential crosses a threshold ϑ from below. The recovery kernel acts at the so-called resetting time tr = t f + where includes the spike duration and an absolute refractory period. The kernel η f operates on a fast timescale, and we approximate it by a Dirac delta

Categorization of Neural Excitability Using Threshold Models

1449

function. Since the membrane trajectory during a spike reflects the membrane properties and not the input, the kernel ηr is independent of the input. To reproduce the recovery processes of neurons, we consider the two following kernels, ηrI (t) = µr e −t/τr sinh ωr t,

(2.1)

ηrII (t) = µr e −t/τr

(2.2)

sin ωr t,

for t > 0 and 0 otherwise, which we call type I and type II recovery kernels, respectively. The parameter µr is a scale factor, τr is the recovery time constant, and ωr determines the global shape of the kernel. We require ωr τr < 1 for type I recovery kernel in order to ensure a decay to 0. Note that these two kernels could be written in a general formalism using complex values of ωr . We will show the following result: type I and type II recovery kernels describe type I and type II membrane models, respectively. The existence for type II kernel of a negative part (i.e., a positive overshoot of the corresponding membrane potential) is the main difference between the two kernels. The precise form of ηr is not important; a similar result holds for different choices. 2.1 Analytical Treatment. A constant input current generates an indefinite train of spikes if t f = n/ν where n is the index of the nth spike and ν is the mean firing rate. Let v∞ (t) be the membrane potential in the repetitive spiking regime. For clarity, we take = 0 in our analytical treatment. We calculate v∞ (t) =

δ(t − n/ν) − ηr,∞ (t) + ustat ,

(2.3)

n

where ηr,∞ is the periodic recovery kernel that we calculate using a summation formula. We find I ηr,∞ (t) =

T−t t µr e τr sinh ωr t + e − τr sinh ωr (T − t) 2(cosh T/τr − cosh ωr T)

II ηr,∞ (t) =

T−t t µr e τr sin ωr t + e − τr sin ωr (T − t) 2(cosh T/τr − cos ωr T)

and

for 0 ≤ t ≤ T, where T = 1/ν is the interspike interval. The frequency of the periodic firing is obtained from the requirement v∞ (T) = ϑ, which we

1450

A. Tonnelier

rewrite as F (x) = ϑe , where x = ωr ν −1 , F I (x) =

sinh x , 2(cosh x − cosh αx)

(2.4)

F II (x) =

sin x , 2(cos x − cosh αx)

(2.5)

ϑe is the effective dimensionless threshold defined by ϑe = (ϑ − ustat )/µr and α = 1/ωr τr . For type II kernel, α is the ratio between its oscillatory period and its decaying time constant. Type I kernel could be expressed as a difference between two exponentials with two timescales (a rising time τ1 and a decaying time τ2 ), and α represents (τ1 + τ2 )/(τ2 − τ1 ). In addition to equations 2.4 and 2.5, the neuronal voltage should exceed ϑ only once during one period, v∞ (t) < ϑ, t ∈ (0, T).

(2.6)

A necessary but not sufficient condition reads −dηr,∞ (T)/dt > 0; the voltage increases just before the spike. First, we consider the type I function, equation 2.4. Using the requirement α = 1/τr ωr > 1 for the type I kernel, it is straightforward to show that a necessary and sufficient condition for the existence of a repetitive spiking regime is ϑe < 0, or equivalently ustat > ϑ, which states that the stationary potential crosses the threshold; that is, the stationary state disappears. Note that the periodic solution is unique since F I is monotonic increasing with respect to x. At the critical regime, ϑe → 0, periodic firing appears with an arbitrarily low frequency, ν I → 0, and from equation 2.4, we derive the following logarithmic law for the emerging frequency: ν I = (ωr − 1/τr )[ln −2ϑe ]−1 .

(2.7)

Determination of the critical current is obtained, giving the dependence of the stationary state ustat with respect to I . For instance, if we consider the steady-state I − V curve of the standard integrate-and-fire model, we have ustat = RI , and we find the well-known critical current Ic = ϑ/R for the emergence of repetitive spiking. Note that the logarithmic law, equation 2.7, of the frequency current relationship is closely related to the exponential decrease of the recovery kernel ηr . A square root law is obtained when considering a quadratic decay of the recovery kernel. We now investigate the type II system: F I I (x) = ϑe . Using equation 2.5, we show in Figure 1B the locus of existence of the repetitive firing regime. There exists a critical threshold ϑe∗ (α) > 0 such that the existence of a periodic spiking solution is obtained for ϑe < ϑe∗ . Solutions appear by pair, but one violates condition 2.6. Periodic firing appears before the vanishing

Categorization of Neural Excitability Using Threshold Models

1

-40

1

α

v (mV)

α -50

0.75

0.75 0.5

0.5

Repetitive Spiking

0.25

0

-0.5

0.5

1

α HH

0.25

ϑe

0 -1

1.5

0 -1

-60

Repetitive Spiking

ϑe

0

-0.5

A -40

1451

0.5

1

-70 0

1.5

t (ms) 20

40

B

v (mV)

10 mV

-50

0.2

10 mV

10 mV

20

10 ms

20 ms

10 ms

60

C

v(t)

v(t) -60 -70

t (ms) -80 0

20

40

D

t

t

t

t

60

E

F

Figure 1: (A, B) Locus of existence of periodic solutions obtained for (A) the type I and (B) the type II recovery kernel of the spike response model in the (ϑe , α) plane. In B, the dotted line indicates if −ηr (t) crosses the effective threshold (left). The parameter α H H is derived from the HH model (see below). (C, D) The periodic subthreshold potential of the spike response model (solid lines) that approximates (C) the Connor et al. model and (D) the HH model. The dotted lines represent the short-term memory approximation. The neuron fires a spike (vertical line) when the voltage membrane hits the firing threshold (solid line). The stationary state (dashed line) is shown. Note that in D, the model exhibits bistability between a stable steady state and stable oscillation. (E, F) Action potential corresponding to (E, left) the Connor et al model; (E, right) the Morris-Lecar model; (F, left) the HH model; and (F, right) the FitzHugh-Nagumo model. These models have been stimulated by a short but strong current pulse before t = 0. A subthreshold DC current has also been applied. The horizontal line represents the stationary potential. Models of E are known as type I models, whereas models of F are referred to as type II models. We idealize the spike (dotted lines) with a Dirac delta function, and we fit (dashed lines) the recovery part with two generic kernels (see the text). The recovery part acts after a delay . Numerical values are (E, left) µr = 17 mV, τr = 0.1985 ms, ωr = 4.691 MHz, = 9.5 ms; (E, right) µr = 40 mV, τr = 6, ωr = 0.1 MHz, = 40 ms; (F, left) µr = 28 mV, τr = 6 ms, ωr = 0.3 MHz, = 5 ms; (F, right) µr = 0.88, τr = 12, ωr = 0.11, = 20.5 (dimensionless units).

of the stationary state and a bistability regime exists for 0 < ϑe < ϑ ∗ , where a stationary state and a periodic solution coexist. Qualitatively, the explanation is based on the existence of a depolarized afterpotential that drives the membrane potential into the superthreshold regime, and therefore repetitive firing appears before the vanishing of the stationary state. At the critical regime, ϑe = ϑe∗ , it is clear from equation 2.5 that small values of

1452

A. Tonnelier

ν (x infinite) are not solutions; that is, periodic firing emerges with a nonzero frequency. The exponential decay of the recovery kernel implies that the summation over the firing time is dominated by the most recent firing event. Hence, the periodic kernel ηr,∞ is well approximated at time t by ηr (t). This approximation, reported as the short-term memory approximation (Gerstner & Kistler, 2002), leads to an accurate determination of the emergence of repetitive firing given by −ηr (ν −1 ) = ϑe (see Figure 1B) and fits the exact periodic solution (see Figures 1C and 1D). In this approximation, the frequency of the emerging periodic firing for a type II kernel is given by −1 ν II = + ωr −1 (π + arctan(ωr τr )) .

(2.8)

In other words, the location of the delayed afterdepolarization of type II models could be used as an approximation of the period of the emerging repetitive spiking regime. 2.2 Type I vs. Type II Neural Excitability. Our analysis of the spike response model suggests a simple observable criterion for determining the excitability of spiking neurons based on the recovery process following an action potential elicited by a brief current pulse: the afterpotential of type I models is hyperpolarized, whereas type II models present a delayed afterdepolarization (DAD), also reported as a prolonged depolarized afterpotential. In this section, we aim at establishing some connections with detailed models and thus deriving some quantitative predictions. In Figures 1E and 1F, we look at the time course of the action potential of popular type I and type II models. In Figure 1E, we show the action potential of models reported as type I: the Connor, Walter, and McKown (1977) model and the Morris-Lecar (1981) model in the type I regime (see Ermentrout, 1996). Figure 1F shows the voltage trajectories of type II models: the Hodgkin-Huxley (HH 1952) model, and FitzHugh-Nagumo (Fitzhugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962) model. Many papers and books describing the equations and the dynamics of these models are available (Koch, 1999; Gerstner & Kistler, 2002). Since they represent different aspects of nerve cell excitability, these models are widely used as paradigmatic models of action potential generation. Numerically, parameters µr , τr , ωr , and are adjusted such that the kernels 2.1 and 2.2 fit the time course of the membrane afterpotential of the detailed neural models (see Figures 1E and 1F). We see that type II kernels numerically fit the afterpotential of the HH and FitzHugh-Nagumo models, whereas type I kernels approximate the afterpotential of the Connor et al. and Morris-Lecar models. The correspondence between the kernels and the detailed models is made near the onset of repetitive firing. Different parameter values do not affect the categorization; the type of kernels

Categorization of Neural Excitability Using Threshold Models

1453

remains unchanged, provided that we work away from the hyperexcitable regime, since at highly depolarized potentials, both type I and type II models can show afterdepolarization. Let us now numerically illustrate some quantitative predictions. We mainly examine the HH model, but a similar analysis can be carried out for the other models. Parameters for the recovery kernel of the HH model are given in Figure 1. Using equation 2.5, solving F I I (x) = ϑe , we find an emerging frequency ν I I = 51 Hz, and using equation 2.8, we find the approximate value 52 Hz. These results fit the exact value of the complete model (about 53 Hz at 6.3◦ C). To go further and find the critical current, one needs to elucidate (1) the steady-state I − V relationship and (2) the threshold behavior. Despite the nonlinearity of the full model, the steady-state membrane depolarization in the subthreshold regime depends linearly on the applied membrane current (Koch, 1999). The numerical fit (simulations not shown) of the steady-state I − V curve is given by ustat = u0 + RI where u0 = −65.0 mV, R = 0.7 M cm2

(2.9)

for ustat < ϑ (the stable branch of the I − V curve for the Connor et al. model is also well fitted with equation 2.9 using u0 = −68 mV). It has been observed that the HH equations exhibit a threshold behavior (Koch, 1999; Gerstner & Kisler, 2002). For the extraction of the threshold, we used a rapid and strong input current—a certain amount of electrical charge is instantaneously delivered to the membrane—and we find a voltage threshold for spike initiation given by ϑ = −58.2 mV (see also Noble & Stein, 1966). Then, we predict that the stationary state destabilizes at I = 9.4 µA/cm2 (using ustat = ϑ), whereas the critical current of emerging repetitive firing is given by Ic = 6.6 µA/cm2 (using equation 2.5) and Ic = 5.9 µA/cm2 in the short memory approximation. Our result emphasizes the difference between voltage threshold and current threshold (Koch, Bernander, & Douglas, 1995). If we define the current threshold, also referred to as the rheobase, as the critical value of a sustained current initiating action potentials, the type II model leads to a corresponding membrane potential (given by the steady-state I − V curve) below the voltage threshold ϑ, whereas these two thresholds coincide for type I models. 3 Discussion One possible classification of neurons uses the discontinuity of the firing rate curve. Following this categorization and in the framework of the spike response model, we have shown that type I and type II recovery kernels account for the difference between type I and type II spiking models. Regarding this result, we state that the spike afterpotential could be used as a characteristic of membrane excitability. Our approach has some connections

1454

A. Tonnelier

with the classification based on the bifurcation theory. Indeed, the oscillations generated by the Hopf bifurcation produce a DAD that leads to the type II response predicted by the bifurcation theory and by our criterion. However, we stress that the existence of a DAD does not require the existence of subthreshold oscillations. In fact, damped oscillations are related to the bifurcation of the resting state that can be different from the bifurcation of the limit cycle as it is the case when bistability occurs, that is, the coexistence of a stable limit cycle and a stable resting state. The existence of the DAD is related to the limit cycle bifurcation. Our approach is highly simplified and neglects many aspects of neuronal dynamics. We used a threshold for spike initiation and neglected the other nonlinear processes for the generation of spikes. However, our result could be used as a criterion for the classification of neural excitability of detailed models away from their hyperexcitable regime. In conductance-based models, the depolarized spike afterpotential appears because of an interplay between the subthreshold voltage-gated currents. The membrane potential during this stage is mainly driven by the dynamics of potassium current(s). Therefore, we suggest that the type I or type II excitability is mainly determined by the voltage-dependent potassium current(s). This intuition is corroborated by the transition between type I and type II excitability when changing the potassium dynamics of the Morris-Lecar model (Rinzel & Ermentrout, 1998). More precisely, it can be shown that the transition from a type I to type II Morris-Lecar model can be monitored only by changing the potassium activation curve. Therefore, as suggested by our analysis, the neural excitability of the Morris-Lecar model could be characterized observing its spike recovery. Note also that the main difference between the HH model and the Connor et al. model is the existence of an additional A-type potassium current leading to a type I response in the Connor et al. model. We stress that other mechanisms can be put forward to explain excitability changes, such as the existence of a transient calcium conductance, but the potassium currents are probably the most commonly encountered mechanism that determines the neural excitability. Acknowledgments I thank Wulfram Gerstner for valuable suggestions. References Connor, J. A., Walter, D., & McKown, R. (1977). Neural repetitive firing: Modifications of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. J., 18, 81–102. Ermentout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001.

Categorization of Neural Excitability Using Threshold Models

1455

Fitzhugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophys. J., 1, 445–466. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996). What matters in neuronal locking. Neural Computation, 8, 1653–1676. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10, 1047–1065. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–335. Hodgkin, A. L. (1948). The local electric changes associated with repetitive action in a non-medullated axon. J. Physiol. (Lond.), 107, 165–181. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its applications to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Izhikevich, E. M. (2000). Neural excitability, spiking, and bursting. Int. J. Bifurca. Chaos, 10, 1171–1266. Jolivet, R., Lewis, T., & Gerstner, W. (2004). Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. J. Neurophysiol., 92, 959–976. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. ¨ & Douglas, R. J. (1995). Do neurons have a voltage or a Koch, C., Bernander, O., current threshold for action potential initiation? J. Comp. Neurosci., 2, 63–82. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nagumo, J. S., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2070. Noble, D., & Stein, R. B. (1966). The threshold conditions for initiation of action potentials by excitable cells. J. Physiol., 187, 129–162. Rinzel, J. M., & Ermentrout, G. B. (1998). Analysis of neuronal excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). Cambridge, MA: MIT Press. Tonnelier, A., & Gerstner, W. (2003). Piecewise linear differential equations and integrate-and-fire neurons: Insights from two-dimensional membrane models. Phys. Rev. E, 67, 021908.

Received March 4, 2004; accepted December 6, 2004.

LETTER

Communicated by George Gerstein

Theory of the Snowflake Plot and Its Relations to Higher-Order Analysis Methods Gabriela Czanner [email protected] Neuroscience Statistics Research Laboratory, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, U.S.A.

Sonja Grun ¨ [email protected] Institute for Biology–Neurobiology, Free University, Berlin, 14195, Germany

Satish Iyengar [email protected] Department of Statistics, University of Pittsburgh Pittsburgh, PA 15260, U.S.A.

The snowflake plot is a scatter plot that displays relative timings of three neurons. It has had rather limited use since its introduction by Perkel, Gerstein, Smith, and Tatton (1975), in part because its triangular coordinates are unfamiliar and its theoretical properties are not well studied. In this letter, we study certain quantitative properties of this plot: we use projections to relate the snowflake plot to the cross-correlation histogram and the spike-triggered joint histogram, study the sampling properties of the plot for the null case of independent spike trains, study a simulation of a coincidence detector, and describe the extension of this plot to more than three neurons. 1 Introduction The snowflake plot (or joint configuration scatter diagram) of Perkel, Gerstein, Smith and Tatton (1975) is a scatter plot that displays the relative timings of firings from three simultaneously recorded neurons. It assumes that the relationships between the neurons can be fully described by the differences between their firing times. The plot uses a triangular coordinate system that treats the neurons symmetrically, and it is bounded by a hexagonal box—hence, its name. A point in the plot corresponds to the three differences between the firing times of the three neurons. Neural Computation 17, 1456–1479 (2005)

© 2005 Massachusetts Institute of Technology

Theory of the Snowflake Plot

1457

Perkel et al. (1975) studied the geometry of the plot and did many simulations to show the patterns in the snowflake plot under different dependence structures; however, there has been no rigorous theoretical study of the properties of this plot yet. One possible reason for the limited interest in the snowflake plot is that it is rather computer intensive; only recently have computers become powerful enough to allow quick construction of this plot and to do simulation studies to understand its properties better. There are also concerns about whether the snowflake plot can be helpful in analyzing data. One reason is that it can contain artifacts that complicate interpretations and can make it more difficult to understand and use. Next, the triangular system used to plot the snowflake is not as familiar as the Cartesian system, so what patterns one should expect given a certain circuit of three neurons are not immediately clear. This is especially hard for the more complex circuits. Nevertheless, the snowflake plot can be helpful (Perkel et al., 1975) as a potential screening device for higher-order correlations, which are hard to detect by other displays. An important example is coincidence detection (Perkel et al., 1975), in which a neuron fires in response to receiving (nearly) coincident spike inputs (Abeles, 1982b; see section 5 below). The simplest circuit that resembles a coincidence detector is a neuron that receives inputs from two neurons, and the threshold of the receiving neuron is such that two coincident input spikes suffice to reach threshold. Such a relation of two synchronous spikes followed (with some small delay) by an output spike cannot be detected by mere pairwise analysis, for example, using a crosscorrelation histogram (CCH). Instead, methods are required that allow the study of more than two neurons at a time as provided in the snowflake plot. How the snowflake plot works for excitatory or inhibitory connections has been studied using simulations (Perkel et al., 1975). If neuron A excites neuron B and if a third neuron C is independent of the others, then there will be an excess of cases when a spike from neuron A is followed by a spike from neuron B. This will be reflected in the snowflake plot as a narrow band of increased density of the points. Note, however, that for each pair of spikes from neurons A and B, we need a spike (although unrelated) from neuron C to draw the point in the snowflake. Thus, we need neuron C to fire persistently to bring out the relationship between A and B in the snowflake plot (Perkel et al., 1975). For this reason, the authors suggest the use of the snowflake plot in addition to the CCH and autocorrelation histogram. A related development is the joint peristimulus time scatter diagram of Gerstein and Perkel (1969, 1972) and joint peristimulus time histogram (JPSTH) of Aertsen, Gerstein, Habib, and Palm (1989) for two neurons that are subject to repeated stimulation. The same concept is used by Prut et al. (1998) for detecting synchrony among three neurons. Here the trigger event is a spike emitted by a neuron instead of a stimulus. Prut et al. (1998) call the resulting matrix of raw counts a threefold correlation matrix, or the counts matrix instead of JPSTH. Since the term counts matrix is a rather

1458

G. Czanner, S. Grun, ¨ and S. Iyengar

generic name and since it is a histogram, we will call it the spike-triggered joint histogram (STJH) instead. Finally, in their work, the three neurons are treated asymmetrically, with one neuron playing the role of a trigger and the other two neurons being the reference units. In that case, if two neurons fire after the third neuron fires and if we do not have prior knowledge about the connections between neurons, we may need three separate STJHs to detect the dependence structure. While this is not difficult for just three neurons, it can become cumbersome when studying many more neurons in groups of three. Many of the ideas that we explore in this article about projection to explain the relationship between different plots and extensions to more than three neurons were qualitatively stated in earlier work (Perkel et al., 1975; Gerstein & Perkel, 1972; Kristan & Gerstein, 1970; Abeles & Gerstein, 1988). This article provides quantitative results that put the early qualitative work on a rigorous mathematical basis. After reviewing some basics in section 2, we show in section 3 that the CCH and the snowflake plot are results of orthogonal projections—the CCH from two- to one-dimensional space and the snowflake plot from three- to two-dimensional space. In section 4, we show that the snowflake plot is a generalization of the STJH in the sense that it treats neurons symmetrically because the STJH can be viewed as a result of a nonorthogonal projection from three- to two-dimensional space; finally, we show that the snowflake is the union of all three projections (STJHs of three neurons) so it gives us the same information as the three STJHs at once. Hence, the snowflake plot can bring simplification in computations. In section 5, we will study in detail the example of a simulated coincidence detector convergent circuit using the snowflake. In section 6, we derive analytically the distribution of the points on the snowflake plot under the null assumption that the neurons fire independently from three homogeneous Poisson processes, and we discuss its properties. In section 7 we discuss the geometry of a higher-dimensional version of the snowflake plot for four or more neurons and end with a discussion in section 8. 2 Snowflake Plot Basics We begin with a brief account of the geometry and construction of the snowflake plot; this material is extracted from Perkel et al. (1975). Let A, B, and C denote the times of spikes from three neurons. Since (B − A) + (C − B) + (A − C) = 0, we can plot the point (A, B, C) on a plane using a triangular coordinate system (see point P in Figure 1A). The Cartesian coordinates of the plotted point P are

2C − (A + B) , B−A . √ 3

(2.1)

Theory of the Snowflake Plot

1459

A

B L

B-A axis

C-B=0

C-A=0 P ACB

B-A

CAB

ABC B-A=0

A-C A-C axis

BAC

CBA

C-B C-B axis

BCA

2 L 3

Figure 1: Geometry of the snowflake plot. (A) The triangular coordinate system for the snowflake plot. (B) The coincidence lines (solid) divide the hexagon into six triangles. Each triangle corresponds to a particular set of timing sequences: the top triangle contains sequences ACB, then to the right ABC, BAC, BCA, CBA, and CAB.

Figure 1B shows again the axis of the triangular system. The lines that are perpendicular to them (solid) are called coincidence lines; for example, at the B − A = 0 line, neurons A and B fire at the same time. The solid lines (see Figure 1B) divide the hexagon into six sectors. Each sector of the snowflake is characterized by the order of firings; for example, ABC refers to the sector representing triples in which A precedes B, which in turn precedes C. Furthermore, each point on the snowflake plot corresponds to a precise sequence of the three neurons A, B, and C (see Figure 1B) defined by the order of the three neurons and two time differences between them. Following Perkel et al. (1975), we often consider only those triples (A, B, C) that satisfy |B − A| < L , |C − B| < L , and |A − C| < L ,

(2.2)

for some time interval, or span L, that is shorter than the recording period τ . The plot of such points will again form a snowflake; that is, they will all lie inside a smaller hexagon centered at the origin. Such a restriction is useful when the dependence among the neurons that is of interest is over small time intervals.

3 Snowflake Plot Generalizes the CCH Aertsen et al. (1989) show the correspondence between the CCH and JPSTH. The CCH can be viewed as an orthogonal projection from the

1460

G. Czanner, S. Grun, ¨ and S. Iyengar A

B

C time

15

y P′

10

4 4

P

1 1 2

8

3

12

x

-15

8 8

15

0

2 2 1

4 0

-10

0

-15

Figure 2: Orthogonal projection of points of the snowflake plot onto the vertical axis. (A) P = ( √13 (2C − B − A), B − A), P = B − A. (B) The snowflake plot and the histogram of all projected points. Neuron A fires at times 1, 5, 9; B fires at 2, 4, 7, 9; C fires at 1, 2, 5, 9. The total time is τ = 10. (C) The raw CCH of A and B. The histogram from B and the raw CCH have the same shape; they are equal up to a constant multiple, nC = 4, which is the number of times neuron C fired.

two-dimensional space spanned by times when neurons A and B fired onto the one-dimensional space spanned by the vector (−1, 1)—that is, onto the hyperplane x + y = 0—up to a scale constant √12 . The projection is simply B − A = (−1, 1)

A . B

Similarly, the snowflake plot can be considered an orthogonal projection from the cube [0, τ ]3 onto the plane x + y + z = 0:

− B − A) B−A

√1 (2C 3

=

− √13 − √13 −1

1

 A  B . 0 C

√2 3

Thus, the snowflake plot is a generalization of the CCH, from two to three neurons. This approach suggests that if we can find the joint spike density function in the cube, we can—at least in principle—then find the induced density of the projection on any hyperplane, for example, in the snowflake plot; in section 6, we use this approach to study the null case of three independent Poisson spike trains. Moreover, we can project the snowflake plot onto, for example, the vertical axis (see Figure 2A) and create a histogram of the points in the projection. Then the number of points occurring in the interval (t1 , t2 ] is equal to the number of points in the cross-correlation histogram multiplied by nC , the number of times neuron C fired (which is independent on t1 , t2 ). The reason is that each possible difference (B − A) is represented by nC points in the

Theory of the Snowflake Plot

1461

15

5 7 -15

15

6 5

-15

Figure 3: The projection when L = 5 < τ = 10. Here we are interested only in the points within the span L = 5, which is illustrated by the hexagon. All points within the hexagon define the snowflake plot. All other points should not be plotted; however, for illustrative purposes, we draw them here. Neuron A fires at times 1, 5, 9; B fires at 2, 4, 7, 9; C fires at 1, 2, 5, 9; the total time is τ = 10. This histogram is not equal to the raw CCH in Figure 2C.

snowflake plot. For illustration, see Figure 2B, where neuron C fired four times: the histogram of the projection is hence equal to raw counts CCH (of A and B) multiplied by 4. However, the details are different when we work with the plot of points within the span L, with L < τ . Consider again the projection of the points onto the vertical axes and a histogram of the points in that projection. Then the projection of the snowflake plot is not equal to the CCH (up to a constant multiple) (see Figure 3). The reason is that each difference (B − A) is plotted in this smaller snowflake only if there is a close spike (within span L) from the third neuron C. 4 Three STJHs Embedded in the Snowflake Plot Consider now the case of three neurons, where one of the three plays the role of a stimulus. Prut et al. (1998) used the idea of JPSTH for detecting the synchrony among three neurons. Here we will show that the snowflake plot can be divided into three parts, each corresponding to one STJH. This suggests that the plot can be considered as a generalization of the STJH by treating the three neurons symmetrically. This can be beneficial if we do not have prior information about which neuron is a trigger, for example, when many neurons are recorded and examined in groups of three. The analysis for spatiotemporal patterns by Prut et al. (1998) was motivated by predictions of the synfire chain model (Abeles, 1982a, 1991): “In such a model, every time the chain is activated, the participating cells will fire in a sequential manner. The temporal spacing between the spikes will

1462

G. Czanner, S. Grun, ¨ and S. Iyengar

t2 t1 S1 S2 S3 Figure 4: Precise firing sequence (PFS). Here, S1 , S2 , and S3 are three different neurons.

Figure 5: The geometry in STJH where neuron A is a trigger. For P, we have C − A > B−A; hence, C > B; hence, the PFS is ABC. For Q, we have C − A < B − A; hence, C < B; hence, the PFS is ACB.

correspond to the relative location of the parent cells along the chain. Such a sequence of spikes is termed a precise firing sequence (PFS)” Prut et al., 1998, pp. 2857–2858), Studies have detected PFSs in cortical activity of the frontal lobe (Abeles, Bergman, Margalit, & Vaadia, 1993; Villa & Fuster, 1992). We will use Prut et al.’s notation. A precise firing sequence is defined by its unit composition and time delays between spikes. The three units are denoted by S1 , S2 , S3 . Unit S1 fires first, unit S2 fires t1 time units after S1 fires, and unit S3 fires t2 time units after S1 fires, where t1 < t2 (i.e. t1 = S2 − S1 , t2 = S3 − S1 ) (see Figure 4). So the PFS can be written as (S1 , S2 , S3 ; t1 , t2 ). Call the first unit the trigger and the other two units reference units. Units S1 , S2 , S3 are either three different neurons, or they can represent one neuron. For three neurons A, B, and C, assume that Ais a trigger (i.e., S1 = A). We can plot C − A on the x-axis and B − A on the y-axes (see Figure 5). All such points create a spike-triggered type of diagram (scatter plot). If we create a grid above it and count the number of points in each square, we get the STJH. Moreover, the triangle to the right of the main diagonal represents the PFSs: (A, B, C; t1 = B − A, t2 = C − A) (since C − A > B − A so B < C; see point P in Figure 5); analogously, the other triangle represents the PFSs: (A, C, B; t1 = C − A, t2 = B − A) (see point Q in Figure 5). The PFSs also

Theory of the Snowflake Plot

1463

A C B

Figure 6: Solid lines divide the snowflake plot into three parts, each defined by a trigger unit (e.g., B means that B is a trigger).

divide the snowflake plot into triangles (see the six triangles, ABC, ACB, BAC, BCA, CAB, CBA, in Figure 1B). In this respect, the snowflake plot contains the same information as the three STJHs—where each STJH is defined by a different trigger neuron—at once. We now show that there is a nice transformation from the snowflake plot onto the STJHs. Let us divide the snowflake plot into three parts (see parts A, B, and C in Figure 6). Each part is determined by the trigger neuron, the one that fired first in the sequence. From the definition of the snowflake plot, there is a one-to-one correspondence between part A of the snowflake plot and the STJH from Figure 5. Note that each point in our STJH has coordinates [xSTJH , ySTJH ] = [C − A, B − A]. Then 2 1 1 xSNOW = √ [2(C − A) − (B − A)] = √ xSTJH ,A − √ ySTJH ,A 3 3 3

(4.1)

ySNOW = B − A = ySTJH ,A

(4.2)

and

since 1 1 √ (2C − B − A) = √ [(C − B) + (C − A)] 3 3 1 = √ [(C − A) − (B − A) + (C − A)] 3 1 = √ [2(C − A) − (B − A)] , 3 where (xSTJH ,A , ySTJH ,A ) are coordinates of the point in the STJH with A being a trigger, and (xSNOW , ySNOW ) are coordinates of the same point in the

1464

G. Czanner, S. Grun, ¨ and S. Iyengar

A

B

C

2

2

2

1

1

1

0

0

0

-1

-1

-1

-2

-2 -2

-1

0

1

-2

-2

2

-1

0

1

2

-2

-1

0

1

2

Figure 7: Transformation (see equation 4.3) of the points in the STJH to the snowflake. (A) Points (0, 0), (1, 0), (0, 1) and (1, 1). (B) The four points after scaling. (C) The four points after scaling and shearing.

snowflake plot. In the same way we can get a transformation between parts B, C, and corresponding STJHs. Note that our transformation matrix (in equations 4.1 and 4.2) is

√2 3

− √13

0

1

=

√2 3

0

0

1

1

− 12

0

1

,

(4.3)

the composition (product) of two affine transformations: first, a shear matrix that preserves the horizontal axis with shear factor −1/2, followed by a √ scaling (in this case, rescaling the horizontal axis by the factor 2/ 3). Hence, it is also an affine transformation. Figure 7 provides an example. Figure 7A shows four points (0, 0), (0, 1), (1, 0), and (1, 1). Figure 7B shows these points after applying the scaling, where the point (1, 0) transforms into (1.155, 0), because

√2 3

0

0

1

1 0

=

1.155 0

.

Similarly, the points (0, 0), (0, 1), (1, 1) transform into (0, 0), (0, 1) and (1.155, 1), respectively. Figure 7C shows the four points after applying the second transformation: it is a shearing, which transforms into (1.155, 0). Also, the points (0, 0), (0, 1), (1.155, 1) transform into (0, 0), (−0.5, 1), and (0.655, 1), respectively. In summary, each of the three STJHs is represented by one-third of a snowflake plot; hence, the three STJHs give exactly the same information as the snowflake plot.

Theory of the Snowflake Plot

1465

Spike trains of neurons A and B

A B 5msec

Coincidence induced spikes

C1 After applying latency

3.5ms

C2 After jittering

1msec

C3 Background spike train

C0 Superposition of C0 and C3

C

Figure 8: Simulation of the circuit with the coincidence detector neuron.

5 Example In this section, we show the use of the snowflake with an example: a simulated circuit with a coincidence detector neuron. Such a circuit has already been discussed by Perkel et al. (1975). However they used an integrateand-fire (IF) model to simulate the data. We use a different approach: we inject the coincidence-induced spikes into the spike train of the coincidence detector, as described below and depicted in Figure 8. We define a coincidence detector as a neuron (neuron C) that fires (with a latency 3 to 4 ms) if it sees an A spike followed by a B spike or vice versa within 5 ms. Also, we assume that the coincidence detector is not perfect; that is, it can fire (with some low probability) even when there is no coincidence. In our simulations, we generated A and B independently as Poisson processes with the same intensity—15 spikes per second. Neuron C was made to fire (with a latency 3 to 4 ms) if it saw a near coincidence (within 5 ms) between A and B spikes. Finally, we injected (see Grun, ¨ Diesmann, & Aertsen, 2002; Kuhn, Aertsen, & Rotter, 2003) low-intensity background

1466

G. Czanner, S. Grun, ¨ and S. Iyengar

spikes into the C spike train. (For more details about the simulation, see appendix A.) The simulated data are shown in Figure 9. The coincidence detector is an example of a PFS in the following sense. This circuit can be described by a set of two kinds of PFSs: (A, B, C; t1 , t2 ), and (B, A, C; t3 , t4 ). In (A, B, C; t1 , t2 ), neuron A fires first; in (B, A, C; t3 , t4 ), neuron B fires first. Figure 9 shows the raw CCHs, raw STJHs, and the corresponding snowflake plot. The first STJH (see Figure 9E) shows that the PFSs (A, B, C; t1 , t2 ) take place where 0 < t1 < 5 and t2 ∈ (t1 + 3, t1 + 4) ms. The second STJH (Figure 9F) says that (B, A, C; t3 , t4 ) also takes place with t3 < 5 and t4 ∈ (t3 + 3, t3 + 4). The first STJH corresponds to part A (see Figure 6) of the snowflake plot (contains the upper part of the chevron), the second corresponds to part B (contains the lower part of the chevron), the third to part C. This example illustrates the correspondence among the three JSTHs and the snowflake plot. 6 The Null Distribution In this section, we analyze the null case of the three neurons firing independently according to homogeneous Poisson processes, with possibly different rates. In practice, the null distribution can then be subtracted from data to detect deviations from it in the snowflake plot. In appendix C, we calculated the mean and variance of counts in bins of the cube [0, τ ]3 , which allows us to study the variability of the snowflake. In appendix B, we show that the joint spike intensity function over the snowflake plot is λ Aλ B λC f X,Y (x, y), where λ A, λ B , and λC are the intensities of neurons A, B, and C, respectively:  √3 (τ − y)   2τ 3  √ √   3 3    2τ 3 (τ − 2 x −  √ √    33 (τ − 3 x + 2 f X,Y (x, y) = 2τ √  3 (τ + y)   2τ 3   √ √   3  (τ + 23 x +  2τ 3   √  √3 (τ + 23 x − 2τ 3

in sector ACB 1 y) 2

in sector ABC

1 y) 2

in sector BAC in sector BCA

1 y) 2

in sector CBA

1 y) 2

in sector CAB,

τ is the recording period, and (x, y) are the Cartesian coordinates used in equation 2.1. The density f X,Y has a tent-like shape (see Figure 10A). To illustrate this result, we simulated three independent neurons with individual intensities of 10 spikes per second and recording period 15 seconds. The neurons fired

Theory of the Snowflake Plot

E

1467

A

B

C

D

F

G

Figure 9: Simulated coincidence detector. Neurons A and B are from two independent Poisson processes with intensities 15 spikes per sec. If neuron C sees an A spike followed by a B spike within 5 ms, then C fires within 3 to 4 ms after the B spike. C also fires if it sees a B spike followed by an A spike (see appendix A). The total time of simulation is 1 minute; neuron A fired 897 times and neuron B fired 932 times. Neuron C fired 125 times, with 12 background and 113 coincidence-induced spikes. (A) Raw CCH of A and B. (B) Raw CCH of B and C. (C) Raw CCH of A and C. (D) Snowflake plot of the simulated data. (E) Spike-triggered scatter diagram of B and C (A is a trigger). STJH can be obtained as a histogram above the plot. (F) Spike-triggered scatter diagram of A and B (C is a trigger). (G) Spike-triggered scatter diagram of A, and B (C is a trigger).

1468

G. Czanner, S. Grun, ¨ and S. Iyengar A 4

f(x,y)

3 2 1 0 15 15

0

0

y

−15

x

−15

B

C

4

2 B−A

relative frequency

15

3

1

0

0 15 15 −15

0 B−A

−15

−15

0 (C−B+C−A)/sqrt(3)

−15

0 (C−B+C−A)/sqrt(3)

15

Figure 10: (A) The density f X,Y for the independence case with τ = 15 sec. The value at the center is 0.003849. (B) The 3D histogram for the independence case with τ = 15 sec. Data are obtained from a simulation of three independent Poisson processes with intensities of 10 spikes per second. The neurons A, B, and C fired 146, 147, and 152 times, respectively. Bin size is 1 ms2 . The height of each bin is equal to number of points in the snowflake plot divided by the total number of points, 146 × 147 × 152. The value at the center is 0.003774. (C) Contour plot of the 3D histogram in B.

146, 147, and 152 times. We created bins over the snowflake plot and plotted the relative counts (See the histogram and contour plot in Figures 10B and 10C). We also derived the joint spike intensity function over the span L,

f X,Y/L (x, y) = λ Aλ B λC

3L 3 τ3

τ 2 − L 3

f X,Y (x, y) I {(x, y) ∈ span L},

where I is the indicator function. Figure 11 shows the intensity function for

Theory of the Snowflake Plot

1469

0.3021

0.4

0.2820 f(X,Y)

0.3 0.2 0.1 0 1 1

0 Y

0 −1

−1

X

Figure 11: The density function f X,Y/L for the independence case; τ = 15 sec, L = 1 sec.

the case of L = 1 sec and total recording time 15 sec. It is not constant, and it also has a maximum at origin. However, the shorter L is or the larger τ is, the more nearly constant the density appears near origin. This explains the observation in Perkel et al. (1975) that the distribution of the points in the snowflake plot should have approximately constant density over the hexagon. To assess the applicability of this null distribution result to trains other than the Poisson, we also simulated three independent spike trains, this time with gamma (α, β) distributed interspike intervals, of orders α = 2 and 3 (the Poisson train has intervals that are gamma with order 1), and scale parameter β = 100/α so that the mean interspike interval for all three cases is 100 ms. The three-dimensional histograms and the corresponding contour plots are given in Figure 12. Note that both resemble those from the Poisson process. Note that as the order increases, the contours appear smoother; √ this may reflect the fact that the coefficient of variation of the gamma, 1/ α, decreases as α increases. Furthermore, our theoretical calculations in appendix C show that for Poisson spike trains, the expected number of points in any small cube contained in [0, τ ]3 depends on intensities only through their product λ Aλ B λC , which is an overall measure of the density of points in the cube, and that the variance depends on the intensities separately. We also show in appendix C that for a fixed value of this product, the variance of the points in any small cube is minimized if the three intensities are equal. Because the snowflake plot is a linear projection of the points in the cube onto a plane, these qualitative properties are inherited by the plot. This supports the observation in Perkel et al. (1975) that the points in the snowflake appear more random over the hexagon if the neurons’ spiking intensities are equal. It also suggests that large differences between intensities cause larger variability across samples.

1470

G. Czanner, S. Grun, ¨ and S. Iyengar

A relative frequency

4

15

3 2

0 1 0 15 15

0 B−A

−15

0 −15

1−5 (C−B+C−A)/sqrt(3)

−15

0

15

−15

0

15

B relative frequency

4

15

3 2

0 1 0 15 15

0 B−A

0 −15

−15

(C−B+C−A)/sqrt(3)

−15

Figure 12: Two simulations of three independent gamma trains and the corresponding 3D histograms and contour plots on the snowflake plot. Total simulation time is τ = 15 sec. The two simulations have different parameters of scale and order; however, they give the same mean interspike interval, 100 ms (as was also the case in Figure 10). (A) Trains simulated from gamma (2,100/2). The neurons fired 159, 148, and 136 times. (B) Trains simulated from gamma (3,100/3). The neurons fired 152, 148, and 138 times.

7 Geometry for More Than Three Neurons Simultaneous recordings of more than three neurons are increasingly common; see, for example Beggs and Plenz (2004), Buzsaki (2004), Hoffman and McNaughton (2002), and Warren, Fernandez, and Normann, (2001). Hence, we explored the idea of generalizing the snowflake for more neurons. In this section, we show how the data from k neurons can be projected onto a (k − 1)-dimensional space, where k ≥ 4, assuming that relative spike timings contain all information about the dependencies among neurons. Although we cannot visualize the projection (except for four neurons), an analysis that is similar to the three-neurons case is possible. In particular, one can derive the joint density in the projection and subtract it from the projected data to search for patterns. This can be especially useful for exploratory analysis.

Theory of the Snowflake Plot

1471

In the case of three neurons (i.e., in the snowflake plot), the angle between the axes of triangular system is arccos (−1/2) = 120◦ . From section 3, we have √ that the snowflake plot is the result of projecting (up to a constant, 1/ 2) the cube onto the plane x + y + z = 0 using the projection matrix given there. For the case of k neurons, the projection matrix has vectors √ √2(1, 2(1,

1, 1,

1, 1,

(1, (1, (1,

1, 1, −1,

1, −2, 0,

..., ..., ... −3, 0, 0,

1, 1,

1, −(k − 2),

0, 0, 0,

..., ..., ...,

−(k −1)) / k(k − 1) 0)/ (k − 2)(k − 1) √ 0)/√6 0)/ 3 0).

√ Once again, they are perpendicular, and their √ √norm is again 2. These veck, form a Helmert matrix (see tors, together with vector 2(1, 1, 1, . . . . , 1)/ √ Mardia, Kent, & Bibby, 1979) up to a constant 2. Note that the point (1, 1, . . . , 1, 1, 0) projects onto √ (k − 1) , 0, . . . , 0 , 2 k(k − 1) and (1, 1, . . . , 1, 0, 1) projects onto √ (k − 2) −1 , , 0, . . . , 0 . 2 k(k − 1) (k − 1)(k − 2) In general, the points (1, 1, . . . , 1, 1, 0) and (1, 1, . . . , 1, 0, 1) represent the (k − 1)-way coincidences; that is, some (k − 1) of the neurons fire simultaneously, and the remaining one fire at some different time. The projections of these points give us information about the coincidence lines in the projection. For example, the angle between these two coincidence lines is   arccos  



 1  = arccos − . √ 1+k(k−2)  k−1 √

k−1 − k(k−1)/2

√

k−1 k(k−1)/2

k(k−2)/2

So for k = 3 neurons, the angle between coincidence lines is arccos (−1/2) = 120◦ ; for k = 4, arccos (−1/3) = 109.47◦ ; and for large k, lim arccos (−1/(k − 1)) =

k−→∞

π ; 2

1472

G. Czanner, S. Grun, ¨ and S. Iyengar

thus, the angle between coincidence lines becomes perpendicular as k → ∞. Finally, we show how to derive the axes for the case of four neurons. Note that the coincidence line ABC must be perpendicular to the plane where the values B − A are varying; also, the coincidence line AB D must be perpendicular to the plane where values B − A are varying. This gives us enough information to find the axes B − A as being perpendicular to the two coincidence lines: v B−A = v A−C = v D−A = vC−B = v B−D = v D−C =

( 0, 0, 1.414) ( 0, −1.225, −0.707) ( 1.155, 0.408, 0.707) ( 0, 1.225, −0.707) (−1.155, −0.408, 0.707) ( 1.155, −0.815, 0).

Notice that v A−C = −(v B−A + vC−B ), so the vectors v B−A, vC−B , v A−C lie in a common plane (x = 0) and have an equal angle, 120◦ . Hence the projection onto the plane x = 0 is again the (snowflake) hexagon. Furthermore, it can be shown that angles among v B−A, v A−D , v D−B are 120◦ ; among v A−C , vC−D , and v D−A, they are 120◦ . Finally, the angle between v B−A and v D−C is 90◦ ; the same holds for v A−C and v B−D and for v D−A and vC−B . 8 Discussion In this article, we have begun a quantitative study of the snowflake plot of Perkel et al. (1975). Our findings can be summarized as follows. First, we used the unifying idea of projections of spike trains to show how the snowflake plot relates to the cross-correlogram (CCH) and the joint peristimulus time histogram (JPSTH). We treated the CCH and the snowflake plot as results of orthogonal projections—the CCH from twoonto one-dimensional space, the snowflake plot from three- onto twodimensional space. We also showed how the snowflake plot generalizes the spike-triggered joint histogram (STJH) symmetrically: The snowflake is the union of all three STJHs of the three neurons. Second, we analytically derived the distribution of points on the snowflake plot under the assumption that the neurons fire independently from homogeneous Poisson processes. The joint spike density function over the snowflake plot has a tent-like shape. However, if we are interested in the firings that lie within some span, then the analytical density of the points in the snowflake converges to a constant density as total observation time increases and span decreases. Third, we studied the geometry of the snowflake for more than three neurons. Although one cannot easily visualize the projections, except for four neurons, an analysis similar to that for three neurons can be done. One

Theory of the Snowflake Plot

1473

can derive the joint density in the projection in the null case and subtract it from the projected data to search for patterns. This can be useful for exploratory analysis to analyze several spike trains simultaneously. How does the snowflake help us in analyzing data? It can be used for screening purposes to find potential PFSs and as a graphical device to visualize the PFSs. The idea is to create a grid above the snowflake plot and construct a histogram. The advantage is that all three STJHs (i.e., the raw counts matrices in Prut et al., 1998) are in one histogram, but the disadvantage is that the display is triangular rather than orthogonal. However, to find the raw counts is only the beginning of the process of detecting spatiotemporal patterns. One then needs to calculate the corrected counts (corrected for the expected counts) and derive a test to identify them as significant events. Prut et al. (1998) developed a method that allows the identification of significant events, which is based on counting. However, there is still a long way to go in order to evaluate significance, since higherorder correlations need to be evaluated, and therefore one needs to correct for by chance coincidence of lower-order correlations (see Schneider & Grun, ¨ 2003; Nakahara & Amari 2002; Gutig, ¨ Aertsen, & Rotter, 2003). If the interest is only to identify deviations from full independence, one can apply the same procedure as in the unitary events analysis (see the joint surprise in Grun ¨ et al., 2002). Analytical work will be of limited value because significance calculations for such models are quite complicated; it is more likely that simulation approaches that use shuffled or jittered trains (see Brown, Kass, & Mitra, 2004) will be needed, but these must be executed with care (Gerstein, 2004). There are other avenues of research besides of the evaluation of significance in the snowflake plot. A thorough study about the artifact of the snowflake plot—the woven pattern (Perkel et al., 1975)—is needed. It is caused by multiple entries, where one firing of a neuron is represented by multiple points in the plot. The artifact is always present, but it is less visible if the span L is small. Also of interest would be to generalize the independence assumption and find the distribution of the points in the snowflake plot. Another generalization can be made in modifying the Poisson assumption. Furthermore, the derivations of the joint spike density can be generalized for more than three neurons. Appendix A: Simulation of a Coincidence Detector To simulate the homogeneous Poisson process with intensity λ, one can use a direct method. Consider a spike at time t, t ∈ [0, τ ]. Then we can draw a random number, E, from exponential distribution with mean time interval 1/λ. This defines the next spike at time t + E. This method can give the times of spikes with any precision in principle. An equivalent way of simulating a Poisson process is to first draw the total number, n, of spikes from Poisson distribution with parameter λτ .

1474

G. Czanner, S. Grun, ¨ and S. Iyengar

Then we can draw a random sample of size n from uniform distribution over the interval [0, τ ]. Finally, we sort the sample in increasing order to get the Poisson process. We took the following steps to simulate a coincidence detector based on Poisson trains (see also Figure 8): 1. We simulated neurons A and B as two independent realizations of Poisson processes with intensities λ A = λ B = 15 spikes per second. In order to simulate the Poisson process, we used the second algorithm above. 2. We created the spike train C1 by looking for coincidences between A and B. If an A spike is followed by a B spike (within 5 ms), we create a C1 spike at the time of the B spike. Analogously, if a B spike is followed by an A spike (within 5 ms), we create a C1 spike at the time of the A spike. 3. We applied a latency of 3.5 ms to each C1 spike to get the C2 spike train. 4. The train C3 is a noisy version of C2 . It is obtained by jittering each spike of C2 independently and uniformly over a time window of ±0.5 ms around its original position. 5. We simulated the background process C0 as Poisson process with intensity 0.2 spike per second, independent of the spike trains A and B. This will be the background process for neuron C. 6. We injected the spikes of C0 into the train C1 to create the spike train C. In the theory of point processes, this is also called the superposition of point processes. Appendix B: Derivation of Null Distribution of Snowflake Plot We now derive the joint density function f X,Y (x, y) (see section 6) of the points on the snowflake plot when the three neurons are firing independently according to homogeneous Poisson processes with possibly different rates. We showed in section 3 that the snowflake plot is an orthogonal projection of the observations from the cube [0, τ ]3 onto the plane x + y + z = 0. Hence, we first compute the joint spike density function in the cube; then we derive the induced probability distribution of the projection. We first recall the following fact about a homogeneous Poisson process with n points in the interval [0, τ ] (Karlin & Taylor, 1975): conditional on n, the points have the same distribution as uniform order statistics of size n in the interval [0, τ ]. Hence, conditional on the number of spikes in each of the three spike trains, the distribution of the points in the cube is the

Theory of the Snowflake Plot

1475

same as the distribution of a random vector (U1 , U2 , U3 ), where U1 , U2 , U3 are independent and identically distributed with uniform distribution on [0, τ ]. We next project this distribution onto the plane x + y + z = 0. Now construct the random vector (X, Y, Z) =

U1 + U2 2 , U2 − U1 , U3 . √ U3 − 2 3

(B.1)

To find the distribution of the snowflake plot (X, Y), we first find the distribution of (X, Y, Z) and then integrate out Z. The inverse transformation of equation B.1 is √ √ (U1 , U2 , U3 ) = (−X 3/2 − Y/2 + Z, −X 3/2 + Y/2 + Z, Z), √ which has Jacobian determinant = − 3/2; hence, the joint probability distribution function of (X, Y, Z) is

f XY Z (x, y, z) =

√ √ √ 3 I {0 ≤ −x 3/2 − y/2 + z ≤ τ, 0 ≤ −x 3/2 + y/2 3 2τ + z ≤ τ, 0 ≤ z ≤ τ },

where I is the indicator function: I {0 ≤ x ≤ τ } = 1 if 0 ≤ x ≤ τ and zero otherwise. Next, we integrate out z from f XY Z (x, √y, z). This is equivalent √ to finding the intersection of the three intervals [x 3/2 + y/2, τ + x 3/2 + √ √ y/2], [x 3/2 − y/2, τ + x 3/2 − y/2], and [0, τ ] for each combination of (x, √ a sector: for example, if (x, y) ∈ sector ACB, then 0 ≤ y ≤ τ , √ y). Now fix x 3 ≤ y, −x 3 ≤ y. Since 0 ≤ y, we have √ √ √ √ x 3/2 − y/2, τ + x 3/2 − y/2 x 3/2 + y/2, τ + x 3/2 + y/2 √ √ = x 3/2 + y/2, τ + x 3/2 − y/2 . √ √ √ Next, √ since −x 3 ≤ y and x 3 ≤ y, we have x 3/2 + y/2 ≥ 0 and τ + x 3/2 − y/2 ≤ τ . Hence, √ √ x 3/2 + y/2, τ + x 3/2 − y/2 [0, τ ] √ √ = x 3/2 + y/2, τ + x 3/2 − y/2 .

1476

G. Czanner, S. Grun, ¨ and S. Iyengar

Finally, the length of the last interval is τ − y, so f XY (x, y) = the sector ACB. The proof for the other sectors is similar.

√ 3 (τ 2τ 3

− y) in

Appendix C: Null Mean and Variance of Counts in Cubes Suppose that the three neurons are firing according to independent Poisson processes with intensities λ A, λ B , and λC . Denote the firing times of the three neurons by (A i , B j , Ck ), with 1 ≤ i ≤ n A, 1 ≤ j ≤ n B , and 1 ≤ k ≤ nC . Next, define the statistic N{(a , b] × (c, d] × (e, f ]} =

nC nB nA

I {(Ai , B j , Ck ) ∈ (a , b] × (c, d] × (e, f ])}

i=1 j=1 k=1

=

nA

I {Ai ∈ (a , b]}

i=1

nB j=1

I {B j ∈ (c, d]}

nC

I {Ck ∈ (e, f ]},

k=1

which counts the number of points in the cube (a , b] × (c, d] × (e, f ]. Thus, conditional on the number of spikes, this count is the product of three independent binomial variates, say, X ∼ Bin(n A, b−a ), Y ∼ Bin(n B , d−c ), and τ τ f −e Z ∼ Bin(nC , τ ). Thus, the conditional mean of N = N{(a , b] × (c, d] × (e, f ]} is E(N|n A, n B , nC ) =

1 n An B nC (b − a )(d − c)( f − e), τ3

and since ni are independent Poisson variates with means τ ni for i = A, B, C, the unconditional mean is E(N) = λ Aλ B λC (b − a )(d − c)( f − e). Thus, the conditional mean depends on the number of spikes n A, n B , nC through their product only, and the unconditional mean depends on intensities λ A, λ B , λC through their product only. In both cases, they are proportional to the volume of the box (a , b] × (c, d] × (e, f ]. Next, the computation and study of the conditional variance of N is rather lengthy, so we sketch the details here and refer to Czanner (2004) for the technical details. The conditional variance is var(N|n A, n B , nC ) = var(XY Z) = E (XY Z)2 − [E(XY Z)]2 . To simplify notation, suppose that b − a = d − c = f − e = h, so the box is a cube. Using the independence of X, Y, and Z and binomial moments, we

Theory of the Snowflake Plot

1477

get var(N|n A, n B , nC ) is h n An B nC h 3 (1 − h)3 1+ (n A + n B + nC ) τ6 1−h

2 h + (n An B + n B nC + nC n A) , 1−h which depends on n A, n B , nC separately, not just on their product. This conditional variance is smallest if n A = n B = nC (assuming (b − a ) = (d − c) = ( f − e) = h and n An B nC = K 0 where K 0 is a given positive constant that measures the overall density of points in the cube). First, notice that the variance has the form H(n A, n B , nC ) = n An B nC [K 1 + K 2 (n An B + n B nC + nC n A) +K 3 (n A + n B + nC )] , for some positive constants K 1 , K 2 , K 3 when the counts are all nonzero and h is not 0 or 1. To minimize H subject to the constraint n An B nC = K 0 , we treat the counts as continuous variables and use Lagrange multipliers to show that the minimum occurs when n A = n B = nC . Finally, to get the unconditional variance, we use the fact that for any two random variables or vectors U and V, var[U] = E (Var[U|V]) + var (E[U|V]). Now if we let U = N and V = (n A, n B , nC ), and use the moments of the Poisson distribution, we get that the unconditional variance of N is var(N) = λ Aλ B λC [K 4 + K 5 (λ A + λ B + λC ) + K 6 (λ Aλ B + λ AλC + λ B λC )] where K 4 , K 5 , and K 6 are all positive constants depending on τ and h only. Note that the unconditional variance has the same form as the conditional variance, so that for a given value of the product λ Aλ B λC , the unconditional variance is minimized when λ A = λ B = λC . Acknowledgments We thank the referees for their helpful comments. References Abeles, M. (1982a). Local cortical circuits: An electrophysiological study, New York: Springer. Abeles, M. (1982b). Role of cortical neuron: Integrator or coincidence detector? Israel J. Medical Science, 18, 83–92.

1478

G. Czanner, S. Grun, ¨ and S. Iyengar

Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiology, 70, 1629– 1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neuroscience, 60, 909–924. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiology, 61, 900–917. Beggs, J. M., & Plenz D. (2004). Neuronal avalanches are diverse and precise activity patterns that are stable for many hours in cortical slice cultures. J. Neuroscience, 24, 5216–5229. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7, 456–461. Buzsaki, G. (2004). Large-scale recording of neuronal ensembles. Nature Neuroscience, 7, 446–451. Czanner, G. (2004). Applications of statistics in neuroscience. Unpublished doctoral dissertation, University of Pittsburgh. Gerstein, G. (2004). Searching for significance in spatiotemporal firing patterns. Acta Neurobiologiiae Experimentalis, 64, 203–207. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials: Analysis and functional interpretations. Science, 164, 828–830. Gerstein, G. L., & Perkel, D. H. (1972). Mutual temporal relationships among neuronal spike trains: Statistical techniques for display and analysis. Biophysical Journal, 12, 453–473. Grun, ¨ S., Diesmann, M., & Aertsen, A. (2002). Unitary events in multiple singleneuron spiking activity: I. Detection and significance. Neural Computation, 14, 43–80. Gutig, ¨ R., Aertsen, A., & Rotter, S. (2003). Analysis of higher-order neuronal interactions based on conditional inference. Biological Cybernetics, 88, 352–359. Hoffman, K. L., & McNaughton, B. L. (2002). Coordinated reactivation of distributed memory traces in primate neocortex. Science, 297, 2070–2073. Karlin, S., & Taylor, H. M. (1975). A first course in stochastic processes. New York: Academic Press. Kristan W. B., & Gerstein G. L. (1970). Plasticity of synchronous activity in a small neural net. Science 169, 1336–1339. Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Computation, 15, 67–101. Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. New York: Academic Press. Nakahara, H., & Amari, S. (2002). Information-geometric measure for neural spikes. Neural Computation, 14, 2269–2316. Perkel, D. H., Gerstein, G. L., Smith, M. S., & Tatton, W. G. (1975). Nerve-impulse patterns: A quantitative display technique for three neurons. Brain Research, 100, 271–296.

Theory of the Snowflake Plot

1479

Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiology, 79, 2857–2874. Schneider, G., & Grun, ¨ S. (2003). Analysis of higher-order correlations in multiple parallel processes. Neurocomputing, 52–54, 771–777. Villa, A. E., & Fuster, J. M. (1992). Temporal correlates of information processing during visual short-term memory. Neuroreport, 3, 113–116. Warren, D. J., Fernandez, E., & Normann, R. A. (2001). High-resolution two-dimensional spatial mapping of cat striate cortex using a 100-microelectrode array. Neuroscience, 105, 19–31.

Received November 4, 2003; accepted December 1, 2004.

LETTER

Communicated by Naftali Tishby

Asymptotic Theory of Information-Theoretic Experimental Design Liam Paninski [email protected] Gatsby Computational Neuroscience Unit, University College London, London, WC1N 3AR, U.K.

We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. In particular, we calculate the asymptotic efficiency of the information-maximization strategy and demonstrate that this method is in a well-defined sense never less efficient—and is generically more efficient—than the nonadaptive strategy. For example, we are able to explicitly calculate the asymptotic relative efficiency of the staircase method widely employed in psychophysics research and to demonstrate the dependence of this efficiency on the form of the psychometric function underlying the output responses. 1 Introduction Many experiments are undertaken with the hope of elucidating some kind of “input-output” relationship: the experimenter presents some stimulus to the system under study and records the response. More generally, the experimenter places some observational apparatus in some state—for example, by pointing a microscope to a given location or selecting some subfield in a database stream—and records the subsequent observation. If the system is simple enough and a sufficient number of observations are made, the resulting collection of data should provide an acceptably precise description of the system’s overall behavior. Given this basic paradigm, in which the experimenter has some kind of control over what stimulus is chosen or what kind of data are collected, how do we design experiments to be as efficient as possible? How can we learn the most about the system under study in the least amount of time? This question becomes especially pressing in the context of high-dimensional, Neural Computation 17, 1480–1507 (2005)

© 2005 Massachusetts Institute of Technology

Asymptotic Theory of Information-Theoretic Experimental Design

1481

complex systems, where each input-output pair typically provides a small amount of information about the behavior of the system as a whole and opportunities to record responses are rare or expensive (or both). In such cases, good experimental design can play an essential role in making the benefits of the experiment worth the cost. How can we precisely define this intuitive concept of the efficiency of an experiment? First, we have to define what exactly we mean by experiment. We use the following simple model of experimental design here (we have neurophysiological experiments in mind, but our results are all general with respect to the identity of the system under study). The basic idea is that we have some set of models , where each model θ indexes a given probabilistic input-output relationship. More precisely, a model is a set of regular conditional probability distributions p(y|x, θ ) on Y, the set of possible output responses, given any input stimulus x in some space X. Therefore, if we know the identity of the model θ , we know the probability of observing any output y given any input x. Of course, we do not know θ exactly (otherwise we would not need to perform any experiments); our knowledge of the system is summarized in the form of a prior probability measure, p0 (θ ), on , and our goal is to reduce the uncertainty of this distribution as much as possible. To put everything together, the joint probability of θ , x, and y is given by the following simple equation: p(x, y, θ ) = p0 (θ ) p(x) p(y|x, θ).

Now we can define the “design” of our experiment in a straightforward way: on any given trial, the design is specified completely by the choice of the input probability p(x), the only piece of the above equation over which we have control. One common approach is to fix some p(x) at the beginning of the experiment and then sample from this distribution in an independent and identically distributed (i.i.d.) manner for all subsequent trials, independently of which input-output pairs might have been observed on any previous trial. Alternatively, we could try to design our experiment— choose p(x)—optimally in some sense, updating p(x) online, on each trial, as more input-output data are collected and our understanding of the system increases. (The simplest special case of this would be to choose p(x) to put all probability mass on a single x, where x is optimized on each new trial.) One natural idea would be to choose p(x) in such a way that we learn as much as possible about the underlying model, on average. Information theory (Cover & Thomas, 1991) thus suggests we choose p(x) to optimize the following objective function, I ({x, y}; θ ),

(1.1)

1482

L. Paninski

where I (.; .) denotes mutual information. In other words, we want to choose p(x) adaptively to maximize the information provided about θ by the pair {x, y}, given our current knowledge of the model as summarized in the posterior distribution given N samples of data: p N (θ) = p(θ|{xi , yi }1≤i≤N ). We will take this information-theoretic concept of efficiency as our starting point. We note, however, that similar ideas have seen application in a wide and somewhat scattered literature: in statistics (Lindley, 1956), computer vision (Denzler & Brown, 2000; Lee & Yu, 1999), machine learning (Luttrell, 1985; Mackay, 1992; Cohn, Ghahramani, & Jordan, 1996; Sollich, 1996; Freund, Seung, Shamir, & Tishby, 1997; Axelrod, Fine, Gilad-Bachrach, Mendelson, & Tishby, 2001), conceptual psychology (Nelson & Movellan, 2000), psychophysics (Watson & Pelli, 1983; Pelli, 1987; Watson & Fitzhugh, 1990; Kontsevich & Tyler, 1999), medical applications (Parmigiani, 1998; Parmigiani & Berry, 1994), and neuroscience (Sahani, 1997). These references all discuss, to some degree, the motivation behind various different design criteria, of which the information-theoretic criterion is well motivated but certainly not unique. For more general reviews of the theory of experimental design, see, for example, Chaloner and Verdinelli (1995) and Fedorov (1972). In addition, several attempts have been made to devise algorithms to find the “optimal stimulus” of a neuron, where optimality is defined in terms of firing rate (Tzanakou, Michalak, & Harth, 1979; Nelken, Prut, Vaadia, & Abeles, 1994; Foldiak, 2001), but we should emphasize that the two concepts of optimality are not related in general and turn out to be typically at odds (maximizing the firing rate of a cell does not maximize— and in fact often minimizes—the amount we can expect to learn about the cell; see sections 3 and 4). Most recently, Machens (2002) proposed the maximization of the mutual information between the stimulus x and response y; again, though, this procedure does not directly maximize the amount of information we gain about the underlying system θ. Somewhat surprisingly, we have not seen any applications of the information-theoretic objective function, equation 1.1, to the design of neurophysiological experiments (although see the abstract by Mascaro & Bradley, 2002, who seem to have independently implemented the same idea in a simulation study). One major reason for this might be the computational demands of this kind of design (particularly for real-time applications), although these problems certainly do not appear to be intractable given modern computing power (see, e.g., Kontsevich & Tyler, 1999, for a real-time application in which is two-dimensional). We hope to address these important computational questions elsewhere. The primary goal of this letter is to elucidate the asymptotic behavior of the a posteriori density p N when we choose x according to the recipe outlined above; in particular, we want to compare the adaptive case to

Asymptotic Theory of Information-Theoretic Experimental Design

1483

the more usual (i.i.d. x) case. Our main result (in section 2) states that under acceptably weak conditions on the models p(y|x, θ), the informationmaximization strategy leads to consistent and efficient estimates of the true underlying model, in a natural sense. In particular, the informationmaximization strategy is never less efficient, in a well-defined sense—and is generically more efficient—than the simpler, nonadaptive, i.i.d. x strategy. We also give a few examples to illustrate the applicability of our results (see sections 3 and 4), including a couple of surprising negative examples that demonstrate the nontriviality of our mathematical results (see section 5). We close by briefly noting the relevance of our results to noninformationtheoretic (e.g., mean-square-error based) design and describing a few open avenues for further research. 2 Results First, we note that the problem as posed in section 1 turns out to be slightly simpler than one might have expected, because I ({x, y}; θ) is linear in p(x): I ({x, y} ; θ) = X

Y

p(x, y, θ ) log

p(x, y, θ ) p(x, y) p N (θ)

p(x, y, θ ) log

p(x) p N (θ ) p(y|x, θ) p(y|x) p(x) p N (θ)

p(x, y, θ ) log

p(y|x, θ) p(y|x)

= X

Y

=

X

=

Y

p(x) X

Y

p N (θ ) p(y|x, θ) log

p(y|x, θ) . p N (θ) p(y|x, θ)

This, in turn, implies that the optimal p(x) must be degenerate, concentrated on the points x where I is maximal. Thus, instead of finding optimal distributions p(x), we need only find optimal inputs x, in the sense of maximizing the conditional information between θ and y, given a single input x: I (y; θ|x) ≡ Y

p N (θ ) p(y|x, θ) log

p(y|x, θ) . p N (θ ) p(y|x, θ)

(We will assume throughout the article that this function attains its supremum in X—a condition guaranteeing that this is so will be given below— and that some reasonable, though possibly nondeterministic, tie-breaking stategy exists when this maximum is not unique.) Our main result is a Bernstein–von Mises type of theorem (van der Vaart, 1998). The classical form of this kind of result says, basically, that if the posterior distributions are consistent (in the sense that p N (U) → 1 for any neighborhood U of the true parameter θ0 ) and the likelihood ratios are sufficiently smooth on average, then the posterior distributions p N (θ ) are

1484

L. Paninski

asymptotically normal, with easily calculable asymptotic mean and variance. In particular, it is well known that a result of this type holds in the i.i.d. x case: under the smoothness conditions stated below, the posterior 2 distribution p N is asymptotically normal, with covariance matrix σiid /N, σiid ≡

−1

2

X

dp(x)Iθ0 (x)

,

and a mean that itself is a normal random variable with mean θ0 and co2 variance σiid /N. Here we have denoted the Fisher information matrices, Iθ (x) =

Y

p˙ (y|x, θ ) p(y|x, θ)

t

p˙ (y|x, θ) dp(y|x, θ), p(y|x, θ)

where the differential p˙ is taken with respect to θ. In other words, the asymp2 totic variance decays as 1/N, with the exact rate σiid defined as the inverse of the average Fisher information, where the average is taken over p(x). We adapt this result to the present case, where x is chosen according to the information-maximization recipe. Our main result will allow us to 2 compute the asymptotic variance σinfo /N and in particular will demon2 2 strate that |σinfo | ≤ |σiid |, with |.| denoting the determinant of a matrix; that is, the information-maximization strategy is more efficient in general than i.i.d. sampling, at least in the sense measured by the determinant |.|. It turns out that the hard part is proving consistency (see section 5); we give the basic consistency lemma (interesting in its own right) first, from which the main theorem follows fairly easily. The proofs appear in appendix A. Lemma 1 (Consistency). Assume the following conditions: 1. The parameter space is a compact metric space. 2. The log likelihood log p(y|x, θ ) is uniformly Lipschitz in θ with respect to some dominating measure on Y. 3. The prior measure p0 assigns positive measure to any neighborhood of θ0 . 4. The maximal Kullback-Leibler divergence, p(y|x, θ0 ) sup DKL (θ0 ; θ|x) ≡ sup dp(y|x, θ0 ) log x x p(y|x, θ) Y is positive for all θ = θ0 . Finally, assume that the set of log likelihood functions log p(y|x, θ), indexed by x, is compact in the sup-norm topology on θ-continuous functions on Y × . Then the posteriors are consistent: p N (U) → 1 in probability for any neighborhood U of θ0 .

Asymptotic Theory of Information-Theoretic Experimental Design

1485

Theorem 1 (Asymptotic normality). Assume the conditions of lemma 1, strengthened as follows: 1. has a smooth, finite-dimensional manifold structure in a neighborhood of θ0 . 2. The log likelihood log p(y|x, θ ) is uniformly C 2 in θ . In particular, the Fisher information matrices Iθ (x) are well defined and continuous in θ , uniformly in (x, θ) in some neighborhood of θ0 . 3. The prior measure p0 is absolutely continuous in some neighborhood of θ0 , with a continuous positive density at θ0 . max C∈co(Iθ0 (x)) |C| > 0 ,

4.

where co (Iθ0 (x)) denotes the convex closure of the set of Fisher information matrices Iθ0 (x). Then || p N − N µ N , σ N2 || → 0 in probability, where ||.|| denotes variation distance, N (µ N , σ N2 ) denotes the normal density with mean µ N and covariance σ N2 , and µ N is asymptotically normally distributed, with mean θ0 and variance σ N2 . Here –1 2 Nσ N2 → σinfo ≡ argmaxC∈co(Iθ (x)) |C| . 0

The maximum in the above expression is well defined and unique. Corollary 1. If, in addition, the prior p0 is absolutely continuous, with density bounded on the parameter space , then the maximum a posteriori (MAP) estimator is consistent almost surely, with asymptotic distribution N (θ0 , σ N2 ). Thus, under these conditions, the information-maximization strategy 2 works; moreover, since the asymptotic i.i.d. variance σiid is inversely related 2 to an average over x and the information-maximization variance σinfo to a maximum over co(Iθ0 (x)), we have by the definition of co(Iθ0 (x))—the closure of the set of all possible averages over Iθ0 (x) with respect to arbitrary 2 2 2 p(x)—that |σinfo | is never larger than |σiid |. For one-dimensional θ, σinfo is 2 strictly smaller than σiid except in the somewhat exceptional case that Iθ0 (x) is constant almost surely in p(x). Thus, information maximization is in a rigorous sense asymptotically more efficient than the i.i.d. sampling strategy. A few words about the assumptions are in order. Most should be fairly self-explanatory: the conditions on the priors, as usual, are there to ensure that no matter how mistaken our original prior beliefs are, in the face of sufficient posterior evidence, we will come around to agreeing that the data

1486

L. Paninski

are in fact generated by the true underlying model θ0 . The smoothness assumptions on the likelihood permit the local expansion that is the source of asymptotic normality, and the condition on the maximal divergence function supx DKL (θ0 ; θ|x) ensures that distinct models θ0 and θ are identifiable (that is, for any θ = θ0 , there is some input x that will reliably distinguish between θ and θ0 given enough output samples yi ). The assumption that the set of log likelihood functions log p(y|x, θ) is compact will guarantee that the objective function I (y, θ |x) always attains its maximum in x. Finally, some form of monotonicity or compactness on is necessary, mostly to bound the maximal divergence function supx DKL (θ0 ; θ |x) and its inverse away from zero (the lower bound, again, is to uniformly ensure identifiability; the necessity of the upper bound will become clear in section 5). Also, compactness is useful (though not necessary) for adapting certain Glivenko-Cantelli bounds (van der Vaart, 1998) for the consistency proof. It should also be clear that we have not stated the results as generally as possible. We have chosen instead to use assumptions that are simple to understand and verify and to leave the technical generalizations to the interested reader. Our assumptions should be weak enough for most neurophysiological and psychophysical situations, for example, by assuming that parameters take values in bounded (though possibly large) sets and that tuning curves are not infinitely steep.

3 Applications 3.1 Psychometric Model. As noted in section 1, psychophysicists have employed versions of the information-maximization procedure for some years (Watson & Pelli, 1983; Pelli, 1987; Watson & Fitzhugh, 1990; Kontsevich & Tyler, 1999). References in Watson and Fitzhugh (1990), for example, go back four decades, and while these earlier investigators usually couched their discussion in terms of variance instead of entropy, the basic idea is the same (note, for example, that in the one-dimensional θ case, minimizing entropy is asymptotically equivalent to minimizing variance, by our main theorem). Our results above allow us to quantify the effectiveness of this stategy precisely. One general psychometric model is as follows. The response space Y is binary, corresponding to subjective yes or no detection responses. Let f be sigmoidal: a uniformly smooth, monotonically increasing function on the line, such that f (0) = 1/2, limt→−∞ f (t) = 0 and limt→∞ f (t) = 1 (this function represents the detection probability when the subject is presented with a stimulus of strength t). Let f a ,θ = f ((t − θ )/a ); θ here serves as a location (“threshold”) parameter, while a sets the scale (we assume a is known for now, although this can be relaxed; (Kontsevich & Tyler, 1999). Finally, let p(x) and p0 (θ ) be some fixed sampling and prior distributions, respectively, both equivalent to Lebesgue measure on some interval .

Asymptotic Theory of Information-Theoretic Experimental Design

1487

Now, for any fixed scale a , we want to compare the performance of the information-maximization strategy to that of the i.i.d. p(x) procedure. We have by theorem 1 that the most efficient estimator of θ is asymptotically 2 unbiased with asymptotic variance σinfo /N, with −1 2 = sup Iθ0 (x) , σinfo x

while the usual calculations show that the asymptotic variance of any effi2 cient estimator based on i.i.d. samples from p(x) is given by σiid /N, with 2 = σiid

−1 X

dp(x)Iθ0 (x)

.

The Fisher information is easily calculated here to be Iθ =

( f˙ a ,θ )2 . f a ,θ (1 − f a ,θ )

We can immediately derive two easy but important conclusions. First, there is just one function f ∗ satisying the assumptions stated above for which the i.i.d. sampling strategy is as asymptotically efficient as the information-maximization strategy; for all other f , information maximization is strictly more efficient. This extremal function f ∗ is the unique solution of the following differential equation, derived by setting Iθ to a constant (and therefore making the expected Fisher information equal to the maximal Fisher information), df ∗ =c dt

1/2

f ∗ (t)(1 − f ∗ (t))

where the auxiliary constant c = calculus, we obtain f ∗ (t) =

, √

I θ uniquely fixes the scale a . After some

sin(ct) + 1 2

on the interval [−π/2c, π/2c] (and defined uniquely, by monotonicity, as 0 or 1 outside this interval). Since the support of the derivative of this function is compact, this result is not independent of the sampling density p(x); if 2 p(x) places any of its mass outside the interval [−π/2c, π/2c], then σiid is 2 always strictly greater than σinfo (since f˙ , and therefore Iθ0 (x), is zero outside this interval). This recapitulates a basic theme from the psychophysical literature comparing adaptive and nonadaptive techniques. When the scale

1488

L. Paninski

of the nonlinearity f is either unknown or smaller than the scale of the i.i.d. sampling density p(x), adaptive techniques are generally preferable. Second, a crude analysis shows that as the scale of the nonlinearity a 2 2 shrinks, the ratio σiid /σinfo grows approximately as 1/a . This gives quantitative support to the intuition that the sharper the nonlinearity with respect to the scale of the sampling distribution p(x), the more we can expect the information-maximization strategy to help. In fact, in the limit as a → 0, samples from the model become perfectly deterministic (with the response curve f 0,θ changing discontinuously from 0 to 1 at θ), and the informationmaximization strategy becomes infinitely more efficient than i.i.d. sampling. Information-maximal sampling is a version of the “twenty questions” game here, with each query x decreasing the entropy of p N (θ) by one bit, which in turn leads to exponential convergence in N instead of the N−1/2 rate guaranteed in the smoothly varying f case. 3.2 Linear-Nonlinear Cascade Model. We now consider a model that has received growing attention from the neurophysiology community (see, e.g., Simoncelli, Paninski, Pillow, & Schwartz, 2004, for a recent review). The model is of cascade form, with a linear stage followed by a nonlinear stage: the input space X is a compact subset of d-dimensional Euclidean space (take X to be the unit sphere, for concreteness), and the firing rate of the model cell, given input x ∈ X, has the simple form x ). E(y| x , θ ) = f ( θ, Here the linear filter θ is some unit vector in X , the dual space of X (thus, is isomorphic to X, as in the previous example), while the nonlinearity f is some nonconstant, nonnegative function on [−1, 1]. We assume that f is uniformly smooth, to satisfy the conditions of theorem 1; we also assume f is known, although, again, this can be relaxed. The response space Y— the space of possible spike counts, given the stimulus x —can be taken to be some large, bounded set of the nonnegative integers. For simplicity, let the conditional probabilities p(y| x , θ ) be parameterized uniquely by the mean firing rate f ( θ , x ); the most convenient model, as usual, is to assume x ). Finally, we assume that the that p(y| x , θ) is Poisson with mean f ( θ, sampling density p(x) is uniform on the unit sphere (this choice is natural for several reasons, mainly involving symmetry; see, e.g., (Chichilnisky, 2001; Simoncelli et al., 2004), and that the prior p0 (θ ) is positive and continuous (and is therefore bounded above and away from zero by the compactness of ). The Fisher information for this model is easily calculated as Iθ (x) =

x ))2 ( f˙ ( θ, Px ,θ , f ( θ , x )

Asymptotic Theory of Information-Theoretic Experimental Design

1489

where f˙ is the usual derivative of the real function f and Px ,θ is the projection operator corresponding to x , restricted to the (d − 1)-dimensional tangent space to the unit sphere at θ . (We have assumed that the bounded set Y of allowed spike counts has been taken sufficiently large to ignore the deviations from exact Poisson behavior due to the finite spike count cutoff.) Theorem 1 now implies that σinfo = 2

f˙ (t)2 g(t) max t∈[−1,1] f (t)

−1 ,

while 2 = σiid

dp(t) [−1,1]

f˙ (t)2 g(t) f (t)

−1 ,

where g(t) = 1 − t 2 , p(t) denotes the one-dimensional marginal measure induced on the interval by the uniform measure p(x) on the unit sphere, and σ 2 in each of these two expressions multiplies (d − 1)Id−1 , with Id−1 denoting the (d − 1)-dimensional identity matrix. 2 2 Clearly, the arguments of section 3.1 apply here as well: the ratio σiid /σinfo grows roughly linearly in the inverse of the scale of the nonlinearity. The more interesting asymptotics here, though, are in d. This is because the unit sphere has a measure concentration property (Milman & Schechtman, 1986; Talagrand, 1995): as d → ∞, the measure p(t) becomes exponentially concentrated around 0. In fact, it is easy to show directly that in this limit, p(t) converges in distribution to the normal measure with mean zero and variance d −2 . The most surprising implication of this result is seen for nonlinearities f such that f˙ (0) = 0, f (0) > 0; we have in mind, for example, symmetric nonlinearities like those often used to model complex cells in visual cortex. For these nonlinearities, 2 σinfo 2 σiid

= O(d −2 ) :

that is, in this case, the information-maximization strategy becomes infinitely more efficient than the usual i.i.d. approach as the dimensionality of the spaces X and grows.

4 Illustrations Next we give some illustrations of the behavior of the informationoptimization strategy, as compared to the nonadaptive i.i.d. case.

1490

L. Paninski

0

p(y = 1 | x, θ )

x 1

a.

0.5

0

b.

Iθ (x)

150 100 50 0 x 10-3

c.

I(y ; θ | x)

6 4 2

d.

p(θ)

0.2

trial 100

0.1 0 0

optimized i.i.d. 0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

θ

Figure 1: Snapshot of behavior of information-maximizing and i.i.d. experimental designs. (a) True underlying conditional response probabilities given input x. Model space includes all translates of firing rate curve shown here. (b) Fisher information Iθ0 (x) as a function of x. (c) Mutual information between the response y and the underlying location parameter θ, as a function of x, after 100 samples. (d) Posterior distributions p N (θ ) after N = 100 samples. The asterisk indicates the location of true model parameter θ0 . The dashed lines give gaussian approximations to true observed posteriors, though the dashed curves are obscured by the quality of the fit.

4.1 One-Dimensional Example. For clarity, we start with a simple example, for which the stimulus and model spaces X and are both onedimensional and the outputs are again binary. We illustrate the model in Figure 1. The system responds positively with high probability when the stimulus x and model preference θ agree; this probability decays smoothly and symmetrically as the difference |x − θ| increases. (We could think of this response probability curve as a sensory neuron’s receptive field for some

Asymptotic Theory of Information-Theoretic Experimental Design

1491

one-dimensional stimulus, for example, or as the place field of a cell in the hippocampus of a rat constrained to run along a one-dimensional track.) The experimenter’s goal is to determine the optimal θ0 , given the response curves p(y = 1|x, θ ). We begin with no knowledge of the true θ0 except that 0 ≤ θ0 ≤ 1; thus, we take the prior p0 (θ ) to be uniform on [0, 1]. In the bottom panel of Figure 1, we show the results of an experiment in which we draw 100 input samples i.i.d. from the uniform distribution p(x) = 1 on [0, 1], and then compare to the results given 100 samples drawn adaptively, following the information-maximization strategy. After 100 samples, we find that the mutual information curve I (y; θ|x), as a function of the input x (see Figure 1c), closely resembles the Fisher information curve Iθ0 (x) (see Figure 1b); in particular, the two curves reach their maxima for the same values of x, indicating that after 100 samples, the information-maximizing strategy is indeed sampling from the x that maximizes the Fisher information Iθ0 (x), as predicted by theorem 1. Note that sampling from the x that maximizes the firing rate, x = θ0 = 0.2, asymptotically minimizes the information gain I (y; θ |x), as emphasized in section 1. Also as predicted, the posterior distributions p N (θ ) are quite well approximated as gaussian, with means near θ0 and with the posterior under the information-maximization strategy more concentrated near θ0 than in the i.i.d. x case. We look more quantitatively at the evolution of the posteriors in Figure 2. The top two panels show the posteriors p N (θ ) as a function of N, while the bottom three panels show the posterior mean, standard deviation, and probability mass in a small neighborhood of the true parameter θ0 , respectively. In each case, we again see that the posterior under the informationmaximization strategy converges more rapidly than under the i.i.d. strategy, as predicted. Moreover, the predicted standard deviations of the posterior density and of the posterior mean, σiid N−1/2 and σinfo N−1/2 , accurately match the true observed behavior. 4.2 A V1 Simple-Cell Example. Our second example is somewhat more realistic in that the neuron we are simulating responds to stimuli that have many degrees of freedom; that is, the parameter and input spaces and X are multidimensional. We take what is perhaps the standard model of the response properties of a simple cell in primary visual cortex (a version of the cascade model discussed in the last section; Dayan & Abbott, 2001): p(spike| x , θ ) = f ( k 0 , x ), where the true receptive field k 0 is taken to be a Gabor function (the product of a two-dimensional sinusoid and a gaussian whose mean determines the location of the receptive field in space; see Figure 3, top left), and the monotonic nonlinear function f enforces the positivity of the firing rate and can also model the cell’s saturation properties (see Figure 3, bottom left). For simplicity, we assume the nonlinearity f and the spatial frequency of the Gabor θ0 = k 0 to be known; thus, the model space is three-dimensional (two dimensions for the location of the receptive field

1492

L. Paninski

a.

θ

0 0.2 0.4 10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

b.

θ

0 0.2

d.

0.4 0.2

σ(p)

c.

E(p)

0.4

10-2

e.

P(θ0)

1

102

101

0.5 0

10

20

30

40

50 60 trial number

70

80

90

100

Figure 2: Evolution of posterior densities in model from Figure 1, as a function of trial number N. (a, b) Evolution of posteriors under information-optimizing and i.i.d. strategy, respectively. White level indicates height of probability density. Recall from Figure 1 that the true parameter is located at θ0 = 0.2. (c) Evolution of the posterior mean. The dotted line indicates true parameter location (θ0 = 0.2). The solid black and gray indicate mean given information-optimizing and i.i.d. strategies, respectively. Dashed lines show predicted 95% confidence intervals, θ0 +/ − 2σinfo N−1/2 and +/ − 2σiid N−1/2 . (d) Evolution of posterior standard deviation. Solid traces are observed standard deviations; dashed traces are predicted, σinfo N−1/2 and σiid N−1/2 . (e) Evolution of posterior mass contained in a small neighborhood of true parameter, p N ([θ0 − 0.05, θ0 + 0.05]).

and one for the orientation). As in the previous example, we start with no knowledge of the true parameter other than the fact that the center of the receptive field lies within the square shown in Figure 3, so our prior p0 (k) is uniform over orientation and spatial location. The series of panels on the

Asymptotic Theory of Information-Theoretic Experimental Design true kernel

N=20

N=40

N=60

N=80

1493 N=100

0

info

0.5

1

iid

p(spike | )

-0.5

0.5

0 -50

0 50

Figure 3: Evolution of posteriors in a simulated V1 simple cell experiment. (Left) True model: the simulated simple cell responds to the stimulus image x according to p(spike| x , k 0 ) = f ( k 0 , x ), with k 0 , x indicating the dot product with the Gabor kernel k 0 shown in the top-left panel and f the nonlinear rectification function shown in the bottom-left panel. (Right) Evolution of posteriors as a N (k). function of trial number N. Each panel shows the posterior mean kdp

via the posterior mean right displays the evolution of the posteriors p N (k) as p N (k) becomes concentrated around the true receptive field N (k); kdp k 0 , the image of the posterior mean resembles k 0 more and more closely. We took the stimuli x here to be Gabors of varying orientations and locations, but similar results are seen if white- or colored-noise stimuli are used instead (data not shown). As before, the information-optimizing strategy leads to more rapid convergence than i.i.d. sampling from x.

5 Negative Examples Our next two examples are more negative and perhaps more surprising: they show how the information-maximation strategy can fail, in a certain sense, if the conditions of the consistency lemma are not met. (Note that as emphasized above, these consistency conditions are fairly weak; therefore, the fact that they fail in the following examples implies that these examples might be interesting from a mathematical point of view but might have less practical negative relevance for psychophysical or neurophysiological situations.) In each case, the method can be fixed using ad hoc methods; it is unclear at present whether a generally applicable modification of the basic information-maximization strategy exists.

1494

L. Paninski

5.1 Two-Threshold Model. Let be multidimensional, with coordinates that are “independent” in the sense that the responses of the model given one coordinate do not depend on the value of the other coordinate, and assume the expected information obtained from one coordinate remains bounded strictly away from the expected information obtained from one of the other coordinates. For instance, consider the following binary model:  .5    f −1 p(1|x, θ ) =  .5    f1

−1 < x ≤ θ−1 , θ−1 < x ≤ 0, 0 < x ≤ θ1 , θ1 < x ≤ 1,

where 0 ≤ f −1 , f 1 ≤ 1, | f −1 − .5| > | f 1 − .5|, are known and −1 < θ−1 < 0 and 0 < θ1 < 1 are the parameters we want to learn. Let the initial prior p0 (θ ) be absolutely continuous with respect to the Lebesgue measure; this implies that all posteriors p N will have the same property. Then, using the inverse cumulative probability transform and the fact that mutual information is invariant with respect to invertible mappings, it is easy to show that the maximal information we can obtain by sampling from the left is strictly greater than the maximal information obtainable from the right, uniformly in N. Thus, the information-maximization strategy will sample from the left side forever, leading to a linear information growth rate (and easily proven consistency) for the left parameter and nonconvergence on the right. Compare the performance of the usual i.i.d. approach for choosing x (using any Lebesgue-dominating measure on the parameter space), which leads to the standard algebraic convergence rate for both parameters (i.e., is strongly consistent in posterior probability). Note that this kind of inconsistency problem does not occur in the case of sufficiently smooth p(y|x, θ), by our main theorem. Thus, one way of avoiding this problem would be to fix a finite sampling scale for each coordinate (i.e., discretizing). Below this scale, no information can be extracted; therefore, when the algorithm hits this “floor” for one coordinate, it will switch to the other. However, the next example shows that the lack of consistency is not necessarily tied to the discontinuous nature of the conditional densities. 5.2 White Noise Models. We present two models of a slightly different flavor: the basic mechanism of inconsistency is the same in each case. The samples x take values on the positive integers. The models live on the positive integers as well: θ is given by a standard discrete (1) normal and (2) binary white noise process (that is, p(θ) is generated by an infinite

Asymptotic Theory of Information-Theoretic Experimental Design

1495

sequence of standard normals and independent fair coins, respectively). The conditionals are defined as follows. For the first model, the observations y are gaussian-contaminated versions of θ (x), that is, y ∼ N (θ(x), 1). For the second model, let y be drawn randomly from q θ(x) , where q 0 and q 1 are nonidentical measures on some arbitrary space. Then it is not hard to show, for either model, that an experimenter using the information-maximization strategy will never sample from any x infinitely often. As soon as we learn something about θi (by sampling from xi ), θi+1 will become more interesting, and we will begin to sample from xi+1 instead. (For the second model, in fact, if the densities of q 0 and q 1 with respect to some dominating measure are unequal almost surely, and I (y, θ(1)) reaches its unique maximum in p(θ (1)) at the midpoint p(θ (1) = 0) = p(θ (1) = 1), then we will sample from each x just once, almost surely.) This again implies a lack of consistency of the posterior (although, as above, we have a linear growth of information). The basic idea is that there will always be a more informative part of the sample space X to measure from, and the experimenter will never spend enough time in one place x to sufficiently characterize θ (x). This emphasizes the necessity of something like the compactness condition we imposed in the statement of lemma 1. As in the last section, the standard i.i.d. approach (using any measure p(x) that does not assign zero mass to any of the integers) is consistent here. Note that in contrast to the last example, the smoothness of the conditionals p(y|x, θ) (in the gaussian model) does not rescue consistency. Nor is the inconsistency due to some pathology of differential entropy (the measures q i can be discrete, even binary). The “floor” trick suggested for the last example can be modified here by sequentially restricting our search for optimal x over compacta that are allowed to grow slowly toward infinity. More generally, we can probably salvage consistency in general by not sampling exclusively from information-maximizing points (perhaps by sampling “passively” with a frequency that decreases as N → ∞; this would restore consistency in many cases without a sacrifice in the asymptotic information growth rate). We leave the general formulation of such a result to the reader.

6 Directions We have presented a rigorous theoretical framework for adaptive design of experiments using the information-theoretic objective function (see equation 1.1). Most important, we have offered some asymptotic results that clarify the effectiveness of this information-maximizing strategy; in addition, we expect that our results should find applications in approximative computational schemes for optimizing stimulus choice during this type of online experiment. For example, our theorem 1 might suggest the use of a

1496

L. Paninski

mixture-of-gaussians representation as an efficient approximation for the posteriors p N (θ) (Deignan, Meckl, Franchek, Abraham, & Jaliwala, 2000). We briefly describe a few more open research directions. 6.1 Nuisance Parameters and Hypothesis Testing. Perhaps the most obvious such open question concerns the use of noninformation-theoretic objective functions. Concerning the question of which objective function is “best” in general, our results should be useful in clarifying the exact form 2 of the asymptotic covariance matrix σinfo : for the information-maximization case, this matrix asymptotically minimizes the log-determinant function on the class of feasible asymptotic covariance matrices (co(Iθ0 (x)))−1 . However, we are typically interested in some parameters more than others; as has been noted elsewhere (Mackay, 1992), mutual information as an objective function (and, by extension, its log-determinant asymptotic form) leaves little flexiblity for focusing our resources on these more interesting parameters. Alternative objective functions include weighted sums of entropies or Bayes mean-square errors. It turns out that many of our results apply with only modest changes if the experiment is instead designed to optimize these alternative objective functions: in this case, the results in sections 3 and 4.1 remain completely unchanged, while the statement of our main theorem requires only slight changes in the asymptotic covariance formula (see appendix A). The task of choosing a good objective function on input distributions p(x) in the multidimensional θ case is thus reduced asymptotically to the simpler problem of choosing a suitable objective function on covariance matrices; in the one-dimensional θ case, the asymptotic variance does not depend on whether we choose to minimize entropy or variance. An alternative approach that has received less attention involves mapping irrelevant “nuisance” parameters out of . In a sense, this is a special case of the weighted sum of entropies idea, for which some of the weights are set to zero, but can be defined in a slightly more general setting. We define our new objective function in a simple way as I ({x, y}; T(θ)), where T is a surjective map from to some new, reduced parameter space (obviously this new definition corresponds to the original equation 1.1 if T is bijective). This approach thus integrates over nuisance parameters in a completely direct way. Clearly, much of our asymptotic theory will go through under some continuity on T, as long as the assumptions on the maximal divergence supx DKL (θ0 ; θ |x) and on the positivity of the Fisher information matrices are unharmed. It is worth addressing an extreme specialization of the above idea: the case for which T maps to the two points {0, 1} corresponds to compound hypothesis testing. Again, as long as the gap on supx DKL (θ0 ; θ |x) is respected

Asymptotic Theory of Information-Theoretic Experimental Design

1497

by T, consistency will go through, but asymptotic normality stops making sense: the more relevant concept now becomes the large deviations behavior of p N (0) and p N (1), as described by the Chernoff information (Cover and Thomas, 1991; Dembo & Zeitouni, 1993). We have not addressed the optimal rates of convergence in this case, but note only that even simple hypothesis testing (i.e., the case that = {0, 1}) is aided by adaptive stimulus design, in the sense that a given x that optimizes I (y, θ|x) for one value of p N (0) is not necessarily optimal for all other p N (0); thus, as before, the optimal sampling strategy varies in general with N. 6.2 “Batch Mode” and Stimulus Dependencies. Perhaps our strongest assumption here is that the experimenter will be able to freely choose the stimuli on each trial. This might be inaccurate for a number of reasons: for example, computational demands might require that experiments be run in batch mode, with stimulus optimization taking place not after every trial, but perhaps only after each batch of k stimuli, all chosen according to some fixed distribution p(x). Another common situation involves stimuli that vary temporally, for which the system is commonly modeled as responding not just to a given stimulus x(t), but also to some or all of its time-translates x(t − τ ). Finally, if there is some cost C(x0 , x1 ) associated with changing the state of the observational apparatus from the current state x0 to x1 , the experimenter may wish to optimize an objective function that incorporates this cost: I (y; θ|x1 ) − C(x0 , x1 ), for example. Each of these situations is clearly ripe for further study. Here we restrict ourselves to the first setting and give a simple conjecture, based on the asymptotic results presented above and inspired by results like those of Berger, Bernardo, and Mendoza (1989), Clarke and Barron (1994), and Scholl (1998). First, we state more precisely the optimization problem inherent in designing a batch experiment: we wish to choose some sequence, {xi }1≤i≤k , to maximize I ({xi , yi }1≤i≤k ; θ). The main difference here is that {xi }1≤i≤k must be chosen nonadaptively, that is, without sequential knowledge of the responses {y j } j
1498

L. Paninski

this optimal experiment by sampling in an i.i.d. manner from some wellchosen p(x). Moreover, we can make a guess as to the identity of this putative p(x): Conjecture (Batch mode). Under suitable conditions on the topology of X, the empirical distribution corresponding to any optimal sequence {xi }1≤i≤k ,

pˆ (x) ≡

k 1 δ(xi ), k i=1

converges weakly as k → ∞ to S, the convex set of maximizers in p(x) of E θ log dp(x)Iθ (x) .

(6.1)

Thus, instead of a very difficult (in particular, nonconvex in general) optimization over the sequence space Xk , we can optimize over distributions p(x) to find good experiments (assuming k is large enough). In particular, this latter optimization is tractable by the concavity of equation 6.1 in p(x) (this follows from the concavity of the function log |C| as a function of the matrix C; Cover & Thomas, 1991; Lewis, 1996): simple ascent methods will find the global maximum without fear of becoming trapped in local optima. Expression 6.1 is an average over p(θ) of terms proportional to the negative entropy of the asymptotic gaussian posterior distribution corresponding to each θ , and thus should be maximized by any optimal approximant distribution p(x). In fact, it is not difficult, using the results of Clarke and Barron (1990), to prove the above conjecture under conditions like those of theorem 1, assuming that X is finite (in which case, weak convergence is equivalent to pointwise convergence). We leave generalizations (in particular, the formulation of suitable conditions on the topology of more general X) for future work. We should note that maximization of terms like expression 6.1 has been previously studied not only in the context of experimental design (where designs that maximize that equation are commonly called “D-optimal”; (Fedorov, 1972; Clyde & Chaloner, 1996), but also elsewhere. For example, when θ is one-dimensional (and thus the information matrices are simply scalar weights), equation 6.1 is mathematically equivalent to the criterion for weighted log optimality in the theory of optimal financial portfolio selection (Cover & Thomas, 1991). In this case, the Kuhn-Tucker conditions for optimality of p(x) are well known and can be generalized easily to the multidimensional case once the directional derivatives of equation 6.1 with respect to p(x) have been identified. We leave the details for appendix B.

Asymptotic Theory of Information-Theoretic Experimental Design

1499

Appendix A: Proofs We sketch proofs for the main results here. A. 1 Posterior Consistency. We follow the basic technique of Wald (van der Vaart, 1998). The main idea is that DKL (θ0 ; θ|{xi }i
d p0 (θ) f N (θ ) →0 d p 0 (θ) f N (θ ) U

c ∩U

for any neighborhood U, where f N (θ ) denotes the (random) likelihood ratio, f N (θ) =

N

p(yi |xi , θ ) . p(y i |xi , θ0 ) i=1

The first step is to demonstrate that than-exponential rate, that is, lim inf N

1 log N

U

dp0 (θ ) f N (θ ) decreases at a slower-

U

dp0 (θ ) f N (θ ) > −

almost surely for any > 0. This follows by exactly the usual proof (Schwartz, 1967; van der Vaart, 1998; Barron et al., 1999). The key step is to approximate log f N (θ) ≈

DKL (θ0 ; θ|xi )

i

by a uniform law of the large numbers argument (van der Vaart, 1998) (the term on the right is the expectation of that on the left, where the expectation is taken under the true parameter θ0 ). Once this is done, the statement is proven by using the fact that DKL (θ0 ; θ|x) → 0

1500

L. Paninski

as θ → θ0 , uniformly in x (by assumption 2 of lemma 1), and that p0 (U) > 0 for any neighborhood U (assumption 3). The next step in the usual proof is to demonstrate that ∩U c dp0 (θ ) f N (θ ) decreases at an exponential rate, that is, 1 lim sup log N N

∩U c

dp0 (θ ) f N (θ ) < −(U) < 0

almost surely for some positive that depends on U. Unfortunately, this exponential decay of p N ( ∩ U c ) does not necessarily hold in the informationmaximization case. An example of this nonexponential decay is given after the proof. To deal with this, first note that the set of functions DKL (θ0 ; θ |x) is compact in the sup-norm topology on functions on . This follows from the ArzelaAscoli theorem (Rudin, 1973), given the equicontinuity of the log likelihoods log p(y|x, θ ) in θ and the assumption that the set of these likelihoods is closed in the sup-norm topology. (A similar compactness argument guarantees that the maximum of I (y, θ|x) in x is always attained in X.) Thus, the full set of stimuli X may be replaced by a finite subset, X = {x j }0< j 0, we may choose X such that DKL (θ0 ; θ|x) − DKL (θ0 ; θ|x ) < . sup sup min x∈X x ∈X θ ∈

The lemma will follow if we can prove the result for any such finite approximating set X . Thus, we may restrict our attention below to finite X . We can immediately dispose of any set Z>δ = {θ : DKL (θ0 ; θ|x) > δ ∀ x ∈ X } for any δ > 0, since the posterior mass of such a set decays exponentially, by the standard proof. This leaves us with the compact subset Z0 = {θ : min DKL (θ0 ; θ|x) = 0}, x∈X

the set of θ where DKL (θ0 ; θ|x) = 0 for at least one x ∈ X . Clearly, θ0 ∈ Z0 . If Z0 = θ0 , the proof is complete; thus, assume otherwise. To complete the proof, we just need to show that the informationmaximizing sampler does not asymptotically “ignore” any set Z within Z0 ∩ ∩ U c such that p0 (Z ) > 0 (with Z an arbitrary -neighborhood of Z); that is, it does not so frequently from x with DKL (θ0 ; θ |x) = 0 for sample θ ∈ Z that q N (Z) ≡ Z e − i DKL (θ0 ;θ |xi ) dp0 (θ ) does not decrease more quickly

Asymptotic Theory of Information-Theoretic Experimental Design

1501

than q N (U) = U e − i DKL (θ0 ;θ |xi ) dp0 (θ). The main idea preventing this is the smoothness condition on the log likelihood functions log p(y|x, θ); this condition guarantees that the information gain associated with increasing the concentration of p N at θ0 (the only point at which DKL (θ0 ; θ |x) = 0 for all x, by assumption) will decrease to zero, roughly as

log

q N (U) q N (U) − q N+1 (U) ≈ . q N+1 (U) q N (U)

Meanwhile, the gain associated with testing between θ0 ∈ U and the alternative hypothesis will remain large whenever the posterior mass on Z remains comparable to that on U, falling as q N (Z)/q N (U). Thus, the informationmaximizing sampler will prefer to increase the concentration at θ0 over attempting to distinguish between the hypotheses θ0 ∈ U and θ0 ∈ Z accordingly as q N (Z) < q N (U) − q N+1 (U) or otherwise, respectively. The definition of the information-maximization strategy now implies that either q N (Z) decays exponentially (which would again complete the proof) or, alternatively, q N (Z) q N (U) − q N+1 (U) ∼ , q N (U) q N (U) that is, q N (U) − q N+1 (U) ∼ q N (Z). Since the posterior mass on Z falls exponentially with the number of samples devoted to this hypothesis test and the posterior mass on U cannot fall exponentially (as discussed above), the posterior mass on Z must therefore decrease to zero relative to q N (U), because q N (U) − q N+1 (U) = o(q N (U)) for any sequence q N (U) that decays at a subexponential rate. The above logic is further illustrated in the example below. We should also note that a similar result can be stated in the case that θ0 ∈ / , that is, when the data are not generated by a member of the hypothesized parameter space. Here, as usual, the posterior may be approximated as p N (θ) ∼ e −

i

DKL (θ0 ;θ |xi )

p0 (θ ).

In the i.i.d. setting, this posterior will asymptotically concentrate around θ, which are closest to the true θ0 in the sense of average DKL distance; however, in the information-maximization case, this notion of closenessto the true θ0 depends strongly on the stimuli x, and it is not clear that e − i DKL (θ0 ;θ|xi ) will even have a well-defined limit in general. Thus, to generalize lemma 1 to this out-of- case, we would have to impose further conditions on the functions DKL (θ0 ; θ|x). For example, the above proof holds if we stipulate some fixed “closest” element θ ∗ such that θ ∗ is in the set of minimizers of DKL (θ0 ; θ|x) for all x and is the only such member of (just as in the setting

1502

L. Paninski 1 x1 x2

p(y = 1 | x,θ)

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

θ

Figure 4: An example of a model in which the posterior mass off neighborhoods of the true parameter θ0 does not decay exponentially. θ0 = 0 here, as marked by an asterisk.

of lemma 2, θ0 is the unique member of that minimizes DKL (θ0 ; θ |x) for all x). In this case, p N will asymptotically concentrate on θ ∗ . A.2 An Example of Subexponential Decay. As discussed above, one major difference between the asymptotic behavior of the posteriors in the information-maximization and i.i.d. sampling cases is that the posterior mass p N ( ∩ U c ) generically decays exponentially in the i.i.d. setting (where U, again, is any neighborhood of the true parameter θ0 ); however, in the information-maximization case, this exponential decay does not necessarily hold. We give a simple example of this phenomenon here. We choose both X and Y to be binary; the conditional distributions p(y = 1|x1 , θ) and p(y = 1|x2 , θ) are shown in Figure 4. The main important features to note are that p(y|x1 , θ0 ) = p(y|x1 , θ = 1), and p(y|x2 , θ0 ) = p(y|x2 , θ ∈ U), with U a sufficiently small neighborhood of θ0 = 0. Thus, x1 cannot distinguish between θ = 0 and 1, and x2 gives no information when p(θ ) is sufficiently concentrated about θ0 . This in turn implies that asymptotically, the information-maximization strategy will be to sample preferentially from x1 , as indicated by theorem 1. However, if all our samples are drawn from x1 , then significant posterior mass will remain on θ = 1. More precisely, the posterior p N (θ ) may be asymptotically approximated by a mixture of gaussians, one with mean at θ = 0 and the other with mean at θ = 1. (This approximation will hold asymptotically, by the usual argument, whenever the prior p0 (θ) has a positive, continuous density with respect to the Lebesgue measure on the shown.) Both gaussians have variance of order 1/n, where n = n(N) is the number of samples from x1 in the first

Asymptotic Theory of Information-Theoretic Experimental Design

1503

N samples; the asymptotic ratio of masses between the two gaussians, on the other hand, behaves as e −c(N−n) , with c = DKL (θ0 ; θ |x2 ). This implies that I (y; θ|x1 ) scales as 1/n, while I (y; θ|x2 ) scales as e −c(N−n) . This, in turn, means that n(N) satisfies the scaling n(N)−1 ∼ e −c(N−n(N)) , leading to the conclusion that N − n(N), the number of samples from x2 , grows sublinearly in N, and therefore that the posterior mass off neighborhoods of the true parameter θ0 does not decay exponentially in N. Conversely, it is easy to demonstrate exponential decay for any i.i.d. sampling strategy that places positive mass p(x) on both x1 and x2 . A.3 Asymptotic Normality. The proof of asymptotic normality is fairly standard, and is therefore omitted (see, e.g., Schervish, 1995; van der Vaart, 1998). The only new part is the computation of the asymptotic variance σ N2 . The classical result tells us that

Nσ N2

−1

→

N 1 Iθ (xi ). N i=1 0

To obtain our result, we need to understand the tail behavior of this sum. We proceed by analyzing the dynamical system AN−1 → AN =

1 ((N − 1)AN−1 + B N ), N

with AN denoting the negative Hessian of a suitable gaussian approximation to the posterior likelihood after N trials at θ N , its maximizer; B N denotes the negative Hessian of log p(yN |xN , θ ) at θ N . This map gives a rough (but asymptotically accurate) approximation of the effect of the Nth sample on the (near-gaussian) posterior, after a suitable variance stabilizing. It is clear that the range of this map is asymptotically contained in co(Iθ0 (x)), which we have defined as the closure of all convex combinations of the available Fisher information matrices, Iθ0 (x), with x ranging over the full sample space X. It is equally clear, when we examine the above dynamical system over multiple trials (thus, roughly, averaging over multiple B N ), that the information-maximization strategy is asymptotically doing something like gradient ascent on log |AN | (the asymptotic negative entropy of the gaussian posterior, up to an irrelevant scale factor), with the allowed ascent directions taking values within co(Iθ N (x)), which in turn, by the consistency lemma and the continuity of Iθ (x), converges to co(Iθ0 (x)). Our asymptotic variance formula now follows from the strict concavity of the function log |C| in C, where C ranges over the symmetric, positive semidefinite (covariance)

1504

L. Paninski

matrices (Cover & Thomas, 1991; Lewis, 1996), the fact that |C| and log |C| have identical maximizers, and the compactness and convexity of co(Iθ0 (x)). It is worth noting that this proof goes through essentially unchanged if we sample to optimize something like a weighted mean-square error instead of mutual information. In this case, the sampler will asymptotically attempt to minimize a matrix function of the form tr V t σ N2 V ≈ tr V t A−1 N V , where V is some weight matrix. Since the above function is convex in AN (Lewis, 1996), the only change we need to make is in the final form of the asymptotic variance formula: for this problem, Nσ N2 → argminC∈co(Iθ

0

(x)) tr(V

t

−1 C −1 V) ,

where once again the optimum is well defined (and unique when V is of full rank). When is one-dimensional, these two approaches clearly lead to the same asymptotic result. Appendix B: Kuhn-Tucker Optimality for Bayesian D-Optimal Design We briefly describe the necessary and sufficient conditions for optimality in the batch experiment setting discussed in section 6.2. We follow Cover and Thomas (1991). We are trying to maximize the function 6.1, which is concave in p(x) (implying that the maximizers we seek form a nonempty convex set). We need to compute the derivative of this function along convex lines through p(x), as follows: ∂ V( p, q ) ≡ E θ log Iθ (x)d(tq (x) + (1 − t) p(x)) ∂t t=0 ∂ = Eθ log Iθ (x)d(tq (x) + (1 − t) p(x)) ∂t t=0 −1 ∂ log (1 − t)I + t = Eθ Iθ (x)dq (x) Iθ (x)dp(x) ∂t t=0 −1 Iθ (x)dp(x) = E θ tr − dim . Iθ (x)dq (x) The interchange of derivative and expectation can be justified by dominated convergence under the conditions of theorem 1.

Asymptotic Theory of Information-Theoretic Experimental Design

1505

In the discrete X case, Kuhn-Tucker now implies that for optimal p(x) (and only optimal p(x)), −1 1 =1 Iθ (x) E θ tr Iθ (x)dp(x) ≤1 dim

if if

p(x) > 0 p(x) = 0.

Similar results can be derived for more general X by the usual approximation techniques. (For further discussion, see, e.g., Bell & Cover, 1980; Cover & Thomas, 1991; Clyde & Chaloner, 1996.) Acknowledgments We thank E. Simoncelli, C. Machens, and D. Pelli for helpful conversations. This work was partially supported by a predoctoral fellowship from HHMI and by funding from the Gatsby Charitable Trust. A brief account of this work appeared in the conference proceedings of the 16th Annual NIPS meeting, Vancouver, B.C., 2003. References Axelrod, S., Fine, S., Gilad-Bachrach, R., Mendelson, S., & Tishby, N. (2001). The information of observations and application for active learning with uncertainty (Tech. Rep.). Jerusalem: Leibniz Center, Hebrew University. Available online: citeseer.nj.nec.com/axelrod01information.html. Barron, A., Schervish, M., & Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics, 27, 536–561. Bell, R., & Cover, T. (1980). Competitive optimality of logarithmic investment. Mathematics of Operations Research, 5, 161–166. Berger, J., Bernardo, J., & Mendoza, M. (1989). On priors that maximize expected information. In J. Klein & H. J. Lee (Eds.), Recent developments of statistics and its applications (pp. 1–20). Seoul: Freedom Academy. Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10, 273–304. Chichilnisky, E. (2001). A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12, 199–213. Clarke, B., & Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36, 453–471. Clarke, B., & Barron, A. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning Inference, 41, 37–60. Clyde, M., & Chaloner, K. (1996). The equivalence of constrained and weighted designs in multiple objective design problems. Journal of the American Statistical Association, 91, 1236–1244. Cohn, D., Ghahramani, Z., & Jordan, M. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4, 129–145. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley.

1506

L. Paninski

Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Deignan, P., Meckl, P., Franchek, M., Abraham, J., & Jaliwala, S. (2000). Using mutual information to pre-process input data for a virtual sensor. Paper presented at the American Control Conference 2000, Chicago. Dembo, A., & Zeitouni, O. (1993). Large deviations techniques and applications. New York: Springer. Denzler, J., & Brown, C. (2000). Optimal selection of camera parameters for state estimation of static systems: An information theoretic approach. (University of Rochester Tech. Rep. 732). Rochester, NY: University of Rochester. Fedorov, V. (1972). Theory of optimal experiments. New York: Academic Press. Foldiak, P. (2001). Stimulus optimisation in primary visual cortex. Neurocomputing, 38–40, 1217–1222. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28(2–3), 133–168. Kontsevich, L., & Tyler, C. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vision Research, 39, 2729–2737. Lee, T., & Yu, S. (1998). An information-theoretic framework for understanding saccadic behaviors. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing, 12. Cambridge, MA: MIT Press. Lewis, A. (1996). Convex analysis on the Hermitian matrices. SIAM Journal on Optimization, 6, 164–177. Lindley, D. (1956). On a measure of information provided by an experiment. Annals of Mathematical Statistics, 29, 986–1005. Luttrell, S. (1985). The use of transinformation in the design of data sampling schemes for inverse problems. Inverse Problems, 1, 199–218. Machens, C. (2002). Adaptive sampling by information maximization. Physical Review Letters, 88, 228104–228107. Mackay, D. (1992). Information-based objective functions for active data selection. Neural Computation, 4, 589–603. Mascaro, M., & Bradley, D. (2002). Optimized neuronal tuning algorithm for multichannel recording. Unpublished abstract. Available online: http://www. compscipreprints.com/. Milman, V., & Schechtman, G. (1986). Asymptotic theory of finite dimensional normed spaces. Berlin: Springer-Verlag. Nelken, I., Prut, Y., Vaadia, E., & Abeles, M. (1994). In search of the best stimulus: An optimization procedure for finding efficient stimuli in the cat auditory cortex. Hearing Research, 72, 237–253. Nelson, J., & Movellan, J. (2000). Active inference in concept learning. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing, 13. Cambridge, MA: MIT Press. Parmigiani, G. (1998). Designing observation times for interval censored data. Sankhya A, 60, 446–458. Parmigiani, G., & Berry, D. (1994). Applications of Lindley information measure to the design of clinical experiments. In A. F. M. Smith & P. Freeman (Eds.), Aspects of uncertainty: A tribute to D. V. Lindley (pp. 333–352). New York: Wiley. Pelli, D. (1987). The ideal psychometric procedure. Investigative Ophthalmology and Visual Science (Suppl.), 28, 366.

Asymptotic Theory of Information-Theoretic Experimental Design

1507

Rudin, W. (1973). Functional analysis. New York: McGraw-Hill. Sahani, M. (1997). Interactively exploring a neural code by active learning. Poster session presented at NIC97 meeting, Snowbird, Utah. Available online: http://www. gatsby.ucl.ac.uk/∼maneesh/conferences/nic97/poster/home.html. Schervish, M. (1995). Theory of statistics. New York: Springer-Verlag. Scholl, H. R. (1998, June). Shannon optimal priors on i.i.d. statistical experiments converge weakly to Jeffreys’ prior. Test, 7(no. 1). Available online: citeseer.nj.nec.com/104699.html. Schwartz, L. (1967). On Bayes procedures. Z. Wahrsch. Verw. Gabiete, 4, 10–26. Simoncelli, E., Paninski, L., Pillow, J., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences. (3rd ed.). Cambridge, MA: MIT Press. Sollich, P. (1996). Learning from minimum entropy queries in a large committee machine. Physical Review E, 53, R2060–R2063. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Publ. Math. IHES, 81, 73–205. Tzanakou, E., Michalak, R., & Harth, E. (1979). The alopex process: Visual receptive fields by response feedback. Biological Cybernetics, 35, 161–174. van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press. Watson, A., & Fitzhugh, A. (1990). The method of constant stimuli is inefficient. Perception and Psychophysics, 47, 87–91. Watson, A., & Pelli, D. (1983). QUEST: A Bayesian adaptive psychophysical method. Perception and Psychophysics, 33, 113–120.

Received June 2, 2003; accepted January 5, 2005.

LETTER

Communicated by Liam Paninski

Maximum Likelihood Set for Estimating a Probability Mass Function Bruno M. Jedynak [email protected] D´epartement de Math´ematiques, Universit´e des Sciences et Technologies de Lille, France, and Department of Applied Mathematics, and Center for Imaging Science, Johns Hopkins University, Baltimore, MD 21218, U.S.A.

Sanjeev Khudanpur [email protected] Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, U.S.A.

We propose a new method for estimating the probability mass function (pmf) of a discrete and finite random variable from a small sample. We focus on the observed counts—the number of times each value appears in the sample—and define the maximum likelihood set (MLS) as the set of pmfs that put more mass on the observed counts than on any other set of counts possible for the same sample size. We characterize the MLS in detail in this article. We show that the MLS is a diamond-shaped subset of the probability simplex [0, 1]k bounded by at most k × (k − 1) hyperplanes, where k is the number of possible values of the random variable. The MLS always contains the empirical distribution, as well as a family of Bayesian estimators based on a Dirichlet prior, particularly the wellknown Laplace estimator. We propose to select from the MLS the pmf that is closest to a fixed pmf that encodes prior knowledge. When using Kullback-Leibler distance for this selection, the optimization problem comprises finding the minimum of a convex function over a domain defined by linear inequalities, for which standard numerical procedures are available. We apply this estimate to language modeling using Zipf’s law to encode prior knowledge and show that this method permits obtaining state-of-the-art results while being conceptually simpler than most competing methods. 1 Introduction Let p be a probability mass function (pmf) over a set {1, . . . , k} of finite cardinality. This may represent a set of numerical values for a quantitative Neural Computation 17, 1508–1530 (2005)

© 2005 Massachusetts Institute of Technology

Maximum Likelihood Set

1509

variable or a set of indices for a qualitative variable. The latter situation is often qualified as nonmetric, as will be the case in section 4, where the indices will refer to words in the English vocabulary. Suppose that we observe n samples x1 , . . . , xn , that are independent and identically distributed (i.i.d.) with common pmf p, which is unknown and needs to be estimated from the observed samples. Prior information may be available about p and, in particular, a specific estimate, or an estimate of a certain form, may be preferred when n = 0. For the case when n k, a very satisfactory answer is the empirical distribution or type pˆ , namely: pˆ (X = i) = pˆ i =

n 1 ni 1(xt = i) ≡ , n t=1 n

i ∈ {1, . . . , k},

(1.1)

where 1(·) is an indicator function and, hence, ni is the number of times the value i is observed in the sample. When n is small, the pioneering work of Laplace (for k = 2) has led to the well-known Bayesian estimates as alternatives to the type. During World War II, while working on cracking German cryptographic systems, Jack Good and Alan Turing invented a method for regularizing the type (Good, 1953; Orlitsky, Santhanam, & Zhang, 2003). In their case, k = 26 was the number of letters in the Latin alphabet, and n ≈ 100 − 1000. In section 4, we consider a case where k is the number of words in the English vocabulary, which is set to about 105 , and the training sample is n ≈ 106 words. Many smoothing techniques, most being variations on the Good-Turing idea, have been compared for such a case by Chen and Goodman (1996) and Chen and Rosenfeld (1999). Excellent empirical performance is obtained by using Good-Turing–like estimators. With the exception of the Bayesian estimates, however, there is often only a heuristic justification and no principled derivation of the estimation formulae. There have, of course, been numerous studies of the pmf estimation problem since Laplace, and it is not our intention to present a comprehensive survey of the literature here, which begins at least as far back as Lidstone (1920) and continues to be an active area of investigation (Ristad, 1995; Poschel, Ebeling, Froemmel, & Ramirez, 2003). We propose the following new method for estimating p. We consider the counts—the number of times each value appears—and define the maximum likelihood set (MLS) as the set of probability mass functions that put more mass on the observed counts than on any other set of counts possible for the given n. In a second step, an element is chosen from this set. It can be the one with maximum entropy or another based on available prior information. This view of the problem, we believe, is very natural—indeed, so much so that when we first arrived at this view, we expected that someone had already investigated it. We have not found any evidence of this in the literature.

1510

B. Jedynak and S. Khudanpur

1.1 The Empirical Distribution. The empirical distribution, or type, of a sample x1 , . . . , xn , as briefly mentioned earlier, is pˆ =

n

1

n

,...,

k nk ni , , with n = n i=1

(1.2)

where ni , 1 ≤ i ≤ k, are the counts, that is, the number of times the value i appeared in the sample. We write P k the set of pmfs over a set of cardinality k and Pnk the set of types with denominator n over a set of cardinality k. The probability, under p ∈ P k , of observing x1 , . . . , xn is p(x1 , . . . , xn ) =

k

pini ,

(1.3)

i=1

where ni are the counts as above. The right-hand side of equation 1.3, viewed as a function of the pmf p, is called the likelihood function and may be rewritten as k

pini = 2−n(D( pˆ , p)+H( pˆ )) ,

(1.4)

i=1

where D( p, q ) =

k i=1

pi log2

pi , qi

(1.5)

p 0

= ∞ for p > 0, is the Kullback-Leibler dis-

with 0 log2 q0 = 0 and p log2 tance of p from q , and H( p) = −

k

pi log2 pi ,

(1.6)

i=1

with 0 log2 0 = 0, is the Shannon entropy of p. It is clear from equation 1.4 that the type pˆ is a sufficient statistic for estimating p. Also note that pˆ is the maximum likelihood estimate (MLE) of p, that is, the choice of p for which the likelihood equation 1.3 of x1 , . . . , xn , is maximum. Indeed, D( pˆ , p) ≥ 0, with equality iff p = pˆ (cf, e.g., Cover & Thomas, 1991). For k fixed and n → ∞, the type is a strongly consistent and efficient estimate of the pmf. However, the type may not be the best possible estimate for finite n. For example, one may have prior information about the true distribution that is captured in the type only for very large n. There is also

Maximum Likelihood Set

1511

a more structural objection: when k is large, there might be many values 1 ≤ i ≤ k, for which pi n1 . In this case, with high probability, we will observe ni = 0. Hence, low-probability events tend to be underestimated and high-probability events overestimated by pˆ . One manifestation of this effect is that the expected entropy of the type underestimates the entropy of the original pmf. Indeed, E [H( pˆ )] = −E

k i=1

pˆ i pˆ i log pi pi

= −E [D( pˆ , p)] + H( p) ≤ H( p).

In section 2, we therefore construct a set of pmfs that contains the type as well as other pmfs that are close to it. In particular, it contains pmfs with larger entropy than the type. We will then choose an estimate from this set based on available prior knowledge. 1.2 Bayesian Estimates. Bayesian analysis offers an alternative to MLE. The Dirichlet family, indexed by a parameter β, is a family of prior distributions over pmfs given by πβ ( p) =

k 1 β−1 p , Z(β) i=1 i

p ∈ Pk,

β ∈ R,

(1.7)

where Z(β) is a normalizing constant. Note that for β = 1, equation 1.7 reduces to the uniform distribution over P k . Now, if the Bayesian cost function is quadratic, that is, L( p, q ) =

k

( pi − q i )2 ,

(1.8)

i=1

then the Bayesian estimate corresponding to the Dirichlet prior is the posterior expectation of p given x1 , . . . , xn , which can be shown to be pˆ β (i) =

ni + β , n + βk

∀ 1 ≤ i ≤ k.

(1.9)

This is often referred to as an add-β rule. The special case of β → 0 yields the MLE pˆ , and β = 1—the so-called Laplace rule (cf. e.g. Lidstone, 1920). Estimators with β = 0.5 and β = k1 have also been considered (see Nemenman, Shafee, & Bialek, 2002). Note that all such estimators with β > 0 assign a strictly positive mass to every value in {1, . . . , k}, and they all converge to the type as n → ∞. We will see that the set from which we will choose our estimate contains all add-β rules in equation 1.9 for 0 ≤ β ≤ 1.

1512

B. Jedynak and S. Khudanpur

1.3 Minimax Estimates. An alternative to Bayesian analysis is minimax analysis where one seeks an estimate that would be optimal in the worst case over the underlying model and in average over the observations. More precisely, if p is the underlying model and q an estimate of p, one builds the functional R(q ) =

sup

p=( p1 ,..., pk ) n ,...,n ;k n =n 1 k i=1 i

n! p n1 . . . pknk L( p, q ). n1 ! . . . nk ! 1

(1.10)

For the quadratic cost, equation 1.8, as well as for the standardized quadratic cost, L( p, q ) =

k ( pi − q i )2 , pi i=1

(1.11)

√ the minimum of R(q ) is achieved by an add-β rule, with β = k −1 n (Steinhaus, 1957) and β = 0 (Olkin & Sobel, 1979) respectively. 1.4 Maximum Entropy Estimates. Maximum entropy estimation is another standard solution to data sparseness. Instead of estimating pˆ , the maximum entropy method first estimates pˆ (Aj ) = aˆ j for select sets Aj ⊂ {1, . . . , k}, for which we have sufficient evidence in the n samples. Fixing the probability of some subsets of {1, . . . , k} in this manner typically underspecifies the pmf of interest, leading to a set M of admissible pmfs, M = p ∈ P k : p(Aj ) = aˆ j , j = 1, . . . , J ,

(1.12)

in which the estimate pˆ is but one member. From this admissible set, the pmf with the highest Shannon entropy is then chosen as the estimate of p. It is well known (see, Berger, Della Pietra, & Della Pietra, 1996) that the pmf with the maximum entropy has an exponential form:

J 1 pˆ ME (i) = λ j 1(i ∈ Aj ) , exp Z() j=1

∀ 1 ≤ i ≤ k,

(1.13)

where the parameters = (λ1 , . . . , λ J ) are chosen to satisfy the constraints of equation 1.12. It can be shown that for every i, as long as at least one p ∈ M satisfies pi > 0, it follows that pˆ ME (i) > 0. Thus, the maximum entropy estimate is inherently smooth. There are several heuristics but few principles for selecting the sets Aj or even J. In language modeling, some Aj ’s are typically singleton, specifying, for instance, the probability of words that have been seen sufficiently

Maximum Likelihood Set

1513

often in the sample; some Aj ’s may contain all words that can take on a certain grammatical part of speech (e.g., adjectives), and some Aj ’s may overlap with others, for example. Therefore, while maximum entropy estimation eliminates the need for some of the ad hoc assumptions made by other techniques, it leaves open the problem of selecting the sets used to define M. Another weakness of the classical maximum entropy method, as others have pointed out, is that the specification of M via equality constraints leads to an ad hoc choice for any candidate Aj : one must either constrain its probability to be exactly aˆ j or leave it completely unconstrained. This is unsatisfactory. For instance, if one were considering as candidate sets Aj all singleton sets, then the naive act of including all of them in the definition of M leads to M = { pˆ }. On the other hand, leaving out all i for which, say, ni = 1 from the definition of M may result in an estimate under which ni > 0 and ni = 0, but pˆ ME (i) = pˆ ME (i ). Maximum entropy estimation has therefore been proposed with inequality constraints (cf. Khudanpur, 1995; Kazama & Tsujii, 2003): M = p : a j ≤ p(Aj ) ≤ b j , j = 1, . . . , J .

(1.14)

To the best of our knowledge, there has not been much discussion in the literature of a principled way to make the choice of a j and b j , particularly of a way that depends on only the observed sample, and not on other ad hoc assumptions about p. Yet another variation on maximum entropy consists of minimizing a functional of the form J

µ j d p(Aj ), aˆ j − H( p) ,

(1.15)

j=1

where d(., .) is some metric of deviation from the constraints of equation 1.12 and the parameters µ = (µ1 , . . . , µ J ) are estimated, usually, from held-out data. Yet another way to relax the constraints in equation 1.12 is to note, using convex duality (Berger et al., 1996), that the parameters that satisfy the constraints are exactly the parameters for which the model of equation 1.13 assigns maximum likelihood to the observed sample. One may then choose a penalized likelihood approach with a regularizing function of . Still, several parameters need to be estimated from held-out data in either case. Several such methods are compared in Chen and Goodman (1996) for the estimation of bigram and trigram language models. In section 2, we will seek to provide a principled way of relaxing the linear equality constraints in maximum entropy estimation.

1514

B. Jedynak and S. Khudanpur

1.5 Good-Turing and Other Held-Out Methods. In Jelinek (1998, p. 258), the author asks, “How much larger a probability should be assigned to an event observed once than to one not observed at all, or, in general, whether the ratio of probabilities of events observed n and m times, respectively, should really be n/m?” Considering pmfs that put more mass on the observed counts than on any others, which we do in section 2, will lead to one answer to this question: equation 2.7. The Good-Turing and other held-out methods answer the question in a different way. The basic idea is to divide the data into two parts. The first part, called the development set, is used for the collection of counts {ni }. The second part, called the held-out set, is used to estimate additional parameters. A typical structure is as follows:

p˜ i =

α× qi

ni n

if ni > M, if ni ≤ M,

(1.16)

where the (usually small) threshold M, and smoothed probability estimates q i , i = 0, . . . , M, are the additional parameters. The Good-Turing estimate (Good, 1953; Orlitsky et al., 2003; McAllester & Schapire, 2000) is obtained by setting qi =

rni +1 ni + 1 , rni n

i ∈ {1, . . . , k},

(1.17)

where rc is the number of symbols j ∈ {1, . . . , k} whose count n j = c. Thus, q i for a symbol i depends not just on its count ni and n, but on the counts of all other symbols. Note that if ni > n j , it is not necessarily true that q i ≥ q j , though this frequently holds in practice for symbols with very small counts. In other words, q i may not respect the rank ordering implied by the empirical counts {ni }, particularly for symbols with large counts. For this reason, the threshold M is often chosen to be small enough so as not to have this undesirable effect. In language modeling, for example, M is typically chosen to be 10 or less, depending on n. The parameter α is then computed so that p˜ i sums to unity. The Good-Turing estimate performs remarkably well for pmf on words. However, its derivation is somewhat ad hoc and unsatisfactory. 2 The Maximum Likelihood Set One of the simplest and driving ideas in statistics is as follows: what we observe has to be fairly likely; otherwise we would not have observed it. One way to quantify this is to say that what we observe has to be more likely under the true pmf than any other comparable event. Let’s define the

Maximum Likelihood Set

1515

MLS as the set of pmfs that put more mass on the observed type than on any other type given n. Let p = ( p1 , . . . , pk ) be a pmf over {1, . . . , k}. The p-probability of observing the type pˆ = ( nn1 , . . . , nnk ) is f ( p, pˆ ) =

k n! p ni . n1 ! . . . nk ! i=1 i

(2.1)

The MLS, with these notations, is defined as M( pˆ ) = { p ∈ P k : ∀ qˆ ∈ Pnk , f ( p, pˆ ) ≥ f ( p, qˆ )} .

(2.2)

We will see in section 2.3 that this set always contains the type pˆ , which is the MLE for p, and that it shrinks down to it as n → ∞. For finite n, it contains pmfs that might reflect prior information such as smoothness or other desirable properties in a better way than the type, but still remain close to the observed counts. Moreover, this set is a close convex subset of P k , opening the way to numerical optimization. Using Stirling formulas, as well as equation 1.4, one can check that 1 un . . f ( p, pˆ ) = 2−D( pˆ , p) , where un = vn ⇔ lim log = 0. n→∞ n vn

(2.3)

Hence, for n sufficiently large, the MLS associated with a type pˆ is roughly

p ∈ P k : D( pˆ , p) ≤ D(qˆ , p),

∀ qˆ ∈ Pnk ,

(2.4)

leading to the loose description that the MLS is the set of pmfs that are “closer” to the observed type than to any other. 2.1 Characterization of the Maximum Likelihood Set. The MLS admits a simpler though still implicit representation. Given the observed counts (n1 , . . . , nk ), define a neighborhood relationship on the set of types with denominator n: the neighbors of (n1 , . . . , nk ) are the types obtained by changing a single sample from one value to another one. That is, assume that for a pair of indexes 1 ≤ i, j ≤ k, we have n j > 0 and ni < n; then (n 1 , . . . , n k ), defined by ni = ni + 1,

n j = n j − 1,

and nl = nl

l = i or j,

(2.5)

is a neighbor of (n1 , . . . , nk ). If a pmf is in the MLS, then it has to put more mass on the observed type than on any of its neighbors. It turns out that the converse is also true, which leads to the following result:

1516

B. Jedynak and S. Khudanpur

Proposition 1. A pmf p = ( p1 , . . . , pk ) on the set {1, . . . , k} belongs to the MLS M( pˆ ) associated with the counts (n1 , . . . , nk ) if and only if n j pi ≤ (ni + 1) p j ,

∀ 1 ≤ i = j ≤ k,

(2.6)

or equivalently, pˆ i pˆ j +

1 n

≤

pˆ i + pi ≤ pj pˆ j

where, by convention,

a 0

1 n

,

∀ 1 ≤ i = j ≤ k,

(2.7)

= +∞ whenever a > 0.

The proof uses elementary algebra and is relegated to the appendix. 2.2 Motivating Examples. For k = 2, the MLS is M( pˆ ) = M

n

1

n

,1 −

n1 n

= p = ( p1 , 1 − p1 );

n1 n1 + 1 ≤ p1 ≤ . n+1 n+1

Note that this set contains the type and shrinks down to it as the number of samples goes to infinity. Beside the connection with Dirichlet priors mentioned in section 1, the MLS in this case can be obtained through Bayesian estimation of a proportion with quadratic cost function and a beta(α, β) prior distribution. It is the set of estimators corresponding to the prior parameters (α, β) satisfying α + β = 1 (see Hogg & Craig, 1995, p. 368). The MLSs for k = 3 are illustrated in Figure 1 for two different values of n. The MLSs are convex cells with linear boundaries. They have at most k × (k − 1) boundaries, one corresponding to each neighboring type. In order to select an estimate from the MLS, one could choose the pmf with maximum Shannon entropy. This choice will be motivated further in section 3. We use it here to illustrate properties of the MLS. For example, if the counts (n1 , . . . , nk ) are made of 0s and 1s only, then the pmf selected is the uniform distribution over {1, . . . , k}, since it is of maximum entropy over all pmfs over {1, . . . , k} and it is included in the MLS, as one can check from equation 2.6. In contrast, if there is one value, say the first one, that gets all the counts, then the selected estimate is, for n > 0, p1∗ =

n , n+k −1

and

pl∗ =

1 , n+k −1

∀ 1 < l ≤ k.

(2.8)

Maximum Likelihood Set

1517

1 0.8

p3

0.6 0.4 0.2 0 0 0.2 1 0.4

0.8 0.6

0.6 0.4

0.8

0.2 1

p1

0

p2

A

1 0.8

p3

0.6 0.4 0.2 0 0 0.2

1

0.4

0.8 0.6

0.6 0.4

0.8 p1

0.2 1

0

p2

B

Figure 1: Illustration of the maximum likelihood sets for all the possible types for alphabet size k = 3. (A) n = 3 samples. (B) n = 10 samples. Each “cell” is an MLS containing exactly one type marked with a cross.

1518

B. Jedynak and S. Khudanpur

If n < k, then note that p1∗ ≤ 0.5, which stands in sharp contrast with the estimate pˆ 1 = 1 given by the type. Equation 2.8 is a direct consequence of the property 3.4. 2.3 Properties of the Maximum Likelihood Set. We now present some insightful and useful properties of the MLS. n

Proposition 2. Let pˆ = ( n1 , . . . , nnk ) be a type. The elements p = ( p1 , . . . , pk ) of the MLS M( pˆ ) defined by pˆ satisfy the following: pˆ p

i.e.

ni > 0 ⇒ pi > 0, ni < n j ⇒ pi ≤ p j

∀ 1 ≤ i ≤ k, ∀ 1 ≤ i, j ≤ k,

1 n pˆ i ≤ pi ≤ pˆ i + n+k n

p − pˆ 1 =

k

| pi − pˆ i | ≤

i=1

2(k

∀ 1 ≤ i ≤ k,

− 1) , n

pˆ ∈ M( pˆ ),

(2.9) (2.10) (2.11) (2.12) (2.13)

but no other type with denominator n is an element of M( pˆ ). If x1 , . . . , xn are independent samples with common pmf q ∈ P k , then the MLS defined by their type pˆ is such that sup p − q 1 → 0

as n → ∞

with probability 1.

(2.14)

p∈M( pˆ )

Proposition 2 is essentially a corollary of proposition 1. Details of the proof are in the appendix. Properties 2.9 and 2.10 are desirable for any estimate of the pmf generating x1 , . . . , xn . Properties 2.11 and 2.12 show how the elements of the MLS may deviate from the underlying type. Property 2.14 shows that for a fixed k, as n gets large, all the elements in the MLS get closer to the pmf generating the samples. It is easy to see, by comparing equation 2.11 and 1.9, that the MLS contains the Bayesian estimates for 0 ≤ β ≤ 1. 3 Selecting an Element from the Maximum Likelihood Set Every pmf in the MLS satisfies a number of properties, as outlined above, that one would consider desirable in an estimate of the pmf generating the samples x1 , . . . , xn , and we advocate M( pˆ ) as an admissible set from which a particular pmf may be selected using secondary criteria. One such criterion is outlined next.

Maximum Likelihood Set

1519

Proposition 3. Let pˆ = ( nn1 , . . . , nnk ) be a type and M( pˆ ) its associated MLS. Let q = (q 1 , . . . , q k ) be a pmf such that pˆ << q . Then there exists a unique element p ∗ ∈ M( pˆ ) such that D( p ∗ , q ) = min D( p, q ) .

(3.1)

p∈M( pˆ )

Note from equation 2.6 that M( pˆ ) is convex and closed in the Euclidean topology on P k . The existence of p ∗ therefore follows from theorem 2.1 in Csiszar (1975), and the uniqueness follows from the convexity of p → D( p, q ). The pmf q may be viewed as a means of incorporating a prior estimate in the estimation process. In the case when n k, the MLS has a very small radius, and the choice of q has a negligible effect on the choice of p ∗ . In the limit as n → 0, p ∗ → q by continuity. Therefore, in the small sample situation, the choice of q will greatly influence p ∗ . One may choose for q the uniform pmf over {1, . . . , k}. p ∗ is then the element of M( pˆ ) with maximum Shannon entropy. It has been argued by Nemenman et al. (2002) that entropy might be the nonmetric (categorical data) analog of smoothness. Other compelling arguments for this choice have been made by Jaynes (1994). In a situation where one needs to estimate a conditional pmf p(·|y) and the marginal pmf p(·) is known, a viable prior estimate is q (·) = p(·). See Jelinek (1998) for related smoothing methods in language modeling. If one chooses a measure such as the Kullback-Leibler (K-L) distance to select a pmf from the MLS, an additional satisfactory property of the selected pmf emerges. Proposition 4. Let M( pˆ ) be the MLS defined by the counts (n1 , . . . , nk ). For any pmf q pˆ , the pmf p ∗ = arg min D( p, q )

(3.2)

p∈M( pˆ )

has the “monotonicity” property: ni = n j

and q i ≥ q j

⇒

pi∗ ≥ p ∗j

∀ 1 ≤ i = j ≤ k.

(3.3)

and q i = q j

⇒

pi∗ = p ∗j

∀ 1 ≤ i = j ≤ k.

(3.4)

Furthermore, ni = n j

The proof is again relegated to the appendix. Every pmf p ∈ M( pˆ ) has been shown, via equation 2.10, to be faithful to the evidence. The monotonicity property, equation 3.3, characterizes the

1520

B. Jedynak and S. Khudanpur

selection rule of proposition 3: if i is a priori more likely than j, then, in the absence of evidence to the contrary, it continues to be more likely under the selected p ∗ . The special case, equation 3.4, has significant implications for the numerical computation of p ∗ , as will be discussed in the following section. Note that the K-L divergence of equation 3.1 is not the only “distance” one may use to select a pmf from the MLS. Any other function D(·, ·) with a projection theorem that guarantees the existence and uniqueness of p ∗ in equation 3.1, together with an algorithm that computes the projection, may be used. An obvious choice is the Euclidean distance, which leads to a standard quadratic programming problem. 3.1 Numerical Optimization Issues. The optimization problem, equation 3.1, cannot in general be solved in closed form and in practice requires a numerical procedure. The setting is known in numerical optimization literature as general linearly constrained optimization (cf., Fletcher, 1981, and Bazaraa, Sherali, & Shetty, 1993). Stated briefly, one needs to minimize a convex function over a domain defined by linear inequalities such as equation 2.6. We minimize the K-L distance of equation 3.1 subject to p satisfying equation 2.6 using the numerical optimization package CFSQP developed by Lawrence, Zhou, and Tits (1997). The number of constraints specifying the MLS is k(k − 1). A typical language modeling situation requires a vocabulary of k ≈ 105 words. Checking just once that a pmf is inside the domain therefore may in general require about 1010 operations. Fortunately, choosing q to be piecewise constant considerably reduces the dimensionality. To see this, consider the extreme situation where q is the uniform pmf. Two indexes 1 ≤ i, j ≤ k may be considered equivalent if ni = n j , and the optimization may be performed over the set of pmfs on {1, . . . , k} modulo this equivalence relation, thanks to equation 3.4. What is the√number of indexes in this set? With n samples, it contains no more than 2n indexes. This is therefore the “effective” k when q is uniform. For other pmfs q , the corresponding equivalence relation is ni = n j together with q i = q j . 4 Language Modeling Statistical language models are a key component in applications such as automatic speech recognition, machine translation, spelling correction, and document retrieval. Language modeling entails estimating a probability distribution over word sequences, and this is typically done by modeling the sequence of words in a sentence by a finite memory Markov chain. An n-gram model is a set of conditional pmfs P(wn |w1 , . . . , wn−1 ), one for every conditioning event. In applications such as document retrieval, where word order is not of paramount importance and a bag-of-words representation is

Maximum Likelihood Set

1521

adequate, i.i.d. models, called unigram models, are used. In all cases, there is a need to estimate a pmf, marginal or conditional, on the vocabulary. In this section, we present experimental results for the estimation of unigram models. If obtaining smooth estimates is the primary goal, one would naturally use the uniform distribution in the role of q in equation 3.1. We obtain empirical results for this (maximum entropy estimation) case as a first step. It should be clear to the reader, however, that all words are not equally likely even a priori, and it is known from several studies that the count ni and the rank of a word i, when the vocabulary is sorted in order of decreasing counts, has a roughly inverse relationship. The relationship, sometimes called Zipf’s law (cf. Li, 1999), makes for a natural prior estimate q for estimating the unigram pmf via equation 3.1. Specifically, we consider q Zipf (i) =

α(k) , rank(i)

(4.1)

where α(k) is a normalizing constant. Empirical studies (Ha, Sicilia, Ming, & Smith, 2002) show that this is a good initial estimate for unigrams. Note that α need not be computed, since it plays no role in the minimization of equation 3.1. The resulting estimate p ∗ in the MLS may then be interpreted as the pmf supported by the evidence x1 , . . . , xn , which is closest to Zipf’s law in the sense of K-L divergence. This seems a plausible choice for language modeling. A problem, however, remains: for a given vocabulary, there is no a priori way of determining the rank ordering of words. One could possibly use word length to perform such ordering. We take a simpler approach and use the rank ordering empirically observed in x1 , . . . , xn to determine q . We make a further modification to break ties: all words that have the same count in x1 , . . . , xn get a rank, namely, the mean of the ranks spanned by those equal-count words. This modification results in an important numerical simplification. By assuming words with the same observed counts to have the same q -probability, we are assured that they will have the same p ∗ probability, reducing the number of free variables in the numerical optimization of equation 3.1 and indeed the specification of p ∗ . Without this modification, p ∗ would have up to k − 1 free parameters, and in case of most language models, this is impractical. We have conducted experiments on English text from the Wall Street Journal corpus, which contains articles from the general news and financial domain. A particular subset of this corpus, the UPenn Treebank corpus (http://www.cis.upenn.edu/∼treebank/home.html), has been widely used by many researchers in language modeling, and we use this for our experiments as well. The corpus is divided into sections, numbered 00 through 24. We use sections 00 to 20 as our training corpus; it contains 900,000 word tokens. Sections 21 and 22, containing 100,000 tokens, are used variably as a

1522

B. Jedynak and S. Khudanpur

training or a held-out corpus as needed, and sections 23 and 24, containing 100,000 tokens make up our test corpus. For the purpose of studying the variability of the estimates, we divided sentences in sections 00 to 22 into 10 roughly equal parts, and results will be presented on these smaller corpora in the following. We made a list of all seen words from sections 00 to 22 and augmented this vocabulary with a set of “unseen” words. The decision on how many unseen words to include is ad hoc. We use a leave-one-out estimate of the number of unseen words by asking, for each xt in x1 , . . . , xn , whether it would be an unseen word if the vocabulary were to be extracted from {x1 , . . . , xt−1 , xt+1 , . . . , xn }, t = 1, . . . , n. It is easy to see that this procedure yields n0 = n1 ; the number of unseen words is exactly equal to the number of words seen only once in the corpus. This procedure, while not theoretically satisfactory, is performed out of necessity. We remark that the MLS of equation 2.2 is well defined even for an infinite vocabulary, and with a suitable prior estimate q , it may be possible to let the vocabulary size be unbounded for the estimate of equation 3.1 as well. 4.1 Empirical Results. The box at the top of Figure 2 illustrates, using crosses, the empirical pmf pˆ obtained from sections 00 to 22, where the words have been (re)ordered along the abscissa in decreasing order of pˆ i . Specifically, for i = 1, . . . , k0 , the ordinate shows the logarithm (to the base 2) of nσ (1) nσ (k0 ) ,..., , n n

(4.2)

with nσ (1) ≥ . . . ≥ nσ (k0 ) . k0 = 37,001 is the number of distinct words seen in sections 00 to 22. The Zipf prior of equation 4.1 is shown in the same box using dots: it is a straight line with slope −1. A uniform prior would be a horizontal line on this plot. Finally, in the same box, the lower and upper bounds on each pi in the MLS, per equation 2.11, are also illustrated using a solid and a dashed line, respectively: nσ (i) + 1 nσ (i) and log i, log , 1 ≤ i ≤ k0 , log i, log n+k n

(4.3)

where the number of words in the vocabulary k = 52,743 is estimated using the procedure described above. Note that the envelope of the MLS has a trumpet-like shape. For large counts, the upper bound of the MLS is essentially indistinguishable from the type. The estimated pmf p ∗ may decrease the mass for these outcomes but cannot increase it significantly. However, for small counts, the envelope of the MLS has a flared bell shape showing the statistical variability of the corresponding probabilities and that the type tends to underestimate rare events. Any pmf chosen from

Maximum Likelihood Set

1523

-2 lower envelope upper envelope data Zipf prior

-4 -6 -8 -10 -12 -14 -16 -18 -20 -22

0

2

6

4

8

10

12

14

16

A -3

lower envelope upper envelope data Zipf prior

-3.5 -4 -4.5 -5 -5.5

0

0.5

1

1.5

2

-15.5 -16 -16.5 -17 -17.5 -18 -18.5 -19 -19.5 -20 -20.5 12

lower envelope upper envelope data Zipf prior

12.5

13

B

13.5

14

14.5

C

Figure 2: Plot of the empirical pmf from data, the Zipf prior, and the lower and upper envelopes of the MLS on a log-log scale. (A) Full range of observed counts. (B) Zoom top left (≡ high counts). (C) Zoom bottom right (≡ low counts).

the MLS corresponds to a curve that lies between the upper and lower envelopes. To measure the efficacy of an estimate p˜ of p, we compute the average code word length (in bits) that the estimate p˜ achieves on the type pˆ T of the test set, that is,

( p˜ ) =

nT 1 1 log = D( pˆ T , p˜ ) + H( pˆ T ) , nT t=1 p˜ (xt )

(4.4)

1524

B. Jedynak and S. Khudanpur

Table 1: Code Word Length in Bits for pmf Estimates. pˆ β β = 1 (·) (·)

10.21 p ∗ : q = unif 10.21 pˆ β β = 1

Average (·) SD Average (·) SD

10.58 0.017 p ∗ : q = unif 10.58 0.015

pˆ β β =

1 2

10.21 p ∗ : q = Zipf 10.20 pˆ β β =

1 2

10.42 0.017 p ∗ : q = Zipf 10.40 0.017

pˆ β β =

1 k

10.52 p ∗ : q = pˆ GT 10.19 pˆ β β =

1 k

11.31 0.036 p ∗ : q = pˆ GT 10.37 0.018

pˆ GT 10.19

pˆ GT 10.37 0.016

Notes: Upper table: n = 106 words. Lower table: average and standard deviation over 10 training sets with n = 105 words. pˆ β is the add-β rule of equation 1.9. pˆ GT is the Good-Turing estimate of equations 1.16 and 1.17. p ∗ is the MLS estimate of equation 3.1 with the prior q as indicated.

where nT is the size of the test set, the xt s are the words of the test set, and H(·) is the Shannon entropy. Experimental results, for the Wall Street Journal data, along with standard deviations, when available, are shown in Table 1. Looking at the average code word lengths in Table 1, the reader unfamiliar with language modeling might be surprised to see how well the Good-Turing (G-T) estimate (fifth column) performs compared to the add-β rules. Three MLS-derived estimates are presented. In the first of these, we have used the uniform pmf as a prior. The estimate thus obtained has comparable performance with the add-1 rule but not as good as the add- 12 rule for the smaller training set. Next, using a Zipf prior, we increase the performance to outperform all add-β rules considered so far and come closer to the GT estimate. Third, we use the GT estimate itself as a prior. We then get an average code word length that is indistinguishable from the GT estimate. In our experiments, the GT estimate has never been inside the MLS. We have thus shown empirically that there exist pmfs that are “closer” to the empirical pmf than to any other type whose code word lengths are undistinguishable from those of the GT estimate. Furthermore, unlike the GT estimate, these pmfs are guaranteed not to contradict the observed counts in the data. Note as an aside that the effective-k for numerical optimization is about 600 for n = 106 and about 180 when n = 105 for all priors used. 5 Conclusion We have proposed a new method for estimating a probability mass function from a sample: we consider the observed counts; the maximum likelihood

Maximum Likelihood Set

1525

set is defined as the set of pmfs that put more mass on the observed counts than on any other set of counts; the closest element from the MLS to a prior estimate in the Kullback-Leibler sense is then selected. The MLS is an admissible set for estimating a pmf that has the following properties: it is built from first principles, and it is strongly consistent (see equation 2.14) and faithful to the evidence (see equations 2.9 and 2.10). The way we select a pmf from the MLS permits encoding domain-specific information in a very natural way, as demonstrated with the Zipf law for language modeling. Moreover, it is practical, as it entails minimizing a convex function over a domain defined by linear inequalities. This is a classic problem in numerical analysis, with known solutions. This way of incorporating domain information is a novel alternative to Bayesian or minimax methods. Experiments with pmfs on English words show that the proposed method is competitive with state-of-the-art methods. Appendix: Proofs of Propositions 1, 2, and 4 Proof of Proposition 1. First, we establish that if p ∈ M( pˆ ), then p satisfies equation 2.6. Toward this end, for any i and any j = i such that n j > 0, let qˆ =

nj − 1 n1 ni + 1 nk ,..., ,..., ,..., n n n n

.

(A.1)

By definition, f ( p, pˆ ) ≥ f ( p, qˆ ), and hence n n! p l n1 ! · · · ni ! · · · n j ! · · · nk ! l l n n! n −1 ≥ pl l pini +1 p j j n1 ! · · · (ni + 1)! · · · (n j − 1)! · · · nk ! l=i, j 1 1 pj ≥ pi . nj ni + 1 Property 2.6 follows. If n j = 0, then equation 2.6 follows trivially. Next, we establish that if p satisfies equation 2.6, then p ∈ M( pˆ ). Toward this end, again, let qˆ =

n˜ 1 n˜ k ,..., n n

(A.2)

be an empirical pmf associated with any other set of counts (n˜ 1 , . . . , n˜ k ) for an n-length sample. We construct a sequence of pmfs qˆ (0) , . . . , qˆ (n) such that qˆ (0) = qˆ ,

f p, qˆ (0) ≤ f p, qˆ (1) ≤ . . . ≤ f p, qˆ (n)

and qˆ (n) = pˆ . (A.3)

1526

B. Jedynak and S. Khudanpur

In particular, we begin with qˆ (0) defined by the counts (0) (0) n1 , . . . , nk = (n˜ 1 , . . . , n˜ k ) ,

(A.4)

and, for m = 1, . . . , n,

r r

If qˆ (m−1) = pˆ , then we set qˆ (m) = qˆ (m−1) . (m−1)

Otherwise, choose i and j such that ni define qˆ (m) by the counts (m)

= ni

(m)

= nl

ni nl

(m−1)

− 1,

(m)

(m−1)

for all other l.

nj

(m−1)

= nj

+ 1,

(m−1)

> ni and n j

< n j , and

and (A.5)

Note that a suitable pair i, j is guaranteed to exist whenever qˆ (m−1) = pˆ . It is clear that for m = 1, . . . , n, if qˆ (m−1) = pˆ , then by construction, 2 (m) qˆ − pˆ = qˆ (m−1) − pˆ − 1 1 n 2m = · · · = qˆ (0) − pˆ − . 1 n Since qˆ − pˆ 1 ≤ 2, it follows that qˆ (n) = pˆ . Finally, note that for m = 1, . . . , n, if qˆ (m−1) = pˆ , (m−1)

n1 f ( p, qˆ (m) ) n! = f ( p, qˆ (m−1) ) n(m) ! · · · n(m) ! k

1

= ≥

(m−1) ni

1 (m−1)

nj

+1

1

(m−1)

! · · · nk n!

k !

(m)

n

pl l

(m−1)

−nl

l=1

pj pi

ni + 1 p j n j pi

≥ 1, (m−1)

> ni and where the first inequality holds by construction, since ni (m−1) < n j , and the second inequality holds due to equation 2.6. nj Proof of Proposition 2. Let us suppose that there is an index 1 ≤ j ≤ k such that n j > 0 and p j = 0. Replacing in equation 2.6, it implies that ∀1 ≤ i ≤ k, i = j, pi = 0 which is impossible since p j = 0. This proves equation 2.9. Equation 2.10 is also a consequence of equation 2.6, as the reader can check.

Maximum Likelihood Set

1527

We remark that equation 2.6 still holds for indexes i = j. Then, summing out, we obtain, for any subset A ⊂ {1, . . . , k}, k

pˆ j pi ≤

i∈A j=1 k

k

pˆ i +

1 n

pˆ i +

1 n

i∈A j=1

pˆ j pi ≤

j∈A i=1

k j∈A i=1

p j and

(A.6)

pj,

(A.7)

from which we obtain ∀A ⊂ {1, . . . , k}, pˆ (A)

#A n ≤ p(A) ≤ pˆ (A) + , n+k n

(A.8)

where #Ais the number of elements in A. Setting A = {i} gives equation 2.11. Now, from Cover and Thomas (1991, p. 300), p − pˆ 1 = 2( p(A) − pˆ (A)); A = {1 ≤ i ≤ k; pi > pˆ i } .

(A.9)

Using equation A.8, we obtain p − pˆ 1 ≤ 2

#A 2(k − 1) ≤ . n n

(A.10)

Using equation 2.6, one can directly check that pˆ ∈ M( pˆ ). If another type in Pnk is also an element of M( pˆ ), then pˆ has a neighbor that is an element of M( pˆ ), following the argument in the part (⇐) of the proof of proposition 1. n −1 Let’s call qˆ this neighbor. It is such that qˆ i = ni n+1 and qˆ j = jn for some indexes 1 ≤ i, j ≤ k such that ni < n and n j > 0. Now, as an element of M( pˆ ), it satisfies (ni + 1)qˆ j ≥ n j qˆ i .

(A.11)

But this is equivalent to saying that ni ≤ −1, which is impossible. Finally, sup p − q 1 ≤ p∈M( pˆ )

2(k − 1) + pˆ − q 1 , n

(A.12)

using the triangular inequality as well as the bound, equation 2.12. Equation 2.14 follows from the fact that the type converges to the true distribution in .1 .

1528

B. Jedynak and S. Khudanpur

Proof of Proposition 4. Assume, to the contrary, that pi∗ < p ∗j for some i = j with ni = n j and q i ≥ q j . Define a pmf p ∗∗ by

pl∗∗

 ∗ p    j = pi∗    ∗ pl

for l = i, for l = j,

(A.13)

for l = i or j.

In other words, construct p ∗∗ by “switching” the ith and the jth entries of p ∗ . Since p ∗ ∈ M( pˆ ), p ∗ satisfies equation 2.6. But ni = n j then implies that, by construction, p ∗∗ also satisfies equation 2.6. Thus, p ∗∗ ∈ M( pˆ ). Next, note that D( p ∗ q ) − D( p ∗∗ q ) =

k

pl∗ log

l=1

= pi∗ log

k pl∗ p ∗∗ − pl∗∗ log l ql ql l=1

p ∗j p ∗j pi∗ p∗ + p ∗j log − p ∗j log − pi∗ log i qi qj qi qj

qj qj − p ∗j log qi qi

qj = pi∗ − p ∗j log ≥0 qi = pi∗ log

which contradicts proposition 3, since p ∗ is the unique minimizer of D( pq ) in M( pˆ ).

Acknowledgments We thank Ali Yazgan for his valuable assistance in the use of the CFSQP package and in conducting most of the empirical studies in section 4.1. This research was partially supported by the National Science Foundation via grants ITR-0225656 and IIS-9982329, ARO DAAD19/-02-1-0337, and general funds from the Center for Imaging Science at the Johns Hopkins University. Finally, we are grateful to the anonymous referees, who gave several insightful suggestions toward improving this article. References Bazaraa, M. S., Sherali, H. D., & Shetty, C. (1993). Nonlinear programming. New York: Wiley. Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39–71.

Maximum Likelihood Set

1529

Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (pp. 310–318). Santa Cruz, CA: Association for Computational Linguistics. Chen, S. F., & Rosenfeld, R. (1999). A gaussian prior for smoothing maximum entropy models (Tech. Rep.). Pittsburgh, PA: Carnegie Mellon University. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Csiszar, I. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3, 146–158. Fletcher, R. (1981). Practical methods of optimization (Vol. 2). New York: Wiley. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264. Ha, L. Q., Sicilia, E., Ming, J., & Smith, F. J. (2002). Extension of Zipf’s law to words and phrases. In International Conference on Computational Linguistics (COLING’2002) (pp. 315–320). Taipei, Taiwan. Hogg, R. V., & Craig, A. T. (1995). Introduction to mathematical statistics. Upper Saddle River, NJ: Prentice Hall. Jaynes, E. T. (1994). Probability theory: The logic of science. Cambridge: Cambridge University Press. Jelinek, F. (1998). Statistical methods for speech recognition. Cambridge, MA: MIT Press. Kazama, J., & Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (pp. 137–144). Sapporo, Japan. Khudanpur, S. (1995). A method of ME estimation with relaxed constraints. In Proceedings of the Johns Hopkins University Language Modeling Workshop (pp. 1–17). Baltimore: Center for Language and Speech Processing, Johns Hopkins University. Lawrence, C. T., Zhou, J. L., & Tits, A. L. (1997). User’s guide for CFSQP version 2.5: A C code for solving (large scale) constrained nonlinear (minimax) optimization problems, generating iterates satisfying all inequality constraints (Tech. Rep. No. TR-94-16r1). College Park: Institute for Systems Research, University of Maryland. Li, W. (1999). References on Zipf’s law. Available online: http://linkage.rockefeller. edu/wli/zipf/. Lidstone, G. (1920). Note on the general case of the Bayes-Laplace formula for inductive or posterior probabilities. Trans Fac. Actuaries, 8, 182–192. McAllester, D., & Schapire, R. E. (2000). On the convergence rate of Good-Turing estimators. In Proc. 13th Annual Conference on Computational Learning Theory, (pp. 1–6). San Francisco: Morgan Kaufmann. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Olkin, I., & Sobel, M. (1979). Admissible and minimax estimation for the multinomial distribution and k independent binomial distributions. Annals of Mathematical Statistics, 7, 284–290. Orlitsky, A., Santhanam, N. P., & Zhang, J. (2003). Always good Turing: Asymptotically optimal probability estimation. Science, 302, 427–431.

1530

B. Jedynak and S. Khudanpur

Poschel, Ebeling, W., Froemmel, C., & Ramirez, R. T., (2003). Correction algorithm for finite sample statistics. Eur. Physics, 12, 531–541. Ristad, E. S. (1995). A natural law of succession (Tech. Rep. No. CS-TR-495-95). Princeton, NJ: Department of Computer Science, Princeton University. Steinhaus, H. (1957). The problem of estimation. Annals of Mathematical Statistics, 28, 633–648.

Received February 17, 2004; accepted January 5, 2005.

LETTER

Communicated by Jonathan Victor

Estimating Entropy Rates with Bayesian Confidence Intervals Matthew B. Kennel [email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.

Jonathon Shlens [email protected] Systems Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, and, Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.

Henry D. I. Abarbanel [email protected] Department of Physics and Marine Physical Laboratory, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.

E. J. Chichilnisky [email protected] Systems Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A.

The entropy rate quantifies the amount of uncertainty or disorder produced by any dynamical system. In a spiking neuron, this uncertainty translates into the amount of information potentially encoded and thus the subject of intense theoretical and experimental investigation. Estimating this quantity in observed, experimental data is difficult and requires a judicious selection of probabilistic models, balancing between two opposing biases. We use a model weighting principle originally developed for lossless data compression, following the minimum description length principle. This weighting yields a direct estimator of the entropy rate, which, compared to existing methods, exhibits significantly less bias and converges faster in simulation. With Monte Carlo techinques, we estimate Neural Computation 17, 1531–1576 (2005)

© 2005 Massachusetts Institute of Technology

1532

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

a Bayesian confidence interval for the entropy rate. In related work, we apply these ideas to estimate the information rates between sensory stimuli and neural responses in experimental data (Shlens, Kennel, Abarbanel, & Chichilnisky, in preparation).

1 Introduction What do neural signals from sensory systems transmit to the brain? An understanding of how neural systems make sense of the natural environment requires characterizing the language neurons use to communicate (Borst & Theunissen, 1999; Perkel & Bullock, 1968; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). As a step in this process, Shannon’s information theory (Cover and Thomas, 1991; Shannon, 1948) provides a means of quantifying the amount of information represented without specific assumptions about what information is important to the animal or how it is represented. Neural systems operate in a time-dependent, fluctuating sensory environment, full of spatial and temporal correlations (Field, 1987; Ruderman & Bialek, 1994; Simonicelli & Olhausen, 2001). Given these correlations, we ask how much novel information per second the neural response provides about the sensory world. This is the average mutual information rate, or the largest amount of new information per second that the brain could use to update its knowledge about the sensory world. Information rates of neural spike trains bound the performance of any candidate model of sensory representation (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Borst & Theunissen, 1999; Buracas, Zador, DeWeese, & Albright, 1998; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998; Warland, Reinagel, & Meister, 1997), without regard to how that information is encoded. Most methods for estimating the information rate from experimental data proceed by estimating entropy rate as an intermediate step (but see Kraskov, Stogbauer, & Grassberger, 2004; Victor, 2002). The entropy rate of a dynamical or stochastic system is the rate that new uncertainty is revealed per unit of time (Lind & Marcus, 1996) and plays a central role in the theory of information transmission (Shannon, 1948). Shannon’s theory showed that any transmission channel must have a capacity above this quantity in order to reproduce a signal without error. The entropy rate also gives the best possible compressed transmission rate for any lossless encoding of typical signals. In dynamical systems theory, the entropy rate (in particular, Kolmolgorov-Sinai entropy) is the principal quantification of the existence and amount of chaos (Gilmore & Lefranc, 2002; Hilborn, 2000; Ott, 2002). Estimating the entropy rate from observed data like spike trains can be surprisingly difficult in practice. The classical definitions of entropy rate do not lead easily to a reliable and accurate estimator, given only an observed data set of finite size. Entropy rate estimators nearly always assume some

Estimating Entropy Rates

1533

underlying probabilistic model for the observed data, although this aspect is often not explicitly recognized. A reliable rate estimator requires addressing two issues:

r r

Accurately estimating entropies from observed data Selecting the appropriate model for time-correlated dynamics

Although the first issue has been vigorously investigated, the second has not. Even with sophisticated methods for entropy estimation (Costa & Hero, 2004; Miller & Madow, 1954; Nemenman, Shafee, & Bialek, 2002; Paninski, 2003; Roulston, 1999; Strong et al., 1998; Treves & Panzeri, 1995; Victor, 2000, 2002; Victor & Purpura, 1997), the final estimate of the rate (as opposed to estimating block entropies) can be dominated by a human-derived choice of some underlying modeling parameter (Schurmann & Grassberger, 1996; Strong et al., 1998). These heuristics are subjective and make constructing a confidence interval for the rate difficult. By explicitly addressing the model selection problem, we have designed an estimator for entropy rate that requires no heuristic decisions, contains no important free parameters, and uniquely provides a Bayesian confidence interval about its estimate. We follow a statistical principle inspired by data compression (Rissanen, 1989) to weight a mixture of models appropriate for a finite time series (Kennel & Mees, 2002; London, Schreibman, Hausser, Larkum, & Segev, 2002; Willems, Shtarkov, & Tjalkens, 1995). These “context-tree” models may be used to estimate many quantities, but here we concentrate on Bayesian estimators of entropy (Nemenman et al., 2002; Wolpert & Wolf, 1995), which are combined to yield a direct estimator of entropy rate. We demonstrate through simulation that these estimates are consistent, exhibit comparatively low bias on finite data sets, outperform common alternative procedures, and provide confidence intervals on the estimated quantities. We calculate confidence intervals about the estimate using a numerical Monte Carlo method (Gamerman, 1997). Although motivated by problems in neural coding, the algorithm presented here is a general estimator of entropy rate for any observed sequence of discrete data. In a related work, we extend these techniques to estimate the mutual information rate from experimentally observed neural spike trains (Shlens, Kennel, Abarbanel, & Chichilnisky, in preparation). This letter proceeds by reviewing the classical and Bayesian estimators of entropy as well as the common issues in extracting entropy rate from entropies. Motivation for the context tree modeling method for a time series and its justification follows. Next, we show the application of the model to entropy rate estimation and empirical results. Finally we discuss potential extensions and limitations for this method of estimating entropy rates, as well as current applications in neural data analysis.

1534

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

2 Entropy and Entropy Rate 2.1 Classical and Bayesian Estimators of Entropy. We begin by outlining a Bayesian procedure for estimating the entropy and its associated confidence interval. The Shannon entropy of a discrete probability distribution P = { pi }, i ∈ [1, A], H(P) = H({ pi }) = −

A

pi log pi ,

(2.1)

i=1

quantifies the uncertainty represented by P, or the average number of yesno questions needed to determine the identity i of a random draw from P (Cover & Thomas, 1991; MacKay, 2003; Shannon, 1948).1 Although equation 2.1 provides a definition of entropy, estimating this quantity (and P) well from finite data sets of observations presents significant statistical and conceptual challenges, which are far greater when estimating the entropy rate. Assume that one observes N independent draws of a discrete random variable from some unknown underlying distribution with alphabet A. The count vector c = [c 1 , . . . , c A] accumulates the observed occurrences of each symbol. The naive entropy estimator assigns the observed frequencies for the probabilities p j = c j /N, Hˆ naive (c) =

A

−

j=1

cj cj log2 . N N

(2.2)

This estimator yields the correct answer as N/A → ∞, but in many practical cases, this estimator is significantly biased. The first-order correction (Miller & Madow, 1954; Roulston, 1999; Victor, 2000), Hˆ MM (c) =

A j=1

−

cj cj A− 1 log2 + log2 e, N N 2N

(2.3)

still retains significant bias when N ≈ A or N A (Paninski, 2003). An alternative approach is a Bayesian estimator of entropy (Wolpert & Wolf, 1995). Consider a hypothetical probability distribution θ, which is a candidate for P. Each component θ j is a guess for the true probability pa rameter p j and, accordingly, must be Aj=1 θ j = 1. Consider the probability

1 All information-theoretic quantities in this letter use base 2 logarithms to provide units of bits.

Estimating Entropy Rates

1535

of drawing one symbol j assuming the underlying distribution is θ. By definition, this is θ j . The likelihood of drawing the particular set of empirically observed counts c is the standard multinomial, P(c|θ) =

A 1 (θ j )c j , Z j=1

(2.4)

where Z is a normalization. A Bayesian entropy estimate averages the Shannon entropy of θ, H(θ) = − j θ j log θ j , over the relative likelihood that θ might actually represent the truth given the observed counts P(θ|c): Hˆ Bayes (c) =

H(θ)P(θ|c)dθ.

(2.5)

By Bayes’ rule, we take P(θ|c) ∝ P(c|θ)P(θ) and recast the estimation as Hˆ Bayes (c) =

H(θ)P(c|θ)P(θ)dθ.

=

H δ [H − H(θ)] P(c|θ)P(θ)dθ d H

=

H P(H|c) d H.

(2.6)

A disadvantage of a Bayesian approach is that one needs a somewhat arbitrary prior distribution, P(θ), on the parameters of the distribution θ. We discuss the common choice in appendix A. Nemenman et al. (2002) investigated the properties of Hˆ Bayes in the modest to small N/A limit. In this regime Hˆ Bayes is dominated by the particular prior P(θ) chosen, not the observations, meaning that the estimate does not reflect properties derived from actual data. They suggest an interesting meta-prior as a correction. In our application, this is not a concern because N/A is typically large when we estimate Hˆ Bayes , and hence we do not apply this correction. There is a conceptual point to consider here. Which estimate should be used if all observations occur in a single bin? An argument can be made that in this circumstance, all of these estimators ( Hˆ Bayes , Hˆ MM ) should be bypassed in favor of an estimate of zero, for example, define Hˆ zero = 0 for deterministic data, Hˆ zero = Hˆ Bayes or Hˆ MM otherwise. The idea is that if the underlying physics reliably produces the same value, then there is probably some underlying mechanistic principle preventing anything else. For instance, in neural data absent of spikes (or completely silent), a better estimate might be to use Hˆ zero rather than the small but positive values that Hˆ MM or Hˆ Bayes would yield. An alternative perspective is that although all

1536

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

observations have been the same value, the future and underlying truth is not necessarily so; thus, Hˆ Bayes or Hˆ MM would be appropriate. Choosing which estimator is appropriate is an externally directed prior on the expected structure of observations. In empirical testing, using Hˆ zero improves bias on some physical systems we have considered. For the results we present later, the effect is very small. 2.2 Bayesian Confidence Intervals for Hˆ Bayes . To make inferences from real data, we need a range of results that appear compatible with the data rather than a single-point estimate of the entropy. Because all estimates are subject to statistical fluctuation, comparisons between any two quantities require comparing any difference to reasonable statistical fluctuations. We need an “error bar” on our estimate, but from experiment, we have only a single data set. Again, we choose the Bayesian perspective and ask what the likelihood is of an estimated entropy given the observed data, P(H|c). A variety of underlying distributions could have been sampled to generate the observed data. Each such distribution has a greater or lesser compatibility with the data (relative likelihood), and we weight their entropies accordingly. The width of the posterior distribution P(H|c), around its mean, Hˆ Bayes , is a Bayesian measure of the uncertainty of the estimate. Wolpert and Wolf (1995) computed Hˆ Bayes and the variance of P( Hˆ Bayes |c) analytically using the same prior as ours. We want to go beyond variance and find confidence intervals. We call the central portion of P(H|c) the Bayesian confidence interval of our estimate. The Bayesian form is sometimes called a credible interval, distinguishing it from frequentist confidence intervals, which have a different definition. Specifically, we find quantiles of P(H|c)—those locations where the cumulative distribution achieves some value 0 < α < 1. With C(H) = H −∞ P(H |c) d H , a quantile Hα is defined as the value where C(Hα ) = α. Then, for example, a 90% confidence interval is [H0.05 , H0.95 ]. Roughly, we might say the observed data have been produced by a distribution whose entropy falls within the interval with 90% likelihood.2 Unlike variance, the quantiles for entropy are not known analytically, but they can be estimated using a Markov chain Monte Carlo (MCMC) technique (Hastings, 1970; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). The MCMC algorithm produces a succession of vectors θ k , k = 1, . . . , NMC , which sample P(θ|c). By taking the entropy function of each θ k , we obtain NMC individual samples, Hk = H(θ k ), from the distribution P(H|c). We calculate Hˆ Bayes to be the mean of this empirical distribution approximating equation 2.6 and the Bayesian confidence interval to be the

2 Note that H ˆ Bayes is the mean, distinct from the median, H0.5 , and the mode, the maximum likelihood estimate.

Estimating Entropy Rates

1537

Table 1: Examples of Entropy Estimates (Bits) ± Estimate of Standard Deviation.

Hˆ naive Hˆ MM Hˆ Bayes

(5,5)

(50,50)

(1,9)

(1,99)

1.000 ± 0.000 1.070 ± 0.000 0.937 ± 0.082

1.000 ± 0.000 1.007 ± 0.000 0.993 ± 0.010

0.47 ± 0.27 0.54 ± 0.27 0.51 ± 0.24

0.081 ± 0.065 0.088 ± 0.065 0.105 ± 0.068

Note: The top row shows count vectors c = [# of heads, # of tails] for a binary alphabet.

quantiles [H0.05 , H0.95 ] of the empirical distribution. The number of samples, NMC , is a user-controlled, free parameter chosen solely to balance Monte Carlo fluctuation against computation time. We typically use NMC = 199 so that the the 5% and 95% confidence interval values can be read off directly as the 10th and 190th smallest values in a sorted list of Hk . We outline the specific prior, the analytical formula for Hˆ Bayes , and concrete implementation of the MCMC algorithm in appendix A. 2.3 Comparing Hˆ naive , Hˆ MM , and Hˆ Bayes . We demonstrate the behavior of Hˆ naive , Hˆ MM , and Hˆ Bayes on varying hypothetical count frequencies to show the properties of the estimators. Suppose we count the probability of flipping a tail in a biased coin. Table 1 shows results. If equal counts are observed, say, (5, 5) and (50, 50), the naive estimator gives entropy of exactly one. The classical bias-corrected version Hˆ MM gives value larger than one, which is impossible for the entropy of any binary distribution. This is because the bias correction is calculated in a “frequentist” philosophy. This means that one considers the observed frequencies to reflect a distribution approximately close to truth, and one then pretends one can simulate new finite sets of observations from these probabilities. It is well known that simulating finite data from a distribution gives on average a spikier empirical distribution than truth, leading to lower entropy with a naive estimator. Thus, the naive estimator is deemed to be biased down, and the upward correction is applied to give Hˆ MM . The same philosophy can be applied to estimate a standard deviation (Roulston, 1999). Unfortunately, for equal counts, this estimate is identically zero, which is nonsensical as well. Hˆ Bayes does not show these pathologies. The posterior distribution of likely values has support only in the valid region, and the width of the distribution is always sensible. For nonequal counts (e.g., (1, 9) and (1, 99)) the estimators behave more similarly. Table 1 demonstrates how Hˆ Bayes may be both larger and smaller than Hˆ MM in various circumstances. (See Wolpert & Wolf, 1995, for more demonstrations of properties of Hˆ Bayes .) 2.4 Entropy Rate. The entropy rate, as opposed to entropy, characterizes the uncertainty in a dynamical system. It is a particular asymptotic limit of the entropy of a time series. Unfortunately, as we see below, there exist

1538

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

major empirical quantitative problems in trying to successfully estimate the entropy rate from observed time series. Resolving these problems is the subject of our investigation. We proceed from discrete distributions to time series of discrete symbols. This stream of integers might be a spike train, with the integer being the number of spikes in a small time interval (MacKay & McCulloch, 1952; Strong et al., 1998); a discretized interspike interval (Rapp, Vining, Cohen, Albano, & Jimenez-Montano, 1994); an arbitrary symbolic dynamical system (Lind & Marcus, 1996); or even a computer file on a hard drive (Cover & Thomas, 1991). The resulting time series of integers, R = {r1 , r2 , r3 , . . .}, where the subscript indexes time, is termed a symbol stream with alphabet size A. The underlying process that created the time series is called a symbol source.3 In a symbolic dynamical system, a characteristic of the underlying symbol source is the uncertainty of the next symbol (Cover & Thomas, 1991; Lind & Marcus, 1996). In other words, the uncertainty of the next symbol is an intensive property of a dynamical system. We briefly explore the significance of this property in a symbolic system. With an independent and identically distributed (i.i.d.) symbol source, the uncertainty of the next symbol in R is simply the entropy of the discrete distribution, H({ri }). In contrast to the i.i.d. case, many real dynamical systems exhibit serial correlations due to internal state variables and time dependence. Consider the refractory dynamics of a spiking neuron, or inertia in a mechanical system. Knowledge of recently observed symbols or states alters the estimate of the next observation. To capture these dynamics, consider the distribution of words, length-D blocks of successive symbols from the source, defining the block entropy HD ≡ H({ri+1 , . . . , ri+D }), which quantifies the average uncertainty in observing any pattern of D consecutive symbols. Block entropy is normally extensive in the thermodynamic sense, increasing linearly with D for sufficiently large D (e.g., the heat capacity of a solid increases with the amount of matter in question).4 The quantity that characterizes the underlying source is, however, the new uncertainty per additional symbol. Dividing HD by D gives an intensive quantity, which reflects a characteristic of the underlying system (e.g., the specific heat is a property of the substance). HD /D will approach the asymptotic limit from above, as increasingly long words reveal more potential interactions.

3 By assumption, the symbol stream and underlying source are stationary and ergodic. 4 The

statistical mechanics community is showing increasing interest in complex systems, often defined as having a substantial subextensive term (Bialek et al., 2001), often at the critical point between complete order and conventional chaos.

Estimating Entropy Rates

1539

The entropy rate is defined by three equivalent asymptotic limits (Cover & Thomas, 1991), h ≡ lim HD /D D→∞

(2.7)

≡ lim HD+1 − HD

(2.8)

≡ lim H(ri+1 |ri , ri−1 , . . . , ri−D ),

(2.9)

D→∞ D→∞

where h has units of bits per symbol or possibly bits per second. The limit D → ∞ ensures that we account for all possible temporal correlations or historical dependence. In the following section we highlight existing methods on entropy rate estimation that have focused on definitions 2.7 and 2.8. Subsequently, we return to definition 2.9 to derive a new estimator of entropy rate. 2.5 Estimating Entropy Rates with Block Entropy. Even with a sophisticated estimator of entropy, it is not trivial to calculate an entropy rate. Definitions 2.7 and 2.8 suggest strategies for estimating the entropy rate by first making an estimate Hˆ D (Schurmann & Grassberger, 1996) for a range of D. The difficulty with these approaches is that two competing biases make the final estimation step of the rate a qualitative judgment with minimal rejection criteria (Treves & Panzeri, 1995), which forgoes any attempt at calculating confidence intervals on the estimated quantities. One result of this situation is that bias can significantly contaminate the estimate, and variances are underestimated (Miller & Madow, 1954; Nemenman et al., 2002; Paninski, 2003; Roulston, 1999; Treves & Panzeri, 1995). We illustrate this situation in simulation. Consider estimating the entropy rate by calculating Hˆ D at varying depths D. Estimating the entropy rate amounts to selecting a particular word length D∗ and calculating hˆ block = Hˆ D∗ /D∗ . Given infinite data, selecting the appropriate word length is trivial as the asymptotic behavior lim D→∞ HD /D guarantees an accurate estimate of hˆ for large D, where all temporal dependencies in the symbol stream have been accounted for.5 However, for finite data, selecting an appropriate D∗ , that value where Hˆ D∗ /D∗ is deemed to be the best estimate of the rate, is not trivial because a second but (typically) opposing bias complicates the procedure. Discrete estimators of entropy are often dominated by negative bias at large enough D due to undersampling. Thus, the observed distribution is spikier and appears to have lower entropy than truth. Estimating hˆ by selecting a word length must contend with these two competing biases in finite data sets: Hˆ D+1 − Hˆ D . Empirically, this estimate converges faster to the entropy rate (in D) but is prone to large errors due to statistical fluctuations (Schurmann & Grassberger, 1996). 5 The same qualitative picture arises by plotting

1540

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

1. Negative bias: At large word lengths, undersampling naive probability estimators typically produces downward bias due to Jensen’s inequality (Paninski, 2003; Treves & Panzeri, 1995). 2. Positive bias: At small word lengths, finite D in HD /D produces upward bias due to insufficiently modeled temporal dependencies (Cover & Thomas, 1991). Given this situation, the common assumption is that there exists an intermediate region in D, between the realms of the two biases, to provide a plateau effect. The plateau suggests a converged region in which one can select a word length to successfully estimate hˆ block . Unfortunately, as seen in Figure 1, the distinctiveness, or even existence, of a plateau depends highly on the number of data and the temporal complexity of the underlying symbol source. In experimental data from an unknown symbol source, the location of the plateau, if any, is unknown. In this approach, the rate estimate is dominated by the selection of the region of word length used, which is determined by a qualitative judgment of the slopes of Figure 1. Another similar criterion for selecting an appropriate intermediate range of word lengths is to follow the direct method (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Reinagel & Reid, 2000; Strong et al., 1998). The convergence of block entropies to the entropy can be decomposed into extensive and subextensive terms, h=

HD f (D) + , D D

where f (D) is defined as a monotonic function of word length that grows at less than a linear rate. For large enough D, the leading subextensive term is assumed to be constant. In practice, the data analyst plots HD /D versus 1/D and assumes there is some region of the plot with a linear slope such that : 1. D is large enough so that f (D) ≈ k. 2. D is small enough so that Hˆ D is not contaminated by sampling bias. Strong et al. (1998) give some weak bounds based on coincidence counting to help with the second criterion, but they are often not tight enough to be empirically effective. As before, this entropy rate estimate requires a choice based on a subjective assessment of the slope of block entropy plots. Rather than rely on human-directed estimation, we offer an approach that, in effect, automatically selects the word lengths appropriate for finite data sets. We place this problem in a larger framework, akin to probabilistic model selection.

Estimating Entropy Rates

1541

ˆ Figure 1: Estimating the entropy rate using h(D) = Hˆ D /D or Hˆ D+1 − Hˆ D for a simple two-symbol Markov process and a symbolized logistic map. The horizontal dashed line is the true entropy rate calculated analytically (see section 5). Hˆ D is calculated using Hˆ MM with O(106 ) and O(105 ) symbols in the top and botˆ tom rows, respectively. Estimating h with h(D) becomes difficult as the plateau disappears with fewer symbols and greater temporal complexity (e.g., logistic map) in the underlying information source.

2.6 Model Selection in Entropy Rate Estimation. Selecting a plateau in the block entropies amounts to selecting a word length D∗ at which we believe we have accounted for all temporal dependencies in the underlying ∗ dynamics. Implicit is a probabilistic model with AD parameters for the distribution of words that accounts for the temporal dependencies. In other words, successive (nonoverlapping) words of length D∗ are assumed to be statistically independent. Block entropies of order D∗ model the spike ∗ train as independent draws from a distribution of alphabet size AD . This relationship suggests an equivalent topology for retaining frequency counts for all words—a suffix tree, as diagrammed in Figure 2. This data structure is often used in practice for accumulating frequency distributions to estimate block entropies. Each node in the suffix tree retains the counts for a particular ˆ i+1 , . . . , ri+D ). Selecting a word, and all the nodes at a depth D retain P(r plateau prunes a suffix tree at a single depth D∗ , selecting the structure necessary to model the data (Johnson, Gruner, Baggerly, & Seshagiri, 2001).

1542

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

Figure 2: A suffix tree of maximum depth 3 modeling a binary symbol stream. Each node retains counts of the number of observations of the word corresponding to the node. The root node λ counts the number of symbols observed in the stream. Children of a node prepend one additional symbol earlier in time. The ˆ i+1 , . . . , ri+3 ). counts at all nodes for D = 3, when normalized, provide P(r

Given a fixed amount of data, can we select a tree topology with a more principled, less qualitative criterion? In the following sections, we make this probabilistic model more explicit for entropy rate estimation using definition 2.9. We show techniques to select an appropriate tree topology for a finite data set.

3 Modeling Discrete Time Series 3.1 Markovian Modeling. Our approach to estimating the entropy rate directly is to focus on the conditional entropy formulation h ≡ lim D→∞ H(ri+1 |ri , . . . , ri−D ). This is a time-series model-building approach, as we make explicit estimates for the distribution of the next symbol. A classical example in the continuous case is an autoregressive model whose future is a function of previous values of fixed order. We confine our work, though, to discrete processes. Consider a first-order Markov chain with states σ ∈ , each of which has some transition probability distribution P(r|σ ) for emitting some symbol

Estimating Entropy Rates

1543

r ∈ A and a deterministic transition to a new state σ . This Markov chain is a stationary symbol source, and its entropy rate is h=

H [P(r|σ )] µ(σ ),

(3.1)

σ

with H[·] denoting the Shannon entropy (see equation 2.1) and µ(σ ) the stationary probability of state σ .6 This formula is exact for a Markov chain (Cover & Thomas, 1991). In our problem, we must estimate the set of states and transition probabilities from data. A Markov model–based entropy rate estimator applies a deterministic projection from the semi-infinite, conditioned past into a finite set of states . We can equate the set of finite states ≡ RD , where RD ≡ {ri , . . . , ri−D } is the finite conditioning depth D. If a finite history of recent symbols uniquely determines the next symbol, then this is a Markov model, and we take equation 3.1 as the entropy rate h for the system. To estimate the entropy, we need to first select the right set of states and then estimate the transition and stationary distributions. We may then calculate the entropy rate directly from this model estimated from the observation.7 This is a model of finite-time conditional dependence, which is distinct from the timescale of absolute correlation. To understand the difference, imagine a first-order Markov chain, which has a large probability at staying in the same state. Equivalently, consider a classical autoregressive gaussian process with a single lag. The autocorrelation in the resulting time series of these processes could be quite high at significant time delays, but the conditional correlation—how much history is necessary to make probabilistic predictions—remains just one time step. 3.2 Context Trees. The Markov formulation does not by itself make the problem of the selection of word length go away. In a naive view, there are still up to AD conditioning states with D-order conditional dependence. One could imagine using a statistical model selection criterion to select some intermediate optimal D∗ , balancing statistics and model generality. We adopt a more flexible model, however. Instead of just a single depth D∗ , giving all states with histories of length D∗ , we adopt a variable-depth tree. At any time, the history of recent symbols can be projected down to the longest matching suffix in the tree, called a terminal node. Our model is an elaboration of a suffix tree, called a context tree, where each node also

6 Technically, this assumes the Markov chain is sufficiently mixing, with one strongly connected component. Mixing means that the influence of dynamics long in the past has dissipated. Strongly connected means that eventually all states, even in a multistable system, are visited. 7 Directly means that that no block entropies H ˆ D need be calculated as an intermediate step. Rather, we calculate hˆ in one step using equation 3.1.

1544

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

Figure 3: A context tree for a binary symbol stream. Each terminal node retains a distribution of future symbols that occurred after the word corresponding to the node. This distribution is an explicit representation of emitting the additional symbol, which, along with a buffer of recently emitted symbols, determines the transition to a new state.

stores probability distributions for emitting one future symbol.8 Figure 3 diagrams how these terminal nodes are akin to the Markov conditioning states. The root node λ corresponds to the empty string, and children of any node correspond to strings with one additional symbol prepended as an earlier symbol in time. The stochastic process underlying a context tree is called a tree machine, and its entropy rate can also be given by equation 3.1 (see Kennel and Mees, 2002). In summary, the conditional definition of entropy rate 2.9 implies a Markov model, where the conditioning word length is the order of the Markov process. The generalization of a Markov model to a context tree provides a more flexible framework, where the conditioning word length need not be a single value (e.g., D∗ ) but is variable, specific to each conditioning state. The complexity of the model is specified not by the word length but rather by the tree topology. Compared to a fixed-order Markov model, a context tree can better adapt to the dependencies of an underlying source while maintaining statistical precision. There now exist two remaining questions to address.

r r

How do we select the appropriate tree topology (addressed in sections 3.3 through 3.6)? How do we estimate the quantities in equation 3.1 (addressed in section 4)?

8 Kennel and Mees (2002) detail specific computational issues and optimizations in implementing a context tree.

Estimating Entropy Rates

1545

3.3 Modeling with the Minimum Description Length Principle. Selecting an appropriate tree topology is a statistical inference problem (Duda, Hart, & Stork, 2000). If we allow greater depths and longer conditioning sequences, we can represent more complex dynamics, but in return we must estimate more free parameters. This model complexity trade-off is the same phenomenon previously seen as opposing biases in the conventional estimation via block entropies. We resolve this problem by following the minimum description length (MDL) principle. We detour briefly to discuss how MDL addresses model selection through data compression ideas. We return to symbol streams to calculate a modified log likelihood, or code length, for any observed data. We use these code lengths to select not just a single tree topology but a weighting over all possible tree topologies, which balances these model complexity issues according to Bayesian theory. We may weight any statistic computable over nodes, or trees, derived from the underlying data. In section 4, we apply this weighting to estimate the entropy rate. We briefly review Rissanen’s MDL theory (Rissanen, 1989) as a model selection principle (also see Balasubramanian, 1997). Model selection is treated as a data compression problem and can be viewed as a particular form of Bayesian estimation. Given a transmitter (encoder), a lossless communication channel, and a receiver (decoder) that know only the overall model class (but not any specific model), what is the smallest number of bits that we need to losslessly code and decode the observed data? The MDL principle asserts that the model that requires the smallest number of bits, termed the description length, is the most desirable one. The description length for expressing any data set is the sum of the code lengths to encode the model L(model) and the residuals L(data|model). Larger numbers of model parameters or decreased predictive performance increase the respective code lengths. Models that minimize the sum L(data|model) + L(model) appropriately balance the predictive performance L(data|model) with the complexity L(model). Statistical estimation in MDL selects a single best model and examines functionals on this model. Instead of estimating by selecting a model, we may also estimate by weighting (Solomonoff, 1964). Rather than selecting that one model with the lowest code length (the MDL principle), we may average over models, weighting good models more than bad ones. This is very similar to Bayesian averaging, weighting by the “posterior distribution.” Given code lengths L estimated from our models and data, we weight any quantity Q by P(model|data) ∝ 2−L(data|model) to get an estimator for Q, ˆ w (data) = Q

Q(model)2−L(data|model) . 2−L(data|model)

(3.2)

This can often provide superior performance compared to choosing a single model, especially when no single model is clearly better than the others.

1546

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

When one model dominates, the weighted estimator is very quickly also dominated by the single best model and gives nearly identical results to the MDL principle. 3.4 Calculating Code Lengths of Symbol Streams. The first step in applying the MDL formalism to judge the appropriate model complexity (i.e., tree topology) is to calculate the code length or the number of bits necessary to transmit losslessly a symbol stream. We assume, temporarily, that we have a sequence of symbols {r1 , . . . , r N } that have no additional temporal structure. That is, these symbols are drawn independently, from a fixed probability distribution θ over A bins. Section 2.5 described conventional estimators of entropy for this problem. Now, our task is to estimate a fair minimal code length for these data: How many bits would it take to transmit these N symbols down a channel and reconstruct them exactly at a hypothetical receiver? If θ were known, then it would take NH(θ) bits (also the negative log likelihood). However, θ is unknown, and the extra parametric complexity must be accounted for properly. Generally, it is not permissible to substitute an arbitrary entropy estimator Hˆ and call N Hˆ a code length. Rissanen (1989) proposed stochastic complexity L = − log P(c|θ ) P(θ)dθ as an extended log likelihood (or code length) to properly account for this model complexity. The difference of two such stochastic complexities, Rissanen’s test statistic for choosing the better model, is precisely the logarithm of the Bayes factor used in Bayesian hypothesis testing and model selection (Lehmann & Casella, 1998). Frequently these integrals cannot be done analytically without making large N approximations. For small N, these approximations may result in pathological behavior such as nonmonotonicity or negative code lengths. In our application, such problems would be fatal, as smooth behavior is required down to even N = 1. A simple and effective prescription to obtain a fair code length and avoid such pathologies is to adapt predictive techniques used in sequential data compression (Cover & Thomas, 1991). At time step k + 1, make a probability ˆ k+1 ) conditioned only on previously observed data, s1 , . . . , sk . estimate θ(s ˆ The total code length of an entire symbol stream, is j − log2 θ(s j ). The complexity cost for coding the parameters is included implicitly because the early data are coded with poorer estimates than the later data. One caveat to sequential prediction is that we must provide a small but finite probability for potential symbols not yet observed. Thus, we cannot predict probability zero for any symbol value, because if it should occur, it would give an infinite code length. We use the Krischevsky-Trofimov (KT) estimator (Krichevsky & Trofimov, 1981), which adds to all possible symbols a small “ballast” of weight β > 0. Given count vector c, this estimator predicts a probability for symbol j ∈ [1, A], θˆKT, j (c) =

cj + β . k (c k + β)

(3.3)

Estimating Entropy Rates

1547

The free parameter β > 0 prevents estimating θˆ = 0, which would give an infinite code length should that symbol actually occur.9 We apply the KT estimator sequentially, using the counts of symbols, i ci , which were seen through time i, that is, cij = l=1 δ(rl = j). We may sequentially code the stream with a net predictive code length, L KT (c, β) =

N

− log2 θˆ KT (ri |ci−1 ),

(3.4)

i=1

having seen all symbols c = c N . It is critical that the estimator for symbol ri uses ci−1 and not ci , so that a hypothetical receiver could causally reconstruct the identical θˆ before seeing ri and, given the compressed bits, decode the actual ri . For this particular estimator 3.3, we may conveniently compute the code length equation 3.14 using only the final counts and not the sequence: L KT (c, β) = log2

A (c j + β) (N + Aβ) log2 − . (Aβ) (β) j=1

(3.5)

It is a particularly convenient that for the KT estimator, the sequential predictive formula can be collapsed so that the actual sequence order is unimportant. It turns out that the stochastic complexity integral of the multinomial 2.4 with the Dirichlet prior can be evaluated exactly (Wolpert & Wolf, 1995), and, moreover, the answer is identical to equation 3.5. This three-way coincidence is specific to KT and not generically true for other estimators. In general, we favor sequential predictive computation as the least risky option over analytical approximation to stochastic complexity or Bayesian integrals. For the KT estimator, selecting a priori β = 1/A tends to perform well empirically (Nemenman et al., 2002; Schurmann & Grassberger, 1996). The distribution of Shannon entropy of the Dirichlet prior happens to be widest with β = 1/A as well. In the final section, we outline an optimization approach for β to eliminate this free parameter if desired. 3.5 Context Tree Weighting. Of course, real symbolic sequences are not independent and identically distributed. Our larger model class is that of a

9 We intentionally use the same β as in equation A.1, because the KT estimate is also the Bayesian estimate of θ with the same Dirichlet prior, θˆ KT = θ P(c|θ)PDir (θ)dθ. Note that Hˆ Bayes , is not the same as the estimator plugging in θˆ KT into the Shannon entropy formula. As θˆ KT always smooths the empirical distribution toward being more uniform, plugging it in would result in a larger value for entropy than the naive estimator, just like Hˆ MM . The Bayesian estimate Hˆ Bayes may be larger than, smaller than, or equal to Hˆ naive or Hˆ MM , depending on the observations.

1548

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

context tree, as previously discussed. The issue now is which nodes of the tree (i.e., its topology) one should choose as the best, balancing statistics versus modeling ability. We choose the optimal topology using equation 3.5 as a subcomponent in the calculation. Consider for a moment the simplest model selection problem: compare coding using no history (i.e., the root node only) versus conditioning on one past binary symbol (i.e., descending one level of the tree). Which is a better model: P(ri ), or P(ri |ri−1 )? We calculate the code length L(n) to code the conditional observations foreach node. The MDL model selection criterion is to compare L(parent) to c L(c), the sum of the children’s code length. If the first is smaller, then P(ri ) is the better; otherwise, the more complex model is better. Imagine that the process actually were independent. The distribution of c(parent) would look similar to c(c). Usually L(parent) would be smaller than c L(c) because only one parameter (implicitly) needs to be encoded instead of two. If, however, the distributions at the children were significantly distinguishable from the parent, then the smaller code length would go to the more complex model and the children preferred. Instead of a hard choice, we may weight our relative trust in the two models as proportional to 2−L (Solomonoff, 1964), and form a weighted code length at the node that captures this weighting. This process is continued recursively: at each node we consider the weighting between the model that stops “here” versus the composite of all models that descend from this subtree. This is the key insight of context tree weighting (CTW) (Willems et al., 1995). Considering binary trees D to maximum depth D, CTW is like Bayesian weighting over 22 possible topologies, but is efficiently implementable in time linear in the size of the input data. The original invention of CTW was as a universal sequential source coding (data compression) algorithm, whose performance may be easily bounded by analytical proof. Here, we are interested in statistical estimation, and explicit sequentiality is not required. We now outline the steps in a batch estimation of the weighted context tree and its code length: 1. Form a context tree for the entire data set, retaining at each node n the conditional future counts, c(n). 2. Compute and store, for every n, its local code length estimate, L e = L KT (c(n), β), with equation 3.5. 3. Compute recursively the weighted code length at each node. With the coding probability distributions defined Pe = 2−L e , Pw = 2−L w and child nodes of n denoted as c, the fundamental formula of context tree weighting is (Willems et al., 1995) Pw =

1 1 Pw (c). Pe + 2 2 c

(3.6)

Estimating Entropy Rates

1549

Defining, for clarity, L c to be the sum of L w over extant children c, L c = c L w (c), we have L w (n) = − log2

2−L e + 2−L c 2

(3.7)

= 1 + min(L e , L c ) − log2 (1 + 2−|L e −L c | ). In practice, this involves a depth-first descent of the context tree, computing L w for all children before doing so for the present node. If there are no children, or the number of observations at n is but one (i.e., a ca (n) = 1), then instead define L w = L e and stop recursion. (With only one observation, any child would have an identical code length.) 4. At the root node λ, L w (λ) is the CTW code length for the sequence. For statistical modeling applications, representing local code lengths L e , L w with standard finite-precision floating point is sufficient, contrary to comments in London et al. (2002) indicating a need to use arbitrary precision arithmetic to store coding distributions. Note that even for A > 2, the factors of 1/2 in equation 3.6 remain as is, reflecting the assumed prior distribution on context trees: that a tree terminating at a given node has an equal prior probability as all subtrees that descend from that node. As discussed briefly in Willems et al. (1995), one could also choose to weight the current node and descendants by factors γ and 1 − γ for 0 < γ < 1. 3.6 Weighting General Statistics. The model selection principle behind CTW may be adapted for more general statistical estimation tasks than source coding. In particular, if we are able to evaluate some statistic Q(n) as a local function of the particular context and the counts observed there, we can find the weighted estimator Qw that weights the local Q values with the same weighting as used in CTW. To compute Qw : 1. Produce the context tree from the observed sequence and compute the per node code lengths as described in section 3.5. 2. For every node, compute the local weighting factor W(n). W(n) is the relative weight of the current node versus the subtree of possible child nodes: W(n) =

2−L e . 2−L e + 2−L c

(3.8)

If there are no children or the current node has only one observation, W(n) = 1.

1550

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

3. Evaluate the local statistic Q(n) for each node from the counts. Compute the weighted statistic Qw , Qw (n) = W(n)Q(n) + (1 − W(n)) Qw (c) . (3.9) c

Like L w this requires a recursive depth first search. 4. The final weighted statistic for the whole tree is Qw (λ), the value at the root node. The weighting can also be understood conceptually as an average over all possible nodes in all possible trees: Qw =

W(n)Q(n),

(3.10)

n

with W(n) the net product of all (1 − W(n)) and W(n) weighting factors descending from the root node to n. The operation of standard CTW is the special case, Q = θˆ KT δ(n = history), of this general statistical weighting. 4 Estimating the Entropy Rate with Context Tree Weighting We now present an algorithm for estimating the entropy rate following the Markov formulation (see equation 3.1) and using the weighting over all tree topologies. We diagram and outline all of these results in parallel in appendix D. As an aside, we could estimate the entropy rate using the code length directly hˆ CTW =L w (λ)/N; however, this estimator is always biased from above, because it obeys the redundancy inherent in any compression scheme (Kennel & Mees, 2002; London et al., 2002). Instead, we emphasize that we use the code lengths (and compression technique) only to define the appropriate weighting over Markov models. This is similar in spirit to string-matching entropy estimators (Amigo, Szczepanski, Wajnryb, & Sanchez-Vives, 2004; Kontoyiannis, Algoet, Suhov, & Wyner, 1998; Lempel & Ziv, 1976; Wyner, Ziv, & Wyner, 1998), which use the same key internal quantities as the Lempel-Ziv class of compression algorithms (Ziv & Lempel, 1977), but without the additional overhead necessary to produce a literal compressed representation. ˆ with the Weighted Context Tree. We first fill an un4.1 Estimating h bounded context tree with all observed symbols R, and find the code lengths as in section 3.5. Recall that at each node, we accumulate the counts of all future symbols c = [c 1 , . . . , c A] (where we have suppressed n for notational convenience), which occurred after the string corresponding to the node.

Estimating Entropy Rates

1551

Every node corresponds to a potential Markov conditioning state σ , with cr ˆ P(r|n) = A

i=1 c i

µ(n) ˆ =

N(n) = N

(4.1) A

i=1 c i

N

(4.2)

as the estimated transition and stationary probabilities, and N is the total ˆ number of symbols. The estimated transition probability P(r|n) has an alphabet of size A and an occupancy ratio N(n)/A 1 for heavily weighted nodes. This is in stark contrast to the explosion of words AD in block entropies. Treating each node as a Markov state via equation 3.1, we estimate the entropy rate by using the local function ˆ Q(n) = H[c|n] µ(n), ˆ

(4.3)

where Hˆ is any estimator of the entropy from observed counts. Following section 2, we select Q(n) = Hˆ Bayes (c|n)µ(n) ˆ and substitute it in into the general weighting procedure of section 3.6, yielding our direct rate estimate hˆ Bayes as the value of Qw at the root. We could also use Hˆ MM instead of Hˆ Bayes to get a rate estimate hˆ MM , which also performs well. This does not lead as easily to a confidence interval, however, as we now discuss for hˆ Bayes . ˆ Bayes . The previous section 4.2 Bayesian Confidence Intervals for h shows how to get a point estimate of the entropy rate. Now we want to find the distribution of entropy rates that seem likely from the data. For clarity, this algorithm is also summarized in appendix D. We simulate tree topologies according to their posterior likelihood, and at each of their nodes, simulate one possible Hˆ Bayes as in section 2.2. Symbolically we perform a Monte Carlo simulation of the integrand of the abstractly represented hˆ Bayes =

h( , µ)P( |T, C)P(T) d dT,

(4.4)

where T represents the topology of a particular tree, the union of parameter vectors at all terminal nodes of that tree, C the observed future counts at those nodes, and h( , µ) the Shannon entropy rate operator on and the occupation probability for those nodes. The result will be samples from the integrand of equation 4.4. Its expectation value is an estimate of the entropy rate hˆ Bayes and the error bars—estimated width of reasonable rates given sampling variation—from the quantiles of the empirical distribution. We randomly draw tree topologies with relative probability equal to the implicit CTW weighting over tree topologies. Consider recursive descent

1552

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

down from the root node λ. At each node n, draw a random variate ξ ∈ [0, 1). If ξ is less than the local weighting factor W(n), then stop here: this is a terminal node for the model. Otherwise, recurse down all extant branches until all paths have terminated. This yields a topology (set of terminal nodes) of a good model, drawn with probability proportional to its weighting as implied by the CTW formulas. At each terminal node of this subtree, draw a single sample of P(H|c), using the MCMC procedure and the counts for that particular node. Call it H ∗ (n). Combine these draws to calculate one estimated sample of the entropy rate: h i∗ =

H ∗ (n)µ(n). ˆ

(4.5)

n

The sum here is over only those terminal nodes selected stochastically for this single estimate. We repeat this procedure NMC times, every time drawing a new topology and set of H ∗ (n) values, and using equation 4.5 to calculate h i∗ for each such draw. This gives samples drawn from the underlying P(h|data). Their mean is the final rate estimate,

hˆ Bayes =

NMC 1 h∗. NMC i=1 i

(4.6)

The distribution of the samples provides a confidence interval for the estimate accounting for natural statistical variation. For instance, to estimate a 90% confidence interval, sort 199 samples of h i∗ ; the 10th and 190th elements in the sorted set approximate the true confidence interval. For computational simplicity, we use µ(n) ˆ = Nn /N, although a more correct µ is fully determined by the node probabilities. Appendix B discusses this issue in detail. 4.3 Asymptotic Consistency of hˆ Bayes . As a data compression method, the arbitrary depth CTW method has been proven to achieve entropy for stationary and ergodic sources, meaning that lim N→∞ hˆ CTW → h for all but infinitesimally improbable strings of length N (Willems, 1998). In statistical terms, this means that hˆ CTW is an asymptotically consistent estimator of the entropy rate. We here give an argument suggesting the same is true of hˆ Bayes . Both hˆ CTW and hˆ Bayes can be expressed as a global sum of the form 3.10, so we may write

L KT (n)

ˆh CTW − hˆ Bayes = Wn µ(n)

− Hˆ Bayes (n)

. Nn n

Estimating Entropy Rates

1553

Using the results in Wolpert and Wolf (1995) and equation. 3.5, we computed the large N asymptotic expansions of the difference of the local entropy estimators: lim (L KT /Nn − Hˆ Bayes ) = C1

N→∞

1 log Nn + C2 (θ, A, β) + ... Nn Nn

with C1 = β(A − 2) + 12 and C2 (θ, A, β) a complicated formula depending only on the local θ n and parameters. As µ(n) = Nn /N ≤ 1, this gives

log N |C2 |

ˆh CTW − hˆ Bayes ≤ Wn C1 + + ... N N n log N 1 ≤ C1 +O , N N

(4.7)

since n Wn = 1. As hˆ CTW is asymptotically consistent and hˆ Bayes → hˆ CTW , then hˆ Bayes is asymptotically consistent. In practice, it appears that hˆ CTW has more bias. Rissanen’s theory (Rissanen, 1989) shows that because hˆ CTW is an actual compressed code length per symbol, and unlike hˆ Bayes , it cannot log N converge to truth any faster than hˆ CTW → h + O( N ). 5 Results We compare the performance of our estimator hˆ Bayes to several other estimators of entropy rate: the direct method hˆ direct (Strong et al., 1998), two stringmatching based estimators hˆ LZ , hˆ SM , and a strict context tree weightingbased estimator hˆ CTW (Kennel & Mees, 2002; London et al., 2002). hˆ CTW and hˆ direct have been discussed previously. hˆ LZ is the Lempel-Ziv complexity measure based on a parsing of the input string with respect to an adaptive dictionary (Amigo et al., 2004; Lempel & Ziv, 1976). hˆ SM averages the length of longest matching strings in a previous fixed-size buffer (Kontoyiannis et al., 1998). See appendix C for a complete description of these algorithms. 5.1 Testing Convergence and Bias. The first example is a source that is a simple three-state hidden Markov model. Its transition matrix is 

0

 M =  1/5 1/10

1/3 4/5 0

2/3



 0 , 9/10

(5.1)

emitting a 0 if the first nonzero transition on each line is taken and a 1 if the second is taken. The entropy rate can be calculated analytically employing

1554

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 1

hBayes h CTW

0.9

String matching LZ parsing

h (bits/symbol)

0.8

0.7

0.6

0.5

0.4

0.3 1.5

2

2.5

3

3.5 log10 N

4

4.5

5

Figure 4: Convergence of several entropy rate estimates, hˆ Bayes , hˆ LZ , hˆ CTW , hˆ SM , on a simple, binary Markov process. The horizontal dashed line is the true entropy rate. hˆ Bayes converges quickest with the smallest overall bias and competitive variance. Data shown are ensembles over draws from the source, with error bars giving sample standard deviations.

(10), h ≈ 0.5623. Figure 4 shows impressive performance of hˆ Bayes , over the straight code length estimator hˆ CTW and string matching methods. The next example is from more complex logistic map dynamics. The continuous space dynamical system is xn+1 = 1 − a xn2 . For a = 1.7, the system is in a generic chaotic regime giving dynamics in x ∈ [−1, 1]. The cutoff x = 0 yields rn =

1,

xn ≥ 0

0,

xn < 0,

a generating partition, and thus a binary symbolization, preserving all the dynamical information from the continuous time series in the symbol stream with a known entropy rate (Kennel & Buhl, 2003). Although the continuous equation of motion is simple, the symbolic dynamics are not, with significant serial dependence among the binary symbols. In fact, it is not a Markov chain

Estimating Entropy Rates

1555

1

hBayes hCTW

0.95

String matching LZ parsing

0.9

h (bits/symbol)

0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 1.5

2

2.5

3 log

10

3.5 N

4

4.5

5

Figure 5: Convergence of several entropy estimates, hˆ Bayes , hˆ LZ , hˆ CTW , hˆ SM , on a symbolized logistic map. The horizontal dashed line is the true entropy rate. Again, hˆ Bayes converges quickest with the smallest overall bias and competitive variance. Data shown are ensembles over draws from the source, with error bars giving sample standard deviations.

of any finite order and shows deep dependence, making this a challenging problem. Figure 5 shows results. Once again hˆ Bayes provides a consistently more accurate estimate. One thing to notice is that whereas in the previous Markov chain example, hˆ LZ was rather good and hˆ SM rather poor, here their performances are reversed. In our experience, this is frequently the case. In various examples when we already knew the exact value, one of hˆ LZ and hˆ SM came close to the truth, and the other was quite erroneous—-but we could never predict ahead of time which one would be better, making them difficult to use for analyzing experimental data where truth is unknown. Finally, Figure 6 shows a comparison of hˆ Bayes to the direct method hˆ direct for the two sources. The plateau disappears for small N or more complex dynamics making estimation of hˆ direct unreliable. 5.2 Testing Error Bars. We now examine the quality of the confidence intervals in the estimation of hˆ Bayes , that is, the quantiles of the h ∗ distribution, equation 4.5. The test procedure is as follows: we produce NS time

1556

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

A

hBayes h

0.65

direct

h

truth

0.6 0.55 103

104

105

106

N

0.75

B

h

0.7 0.65 0.6 0.55

103

104

105

N

0.8

C

1

D

0.8

0.6

D

H /D

HD/D

0.7

0.5

0.6 0.4

0.4 0

20 word length (D)

40

0.2

0

20

40

60

word length (D)

Figure 6: Comparison of hˆ Bayes (circle, solid line) to hˆ direct (dash-dot line) for (A) a simple Markov process and (B) a symbolized logistic map. The horizontal dashed line is the true entropy rate. We implemented a simple extrapolation algorithm for calculating hˆ direct ; however, a human-directed approach could yield a more accurate estimate. (C) Block entropy for Markov process, and (D) for logistic map N = O(106 ). The plateau effect in the block entropies Hˆ D disappears for the more complex logistic dynamics or decreased N (number of symbols), making the estimate of hˆ direct less reliable.

series from the source and find hˆ Bayes (i) for each of them, i = 1, . . . , NS . We find the central 90% quantile of this distribution and take it as the real “error bar”—the dispersion of the statistic under repeated samplings from the source. For each time series, there is an individual confidence interval estimated by the central 90% quantile of the distribution of h ∗ . Ideally, the

Estimating Entropy Rates

1557

0.66

hBayes

0.65

0.64

0.63

0.62

0

10

20

30

40

50

60

70

80

90

Data set number

Figure 7: Testing the Bayesian confidence interval on 99 data sets of length N = 5000 from a pseudo-logistic source (gray circles). Average of hˆ Bayes and average of individually estimated confidence intervals (left-most triangle, black). Average of hˆ Bayes and 90% percentile on actual ensemble (left-most square, gray). The horizontal dashed line is the true entropy rate. The confidence interval is well calibrated: the ensemble average of estimators is very close to the exact value, and the average size of their confidence intervals (each estimated from a single data set) is also nearly equal to the variation in hˆ Bayes under new time series from the source.

ensemble-average size of the individual confidence intervals should be the same size as the size of the variation of the statistic under actual new samples from the source. Our two sources with known entropy rates are the logistic map (as previously discussed) and a pseudo-logistic map, that is, a Markov chain that was estimated from a sample of logistic map symbols from a single tree model found from the method in Kennel and Mees (2002). Comparing these similar sources is instructive. Figure 7 shows results on the logistic-like Markov chain. Here, at N = 5000, as for most N, the estimator performs exceptionally well with almost no bias in ensemble, and estimating sampling variation accurately. Across the wide range of N, truth lies within the 90% intervals for about 90% of the samples from the source, implying the

1558

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.66

0.65

0.63

h

Bayes

0.64

0.62

0.61

0. 6

0

10

20

30

40

50

60

70

80

90

Data set number

Figure 8: Testing the Bayesian confidence interval on 99 data sets of length N = 5000 from a true, symbolized logistic source. This figure follows Figure 7. Here, the confidence interval is not well calibrated, and the actual variation under sampling from the source is larger than the size estimated by Bayesian posterior. Note that these data are for a value of N that shows a peculiarly large relative discrepancy in Figure 10.

error bars are calibrated well. The general weighted tree model appears to estimate this Markov chain quite well. Figure 8 shows the equivalent result for the challenging case of small N in the real logistic map. Here, the ensemble average is slightly higher than truth, but the variation in actual samples from the ensemble is larger than the size of the variation estimated by the Bayesian posterior. In other words, for a small number of samples from this more complex source, the error bar is not well calibrated and engulfs truth for roughly 60% of the samples from the source. Our estimator performs well in an absolute sense despite this mismatch. This is also evident in Figures 9 and 10, which show the similar summaries of true variation and estimated variation across a range of N. Paninski (2003) examined variance estimators for block entropy but found they could severely underestimate the true variance in the undersampled limit. By contrast, we get empirically more accurate error bars for rate, staying in a good regime by dint of estimating entropy rate directly with the model selection criteria.

Estimating Entropy Rates

1559

0.8

0.75

h Bayes

0.7

0.65

0.6

0.55

2

2.5

3

3.5 log10 N

4

4.5

5

Figure 9: Convergence of the Bayesian confidence interval on the pseudologistic source: 90% percentile of hˆ Bayes distribution under sampling from source (gray squares) and average from posterior estimation (black triangles). The horizontal dashed line is the true entropy rate. For all N, the average confidence interval engulfs the exact value, and with increasing N, the estimated error bar size matches the actual size accurately.

Controlling for the discrepancy seen in Figure 10 is necessary in experimental data in which truth is unknown. What we think to be happening here is that for some samples from the source, the tree weighting detects statistically significant deep contexts, but for other data sets (more similar to the Markov approximation), the typical depth is comparatively lower. As a result, a finite bias remains, and the variation in the entropy rate on sampling from the true source is larger than expected under the weighted mixture of Markov models. We feel that no model-based statistical estimation procedure can ever be immune to this effect. As a philosophical principle, it is impossible for an estimator to guess outside its model assumptions about what might happen in other, unseen data. Whatever it can use to guess—a model class, estimated parameters, and one observed data set—is already accounted for in its Bayesian posterior distribution. This mismatch can be greater with small amounts of data and complex information sources. We thus ask how we can discern from single trials whether we are within this regime. We suspect the result shown here may represent a rather difficult

1560

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.72 0.7 0.68

hBayes

0.66 0.64 0.62 0.6 0.58 0.56 0.54

2

2.5

3

3.5

4

4.5

5

log10 N

Figure 10: Convergence of the Bayesian confidence interval on the true symbolized logistic map. Symbols are the same as in Figure 9. The estimated confidence interval size is smaller than the true size until N is large.

case: we believe that symbolic dynamics of a zero-noise deterministic chaotic system are harder to estimate (on account of more deep, significant contexts) than what would typically occur in neural data. Furthermore, we highlighted a specific value of N where the mismatch is particularly large. 5.3 Diagnostics. If structural mismatch is sometimes unavoidable, are there signs when it is lurking? If it is, then the statistical user might imagine that there remains some positive bias and that truth might be in the unlikely tails of the posterior (outside the Bayesian confidence interval) more often than nominal. Imagine the situation when the underlying source has a complex structure but the length of data is limited. For arbitrarily short data sets, the method will not choose models with a large number of parameters (i.e., a large word length) because that usually leads to overfitting. As more data are seen, more complex models will be justifiable, which typically exhibit a lower entropy rate. If adding still more data does not seem to increase the effective complexity further, then one could think that the estimated tree at that point is a good, convergent model. If we were estimating only a single tree topology, it would be easy to judge convergence of complexity and topology: the number of terminal nodes times A − 1.

Estimating Entropy Rates

1561

6

~

D

4 2 0

2

2.5

3

3.5

4

4.5

5

2

2.5

3

3.5 log N

4

4.5

5

60

~

k

40 20 0

10

Figure 11: Convergence of diagnostics over number of symbols (N) for the pseudo-logistic source (squares) and the true symbolized logistic map (circles). ˜ expectation of average effective depth over samples from the source. (Top) D, ˜ expectation of estimated number of parameters (following the (Bottom) k, Rissanen ansatz) over samples from the source.

For the weighted trees, we offer two statistics that quantify the effective ˜ using the general depth and the complexity. We define the effective depth, D, ˜ is formula 3.9 with Q(n) = D(n)µ(n), ˆ where D(n) is the depth of node n. D the effective, average conditioning depth or word length used for prediction. Rissanen’s theory (Rissanen, 1989) provides an alternate, more interesting complexity measure. Asymptotically in N, the description length scales as ˆ + L = − log P( )

k log N + O(k), 2

(5.2)

where P is the likelihood at the maximum likelihood solution and k is the effective number of free parameters. We identify L with the overall net CTW ˆ ≈ Nh. Since h is unknown, we substitute code length L w (λ), and − log P( ) ˜ one of our low-bias estimators (e.g., hˆ Bayes ) and invert to solve for k: ˆ log N. k˜ = 2(L w (λ) − Nh)/

(5.3)

˜ and k˜ for the two sources. For Figure 11 shows ensemble averages of D smaller N, they track very well, but for middle ranges of N, the statistics for

1562

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

the pseudo-logistic source have converged, but the ones for true logistic data ˜ as expected. For the logistic-like Markov continue to increase, especially k, ˜ source, k asymptotes at about 10, precisely the number of terminal nodes in the Markov model used as a source. Given but a single data set, one would not observe as smooth a rise, of course, but even still, the lack of convergence ˜ or k˜ with N is suggestive of structural mismatch. The prescription from in D the diagnostics is to collect more data, and if that is not possible, consider the possibility that the estimated confidence intervals might not engulf the truth because of modeling-induced bias. 6 Discussion 6.1 Extensions. There remain several opportunities for improvement and future investigations. Of course, estimating mutual information rates of driven input-output systems is a key problem in neuroscience. We address this in a subsequent work by conditioning on the stimulus similar to Strong et al. (1998), although conditioning on the stimulus can be done any way that is feasible. If the stimulus history were on a small, discrete set, then it could be included in the context tree history as well as past observations. Another large issue revolves around selecting the appropriate embedding or representation for a dynamical system (Kennel & Buhl, 2003)—in particular, for a spike train (MacKay & McCulloch, 1952; Rapp et al., 1994; Victor, 2002). There is no reason to suppose a priori that a fixed-width spike count discretization, though commonly used, is the best, as compared to, for example, some kind of discretized interspike interval representation. After selecting the appropriate embedding, the issue of the proper choice for alphabet size A and the Dirichlet parameter β remains. In this work, we assumed that A represented the nominal alphabet size—that is, the total cardinality of all possible symbols of the space in question. In simple cases, this is known by construction. For example, the total cardinality is known a priori in a spike train discretized so that no more than A spikes can ever occur in a bin. In this case, outside physical knowledge constrains the alphabet to size A. There are cases, however, where the true A is excessively large or potentially unknown. The cardinality of observed symbols m might be substantially less than A (Nemenman et al., 2002; Paninski, 2003). A concrete example would be with a large multichannel recording of k distinct neurons. With only a spiking or nonspiking representation, the Adefined in a general outer product sense would be 2k , but far fewer combinations might ever be observed in practice. We believe that more sophisticated representations and approaches are likely to be necessary in this case, with the code-lengthbased ideas proving useful for weighting among representations as well as histories. There are some very simple two-part coding approaches that may suffice if the nominal A is not excessively large. Conceptually, the transmission sequence consists of sending the alphabet identity information at each

Estimating Entropy Rates

1563

node before the symbols are encoded in the reduced alphabet. Assuming uniform prior on m, it takes log2 A bits to specify m and then log2 mA to specify which of them occur, so that the code length may be expressed as log2 A + log2 mA + L KT (m) with L KT (m) the KT code length over the m actually observed symbols (L KT (1) = 0). In the context tree, note that the A in this expression is actually the value of m for each node’s parent, because the identity of used symbols at a child is a subset of the parent. For the root, of course, it is the true nominal A. The text compression community has dealt with the unknown A issue frequently and has devised numerous solutions. A typical approach is to mix regular conditional estimators (over the nonzero alphabet) with some extra “escape” probability to allow for the situation that a new symbol may have occurred. The escape and ordinary probabilities are defined conditionally for sequential estimation in a variety of ways. Shtarkov, Tjalkens, and Willems (1995) demonstrated a zero-memory estimator whose leading asymptotic term is proportional to m and not A. Nemenman (2002) and Orlitsky, Santhanam, and Zhang (2004) described estimators for the potentially unbounded alphabet case. All of these considerations would go into the zero-memory local code length estimator, modifying equations 3.3, and 3.5. These options may require actual sequential computation for the local code length and may not have a clean batch expression like equation 3.5. For moderate A, another option is optimization of net code length over a varying β. There needs to be a penalty for varying β, L β = − log2 P(β) over some assumed prior on β (e.g., P(β) = A2 βe −Aβ , an arbitrary choice that has a maximum at β = 1/A). The total code length, L T (β) = L C T W (β) + L β is computed over β, and following the MDL criterion, the minimizing β ∗ is located and hˆ Bayes evaluated there. The corresponding Bayesian weighting, an alternative to minimizing, is (Solomonoff, 1964) hˆ =

−L T (β) ˆ h(β)2 dβ . −L (β) T dβ 2

(6.1)

This differs conceptually from the use of alphabet-adaptive zero-memory estimators, because here there is but a single β chosen for all nodes, whereas the other estimators would adapt to the actual alphabet observed at each node of the tree, which will induce some additional overhead. Of course, local optimization over β is possible on each node, though here, unlike the global optimization case, careful attention to discretization issues for the free parameter would be necessary (Rissanen, 1989). Figure 12 shows the effect of varying β on the discretized logistic map (A = 2). hˆ CTW = L C T W /N has a modest minimum around β ≈ 0.5. The effect on hˆ Bayes is even smaller still for a wide plateau; thus, for this data set with a small alphabet, this extension has little benefit beyond the heuristic β = 1/A. See Nemenman et al. (2002) and Nemenman, Bialek, and de Ruyter van Steveninck (2004)

1564

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.67

0.66

h

0.65

0.64

0.63

0.62

0.61

0

0.5

1

1.5 β

2

2.5

3

Figure 12: Truth (dashed line), hˆ CTW (circles, upper curve), hˆ Bayes (diamonds, medium curve), hˆ Bayes utilizing Hˆ zero (squares, lower curve) over a range of β on 5000 symbols from the logistic map as previously described. The fact that the diamond curve comes closest to truth around β = 1/2 versus squares is coincidence; for other samples from the source and for other data set sizes, the situation is often reversed. The fact that hˆ Bayes depends less on β when utilizing Hˆ zero is a general phenomenon.

for a discussion in substantially larger A and an interesting estimator that also averages over β in a different way from equation 6.1. 6.2 Limitations. It is important to recognize beforehand when the method may give suboptimal results. Since we make explicit time-series models of the observed data, we might encounter difficulties when the model scheme cannot efficiently predict the data using the short-term context tree history scheme. The method is based on universal compression, so that eventually, with sufficient data, it will find conditional dependencies, but the number of data necessary may be excessively large. The CTW weighting prefers smaller and shallower trees (fewer parameters) over deep ones. If the important dependencies are very deep, they may not contribute effectively to a proper estimate. The result would be a positive bias because the estimated tree is too shallow.

Estimating Entropy Rates

1565

Consider a spike train discretized with an extremely small bin size (e.g., 1 ns), such that the number of symbols between individual spikes is large. The CTW tree would put most model weighting near the root, in effect viewing the spike train as a nearly Poisson process. This would occur because the context tree would not discern enough statistically significant dependence to overcome the complexity penalty. One might diagnose this if, at coarser bin sizes, the estimated entropy rate diverged from a Poisson assumption. Another situation may be multiscale dynamics, where a slow variable modulates the dynamics of a fast variable. Without additional hints, the tree would be sensitive only to the fast dynamics. An approach would be to define some discretized slow variable appropriately averaged over the original data and allow it to occur at some level or levels in the tree as context. If it induced a lower net code length, it would be deemed preferable by MDL considerations. More exotic dynamical systems with high complexity may present difficulties to the present method as well as all other general-purpose entropy rate estimation methods. For example symbol sources with a substantial power law subextensive term f (D) would exhibit this pathology (Bialek, Nemenman, & Tishby, 2001). In these cases, deep contexts may have substantial conditional influence on the future, and only a small subset of the underlying state space may have been observed. This behavior may be nearly indistinguishable in experimental practice from ordinary nonstationarity in a time series. Such dynamics can have either zero or positive Shannon or Kolmogorov-Sinai entropy rate, but the present estimator would likely overestimate the rate for both circumstances. A canonical example of this case is the symbolic dynamics of the logistic map at the parameter boundary between periodicity and chaos, where the true entropy rate is zero, but it may take a large data set to notice this. Perturbing the output of that process by flipping a few bits (in our case, with probability p = 0.02) gives a new process with the entropy of that Poisson process, but far higher complexity than a point Poisson process. Figure 13 shows results of our estimator and the direct method (Strong et al., 1998) using a block entropy estimator of Grassberger (2003). For ordinary extensive chaos with h > 0, one typically expects to find a region of linear behavior in Hˆ D versus 1/D and extrapolate to 1/D → 0, but this does not happen at the boundary of chaos. On both of these data sets, there are two apparently good scaling regions, leading to different entropy estimates, with the Ma bound not providing an objective way to choose one from the other (though N/2 H(D) does stay reasonably large as D → ∞ for the no-noise logistic map and does not for the noisified set). Our estimator, hˆ Bayes , automatically and correctly diagnoses the zero entropy case from the positive entropy case. For the latter, there is a noticeable positive bias, as is expected on this difficult process, which has a very long effective memory. On the positive side, recent theoretical results have shown that a particular context tree compression scheme has an asymptotic rate of convergence

1566

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.4

h

0.3 0.2 0.1 0 0

0.02

0.04

0.06

0.08

0.1

0.06

0.08

0.1

1/D 0.4

h

0.3 0.2 0.1 0

0

0.02

0.04 1/D

Figure 13: (Upper) Logistic map at accumulation point, N = 100, 000. True h = 0. Hˆ D /D versus 1/D (circles), Ma lower bound (dotted line), hˆ Bayes (dash-dotted line). The direct method shows two apparently good scaling regions—one for larger D, which extrapolates to h ≈ 0.05, and the other to zero. (Lower) Logistic map at accumulation point, perturbed by Poisson process with p = 0.02. True h ≈ 0.1414 (solid line). Hˆ D /D (diamond), hˆ Bayes (dash-dotted line). Again, there are apparently two good scaling regions: the upper leads to close to the true entropy and the lower, extrapolating to zero. The Ma bounds (dotted lines) do not help constrain the proper scaling region for the direct method. hˆ Bayes gets in the right vicinity, though in the lower plot, there is a positive bias as expected.

in compression rate, which is as good as any other method for almost all strings (Martin, Seroussi, & Weinberger, 2004). While the actual rate of convergence depends on the character of the source, this result does suggest that one cannot do any better than context trees for compression as a class. J. Ziv (personal communication, 2004) conjectured the same would be found for most practical context tree methods such as CTW. It is not clear whether this result also applies to entropy rate estimation, but it gives confidence that the modeling approach is likely to be generally successful. We note that modeling, as in estimating probabilities of finite strings, is a distinct task from entropy estimation, and as entropy is a nonlinear function of the underlying distribution, different models might be good for dynamics estimation versus entropy estimation. Accurately modeling probabilities of

Estimating Entropy Rates

1567

words of variable length is a harder task. The present method does in fact attempt to make models, but note that minimizing the average compression rate (the implied loss function for choosing the tree weightings) is not as difficult as bounding prediction error on specific strings, and as the target of average compression rate in universal compression is bounded by entropy rate, the conceptual difference may not be so large here. Recall that in the classical case, the entropy estimate depends only on the histogram of occupancies (Nemenman et al., 2004; Paninski, 2003), and the specific labels attached to these may be shuffled without changing the entropy estimate. In our method, once a tree is estimated, the specific labels attached to the contexts designating their history may also be shuffled without changing the entropy rate estimate. 6.3 Conclusions. We have discussed how to calculate the entropy rate in any observed, symbolic dynamical system. We highlighted difficulties with estimating entropy rates using traditional estimation techniques and presented a new approach to estimating entropy rates, following the rationale of selecting a word length D more rigorously. Specifically, we examined how to generate a more appropriate model for a symbol stream by using a Markovian assumption and following the minimum description length principle to select the appropriate model structure and complexity. This probabilistic model provides a framework for sampling the Bayesian posterior distribution, P(hˆ Bayes | data), estimating the range of possible models and their entropy rates that could reasonably give rise to our data. We thus can provide Bayesian confidence intervals on the entropy rate of a discrete (symbolic) time series.10 In simulation, our estimator outperforms other forms of entropy rate estimation, such as algorithms based on string matching and strict CTW compression, showing much lower bias and competitive variance. For well-converged situations, the Bayesian confidence interval appears nearly exact, matching the true posterior distribution of the potential entropy rates. In the neural coding literature, substantial effort has focused on estimating block entropies well in the undersampled regime (Nemenman et al., 2004, 2002; Paninski, 2003; Strong et al., 1998; Treves & Panzeri, 1995; Victor, 2002; Victor & Purpura, 1997); however, little work has approached rate estimation from the perspective of model selection. Although several papers have applied ideas from compression to estimate the entropy rate (Amigo et al., 2004; Kontoyiannis et al., 1998; Lempel & Ziv, 1976), these stringmatching estimators have followed the spirit of coincidence detection (Ma, 1981) and equipartition principles from coding theory, not explicit model

10 Fortran

95 source code as well as a MATLAB interface for both Bayesian entropy rate and information rate estimation is available online at http://www.snl.salk.edu/ ∼shlens/info-theory.html.

1568

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

selection. One previous work (London et al., 2002) did employ the CTW’s compression ratio itself, hˆ CTW , as an estimator for analyzing neural data. In related work, we extend these methods to estimate the information rate, with a Bayesian confidence interval, in real neural data. By extracting a probabilistic model for the spike train conditioned on stimuli, we can sample the posterior distribution of average and specific mutual information rates associated with a neuron (Butts, 2003; DeWeese & Meister, 1999). The reliability of our estimator, coupled with Bayesian confidence intervals, increases our confidence in the validity of estimates of information rates in neural data and offers a method to compare these quantities across temporal resolutions, stimulus distributions, and correlations in neural responses (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Lewen, Bialek, & de Ruyter van Steveninck, 2001; Liu, Tzonev, Rebrik, & Miller, 2001; Reinagel & Reid, 2000; Rieke, Bodnar, & Bialek, 1995; Schneidman, Bialek, & Berry, 2003). Appendix A: Markov Chain Monte Carlo for Bayesian Entropy Estimation In this appendix we outline a numerical method to obtain samples from the integrand of equation 2.6. We select for P(θ) the Dirichlet prior, 1 PDir (θ) = δ θj − 1 (θ j )β−1 , Z j j

(A.1)

which smoothly parameterizes between the prior uniform on θ (β = 1) and the maximum-likelihood situation (β = 0). Except for the limiting cases of β, this prior does not have an intuitive motivation, but it does allow analytical computation of some integrals, explaining its historical use in classical Bayesian statistics as a “conjugate prior” for the multinomial distribution (see equation 2.4). Wolpert and Wolf (1995) analytically computed the expectation in equation 2.6, Hˆ Bayes =

ci + β [ψ(N + Aβ + 1) − ψ(c i + β + 1)] , N + Aβ i

(A.2)

with ψ(z) = d/dz log2 (z). We are interested in the distribution of the integrand, the posterior distribution of entropy, having observed a set of counts. Markov chain Monte Carlo is a numerical algorithm to draw samples from a probability distribution, particularly useful for estimating

F =

F (θ)P(θ)dθ

(A.3)

Estimating Entropy Rates

1569

for some function F and probability distribution P. We refer the reader to a standard reference for MCMC (Gamerman, 1997) and focus on our specific implementation. In our problem, P(θ) = P(c|θ)PDir (θ) F (θ) = H(θ) = − θ j log θ j , j

where we have avoided complicated normalizations. A key advantage is that MCMC works even if P is not normalized. In many Bayesian problems, a numerator of P is easy to compute, but the normalization may be analytically intractable. The inputs are c and β, and the output is a succession of Hk , which samples from the posterior distribution P(H | c). Define N = c j . 1. Set θ 0 = c/N. Set i = 1. Set k = 1. 2. Randomly draw a candidate θ . For j = 1, . . . , A − 1 and σ j = 1/2 1 min(c j , N − c j ) + 12 , set N θ j = θi−1 + σ j N (0, 1), j with N (0, 1) a gaussian random variate of zero mean and unit vari ance. Set θ A = 1 − A−1 j=1 θ j . 3. If any element of θ is < 0 or >1, then reject it a priori. 4. Otherwise, draw a uniform random variate ξ ∈ [0, 1) and test the likelihood ratio R = P(θ )/P(θi−1 ). If ξ ≤ min(R, 1), then accept θ ; otherwise, reject it. 5. If θ is accepted, set θi = θ ; otherwise θi = θi−1 . 6. If (i mod 100) = 0, then record Hk = H(θi ) and increment k. 7. Increment i by 1. Go to step 2 unless NMC values of Hk have been recorded. A key requirement for proper MCMC estimation is that the Markov chain of successive θi be sufficiently well mixing such that a reasonable-length finite sample can serve to well approximate the true integral. The parameter value of 100 (found to be good for A = 2 or A = 3) in the penultimate step is arbitrary and should be increased if the mixing is insufficient. The central freedom is defining the distribution of the candidate θ ∗ perturbations in step 2. For our problem, we have found empirically that this specific implementation produces good results for evaluating integrals or estimating quantiles in the Bayesian posterior distribution.

1570

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

Appendix B: Computing µ from Node Probabilities Equation 4.5 is a rough, first-order approximation of a more accurate estimate. Suppose we have estimated a tree topology with given terminal nodes, and at each node, we have the emission probabilities θ. Knowing the active contexts and symbol emission probabilities provides the transition probabilities as well. Theoretically, the stationary density µ is completely determined from those two quantities by the Chapman-Kolmogorov equation of detailed balance (Gamerman, 1997). If the set of terminal nodes and allowable transitions happen to define a first-order Markov chain, then the stationary density can be computed as an eigenvector problem (Cover & Thomas, 1991). This is not guaranteed. An arbitrary tree is not necessarily a first-order Markov chain, and computing its stationary µ can be more difficult. One solution is to employ the algorithm of Kennel & Mees (2002) to temporarily expand the tree topology until it is a first-order chain. At that point, the stationary µ is computable by the eigenvector method. Another solution is to iterate the detailed balance equations for the tree machine. Starting with µ(n) = N(n)/N, the tree can evolve density to density until the µ converges numerically to a stationary distribution. When a transition is not strictly first order, the relative probabilities of appropriate past states are estimated using the previous iterate’s µ. Both of these solutions make the computation of h i∗ significantly more intricate. We have implemented these more accurate methods for finding µ, but in our empirical investigation, the effect on the estimate, in the ensemble average, appears to be minimal. Thus, for computational simplicity, we usually use µ(n) = N(n)/N. Appendix C: String Matching Entropy Estimation This estimator is used in section 5. With time index i and integer n, define

in as the length of the shortest substring starting at si that does not appear anywhere as a contiguous substring of the previous n symbols ri−n , . . . , ri−1 . Consider sequentially parsing a string from first to last. At some cursor location i, compute ii , which is the length of the “phrase”, advance the cursor by that amount, and repeat until the entire string is parsed into M phrases. An entropy rate estimate for stationary and ergodic sequences is (Lempel & Ziv, 1976) hˆ LZ =

M log2 N, N

measuring entropy in bits. Amigo et al. (2004) examined the performance of this estimator and found it compared favorably to the method of Strong et al.

Estimating Entropy Rates

1571

(1998) for small data sets and was substantially superior to the better-known data compression method in Ziv & Lempel (1978). The string matching estimator of Kontoyiannis et al. (1998) is similar to the Lempel-Ziv estimator, but averages string match length at every location instead of skipping by the length of the found phrase: hˆ SM =

n 1

n n i=1 i

−1 log2 n.

To implement C with N observed symbols, we first remove a small number of symbols off the end and then split the remaining into two halves. String matching begins with the first element of the second half, (N − )/2 + 1, and examines the previous n = (N − )/2 characters. The length excess padding is necessary to allow string matches from the end locations of the match buffer. is presumed to be a few times longer than the expected match length,

≈ log n/ h.

Appendix D: Overview of Method Figure 14 provides a block diagram of the entire procedure for estimating the entropy rate of any discrete (symbolic) time series. The upper gray box details how to estimate an appropriate weighting of probabilistic models, while the lower gray box outlines how to estimate the range of likely entropy rates from this model. The curved boxes represent calculations, and squared boxes represent procedural aspects. 1. β is an arbitrary parameter for the Dirichlet prior. Selecting β = 1/A, where Ais the alphabet size, works well in practice, but see section 3.4 for a complete discussion. 2. Although section 3.2 discusses how to build a context tree, many optimizations exist to ease their computational tractability (Kennel & Mees, 2002). 3. Recursively descend the tree, and calculate equation 3.5 at each node. 4. Beginning at tree leaves, recursively ascend the tree and sum the code lengths of all children c at each node L c = c L w (c). 5. Recursively descend the tree to calculate equation 3.8 at each node. 6. Draw a single sample ξ from a uniform probability distribution [0, 1]. 7. Descend down the tree from the current node to the next level, and select a single child to examine. All remaining nodes are placed on the stack for future processing.

1572

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

Figure 14: Schematic block diagram for the entire procedure for estimating the range of entropy rates associated in a discrete time series.

Estimating Entropy Rates

1573

8. The number of node A observations is the sum of the counts of future symbols N(n) = i=1 c i (n). 9. The stationary probability distribution µ(n) ˆ follows equation 4.2 (but see appendix B). 10. See section 4.2 as well as appendix A on how to generate a single Markov chain Monte Carlo sample from P(θ|c). 11. Use equations 2.4 and 2.5 to estimate the corresponding entropy. 12. Following equation 4.5, the contribution of this node to the entropy ˆ rate is Qw (n) = H ∗ (n)µ(n). 13. Sum up Qw from all nodes to estimate the entropy rate from the weighted probabilistic model. The weighting W(n) is implicit in the probabilistic arrival at node n. 14. The final output of this procedure is a single sample from the continuous distribution P(h|data) denoted by a small line in the graph. We repeat the estimation procedure NMC times to generate multiple samples from the underlying distribution. Acknowledgments This work was supported by NSF IGERT and La Jolla Interfaces in the Sciences (J.S.); a Sloan Research Fellowship and NIH grant EY13150 (E.J.C.). We thank Pam Reinagel for valuable discussion and support; and colleagues at the Institute for Nonlinear Science (UCSD) and the Systems Neurobiology Laboratory (Salk Institute) for tremendous feedback and experimental assistance. References Amigo, J., Szczepanski, J., Wajnryb, E., & Sanchez-Vives, M. (2004). Estimating the entropy rate of spike trains via Lempel-Ziv complexity. Neural Computation, 16, 717–736. Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Computation, 9, 349– 368. Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neural Computation, 13, 2409–2463. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552.

1574

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

Buracas, G., Zador, A., DeWeese, M., & Albright, T. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Butts, D. (2003). How much information is associated with a particular stimulus? Network, 14, 177–187. Costa, J., & Hero, A. (2004). Geodesic entropic graphs for dimension and entropy estimation in manifold learning. IEEE Transactions on Signal Processing, 25, 2210– 2221. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. DeWeese, M., & Meister, M. (1999). How to measure the information gained from one symbol. Network, 10, 325–340. Duda, R., Hart, P., & Stork, D. (2000). Pattern classification (2nd ed.). New York: Wiley. Field, D. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America, A4, 2379–2394. Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation of Bayesian inference. New York: CRC Press. Gilmore, R., & Lefranc, M. (2002). The topology of chaos: Alice in stretch and squeeze land. New York: Wiley. Grassberger, P. (2003). Entropy estimates from insufficient samplings. e-print, physics/0307138. Hastings, W. (1970). Monte Carlo sampling using Markov chains and their applications. Biometrica, 57, 97–109. Hilborn, R. (2000). Chaos and nonlinear dynamics: An introduction for scientists and engineers. New York: Oxford University Press. Johnson, D., Gruner, C., Baggerly, K., & Seshagiri, C. (2001). Information-theoretic analysis of neural coding. Journal of Computational Neuroscience, 10, 47–69. Kennel, M., & Buhl, M. (2003). Estimating good discrete partitions from observed data: Symbolic false nearest neighbors. Phys. Rev. Lett., 91, 084102. Kennel, M., & Mees, A. (2002). Context-tree modeling of observed symbolic dynamics. Physical Review E, 66, 056209. Kontoyiannis, I., Algoet, P., Suhov, Y., & Wyner, A. (1998). Nonparametric entropy estimation for stationary processes and random fields with applications to English text. IEEE Transactions in Information Theory, 44, 1319–1327. Kraskov, A., Stogbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 006138. Krichevsky, R. E., & Trofimov, V. K. (1981). The performance of universal coding. IEEE Transactions in Information Theory, 28, 199–207. Lehmann, E., & Casella, G. (1998). Theory of point estimation. New York: Springer. Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Transactions in Information Theory, 22, 75–78. Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. (2001). Neural coding of naturalistic motion stimuli. Network: Computation in Neural Systems, 12, 317–329. Lind, D., & Marcus, B. (1996). Symbolic dynamics and coding. Cambridge: Cambridge University Press.

Estimating Entropy Rates

1575

Liu, R., Tzonev, S., Rebrik, S., & Miller, K. (2001). Variability and information in a neural code of the cat lateral geniculate nucleus. Journal of Neurophysiology, 86, 2789–2806. London, M., Schreibman, A., Hausser, M., Larkum, M., & Segev, I. (2002). The information efficacy of a synapse. Nature Neuroscience, 5, 332–340. Ma, S. (1981). Calculation of entropy from data of motion. Journal of Statistical Physics, 26, 221–240. MacKay, D. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. MacKay, D., & McCulloch, W. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135. Martin, A., Seroussi, G., & Weinberger, M. (2004). Linear time universal coding and time reversal of tree sources via FSM closure. IEEE Transactions on Information Theory, 50, 1442–1468. Metropolis, N., Rosenbluth, M., Rosenbluth, A., Teller, M., & Teller, A. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Miller, G., & Madow, W. (1954). On the maximum likelihood estimate of the ShannonWiener measure of information. Air Force Cambridge Research Center Technical Report, 75, 54. Nemenman, I. (2002). Inference of entropies of discrete random variables with unknown cardinalities. physics/02070009. Available online: www.arxiv.org. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. Advances in neural information processsing systems, 14, In T. G. Dietterich, S. Becker, & Z. Ghahraman: (Eds.), Cambridge, MA: MIT Press. Orlitsky, A., Santhanam, N., & Zhang, J. (2004). Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inf. Theo., 50, 1469–1481. Ott, E. (2002). Chaos in dynamical systems. Cambridge: Cambridge University Press. Paninski, L. (2003). Estmation of entropy and mutual information. Neural Computation, 15, 1191–1254. Perkel, D., & Bullock, T. (1968). Neural coding. Neurosci Res. Prog. Sum., 3, 405–527. Rapp, P., Vining, E., Cohen, N., Albano, A., & Jimenez-Montano, M. (1994). The algorithmic complexity of neural spike trains increases during focal seizures. Journal of Neuroscience, 14, 4731–4739. Reinagel, P., & Reid, R. (2000). Temporal coding of visual information in the thalamus. Journal of Neuroscience, 20, 5392–5400. Rieke, F., Bodnar, D., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proceedings of the Royal Society of London B: Biological Sciences, 262, 259–265. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific.

1576

M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky

Roulston, M. (1999). Estimating the errors on measured entropy and mutual information. Physica D, 125, 285–294. Ruderman, D., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Physical Review Letters, 73, 814–817. Schneidman, E., Bialek, W., & Berry, M. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Schurmann, T., & Grassberger, P. (1996). Entropy estimation of symbol sequences. Chaos, 6, 414–427. Shannon, C. (1948). A mathematical theory of communication. Bell Systems Technology Journal, 27, 379–423. Shlens, J., Kennel, M., Abarbanel, H., & Chichilnisky, E. (In preparation). Estimating information rates with Bayesian confidence intervals in neural spike trains. University of California, San Diego, and Salk Institute. Shtarkov, Y., Tjalkens, T., & Willems, F. (1995). Multialphabet weighting: Universal coding of context tree sources. Problems of Info. Trans., 33, 17–28. Simonicelli, E., & Olhausen, B. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1215. Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22. Strong, S., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. Victor, J. (2000). Asymptotic bias in information estimates and the exponential (Bell) polynomials. Neural Computation, 12, 2797–2804. Victor, J. (2002). Binless strategies for estimation of information from neural data. Physical Review E, 66, 051903. Victor, J., & Purpura, K. (1997). Metric-space analysis of spike trains: Theory, algorithms, and application. Network, 8, 127–164. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Willems, F. (1998). Context tree weighting: Extensions. IEEE Transactions on Information Theory, 44, 792–798. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context tree weighting method: Basic properties. IEEE Transactions in Information Theory, IT-41, 653–664. Wolpert, D., & Wolf, D. (1995). Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52, 6841–6854. Wyner, A. D., Ziv, J., & Wyner, A. J. (1998). On the role of pattern matching in information theory. IEEE Transactions on Information Theory, 44, 2045–2056. Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions in Information Theory, 23, 337–343. Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions in Information Theory, 24, 530–536.

Received June 23, 2004; accepted January 5, 2005.

LETTER

Communicated by Paul Tiesinga

Spike Timing Precision and Neural Error Correction: Local Behavior Michael Stiber [email protected] Computing & Software Systems, University of Washington, Bothell, WA, 98011-8246 U.S.A.

The effects of spike timing precision and dynamical behavior on error correction in spiking neurons were investigated. Stationary discharges— phase locked, quasiperiodic, or chaotic—were induced in a simulated neuron by presenting pacemaker presynaptic spike trains across a model of a prototypical inhibitory synapse. Reduced timing precision was modeled by jittering presynaptic spike times. Aftereffects of errors—in this communication, missed presynaptic spikes—were determined by comparing postsynaptic spike times between simulations identical except for the presence or absence of errors. Results show that the effects of an error vary greatly depending on the ongoing dynamical behavior. In the case of phase lockings, a high degree of presynaptic spike timing precision can provide significantly faster error recovery. For nonlocked behaviors, isolated missed spikes can have little or no discernible aftereffects (or even serve to paradoxically reduce uncertainty in postsynaptic spike timing), regardless of presynaptic imprecision. This suggests two possible categories of error correction: high-precision locking with rapid recovery and low-precision nonlocked with error immunity. 1 Introduction This work is concerned with the effects of spike timing precision and dynamical behavior on the recovery of ongoing neural discharges after a perturbation, or an error. The concepts of “precision” and “error” are, of course, intimately linked. Determining how information is transmitted among neurons is an essential issue in computational neuroscience, with a spectrum of possibilities, ranging from overall average firing rates to timing of individual spikes (Dan, Alonso, Usrey, & Reid, 1998; Lestienne, 2001). Central to the question of where a presynaptic neuron’s code lies in this spectrum is the precision of spike timing necessary to elicit a particular response from a postsynaptic neuron. This follows logically, as two different presynaptic discharges that produce the same postsynaptic response have achieved the same functional result, and thus can be said to have carried the same message. Neural Computation 17, 1577–1601 (2005)

© 2005 Massachusetts Institute of Technology

1578

M. Stiber

This article addresses this issue directly by comparing recovery from errors in a spike train for two levels of spike timing precision. The goal is to test the feasibility of the idea that extra spike timing precision (in the sense that it is more than the minimum necessary to transmit a given amount of information) can be used for error correction. The errors that are used are missed spikes, based on the observed apparent unreliability of synaptic transmission (Katz, 1966; Allen & Stevens, 1994). Examination of the effects of noise and jitter in spike trains on neural behavior goes back decades (Segundo, Moore, Stensaas, & Bullock, 1963; Segundo, Perkel, Wyman, Hegstad, & Moore, 1968; Bryant, Marcos, & Segundo, 1973; Segundo, Tolkunov, & Wolfe, 1976; Kohn, da Rocha, & Segundo, 1981; Kohn & Segundo, 1983; Segundo, Vibert, Pakdaman, Stiber, & Diez Mart`ınez, 1994). Briefly, the larger the number of input synapses (the smaller the effect of each presynaptic spike) and the smaller the correlation among them, the more their aggregate postsynaptic effect was like a DC (i.e., constant current) bias. Greater correlations and fewer, stronger synapses operated more like single, strong synapses, producing reliably repeatable responses in the postsynaptic cell. The effect of increasing noise on ongoing dynamical behavior in pacemakers was observed as steadily narrowing the input rate domains for phase-locked behaviors, abolishing those with narrower domains before those with broader ones. Consequently, periodic behaviors with broader domains in parameter space were more stable in the presence of noise. Later work showed this effect in rat spinal interneurons (Beierholm, Nielsen, Ryge, Alstrøm, & Kiehn, 2001), and presented similar results for 1:1 alternation (the nonpacemaker counterpart of phase locking) with quasiperiodic driving (Tiesinga, 2002). Analysis of networks showed that coupling large numbers of cells could lead to lower output variability in the face of input jitter (Marˇsa´ lek, Koch, & Maunsell, 1997; Tiesinga & Sejnowski, 2001). Additionally, periodic phaselocked dynamics were noted to be optimal for reduction of output jitter (the exception to this being near bifurcation boundaries, where input jitter could occasionally abolish locking) in network models (Tiesinga & Sejnowski, 2001). This letter investigates how timing precision in a presynaptic train interacts with the destruction of information caused by a single neuron’s internal dynamics. It extends previous work by evaluating both periodic and aperiodic (quasiperiodic and chaotic) postsynaptic dynamics, as well as the effects of proximity to bifurcations bounding periodic behaviors. Additionally, rather than simply examining the amount of perturbation produced by presynaptic errors, it analyzes the time course of recovery. The ongoing presynaptic spike trains used here were stationary ones, with and without jitter, that induced a range of dynamical behaviors in the postsynaptic cell. Error recovery was measured by the time course of the return of the postsynaptic cell to its stationary behavior. This was determined by comparing pairs of simulations that were identical except for the presence of errors, so

Spike Timing Precision and Neural Error Correction

1579

that after an error, the spike train of the postsynaptic cell could be compared to that of one that did not see the error. Pairs of simulations were repeated for inputs with and without jitter, and error recovery was then compared in those two cases.

2 Methods A well-characterized physiological model (see the appendix) of the crayfish slowly adapting stretch receptor organ (SAO) was used. In the living preparation, the SAO produces spike trains with a rate broadly proportionate to muscle stretch; at constant stretch, it produces pacemaker spike trains. It includes the recognized prototype of a moderately powerful GABAergic inhibitory synapse, and thus has been used as a living model to explore synaptic coding across inhibitory synapses in general. The model has been used previously to explain the dynamics behind paradoxical accelerations observed in the living preparation (Segundo, 1979; Buno, ˜ Fuentes, & Barrio, 1987) in response to pacemaker inhibitory postsynaptic potentials (PSPs) (Segundo, Altshuler, Stiber, & Garfinkel, 1991a, 1991b), as well as the SAO’s responses to nonstationary spike trains (Segundo, Stiber, Altshuler, & Vibert, 1994; Segundo, Vibert, Stiber, & Hanneton, 1995; Segundo, Stiber, Vibert, & Hanneton, 1995; Stiber, Ieong, & Segundo, 1997; Segundo, Vibert, & Stiber, 1998). Presynaptic spike train timing generation and analysis was performed in MATLAB. Simulation was performed using custom C code and the ODEPACK differential equation integration library (Hindmarsh, 1983). Both pre- and postsynaptic spike trains were assimilated to point processes (Cox & Isham, 1980), and all analysis was based on interevent intervals. In the absence of presynaptic input, the SAO and its model behave as pacemakers, producing action potentials with an almost invariant interspike interval, N, its natural interval. In the presence of pacemaker presynaptic trains, both exhibit a variety of dynamical behaviors, including phase lockings, quasiperiodicities, and chaos, with characteristic dependencies on the presynaptic rate (Segundo et al., 1991a, 1991b; Stiber & Segundo, 1993; Stiber et al., 1996). The current work began by building on previous works’ survey of the dynamical behaviors exhibited by the model. Approximately 10,000 simulations were performed, varying both presynaptic rate (N/I , 1/I being the presynaptic rate for each simulation) and inhibitory postsynaptic potential (IPSP) strength (P) within biologically plausible limits. This was used to produce an Arnol’d map, or two-dimensional bifurcation diagram, which categorized each simulation’s dynamical behavior within the two-dimensional (N/I, P) plane. Three values of P were then chosen for more detailed examination. These were dubbed weak, moderate, and strong IPSPs, based on the ratio of locked (periodic) to nonlocked (aperiodic) behaviors they

1580

M. Stiber

engendered (≤20%, ≈50%, and ≈100%, respectively). For the current investigation, a moderate value of IPSP strength was chosen. A range of normalized presynaptic rates 0.2 ≤ N/I ≤ 2.0 was then explored using one-dimensional bifurcation diagrams. Here, such diagrams were constructed by plotting postsynaptic spike phase (cross interval from a postsynaptic spike back to the most recent presynaptic spike) versus N/I . These allowed easy determination of the distribution and organization of qualitatively different dynamical behaviors along the presynaptic rate scale (types of behaviors, behavior ranges, and bifurcation locations). Each bifurcation diagram involved a set of 90 to 200 simulations, with diagrams produced for pacemaker and jittered pacemaker presynaptic spike trains. Values of jitter (w; see below) ranging from 0.01I to 0.1I (±1%–±10%) were examined; their bifurcation diagrams were similar overall. Results for ±1% jitter are presented here. This was viewed as a stricter test of the hypothesis (that high-precision inputs would result in faster recovery from errors than low precision, and thus the extra precision could be viewed as a type of redundant, error-correcting code) than larger values of w, as larger values would certainly result in the model’s state being perturbed farther from the stationary attractors (and thus it would be a much shorter distance, and therefore faster, for the model to return to its attractor for high-precision input). Work on evaluating in detail the effects of varying level of jitter on error recovery are ongoing (Stiber & Holderman, 2004b). Four sets of 90 to 100 simulations used to produce bifurcation diagrams (high precision, low precision, and each with errors; see below) then served as a database describing error responses of the model. This communication focuses on the details of the model’s recovery after an error and the effects of presynaptic spike timing precision on that recovery, and so a few illustrative behaviors were selected. The behaviors chosen are archetypical of the full range of dynamical behavior types exhibited by driven nonlinear oscillators: 1:1 locking (period 1), 1:2 locking (period 2), 2:3 locking (period 3, the longest periodic behavior for this model with significant extent along the N/I scale), two quasiperiodicities (low—N/I < 1.0—and high—N/I > 1.0—rate), and chaos. An additional behavior was chosen to illustrate the effects of proximity to a bifurcation. A more complete examination of the global behavior of error correction (e.g., two- and three-dimensional bifurcation structure) is under way (Stiber & Holderman, 2004a). For each example behavior, simulations involved generating two pairs of almost identical presynaptic spike trains, as schematized in Figure 1. For each set of simulations, a reference high-precision train, H, with spike times {s1 , s2 , . . . , sn }, was first generated. As this was a pacemaker train, all interspike intervals were identical (∀k, I = Ik = sk − sk−1 ). I was the independent variable, chosen to elicit the example dynamical behaviors from the model. An error was defined to be complete failure of synaptic transmission, and so a set of spike numbers, K = {k1 , k2 , . . . , km }, was generated randomly such that all errors were separated by at least 15 s (to allow the

Spike Timing Precision and Neural Error Correction

1581

error time, S k j

e

e presynaptic

postsynaptic

Figure 1: Simulation methods. Pairs of high (H, He ) or low (L, Le ) precision spike trains were applied via the simulated inhibitory fiber (IF). Postsynaptic spike times (T ) were recorded for analysis.

neuron to return to its stationary behavior before each error). Presynaptic spikes at times HK = {sk1 , sk2 , . . . , skm } were eliminated from H to produce a high-precision erroneous train, He = H − HK . A low-precision counterpart of H was generated by jittering each spike sk ∈ H by a value uk taken from a uniform distribution with range ±w I . In the current communication, a small value of w (w = 0.01, 2% jitter) was used. The result of applying this jitter was a low-precision train L = {s1 + u1 , s2 + u2 , . . . , sn + un }. A low-precision erroneous train, Le = L − LK , was then produced by eliminating the spikes numbered K at times LK = {sk1 + uk1 , sk2 + uk2 , . . . , skm + ukm }. Subsequent analysis involved pooling timing data for postsynaptic spikes near errors in these sets of four simulations. As a double-check, multiple simulations were run for each set of simulation parameters (with different realizations of {uk }) and analysis performed on the pooled data from all; as might be expected, there was no difference in results for m errors taken from n simulations versus m errors taken from a single simulation. As Figure 2 illustrates, the first step of analysis was based on comparing times of postsynaptic spikes T = {. . . , ti−1 , ti , . . .} just before and after the error times for the erroneous and error-free cases. For each error k j ∈ K, j = 1, 2, . . . , m, the set of c postsynaptic spikes just before and after were selected (here, 5 spikes before and 25, 30, or 45 spikes after, depending on the rate of production of postsynaptic spikes and the recovery time, and thus c was 30, 35, or 50). These spike times can be renamed relative to the error number j as {t j,1 , t j,2 , . . . , t j,c } (the bottom spike train in Figure 2).

1582

M. Stiber

ψj,2

ψj,3

ti-1 tj,1

ti Skj

tj,2

ti+1

ti+2

tj,3

tj,4

perturbation (s)

error time

Time From Error (s)

Figure 2: Perturbation graph generation. The displacement of a postsynaptic spike after an error (bottom spike train) from the time of the corresponding spike in the situation where there was no error (top spike train) is plotted versus time since the error.

For each postsynaptic spike in the erroneous case, {t j,g }, j = 1, 2, . . . , m, g = 1, 2, . . . , c, the cross interval back to the most recent postsynaptic spike in the error-free case (top spike train in the figure), ψ j,g , was computed. As the simulations involve an inhibitory synapse, the number of postsynaptic spikes in the erroneous case was always greater than or equal to the number in the error-free case; this was the reason for using the former spikes as the reference for computing ψj,g . This was done for each pair of high- and lowL H precision input simulations to yield ψ H j,g and ψ j,g , respectively. Thus, ψ j,g L and ψ j,g measure the shift, or perturbation, of each postsynaptic spike from the time it would have occurred if there had been no error. The graph of L (t j,g − sk j , ψ H j,g ) or (t j,g − sk j − uk j , ψ j,g ), such as in Figure 6A, was termed a perturbation graph. In every case, the perturbation graph compares a simulation (either high or low precision) with errors to the identical (canonical) one without errors. Therefore, any scatter of points is due to either the simulation accuracy (which, for lockings at least, can be judged by the points before the error

Spike Timing Precision and Neural Error Correction

Input Perturbation (s)

Perturbation (s)

Input

1583

Time From Error (s)

Time From Error (s) Perturbation Difference (s)

–

∆g Time From Error (s)

Figure 3: Recovery plot construction. A pair of corresponding high- and lowprecision perturbation graphs (top) are subtracted to produce perturbation differences. The recovery plot is the plot of the range of differences, g .

time) or the presence of errors themselves. Jitter, in and of itself, did not produce any scatter because each presynaptic spike in both simulations occurred at identical times in identical simulations (neglecting errors). Perturbation graphs show only the effects of errors and possible interaction of errors with the presence of jitter. L The high- and low-precision perturbations, ψ H j,g and ψ j,g , were compared in recovery plots. Figure 3 illustrates how such plots were generated. For each pair of cross intervals, the difference, δ j,g = ψ Lj,g − ψ H j,g , was computed. Because of the variation in the individual error responses in the low-precision case (the scatter of points in the perturbation plot), δ j,g typically took on a range of values for each g. Rather than examine the average difference, the decision was made to compute the range of differences, g = max j (δ j,g ) − min j (δ j,g ). A difference δ j,g = 0 indicated that the highand low-precision perturbations, for that particular error j and postsynaptic spike g, were identical. A range g = 0 indicated that for all errors, the perturbation of postsynaptic spike g was the same for high and low precision—that low-precision input did not have an effect on the perturbation. Thus, a range g > 0 indicated that some low-precision perturbations differed from high-precision ones. In other words, because these were simulations, none of the postsynaptic spikes were considered outliers. As an

1584

M. Stiber

alternative, the variance of δ j,g (for each value of g) could have been used; in practice, it produced equivalent results. So a recovery plot such as in Figure 4 is a graph of g versus the mean offset e g = t j,g − sk j , for each value of g = 1, 2, . . . , c. It collapses the gth set of perturbations to a single point and compares high- and low-precision perturbations, showing the time course by which the worst low-precision error responses approached those for high precision. The primary motivations behind the use of recovery plots were easier analysis of the time course of recovery than perturbation graphs and elimination of the need to compare pairs of perturbation plots by eye to evaluate the effects of low versus high precision. Assuming the erroneous simulation fully recovered before each error, the values of g for e g < 0 can also serve to test whether low-precision responses were still significantly perturbed. Comparing g for e g < 0 for different simulation parameters allows one to judge whether the simulation was recovered. To characterize the ongoing dynamical behavior exhibited by the SAO, cross intervals between the pre- and postsynaptic times, H and T , respectively, were used. Since H was a pacemaker process, the cross intervals φi = ti − sk from postsynaptic spikes back to most recent preceding presynaptic spikes were termed phases (Winfree, 1980; Glass & Mackey, 1988). The phase return map, plotting (φi , φi+q ), trivially indicates p:q locking since φi = φi+q . Quasiperiodic responses produce a one-dimensional curve without a maximum for q = 1, while chaotic behaviors have return maps with maxima or that are not one-dimensional curves (Abraham & Shaw, 1984; Rapp, Zimmerman, Albano, deGuzman, & Greenbaun, 1985; Ruelle, 1989). For brevity, return maps are omitted from section 3 for phase lockings. 3 Results In this section, error recovery during locked, quasiperiodic, and chaotic responses is considered. A final subsection examines the effect of proximity to a bifurcation boundary on error recovery. 3.1 Phase Lockings. In previous work on both the SAO and this model, phase-locked responses were found to be the most robust in the presence of nonstationary presynaptic spike trains, with 1:1 locking being the most robust among lockings (Segundo, Vibert, Stiber, & Hanneton, 1993; Segundo, Stiber, et al., 1994; Segundo, Vibert, et al., 1995; Segundo, Stiber, et al., 1995; Segundo et al., 1998; Stiber et al., 1997). This has also been noted in other preparations and simulations (Beierholm et al., 2001; Tiesinga & Sejnowski, 2001; Tiesinga, 2002). Thus, analysis began with these behaviors in the expectation that they would exhibit the greatest immunity to low-precision jitter and the fastest error recovery times (Stiber, 2003). Figure 4 presents typical results for 1:1 locking. For high-precision input, the response to every error was identical (see Figure 4A), with a

Spike Timing Precision and Neural Error Correction

6

x 10-3

5 Perturbation, ψ (s)

A

1585

4 3 2 1 0 -1

6

3

0 1 2 Time From Error, eg (s)

3

x 10-3

5 Perturbation, ψ (s)

B

0 1 2 Time From Error, eg (s)

4 3 2 1 0 -1

10-2

C ∆ (s)

10-3 10-4 10-5 10-6 -1

0 1 2 Time From Error, eg (s)

3

Figure 4: Results for center of 1:1 locking (I = 108.78 ms, N/I = 0.96). Highprecision perturbation graph (A) shows identical responses to all errors; low precision (B) shows range of responses. Recovery plot (C) shows long-duration recovery. m = 49 errors, c = 25 spikes per error.

1586

M. Stiber

maximum perturbation of approximately ψ H j,g = 5.25 ms. For low precision (see Figure 4B), the maximum perturbation was ψ Lj,g ≈ 6 ms, with up to a 1.2 ms range for any single value of g. An exponential fit for the points in the recovery plot (see Figure 4C) in the range 0.4 ≤ e g ≤ 0.9 yielded a recovery time constant of τ = 0.20 s (τ/I = 1.84), with a slower recovery proceeding thereafter. This is the rate at which the “worst” low-precision error responses approached that for high precision. Low-precision responses were still significantly perturbed 2 s after an error. Perturbation graphs are shown in Figures 5A and 5B and a recovery plot in Figure 5C for 1:2 locking. Note that the points occur in pairs, with small differences between the pairs of points leading to the appearance of oscillations in the recovery plot. These correspond to the pairs of postsynaptic spikes that occur between each pair of presynaptic spikes (two categories of cross intervals). As with 1:1 locking, all responses to errors for high-precision input were identical (see Figure 5A), the maximum perturbation in this case being 19 ms. Lowprecision maximum perturbations (see Figure 5B) ranged from about 17 ms to 20 ms. The recovery time constant for 0.51 ≤ e g ≤ 1.08 was τ = 0.27 s (τ/I = 1.19). Recovery slows considerably after e g = 1.2 s, at about 10 times the pre-error difference range. As Figure 6A shows, 2:3 locking, unlike the previous lockings, exhibited some variability even in its high-precision response to repeated errors— qualitatively similar to low-precision simulations, though much smaller. Note the linear scale in Figures 6A and 6B. The maximum high-precision error perturbation was around 24 ms with a 1 ms spread; for low precision, this was 27 ms with a 5 ms spread (ignoring bimodality; see below). In both cases, there were triplets of clusters, corresponding to the three postsynaptic spikes occurring for each pair of presynaptic spikes (i.e., this was a period 3 behavior). Differences among the three categories of cross intervals explain what appear to be oscillations in the recovery plot (see Figure 6C) as with 1:2 locking. Clusters had a bimodal distribution of perturbations, with some small (around 0–2 ms) and some large. The recovery plot shows an initial peak at less than an order of magnitude more than the pre-error value, a slow recovery until around e g = 2.5 s; there were additional lower peaks after 2.5 s (before eventual return to pre-error values, as indicated by the points before the error). The range of differences before the error (e g < 0) was on the order of 1 ms, compared to 2 to 5 µs for 1:1 and 1:2 lockings. 3.2 Quasiperiodicities. A quasiperiodic behavior is one in which the phase of the response shifts with respect to the stimulus, eventually taking on any value within some range (Ermentrout & Rinzel, 1984). This is made apparent by the phase return map in Figure 7A, which is a one-dimensional curve covering the range 0 to I . Note that a phase of zero is equivalent to one of I , so the graph corresponds to a torus and the curve is not discontinuous.

Spike Timing Precision and Neural Error Correction

1587

A

Perturbation, ψ (s)

0.02

0.015

0.01

0.005

0 -1

0 1 2 Time From Error, eg (s)

3

B

Perturbation, ψ (s)

0.025 0.02 0.015 0.01 0.005 0 -1

0

1

2

3

Time From Error, eg (s) 10-2

C ∆ (s)

10-3 10-4 10-5 10-6 -1

0

1

2

3

Time From Error, eg (s)

Figure 5: Results for 1:2 locking (I = 227.01 ms, N/I = 0.46). Perturbation graphs for high- (A) and low- (B) precision input and recovery plot (C) all carry mark of the two categories of phases. m = 49 errors, c = 25 spikes per error.

1588

M. Stiber 0.025 Perturbation, ψ (s)

A

0.02 0.015 0.01 0.005 0 -1

0 1 2 Time From Error, eg (s)

3

0.03

B Perturbation, ψ (s)

0.025 0.02 0.015 0.01 0.005 0 -1

0

1

2

3

Time From Error, eg (s) 10-2

∆ (s)

C

10-3

10-4

-1

0

1

2

3

Time From Error, eg (s)

Figure 6: Results for 2:3 locking (I = 168.43 ms, N/I = 0.62). Both highprecision (A) and low-precision (B) perturbation graphs show ranges of responses to different errors. The recovery plot (C) shows reduced, long-duration error aftereffect. m = 49 errors, c = 25 spikes per error.

Spike Timing Precision and Neural Error Correction

A

0.1 Perturbation, ψ (s)

φi+1 (s)

0.12

B

0.15

0. 1

0.05

1589

0.08 0.06 0.04 0.02

0

0

0.05

0.1

0 -1

0.15

φ (s)

0 1 2 Time From Error, eg (s)

3

i

0.1

0.18

D

0.08

0.16

0.06

0.14

∆ (s)

Perturbation, ψ (s)

C

0.04

0.12

0.02

0.1

0 -1

0 1 2 Time From Error, eg (s)

3

0.08 -1

0

1

2

3

Time From Error, eg (s)

Figure 7: Error responses for quasiperiodic behavior near 2:3 locking (I = 174.04 ms, N/I = 0.60). The phase return map (A) is a one-dimensional curve. High-precision (B) and low-precision (C) perturbation plots. Note the linear y-axis for recovery plot (D). m = 49 errors, c = 25 spikes per error.

There was no apparent qualitative difference (and little quantitative difference) between high- (B) and low- (C) precision error responses; if anything, there was even a slight decrease in postsynaptic spike timing variability after an error. This is especially apparent in the recovery plot (D), with a maximum range of I before an error and a minimum range (ignoring the high-frequency “oscillations” every third spike—the “ghost” of the nearby 2:3 locking) of around 95 ms at e g ≈ 1 s. Figure 8 presents results for a high-input-rate (N/I > 1) quasiperiodicity. As can be seen in the phase return map in Figure 8A, the model fired only within a relatively restricted range of phases, approximately 59 to 63 ms after a presynaptic spike. This is an example of a windowed behavior (Segundo, Altshuler, Stiber, & Garfinkel, 1991a), in which the postsynaptic cell is able to fire only within some narrow time. Both high and low precision (see Figures 8B and 8C, respectively) had a spike timing variability of around 3 ms before an error, with a peak perturbation of 28 ms and a perturbation range of about 8 ms after an error. The range of differences (see Figure 8D) was less than doubled by an error, with an oscillatory recovery with period a bit over 1 s and apparent complete recovery after 4 s.

1590

M. Stiber

A

0.025 Perturbation, ψ (s)

0.0625 0.062 φi+1 (s)

0.03

B

0.063

0.0615 0.061 0.0605

0.02 0.015 0.01 0.005

0.06 0.06

0.061 0.062 φ (s)

0

0.063

-1

i

0.03

C

10

D

0.025

1

2

3

x 10-3

8 0.02

∆ (s)

Perturbation, ψ (s)

0

Time From Error, eg (s)

0.015

6

0.01

4 0.005

2

0 -1

0

1

2

Time From Error,eg (s)

3

-1

0

1

2

3

Time From Error, eg (s)

Figure 8: Error responses for windowed quasiperiodic behavior (I = 63.67 ms, N/I = 1.64). The phase return map (A) is a one-dimensional curve. Highprecision (B) and low-precision (C) perturbation plots. Note the linear y-axis for recovery plot (D). m = 49 errors, c = 25 spikes per error.

3.3 Chaos. Figure 9 shows the error responses for a high-input-rate chaotic behavior. The phase return map in Figure 9A is “braided,” clearly not a one-dimensional curve. The neuron could fire between 0 and 56 ms (I ) after a presynaptic spike, with a postsynaptic interval range of 139 to 155 ms and a mean of 148 ms. So this response was not windowed. As with the lowinput-rate quasiperiodic behavior, there was little difference between highand low-precision responses (see Figures 9B and 9C, respectively). In both cases, postsynaptic spike timing variability decreased after the error. There was perhaps a slight increase in the recovery plot (see Figure 9D) after an error, with subsequent decrease and increase again, but this was not dramatic. 3.4 Bifurcation Boundaries. The earlier locking examples were chosen to lie within the middle of the range of presynaptic rates that elicited the behavior. For input rates near an extreme presynaptic rate, different results were found. This is illustrated in Figure 10 for an input near the low-rate bifurcation point bounding 1:1 locking. As in the middle of the range, high-precision inputs (see Figure 10A) elicited the same response

Spike Timing Precision and Neural Error Correction

A

Perturbation, ψ (s)

0.04 φi+1 (s)

0.2

B

0.05

0.03 0.02 0.01 0

0

0.01

0.02

0.03

0.04

0.15

0.1

0.05

0

0.05

-1

φi (s)

1

2

3

0.145

D

0.14

0.15 ∆ (s)

Perturbation, ψ (s)

0

Time From Error, eg (s)

0. 2

C

1591

0.1

0.135 0.13

0.05

0.125

0 -1

0

1

2

Time From Error, eg (s)

3

0.12 -1

0

1

2

3

Time From Error, eg (s)

Figure 9: Chaotic behavior (I = 56.14 ms, N/I = 1.86). The phase return map (A) is not a one-dimensional curve. High-precision (B) and low-precision (C) perturbation plots. Note the linear y-axis for recovery plot (D). m = 49 errors, c = 25 spikes per error.

to every error, with a maximum perturbation of almost 60 ms. For lowprecision inputs (see Figure 10B), maximum error perturbations were between 55 ms and 60 ms. Unlike the previous 1:1 example, the peak perturbation was reached slowly—after three postsynaptic spikes—and the recovery was more gradual (see Figure 10C), with a broad secondary peak around e g = 1.5 s and a value still about two orders of magnitude greater than that before the error at e g = 2.5.

4 Discussion This article has examined the interaction of neuron dynamics and spike timing precision in error correction in a model of the living prototype of an inhibitory synapse and postsynaptic neuron. The dynamical behaviors included locked, quasiperiodic, and chaotic; additional issues, such as proximity to bifurcation points and windowing, were touched on. All analysis was based on spike times and interspike intervals rather than state-space analysis. This is especially significant, in our view, because that is exactly the message that neurons transmit to each other. Thus, variations seen here

1592

M. Stiber 0.06 0.05 Perturbation, ψ (s)

A

0.04 0.03 0.02 0.01 0 -1

0 1 2 Time From Error, eg (s)

3

0 1 2 Time From Error, eg (s)

3

0

3

0.06 0.05 Perturbation, ψ (s)

B

0.04 0.03 0.02 0.01 0

-1

10-2

C ∆ (s)

10-3 10-4 10-5 10-6 -1

1

2

Time From Error, eg (s)

Figure 10: Results for edge of 1:1 locking (I = 133.88 ms, N/I = 0.78). The highprecision perturbation graph (A) shows identical responses to all errors; the low-precision graph (B) shows a range of responses. Both differ from farther away from the bifurcation (see Figure 4). The recovery plot (C) shows greatly extended recovery. m = 49 errors, c = 25 spikes per error.

Spike Timing Precision and Neural Error Correction

1593

are at least potentially significant (as opposed to changes in internal state that might have no impact on spike timing). It is important to remember that in all cases, high-precision erroneous simulations were compared to high-precision error-free ones, and lowprecision erroneous simulations were compared to low-precision error-free ones. Thus, when a recovery plot was generated, values larger than those just before an error showed that at least some low-precision responses were significantly further away from “full recovery” than the high-precision ones. Among all behaviors examined, phase lockings produced the most rapid recovery, with lower ratios that occur in broader rate domains (1:1 and 1:2) recovering faster. One reason for this likely was the gradient of the attractor basin surrounding such behaviors. During the acceptably exponential phase of their recoveries (roughly from 0.5 s to 1 s after an error), 1:1 and 1:2 locking had similar time constant (0.20 s and 0.27 s), despite the fact that their associated presynaptic rates differed by more than a factor of two. The conclusion is that at least for that phase of error recovery, it was not the case that each presynaptic spike simply corrected the error by some amount proportional to the perturbation. Rather, the gradients of the two basins of attraction—at least at those distances from the attractors—were similar. Based on computed recovery time constants, the worst-case low-precision recoveries had significantly longer recovery times—on the order of seconds longer. For 2:3 locking and all nonlocked responses, the “excess perturbation” of low precision versus high precision (the peak of the recovery plot) was much less than for the lower-ratio locked behaviors. In fact, for 2:3 locking, it was roughly a factor of five greater, while for all nonlocked behaviors, it was less than a factor of two. There were two possible mechanisms underlying this: a narrow rate domain that elicited the particular behavior and desynchronization of erroneous and error-free responses by errors. The first matter, that of domain width, was especially relevant to 2:3 locking. As can be seen by comparing Figure 6C to Figures 4C and 5C, the peak was not too different from the lower ratio lockings; the difference was the degree of perturbation before the error (more than two orders of magnitude higher in the case of 2:3 locking, roughly at the level of the added low-precision jitter). This was probably caused by the jitter in the low-precision simulations bumping the state out of the locking behavior (Beierholm et al., 2001; Tiesinga & Sejnowski, 2001), causing the two lowprecision simulations to remain partially desynchronized. They were, at least part of the time, away from the synchronizing effect of the periodic attractor (see the discussion below of bifurcation boundary transitions). That was a minor desynchronization effect because there was a periodic attractor that tended to pull the erroneous and error-free simulations’ states back together. Such was not the case for aperiodic behaviors. In those cases (quasiperiodicity and chaos), once the neuron had been perturbed by the first error, returning to the attractor did not bring its subsequent spikes back

1594

M. Stiber

to the same times they would have occurred in the absence of the error. This was the case for both high and low precision. For the quasiperiodic cases (see Figures 7 and 8), there were preferred firing intervals (where the phase return map was closest to the diagonal in both figures, and additionally due to windowing itself for windowed quasiperiodicity). The closer the quasiperiodicity came to being periodic (the more dominant the preferred firing intervals were), the more it would act to resynchronize the erroneous simulation with the error-free one. For Figure 7, this was not very close at all (about 25 ms), the erroneous and error free experiments were completely desynchronized by the first error, and as a result their “perturbations” in Figures 7B and 7C before the error could be as large as the full range of possible phase differences (I /2). Additionally, the high- and low-precision cases thereafter were also decorrelated, and so the range of their perturbation differences in Figure 7D was the sum of their maxima. However, in both the high- and lowperturbation cases, errors arrived at approximately the same time (within ±w I of each other), and so the errors paradoxically served as a partial resynchronizing input, as indicated by the decrease in Figure 7D after the error. Windowed quasiperiodicity, however, behaved more like a weak locking. As Figure 8 shows, the neuron was constrained to fire within a 4 ms window relative to an input spike when stationary (see Figure 8A), which was much less than the 25 to 30 ms perturbation induced by an error (see Figures 8B and 8C). As a result, the cell exhibited an oscillatory recovery, with highprecision input being somewhat more effective (see Figure 8D). Chaos, by definition of its hallmark sensitive dependence on initial conditions, was the extreme case of these desynchronizing processes, with the jitter of low precision leading those cases to diverge rapidly from high precision and the presence of errors having a similar effect between erroneous and error-free experiments. As Figure 9D shows, error aftereffects were qualitatively similar to (though perhaps more extreme than) that for quasiperiodicity in Figure 7D, with the similarly timed errors affecting the high- and low-precision erroneous experiments in similar fashion. The final results in Figure 10 address the issue of proximity to a bifurcation boundary. This is also relevant to the effects of domain width and similarity or dissimilarity of attractor basins and attractors on error recovery. In both the high- and low-precision cases, in the absence of errors, the neuron’s state was on or near a period 1 attractor at all times (the jitter was not large enough to destroy locking, as was verified by examination of the phase return map for the low-precision case). Unlike the simulation in Figure 4, however, the perturbation produced by an error did not displace the state into the basin of another 1:1 locking. Instead, the perturbation moved the state across the bifurcation boundary into the basins of qualitatively different attractors, and its motion thereafter was very different from that of recovery exclusively through period 1 attractor basins.

Spike Timing Precision and Neural Error Correction

1595

This raises an important point regarding the nature of the perturbation produced by an error. Because the presynaptic intervals served as a control parameter, errors were not merely perturbations of the neuron’s state within its state space, ⊂ Rn . Instead, they were perturbations within the system’s response space, I × (Thom, 1975; Abraham & Shaw, 1984). The perturbation of an ongoing behavior near a bifurcation boundary or within a narrow domain (such as in Figures 6–10) displaced the neuron’s state across domains of different behaviors. After the perturbation, the reestablished sequence of spike times drew the neuron through its response space and back across (possibly multiple) bifurcations into basins of attractors qualitatively like its “home” (stationary attractor). Its evolution, however, because of these dissimilar basins, did not necessarily lead it toward the region of that contained this home. As a result, the perturbation and recovery plots could show multimodal, or oscillatory, shapes. Having established that even a small amount of spike timing jitter can have observable effects on the time course of error recovery (and potentially on the activity of further postsynaptic neurons), subsequent work is progressing along a number of fronts:

r

r

r

A more complete model of the bifurcation behavior of error recovery (Stiber & Holderman, 2004a) is being developed. Rather than using some property of the stationary behavior of a dynamical system, one can examine how properties of transient neural responses depend on preexisting stationary behaviors. The role of degree of spike timing accuracy on error is being investigated (Stiber & Holderman, 2004b). The current results show that even small amounts of jitter can have a significant effect on error recovery. The amounts used here were in the range of 1 to 4 ms or so, and thus are relevant to a range of biological systems, based on observations (for multiple spike trains) of correlations and correlation-dependent activity in retinal ganglion cells and lateral geniculate cells, and are at the low end for some cortical cells (Bair, 1999). Preliminarily, order-ofmagnitude jitter increases in general produce similar effects, though, as might be expected in a nonlinear system, linear increases do not appear to produce linear changes in recovery. The transient response of this model is being examined in terms of its internal state variables (Stiber & Pottorf, 2004), as preliminary to developing a simplified model that captures the error responses and is more amenable to analytic approaches. Early results suggest that spike rate adaptation is the most significant contributor to the observed error responses.

A final matter is that of the application of Shannon’s information theory to coding by nonlinear dynamical neurons. Techniques from the field of information theory (Shannon, 1948; Cover & Thomas, 1991) are currently

1596

M. Stiber

of great interest in addressing the question of what spike trains can be said to carry identical information (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Borst & Theunissen, 1999; Fuhrmann, Segev, Markram, & Tsodyks, 2002), and thus whether the timing of individual spikes or average firing rate is relevant. Entropy calculations can be quantified in terms of bits of information per spike, a compelling statistic. The typical experimental approach is to present a stimulus multiple times and use the “average” response (where this averaging might be accomplished in varying ways) to compute entropy or mutual information. This is then used to infer characteristics of the neural code (e.g., rate versus time codes). It is important, however, to remember that information theory reveals nothing about the semantics of a code (Shannon & Weaver, 1949). Primarily, and in conjunction with knowledge of the noise in a communication channel, it addresses issues of information capacity: how close to optimally compressed a signal is. This part of Shannon’s work—the noisy channel coding theorem—is usually used in neuroscience only in terms of estimating channel noise from multiple presentations of the same stimulus. However, another application of this theorem is the creation of codes that use more bits (symbols, or in the context of neural coding, greater spike timing precision) than the required minimum to provide for error correction in the presence of channel noise. As these results clearly show, both the quantitative and qualitative results of a presynaptic error depend critically on a neuron’s internal state’s ongoing evolution. It is a mapping f : I × → O × , from input and state to output and new state. This suggests that the computation performed by a neuron will inevitably result in a decrease in mutual information between its input and output, according to the data processing theorem (Shannon, 1948; Cover & Thomas, 1991). In contrast, Shannon’s formal definition of an encoder or decoder is a stationary mapping f : A∗ → B ∗ from input sequences of symbols taken from some set A to output sequences of symbols from some (possibly different) set B. In other words, Shannon’s information theory applies when the encoding of an input symbol depends only on its context within the sequence of input symbols. On the other hand, for a nonlinear dynamical system, output depends on input and state, where state is not merely a function of input but also of intrinsic dynamics. Therefore, Shannon’s information theory is conceptually incompatible with neural information processing. In practical terms, this implies that informationtheoretic techniques can be used only to establish lower bounds on matters such as spike timing precision.

Appendix: SAO Model Equations The SAO model equations used here are documented more completely in Edman, Gestrelius, and Grampp (1987) and Stiber et al. (1997); following is

Spike Timing Precision and Neural Error Correction

1597

Table 1: Constants Used in This Simulation, in Rough Order of Appearance in Appendix Equations. Constant Cm Ibias A P Na [Na+ ]o [Na+ ]i T PK [K+ ]o PL,Na PL,K PL,Cl [Cl− ]o [Cl− ]i J p,Na Km Psyn τ+ τ− νm νh νl νn νr zm zh zl zn zr Vm Vh Vl Vn Vr τm τh τl τn τr

Value

Units (MKS)

7.5 × 10−9 −2.5 × 10−9 1.0 × 10−7 6.0 × 10−6 325.0 10.0 291.15 2.0 × 10−6 5.0 5.0 × 10−10 1.8 × 10−8 1.1 × 10−9 650.0 46.0 6.0 × 10−6 13.4 6.0 × 10−8 0.25 × 10−3 0.5 × 10−3 0.0 0.0 0.0 0.03 0.5 −3.1 4.0 3.5 −2.6 4.0 −19.0 × 10−3 −35.0 × 10−3 −53.0 × 10−3 −18.0 × 10−3 −56.0 × 10−3 0.3 × 10−3 5.0 × 10−3 1700.0 × 10−3 6.0 × 10−3 2000.0 × 10−3

Farads A m2 m/s mM mM ◦K m/s mM m/s m/s m/s mM mM mol/(m2 s) mM m/s s s dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless V V V V V s s s s s

a brief summary. It is a permeability-based, rather than conductance-based, model with the following main equations: dVm /dt = −(INa + IK + IL,Na + IL,K + IL,Cl + Ip + Ibias + Isyn )/Cm (A.1)

1598

M. Stiber

INa = APNa m2 hl IK = APK n2 r

AF 3

(A.2)

Vm F 2 [K+ ]o − [K+ ]i exp FVm /RT RT 1 − exp FVm /RT

(A.3)

Vm F 2 [X]o − [X]i exp(FVm /RT) RT 1 − exp(FVm /RT)

IL,X = APL,X Ip =

Vm F 2 [Na+ ]o − [Na+ ]i exp FVm /RT RT 1 − exp FVm /RT

J p ,Na 1+

Km [Na+ ]i

(A.4)

3

(A.5)

Vm F 2 [Cl− ]o − [Cl− ]i exp(−FVm /RT) RT 1 − exp(−FVm /RT) e (sk −t)/τ+ − e (sk −t)/τ− . ×

Isyn = APsyn

(A.6)

∀sk ≤t

Values for constants are given in Table 1. Besides active and leak Na+ , K , and Cl− currents, there is a bias current (Ibias ), generated in response to muscle stretch, an active pump (Ip ), and a synaptic current (Isyn ). There are five gating variables, m, h, l, n, and r , with first-order kinetics described by equations A.7 to A.10 (where g ∈ {m, h, l, n, r }). +

dg/dt = (g∞ − g) g∞ = νg + τg =

Qg =

(A.7)

1 − νg zg e 1 + exp kT (Vm − Vg )

exp

1 τg

δ g zg e (Vm kT

1 − δg δg

δ g

− Vg ) + exp

+

Qg τ¯g

1 − δg δg

(A.8)

δg −1

.

(δg −1)zg e (Vm kT

− Vg )

(A.9)

(A.10)

References Abraham, R., & Shaw, C. (1984). Dynamics—The geometry of behavior. Santa Cruz, CA: Aerial Press. Allen, C., & Stevens, C. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci. USA, 91, 10380–10383. Bair, W. (1999). Spike timing in the mammalian visual system. Curr. Opin. Neurobiol., 9(4), 447–453.

Spike Timing Precision and Neural Error Correction

1599

Beierholm, U., Nielsen, C. D., Ryge, J., Alstrøm, P., & Kiehn, O. (2001). Characterization of reliability of spike timing in spinal interneurons during oscillating inputs. J. Neurophysiol., 86, 1858–1868. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nature Neurosci., 2(11), 947–957. Bryant, H., Marcos, A. R., & Segundo, J. (1973). Correlations of neuronal spike discharges produced by monosynaptic connections and by common inputs. J. Neurophysiol., 36, 205–225. Buno, ˜ W., Fuentes, J., & Barrio, L. (1987). Modulation of pacemaker activity by IPSP and brief length perturbations in the crayfish stretch receptor. J. Neurophysiol., 57, 819–834. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Cox, D., & Isham, V. (1980). Point processes. London: Chapman and Hall. Dan, Y., Alonso, J.-M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neurosci., 1(6), 501–507. Edman, A., Gestrelius, S., & Grampp, W. (1987). Analysis of gated membrane currents and mechanisms of firing control in the rapidly adapting lobster stretch receptor neurone. J. Physiol., 384, 649–669. Ermentrout, G., & Rinzel, J. (1984). Beyond a pacemaker’s entrainment limit: Phase walk-through. Am. J. Physiol., 246, R102–106. Fuhrmann, G., Segev, I., Markram, H., & Tsodyks, M. (2002). Coding of temporal information by activity-dependent synapses. J. Neurophysiol., 87, 140–148. Glass, L., & Mackey, M. (1988). From clocks to chaos. Princeton, NJ: Princeton University Press. Hindmarsh, A. (1983). ODEPACK: A systematized collection of ODE solvers. In R. Stepleman (Ed.), Scientific computing (pp. 55–64). Amsterdam: North-Holland. Katz, B. (1966). Nerve, muscle, and synapse. New York: McGraw-Hill. Kohn, A., da Rocha, A., & Segundo, J. (1981). Presynaptic irregularity and pacemaker inhibition. Biol. Cybern., 41, 5–18. Kohn, A., & Segundo, J. (1983). Neuromime and computer simulations of synaptic interactions between pacemakers: Mathematical expansions of existing models. J. Theoret. Neurobiol., 2, 101–125. Lestienne, R. (2001). Spike timing, synchronization and information processing on the sensory side of the central nervous system. Prog. Neurobiol., 65, 545–591. Marˇsa´ lek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. Rapp, P., Zimmerman, I., Albano, A., deGuzman, G., & Greenbaun, N. (1985). Dynamics of spontaneous neural activity in the simian motor cortex: The dimension of chaotic neurons. Phys. Lett. A, 110, 335–338. Ruelle, D. (1989). Elements of differentiable dynamics and bifurcation theory. Orlando, FL: Academic Press. Segundo, J. (1979). Pacemaker synaptic interactions: Modeled locking and paradoxical features. Biol. Cybern., 35, 55–62.

1600

M. Stiber

Segundo, J. P., Altshuler, E., Stiber, M., & Garfinkel, A. (1991a). Periodic inhibition of living pacemaker neurons: I. Locked, intermittent, messy, and hopping behaviors. Int. J. Bifurcation and Chaos, 1(3) 549–581. Segundo, J. P., Altshuler, E., Stiber, M., & Garfinkel, A. (1991b). Periodic inhibition of living pacemaker neurons: II. Influences of driver rates and transients and of non-driven post-synaptic rates. Int. J. Bifurcation and Chaos, 1(4), 873– 890. Segundo, J., Moore, G., Stensaas, L., & Bullock, T. (1963). Sensitivity of neurones in aplysia to temporal pattern of arriving impulses. J. Exp. Biol., 40, 643–667. Segundo, J., Perkel, D., Wyman, H., Hegstad, H., & Moore, G. (1968). Input-output relations in computer-simulated nerve cells: Influence of the statistical properties, strength, number, and inter-dependence of excitatory pre-synaptic terminals. Kybernetic, 4, 157–171. Segundo, J., Stiber, M., Altshuler, E., & Vibert, J.-F. (1994). Transients in the inhibitory driving of neurons and their post-synaptic consequences. Neurosci., 62(2), 459– 480. Segundo, J., Stiber, M., Vibert, J.-F., & Hanneton, S. (1995). Periodically modulated inhibition and its post-synaptic consequences. II. Influence of pre-synaptic slope, depth, range, noise and of post-synaptic natural discharges. Neurosci., 68(3), 693– 719. Segundo, J., Tolkunov, B., & Wolfe, G. (1976). Relation between trains of action potentials across an inhibitory synapse: Influence of presynaptic irregularity. Biol. Cybern., 24, 169–179. Segundo, J., Vibert, J.-F., Pakdaman, K., Stiber, M., & Diez Mart`inez, O. (1994). Noise and the neurosciences: A long history, a recent revival and some theory. In K. Pribram (Ed.), Origins: Brain and self organization. Mahwah, NJ: Erlbaum. Segundo, J., Vibert, J.-F., & Stiber, M. (1998). Periodically modulated inhibition of living pacemaker neurons. III. The heterogeneity of the postsynaptic spike trains and how control parameters affect it. Neurosci., 87(1), 15–47. Segundo, J., Vibert, J.-F., Stiber, M., & Hanneton, S. (1993). Synaptic coding of periodically modulated spike trains. In Proc. ICNN (pp. 58–63). Parsippany Hills, NJ: IEEE. Segundo, J., Vibert, J.-F., Stiber, M., & Hanneton, S. (1995). Periodically modulated inhibition and its post-synaptic consequences. I. General features. Influences of modulation frequency. Neurosci., 68(3), 657–692. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Stiber, M. (2003). Non-information-maximizing neural coding. In Proc. IJCNN’03. Parsippany Hills, NJ: IEEE. Stiber, M., & Holderman, T. (2004a). Global behavior of neural error correction. In Proc. IJCNN. Parsippany Hills, NJ: IEEE. Stiber, M., & Holderman, T. (2004b). Stochastic-resonance-like effects in neural error correction. Manuscript in preparation, University of Washington, Bothell. Stiber, M., Ieong, R., & Segundo, J. (1997). Responses to transients in living and simulated neurons. IEEE Trans. Neural Networks, 8(6), 1379–1385.

Spike Timing Precision and Neural Error Correction

1601

Stiber, M., Pakdaman, K., Vibert, J.-F., Boussard, E., Segundo, J., Nomura, T., Sato, S., & Doi, S. (1996). Complex responses of living pacemaker neurons to pacemaker inhibition: A comparison of dynamical models. Biosystems, 40, 177–188. Stiber, M., & Pottorf, M. (2004). Response space construction for neural error correction. In Proc. IJCNN. Parsippany Hills, NJ: IEEE. Stiber, M., & Segundo, J. P. (1993). Dynamics of synaptic transfer in living and simulated neurons. In Proc. ICNN (pp. 75–80). Parsippany Hills, NJ: IEEE. Thom, R. (1975). Structural stability and morphogenesis. Reading, MA: Benjamin. Tiesinga, P. (2002). Precision and reliability of periodically and quasiperiodically driven integrate-and-fire neurons. Phys. Rev. E, 65(4), 041913. Tiesinga, P., & Sejnowski, T. J. (2001). Precision of pulse-coupled networks of integrate-and-fire neurons. Network: Comput. Neural Syst., 12, 215–233. Winfree, A. (1980). The geometry of biological time. New York: Springer-Verlag.

Received October 10, 2003; accepted January 5, 2005.

LETTER

Communicated by Tzyy-Ping Jung

A New Approach to Spatial Covariance Modeling of Functional Brain Imaging Data: Ordinal Trend Analysis Christian Habeck [email protected] Cognitive Neuroscience Division, Taub Institute, and Department of Neurology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, U.S.A.

John W. Krakauer [email protected] Department of Neurology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, U.S.A.

Claude Ghez [email protected] Department of Neurology, College of Physicians and Surgeons, and Center for Neurobiology and Behavior, Columbia University, New York, NY 10032, U.S.A.

Harold A. Sackeim [email protected] Departments of Neurology, Psychiatry, and Radiology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, and Department of Biological Psychiatry, New York State Psychiatric Institute, New York, NY 10032, U.S.A.

David Eidelberg [email protected] Center for Neurosciences, Institute for Medical Research, North Shore–Long Island Jewish Health System, Manhasset, NY 11030, and Department of Neurology, School of Medicine, New York University, New York, NY 10016, U.S.A.

Yaakov Stern [email protected] Cognitive Neuroscience Division, Taub Institute, and Departments of Neurology and Psychiatry, College of Physicians and Surgeons, Columbia University, New York, NY 10032, USA; and Department of Biological Psychiatry, New York State Psychiatric Institute, New York, NY 10032, U.S.A. Neural Computation 17, 1602–1645 (2005)

© 2005 Massachusetts Institute of Technology

Ordinal Trend Canonical Variates Analysis

1603

James R. Moeller [email protected] Cognitive Neuroscience Division, Taub Institute, and Department of Neurology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, USA; and Department of Biological Psychiatry, New York State Psychiatric Institute, New York, NY 10032, U.S.A.

In neuroimaging studies of human cognitive abilities, brain activation patterns that include regions that are strongly interactive in response to experimental task demands are of particular interest. Among the existing network analyses, partial least squares (PLS; McIntosh, 1999; McIntosh, Bookstein, Haxby, & Grady, 1996) has been highly successful, particularly in identifying group differences in regional functional connectivity, including differences as diverse as those associated with states of awareness and normal aging. However, we address the need for a within-group model that identifies patterns of regional functional connectivity that exhibit sustained activity across graduated changes in task parameters. For example, predictions of sustained connectivity are commonplace in studies of cognition that involve a series of tasks over which task difficulty increases (Baddeley, 2003). We designed ordinal trend analysis (OrT) to identify activation patterns that increase monotonically in their expression as the experimental task parameter increases, while the correlative relationships between brain regions remain constant. Of specific interest are patterns that express positive ordinal trends on a subject-by-subject basis. A unique feature of OrT is that it recovers information about functional connectivity based solely on experimental design variables. In particular, there is no requirement by OrT to provide either a quantitative model of the uncertain relationship between functional brain circuitry and subject variables (e.g., task performance and IQ) or partial information about the regions that are functionally connected. In this letter, we provide a step-by-step recipe of the computations performed in the new OrT analysis, including a description of the inferential statistical methods applied. Second, we describe applications of OrT to an event-related fMRI study of verbal working memory and H2 15 O-PET study of visuomotor learning. In sum, OrT has potential applications to not only studies of young adults and their cognitive abilities, but also studies of normal aging and neurological and psychiatric disease. 1 Introduction Perhaps it is not an oversimplification to say that in neuroimaging studies of human cognition, it is rare to capture glimpses of the regional functional connectivity of the underlying neural circuitry, particularly in studies involving H2 15 O PET and event-related fMRI (Friston, Frith, Liddle, &

1604

C. Habeck et al.

Frackowiak, 1993; McIntosh, Bookstein, Haxby, & Grady, 1996; McIntosh & Gonzalez-Lima, 1994). In cognitive neuroscience, the term functional connectivity refers to the distributed nature of human information processing that occurs over a scale of centimeters (DeFelipe et al., 2002; Felleman & Van Essen, 1991; McIntosh & Gonzalez-Lima, 1994; Mellet et al., 2000), where the term was co-opted from neurophysiologists (Gerstein, Perkel, & Subramanian, 1978) who originally used it to describe the cooperative firing between functionally related neurons that were grouped together on a submillimeter scale. Although functional connectivity was intended and is often evoked as a guiding principle for understanding brain function (Friston et al., 1993; Horwitz, 1991; McIntosh, 1999), it bears a resemblance to the elusive qualities of dark matter in the astrophysicist’s presentday universe (Abbott, 2002; Ostriker & Steinhardt, 2003). That is, interregional functional connectivity and dark matter are known to be ubiquitous in their respective sciences, but sightings are rare. What is the explanation for this odd circumstance as it pertains to cognitive neuroscience? Based on current and past neuroimaging studies of ordinary human abilities, we do not know whether the apparent uncertainty we confront in mapping regional functional connectivity reflects an inherent property of human information processing or whether it is a property of the experimental designs and statistical models we apply. Concretely, there is no assurance that latent patterns of functional connectivity will be uncovered using the conventional voxel-by-voxel modeling of expected experimental effects (Friston, Frith, Liddle, & Frackowiak, 1991; Friston et al., 1996; Worsley, Poline, Friston, & Evans, 1997). Indeed, the principal authors of the voxel-wise univariate and multivariate linear models, Friston, Worsley, and colleagues, have been reticent to suggest otherwise (Worsley et al., 1997). Further, McIntosh (1999) in his discourse on the need for spatial covariance modeling and the likely empirical evidence for functional connectivity to be derived therefrom, stops just short of asserting that latent patterns of functional connectivity frequently will be missed by voxel-by-voxel modeling. On the other hand, with the network analyses offered by McIntosh— structural equation modeling (McIntosh & Gonzalez-Lima, 1994) and partial least-squares analysis (McIntosh et al., 1996)—there is always a concern that some misattributions of connectivity will inevitably occur. Indeed, it is somewhat surprising that McIntosh and colleagues’ strongest evidence for functional connectivity has come from demonstrations that connectivity is substantially altered by graduated changes in task parameters (McIntosh, 1999), subject mind-set or state of awareness (McIntosh, Rajah, & Lobaugh, 1999), and subtle changes in neurophysiology that occur with normal aging (Cabeza, Anderson, Houle, Mangels, & Nyberg, 2000; Cabeza, McIntosh, Tulving, Nyberg, & Grady, 1997; Grady, McIntosh, & Craik, 2003). The question is whether this level of apparent volatility in functional connectivity is an inherent property of ordinary human abilities (Fernandez-Duque, Baird, & Posner, 2000a, 2000b), or a by-product of the

Ordinal Trend Canonical Variates Analysis

1605

limitations in our experimental designs and statistical modeling methods. To address this issue, we sought to devise a different type of spatial covariance modeling that would identify sustained functional connectivity across graduated changes in task parameters. Our intention has been to extend the definition of functional connectivity as it was originally applied to individual task conditions so as to define it for experiments that consist of parametric series of tasks conditions. The notion of sustained functional connectivity we consider here is that in which the influence of parametric changes between tasks is exchangeable with the influence of endogenous variables that induce subject differences within a task. Specifically, if—within a task—the effect of changing the level of endogenous variables is to scale up or down the activity of the functionally connected brain regions, then the effect of parametric changes between tasks is likewise to scale up or down the activity in these functionally connected brain regions, albeit on a subject-by-subject basis. In every task and subject, the scaling of activity in the functionally connected regions is therefore determined jointly by the experimental and endogenous variables. Based on this exchangeability of experimental and subject variables, we designed a spatial covariance model that can identify regional brain activations that in aggregate (i.e., as represented by a pattern of regional weights) express a positive ordinal trend with incremental changes in a task parameter; that is, these brain activations increase monotonically as the experimental task parameter increases, while the correlative relationships between brain regions remain constant. The activation patterns of interest are those that express positive ordinal trends on a subject-by-subject basis. Indeed, a prediction of ordinal trends is commonplace in studies of cognition that involve a series of experimental conditions over which task difficulty increases. Representative examples in the study of working memory are the N-back (Braver et al., 1997) and Sternberg (1966, 1969) tasks. Another example is the well-known auditory “oddball task” (Naatanen, Tervaniemi, Sussman, Paavilainen, & Winkler, 2001), where the fraction of standard to deviant tones is varied in a parametric manner. Many cognitive theories presume that there is sustained functional connectivity in the sense described above. For example, in his theory of verbal memory, Baddeley (1988, 2003) describes a coalition of component processes, including articulatory rehearsal, phonological store, and central executive, that is common to all individuals, and he predicts that the effect of increasing memory load is to incrementally increase activity in brain regions associated with these processes. The spatial covariance analysis that we devised, which we have named ordinal trends (OrT) analysis, provides an explicit test of the assumption of sustained functional connectivity. Of course, OrT is applicable not only to studies of task difficulty, but also to the broader spectrum of parametrically designed studies of human cognition.

1606

C. Habeck et al.

The strategy for using the OrT analysis to test the prediction of sustained functional connectivity is briefly the following. In a data set, OrT assigns reduced salience, on the one hand, to latent activation patterns that express mean directional changes between tasks that are different from the predicted ordinal trend.1 On the other hand, among the latent patterns that express mean trends in the predicted direction, OrT assigns a different level of salience depending on the type of task × subject interaction that a pattern expresses. Salience is reduced in activation patterns where the direction of the trend expressed is different for different subjects. By contrast, salience is enhanced in activation patterns where the direction of the trend is same in all subjects. Among the latter type of patterns, the salience assigned to a pattern is directly related to the proportion of the total (voxel × task × subject) variance in the original data set that is accounted for by the pattern and its expression. Among the latent patterns that express ordinal trends, OrT provides an estimate of the pattern with the highest salience, quantifies the expression of this pattern for each subject and task condition, and quantifies the statistical significance of the pattern expression and the reliability of the pattern’s voxel weights. In this article, we discuss the feasibility of the OrT computational approach, followed by a step-by-step recipe of the computations performed in an OrT analysis, including a description of the inferential statistical methods applied. Second, we describe applications of OrT to actual event-related fMRI and PET data sets. We report the results from OrT analyses of two studies of ordinary human abilities: (1) an event-related fMRI study of a verbal working memory task involving a delayed-matched-to-sample experimental design and (2) an H2 15 O-PET study of visuomotor learning. We show that the OrT analysis takes its place alongside PLS network analysis as being only the second spatial covariance model that is specifically designed to recover latent aspects of functional connectivity in neuroimaging studies that involve parametric experimental designs. In short, OrT serves as an omnibus test of sustained functional connectivity that is performed across multiple task conditions and all brain regions (voxels). 2 Feasibility of the OrT Computational Approach From a computational perspective, the ideal approach to identifying latent patterns that express ordinal trends would be simply to multiply the original neuroimaging data matrix by a matrix that would maximally enhance the salience of the target patterns, where the latter matrix is based on the parametric design of the experiment. This approach is similar to 1 In our description of the OrT strategy for assigning salience to regional covariance patterns, a latent pattern is any covariance pattern that is contained in the vector space spanned by the functional images contained in a data set. Moreover, the term latent pattern is not used in reference to a particular canonical representation of the vector space.

Ordinal Trend Canonical Variates Analysis

1607

the current canonical variates analyses (CVA) that have been designed for analyzing neuroimaging data (McIntosh et al., 1996; Worsley et al., 1997). In current CVAs, the task-subject × voxel matrix Y (the neuroimage data set) is multiplied by a task-subject × design matrix X consisting of predictor variables, after which the X Y product is submitted to singular value decomposition (SVD). Algebraically, the effect of matrix multiplication is always to differentially alter the voxel × task × subject variance accounted for by different latent patterns. In particular, matrix multiplication in the OrT analysis would be designed to selectively enhance the voxel × task × subject variance of patterns that expressed ordinal trends and, among these latter patterns, to produce the greatest enhancement in the pattern that expressed the largest voxel × task × subject variance in the original data set. On this basis, the application of principal component analysis (PCA), or SVD, to the transformed data set could be expected to produce major principal components that provided a good approximation to one or more patterns that express ordinal trends. In these respects, our approach builds on the current model-guided PCA methods designed for analyzing neuroimaging data (Petersson, Nichols, Poline, & Holmes, 1999). Spelling out what is required by the OrT analysis made it less certain, however, that there actually is a design matrix that would guarantee the identification of patterns that expressed ordinal trends. A new design matrix had to be invented that would differentiate among three categories of latent patterns, that is, would differentially alter the voxel × task × subject variance of three types of latent patents: first, discriminate among patterns that expressed mean trends in the predicted direction from patterns that expressed mean directional changes between tasks that are different from the predicted trend; and second, discriminate among different types of patterns within the first category. In the first category, the design matrix has to discriminate among patterns in which the direction of the trend expressed is the same in all subjects from patterns that express task × subject interactions in which the trend expressed is different for different subjects. In addition, the design matrix has to preserve the relative size of the voxel × task × subject variance accounted for by latent patterns that express ordinal trends. On the one hand, our computational approach is a form of CVA. On the other hand, the previous work on CVA—as it concerns the analysis of neuroimaging data (McIntosh et al., 1996; Worsley et al., 1997)—is not extendable in a straightforward manner to provide a matrix solution that satisfies the above OrT requirements. Indeed, in the extant CVA approaches, the presumption is that a modest number of predictor variables (i.e., a lowdimensional design matrix) will provide an adequate account of the neuroimaging data or, conversely, a large number of predictor variables would likely produce mixtures of different model effects that are not interpretable. We recognized, however, that no low-dimensional design matrix could correctly assign the appropriate salience to the three pattern categories that

1608

C. Habeck et al.

must be discriminated in the OrT analysis. In particular, a low-dimensional design matrix would not provide the means to differentiate among patterns that expressed the predicted mean trend but differed in the type of task × subject interactions. A high-dimensional matrix is required to differentiate among patterns in which the direction of the trend expressed is the same in all subjects and patterns that express task × subject interactions in which the trend expressed is different for different subjects. Our answer to the question of feasibility was a design matrix of dimension T ∗ N × (T − 1)∗ N, where T is the number of task conditions in the parametric series and N is the number of subjects. For the particular series of task conditions, E i , i = 1, . . . , T, the individual columns of the design matrix assign unit values to one or another pair of consecutive task conditions and zero to all other task conditions. For each pair of consecutive task conditions, these assignments are made on a subject-by-subject basis. When T = 5, for example, the OrT design matrix (Q) is a 5N × 4N matrix of the form 

IN  IN  Q=   0  0 0

0 IN IN 0 0

0 0 IN IN 0

 0 0   0  , IN  IN

where for each task, IN denotes the N × N identity matrix. In other words, tasks are ordered sequentially along the row dimension, and subjects are repeated within each task. This article includes the tests we have performed to evaluate the utility of this new design matrix. We have applied the OrT analysis not only to real event-related fMRI and H2 15 O PET data sets, but also to data sets simulated using Monte Carlo methods. In the Monte Carlo simulations, the performance of the OrT design matrix was evaluated in terms of the degree that the PCA of the transformed data set outperformed the PCA of the untransformed data set (see the appendix). The primary aim of the Monte Carlo computations was to verify that the target patterns that expressed ordinal trends in the simulated data sets were better estimated by a fixed number of the major principal components of the transformed data than by the same number of principal components of the untransformed data. A detailed discussion of these Monte Carlo tests of feasibility is contained in the appendix. In addition, we compared the performance of the OrT method to the performance of the conventional, low-dimensional CVA models that are ordinarily applied to event-related fMRI and H2 15 O PET data sets. In brief, it may come as a surprise that the low-dimensional CVA models actually performed worse—in recovering target patterns that expressed ordinal trends—than PCA applied to the untransformed data set (Figure 8 in the

Ordinal Trend Canonical Variates Analysis

1609

appendix). The differences in performance are substantial. We also found that even high-dimensional design matrices that do not contain the particular features of the OrT matrix, such as, the Helmert matrix (Venables & Ripley, 1999), perform nearly as poorly as the conventional low-dimensional design matrices (Figure 8 in the appendix). Finally, we have addressed the issue of the statistical specificity of the OrT analysis by demonstrating that the OrT design matrix achieves low type I error rates (false alarm rates) in data sets in which the task × subject neuroimaging data are generated using the statistics of random gaussian fields. In the next section, before presenting the computational recipe for the OrT analysis and the information about its statistical specificity, we provide one example of a Monte Carlo simulation in order to portray with simple graphics (see Figure 1) what it means for the PCA of the OrT transformed data set to outperform the PCA of the untransformed data set in identifying latent patterns with ordinal trends. 2.1 First Illustration of the OrT Analysis. Consider a miniature data set that consists of two task conditions: a control condition B and an experimental challenge condition E1 where, for purposes of visual display, each image was limited to just two voxels. These diminutive images for 100 subjects are displayed in a two-dimensional Cartesian coordinate system in Figure 1. This visual display corresponds to the formal algebraic representation of the data set as a task-and-subject × voxel matrix (Y), consisting of 200 rows (two tasks times 100 subjects) and two columns (two voxels). Moreover, different data points in Figure 1 represent two-voxel images for different subjects and task conditions, where each data point corresponds to a single row in the data matrix Y. With regard to the remainder of the Monte Carlo simulations described in the appendix, they involve more realistic data sets and are described using matrix notation only. In our miniature data set, all 200 images are actually admixtures of just two latent patterns of functional connectivity: a latent pattern that expresses a positive trend for every subject and a second latent pattern that does not. In algebraic notation, the OrT pattern is z1 = [1; 1], and the second, nonOrT pattern is z2 = [1; −1], where the boldface z1 and z2 variables represent 2 × 1 column vectors. In z1 , the two positive voxel weights indicate a form of functional connectivity in which z1 ’s contributions to the overall activity of voxels 1 and 2 are positive. In contrast, in z2 , the voxel weights are of the opposite sign, indicating a form of functional connectivity in which z2 ’s contribution to the overall activity of voxel 1 is positive, but its contribution to voxel 2 is negative. In this sense, z1 and z2 represent orthogonal patterns of functional connectivity, which is represented algebraically by a zero inner product, z1 · z2 = 0. In this simulated data set, the group mean expression of each pattern is configured to reveal a positive change from B to E1. Indeed, the subject

1610

C. Habeck et al.

expressions of the two patterns, z1 and z2 , are configured to have the same means and the same variances in each of the two task conditions, while at the same time, z1 expresses a positive trend for every subject, whereas z2 expresses a positive trend for 55 of the subjects and expresses a negative trend for the remainder of subjects. These several features of latent activation patterns z1 and z2 are displayed in Figure 1A, where the total voxel activity in a data set is plotted for each task and subject (open circles). Also plotted are the subject levels of voxel activity in the individual patterns of functional connectivity. For z1 , the circles are linearly aligned with positive slope along the line [1; 1], and for z2 , the circles are linearly aligned with negative slope along [1; −1]. Different colors are used to distinguish task B (green) from task E1 (red) in depicting the activity in both whole images and the individual patterns of functional connectivity. The details of subject pattern expression are as follows. For z1 , the subject expression values b for task B were sampled from the uniform distribution U(0,1). The vector b is an N × 1 column vector. The expression values for E1 are denoted as e1 and were generated as b + , where also is a N × 1 random variable, sampled from U(0,1). This results in b and e1 having mean 1/2 and 1 and variance 1/12 and 1/6, respectively. For z2 , b and e1 are similarly constructed. However, the subject labels for both b and e1 have been randomly permuted, resulting in different subjects exhibiting opposite trends. In other words, the collection of images in the task-and-subject × region data matrix can be represented as an algebraic sum of the individual contributions of the latent patterns z1 and z2 : Y=

b e1

Target

1 1 +

b e1

1 −1 .

Non−Target

Indeed, it is clear from this formula that subject expression of a latent pattern is simply the projection of the pattern onto the data set. For example,

Figure 1: Miniature data set Y involving two experimental tasks B and E1 and 100 subjects, in which each image, containing just two voxels, is an admixture of a targeted activation pattern and an orthogonal, nontargeted activation pattern. (A, Right) Task × subject voxel activity in Y. Activity values due to the individual patterns are displayed as well as the aggregate data. Different tasks are indicated by different-colored open circles: green indicates activity pertaining to task B, red to task E1 . The blue line indicates the major source of variance, which is also the mean difference image between conditions found with mean contrast analysis. It is the vector average of both targeted and nontargeted activation pattern. (A, left) Task-activity curves for the subject expression of targeted and nontargeted pattern, shown for a subset of 20 subjects to avoid clutter. The ordinal trend feature of strictly monotonic curves with a high intertask correlation can be discerned for the

Ordinal Trend Canonical Variates Analysis

1611

Target expression

(A)

E1 Voxel 2

B

Non-target expression

B

Voxel 1

E1

(B)

Voxel 2

main source of variance

Voxel 1 Figure 1: (cont.) targeted (upper-left figure) but not the nontargeted activation pattern (lower-left figure). (B) Subject-voxel activity in the data transformed by the OrT matrix Q(Q Q)−1/2 (open blue circles). Application of the OrT matrix results in the same amount of variance of the open circles along the direction of the nontargeted component, but a much enlarged amount of variance ( = var(b) + var(e1 ) + 2cov(e, b1 )) along the direction of the targeted pattern because of the high intertask correlation of subject expression. The direction of the major source of variance ( = first principal component) is drawn as a blue line.

1612

C. Habeck et al.

for z1 , the pattern expression is the inner product of the z1 vector with the Y matrix:

b e1

= Y · z1 . Target

A comparison between Figures 1A and 1B illustrates the considerable impact that the OrT matrix has on the voxel × task × subject variances of z1 and z2 . In particular, the comparison reveals that the OrT transformed data set outperforms the PCA of the untransformed data set in identifying as the first principal component the latent pattern z1 for which all subjects exhibit a positive trend. Figure 1A shows that the patterns z1 and z2 have equal salience and that the first principal component of the untransformed data set is not latent pattern z1 but, rather, the vector average of patterns z1 and z2 along the direction [1; 0], which is indicated by a blue line. By contrast, Figure 1B shows that after OrT matrix multiplication, the patterns z1 and z2 no longer have equal salience, and now the first principal component is a good approximation of z1 . We have used extensive Monte Carlo simulations to generalize these results and establish a benchmark of the accuracy of OrT pattern estimation. The simulated data sets and OrT performance are described in the appendix. Monte Carlo simulations were written in MATLAB 6.0 (Mathworks, Natick, MA) and performed on Linux workstations. Interested readers who want to learn more about the derivation and utility of the OrT design matrix can find all pertinent information in the appendix. 2.2 Algorithm of OrT/CVA. We now present a list of the 5 computational steps of the OrT/CVA, assuming that the neuroimaging data have undergone sufficient preprocessing, resulting in one scan per subject per task. (The preprocessing steps will be explained in detail in the sections showing applications to real-world PET and fMRI data sets.) We assume three task conditions, B, E1, E2, but our recipe can be generalized to any number of task conditions (two or greater). Step 1: Application of a projection operator, P, by multiplication from the right according to YP, to eliminate strictly task-independent effects. P is constructed from the set of 2N eigen images of the Helmerttransformed data matrix H Y. The eigen decomposition can be written as Y HH Y W = W with the Helmert matrix 

−IN H =  IN 0

 IN IN  . −2IN

Ordinal Trend Canonical Variates Analysis

1613

The matrix W contains the 2N eigen images as column vectors, and is a 2N-diagonal matrix containing the nonzero eigen values. P is the projection matrix of the Helmert eigen images; that is, P is the matrix WW . The modified data matrix YP has the same dimensions as the original data matrix Y. However, YP will contain N fewer activation patterns and has rank 2N (i.e., it is of lower rank than Y, which has rank 3N). Step 2: Application of the OrT design matrix, Q, by multiplication from the left according to [Q(Q Q)−1/2 ] YP, to increase the salience of ordinal trend effects. With N subjects and 3 task conditions, the OrT design matrix consists of 2N predictor variables, where a pair of predictor variables is constructed for each subject. With the preselected ordering of tasks B, E 1 , and E2, the predictor variables for the jth subject are (1) a 3N vector in which the entries are zero except for the jth scans for the B and E1 task conditions, which both contain unit values, and (2) a 3N vector in which the entries are zero except for the jth scans for the E1 and E2 task conditions, which contain unit values. The OrT design matrix can thus be written as 

 0 IN  . IN

IN Q =  IN 0

The OrT design matrix applied to the imaging data YP is the orthonormal version Q(Q Q)−1/2 of the above matrix. This normalization guarantees that all predictor variables, specifically all subjects, are equally influential in assigning salience to latent patterns. Step 3: Singular value decomposition is applied to the mean-centered [Q(Q Q)−1/2 ] YP matrix. This is equivalent to applying PCA, that is, P Y Q(Q Q)−1 Q YP

V = V.

V contains 2N orthogonal eigen images as column vectors, and is a 2Ndiagonal matrix of the eigen values. Step 4: The first K eigen images are tested for the presence of an ordinal trend. For the first K singular images, a 2N × K predictor array is calculated according to [E1 − B; E1 + B − 2E2 ]. B is obtained by projection of all K images onto the raw data pertaining to condition B : B = Y(1 : N, :) V(:, 1 : K ). Likewise, for E 1 and E2, we have E1 = Y(N + 1 : 2N, :)V(:, 1 : K ), and E2 = Y(2N + 1 : 3N, :)V(:, 1 : K ). We then conduct a linear regression to best predict the dependent variable of the regression, which is a 2N column vector [1; 1], with the 2N × K predictor array described above:

1 −1

≈

E1 − B β. E1 + B − 2E2

1614

C. Habeck et al.

Table 1: Tabulation of Type I Error Rates for the Number-of-Exceptions Criterion, Obtained from 10,000 Monte Carlo Simulations for 13 Subjects and 500 Regional Resolution Elements.

0 exceptions 1 exceptions 2 exceptions 3 exceptions 4 exceptions 5 exceptions 6 exceptions

PC1

PC1-2

PC1-3

PC1-4

PC1-5

PC1-6

0.000 0.001 0.011 0.059 0.209 0.533 1.000

0.000 0.005 0.038 0.156 0.409 0.730 1.000

0.001 0.015 0.090 0.290 0.591 0.856 1.000

0.004 0.041 0.176 0.440 0.739 0.925 1.000

0.012 0.088 0.291 0.589 0.844 0.966 1.000

0.030 0.167 0.432 0.724 0.916 0.985 1.000

In other words, the regression is a type of discriminant analysis that produces the linear combination of the K eigen images according to V(:, 1 : K )β whose mean expression changes maximally across task conditions. For the test of significance of the ordinal trend, we compute the task-subject scores for this new linear combination image according to the right-hand side of the above regression equation. The test of significance is based on the minimum number of exceptions to a perfect segregation of these contrast scores, which is an inverse correlate to the maximum number of subjects who exhibit monotonic task-activity curves. Monte Carlo methods are used to calculate the type I error rate of ordinal trends based on the minimum number of exceptions to a perfect segregation of scores. Table 1 illustrates the type I error based on (1) fixing in advance the number of the principal components of [Q(Q Q)−1/2 ] YP employed in the discriminant analysis described above and (2) task × subject images that have been smoothed to 500 regional resolution elements (”resels”), where images contain resels of independently and normally distributed noise. Naturally, different error rates are obtained depending on the particular decision rule that is used to select principal components 1 to K from [Q(Q Q)−1/2 ] YP and the number of resels per image. Step 5: Bootstrap estimation of the robustness of voxel weights in the ordinal trend topographic estimate. A bootstrap resampling procedure can be used to estimate the variability of the regional weights in the patterns about their point estimate values. The complete analysis (steps 1–4) that was performed on the original subject sample to arrive at the ordinal trend topographic estimate is usually repeated 100 to 1000 times on samples of subjects that have been chosen randomly with replacement from the original subject pool.2 The inverse coefficient of variation (ICV) serves as the 2

The computational requirements of this resampling process are as follows: for our fMRI example using 16 subjects and 500 iterations on a Linux workstation with 4 GB of RAM, the bootstrap estimation process took about 2 hours.

Ordinal Trend Canonical Variates Analysis

1615

measure of the reliability of the regional weight at each voxel in the topographic pattern. ICV is computed from the point estimate of the regional weights, wvoxel , and the variability of the resampling process around this point estimate, captured as the standard deviation σvoxel , as ICVvoxel =

Wvoxel ∼ N(0, 1), σvoxel

and is approximately standard normally distributed (Efron & Tibshirani, 1994). The larger the absolute magnitude of ICVvoxel , the smaller the relative variability of the regional weight about its point estimate value. We adopted a threshold of at least |ICVvoxel | > 2 for the two examples discussed in the next section. Under the assumption of a standard-normal distribution, this corresponds to a one-tailed p-level of p < 0.0228. 3 Application to Event-Related fMRI Study of Working Memory Methodological details are spelled out for a typical application of OrT to event-related fMRI data. The application is a study of verbal working memory, some of whose results have been published in Habeck et al., (2004). Here we present results that have not been included in the previous publication. 3.1 Study Design. Eighteen young subjects (age = 26.3 ± 4.9 years) participated in an event-related functional magnetic resonance imaging (efMRI) paradigm of a delayed-match-to-sample (DMS) task. The initial scan occurred at 9 A.M. (”PRE”), and the follow-up scan occurred at the same time 48 hours later (”POST”) to eliminate confounding circadian effects, yielding 48 hours of prolonged wakefulness. All subjects had been carefully screened for normal sleep patterns for 2 weeks prior to the experiments and for the absence of neurological or psychiatric contraindications. (Details are available elsewhere (Habeck et al., 2004) and are omitted here for brevity.) Finally, 22 additional subjects (age 23.93 ± 1.14) participated in the PRE scan but did not undergo the sleep deprivation protocol. The DMS task was a variant of the Sternberg task (Sternberg, 1966, 1969). A trial lasted 16 seconds. Subjects were instructed to respond as accurately as possible. No feedback about their performance was given. The sequence of trial events was as follows: first, a fixed 3 second period of blank presentation marked the beginning of trial; then, during the stimulus period of the task, an array of one, three, or six uppercase letters was presented for 3 seconds (the stimulus phase). With the offset of the visual stimulus, subjects were instructed to focus on the blank screen and hold the stimulus items in mind for a 1 second maintenance interval (the retention phase). Finally, a probe appeared for 3 seconds (the probe phase), which was a lowercase letter centered in the field of view. In response to the probe,

1616

C. Habeck et al.

subjects indicated by a button press whether the probe matched a letter in the study array (the left index finger indicated yes and the right index finger indicated No). Each of three experimental blocks contained 10 trials for each of 3 set sizes with 5 true negative and 5 true positive probes per set size. There were 10 × 3 × 3 = 90 experimental trials per scanning session. In addition to the fixed 3 second period of a blank screen presentation, which we counted as part of the experimental trial, there were intertrial intervals (ITI) that consisted of presentation of a blank screen and were used as baseline epochs in the time-series analysis of the subject’s data. Their length was variable and determined in the following way: 70 2-second increments were available throughout the whole block, for 30 intertrial intervals. It was decided stochastically whether a 2 second increment of ITI would be inserted prior to the start of the trial or whether the trial would begin immediately. The details of this procedure have been reported in detail elsewhere (Habeck et al., 2004). With 30 trials of 16 seconds each, each block lasted for 140 + (30 × 16) = 620 seconds. There were two breaks of approximately 1 minute each between block 1 and 2 as well as block 2 and 3. This brings the overall time subjects spent in the scanner by each subject to (3 × 620) + 120 = 1980 seconds, or 33 minutes. On the evening before the first day of fMRI scanning, every subject received seven blocks of initial training on the experimental setup: the first 6 training blocks were run with feedback and the seventh without feedback. On the first day of fMRI scanning, all subjects were well rested. Reported here are new OrT/CVA analyses of the fMRI data sets of PRE and POST sleep deprivation; the focus is on the functional activity of the retention phase. The first analytic goal was to recover a pattern of sustained functional connectivity in which subjects expressed ordinal trends with an increasing memory load of 1-, 3-, and 6-letter arrays. The OrT/CVA analysis was initially performed on fMRI data from a subgroup of the 40 subjects who participated in our working memory study: the subgroup consisted of 16 randomly selected subjects. Subsequent to this analysis, the plan was to apply the presumptive OrT pattern obtained from the analysis of 16 subjects to the fMRI data of the 24 additional individuals, where there were again three load conditions per subject, and again subjects were well rested. The aim of this ”forward application” was to illustrate that the original ordinal trend effect could be replicated in an independent subject sample of comparable size. Successful forward application provided a demonstration that an accurate estimate of the OrT pattern had been obtained from the initial OrT/CVA analysis. The forward application of an OrT pattern simply entailed the calculation of pattern expression in the individual fMRI images obtained for the retention interval for each of the 24 subjects, for each of the three load conditions. For each subject and load condition, pattern expression is a scalar

Ordinal Trend Canonical Variates Analysis

1617

value that is the inner product between each raw fMRI image and the OrT pattern image: the inner product is simply the voxel-by-voxel multiplication of the weights in the fMRI and OrT pattern images summed over the whole brain. (See section 2.1 for a description of this operation in terms of vector notation.) Provided that the ordinal trend effect was replicated in the independent group of 24 well-rested subjects, our plan was to assess whether this pattern of sustained functional connectivity revealed in the well-rested state was preserved after sleep deprivation. Concretely, our plan was to forwardapply the OrT pattern estimate of the well-rested state to the fMRI data of the 18 sleep-deprived subjects, where again there were three load conditions (1-, 3-, and 6-letter arrays) per subject. This forward application of the OrT pattern provided a test for ordinal trends with maximal statistical degrees of freedom, where the statistical power of the forward application was dependent on the accuracy of the original OrT pattern estimate. 3.2 FMRI Preprocessing Steps. Functional images were acquired using a 1.5 Tesla magnetic resonance scanner (Philips). A gradient echo EPI sequence (TE = 50 ms; TR = 3 sec; flip angle = 90◦ ) and a standard quadrature head coil were used to acquire T2∗ weighted images with an in-plane resolution of 3.124 mm × 3.124 mm (64 × 64 matrix; 20 cm2 field of view). Based on T1 “scout” images, 8 mm transaxial slices (15–17) were acquired. Following the fMRI runs, a high (in-plane) resolution T2 image at the same slice locations used in the fMRI run was acquired using a fast spin echo sequence (TE = 100 ms; TR = 3 sec; 256 × 256 matrix; 20 cm2 field of view). Task administration and data collection were controlled by a computer running appropriate software (Psyscope 1.1) in electronic synchrony with the MR scanner. Task stimuli were back-projected onto a screen located at the foot of the MRI bed using an LCD projector. Subjects viewed the screen via a mirror system located in the head coil. Task responses were made on an LUMItouch response system, and behavioral response data were recorded on the task computer. All image processing and analysis was done using the SPM99 program (Wellcome Department of Cognitive Neurology) and supporting code written in Matlab 6.0 (Mathworks, Natick, MA). FMRI time series were corrected for order of slice acquisition. All functional volumes in a given subject were realigned to the first volume from the first run of each study. The T2 anatomical image was then coregistered to the first functional volume, using the mutual information coregistration algorithm implemented in SPM99. This coregistered structural image was then used in determining nonlinear spatial normalization (7 × 8 × 7 nonlinear basis functions) parameters for a transformation into a Talairach standard space defined by the Montreal Neurological Institute template brain applied with SPM99. These normalization parameters were then applied to the functional data (using SINC interpolation to reslice the images to 2 mm × 2 mm × 2 mm).

1618

C. Habeck et al.

In a level 1 time-series analysis of the individual subject data, the fMRI responses to the three separate temporal components of the task, in each experimental condition and in each block, were fit to separate sets of predictor variables (Zarahn, 2000). The predictor variables of the time-series modeling were the following: a constant intercept (0th-order discrete cosine set) was chosen for the stimulus and probe phases, whereas a 0th- to 2nd-order discrete cosine set was chosen for the retention phase. For one block, this results in five predictor variables (one for stimulus, three for retention, one for probe) per set size (one, three, and six) per probe type (positive or negative). An additional intercept term is provided for the effect of block, bringing the total number of predictor variables per block to (5 × 3 × 2) + 1 = 31. Predictor variables had a nonzero value at every point in the time series where a particular condition was met and a zero value at every other point. For example, one predictor had a value of one during all stimulus phases of set size one, with a positive probe, during the first block. The set was convolved with a canonical hemodynamic response waveform (a sum of two gamma functions, as specified in the SPM99 program (Friston et al., 1998) whose beginnings were marked by the appropriate onset vector for each epoch, set size, and probe type. The resulting timeseries vectors were used in the design matrix for the within-subject model estimation. The number of rows was the total number of volumes denoting the complete fMRI time series across the scanning session. The number of columns was 3 × 31 = 93, with 31 design vectors for each experimental block. The bandpass-filtered (low pass by a gaussian with a FWHM of 4 sec and a high-pass cutoff of 14.5 mHz) fMRI time series at each voxel were regressed onto these predictor variables. A first-order autoregressive autocorrelation model was fit to the residuals to make statistical inference more robust to the intrinsic temporal autocorrelation structure (Friston et al., 2000). At every voxel in the image, components of the event-related responses that matched the canonical hemodynamic response waveform were estimated for the whole scanning session. Linear contrasts assessed the amplitudes (normalized regression coefficients) of these components. A typical contrast used in our analysis, for instance, would be activity during the retention phase for six items collapsed across probe types and experimental blocks versus activity in the ITI blank period. This method of time-series modeling and contrast estimation at each voxel reduces the number of images to one per subject per condition. To account for gain differences between fMRI sessions, activation values were normalized by their voxel averages. The resulting parametric map images were smoothed using an isotropic gaussian kernel (FWHM = 8 mm) and used as the data in the subsequent analysis. Afterward, a probabilistic gray matter mask was applied with a threshold of 0.5: every voxel submitted to the analysis had at least a chance of 0.5 of being gray matter. The resulting masked brain images contained 115 resolution elements as indicated by SPM99. These parametric maps serve

Ordinal Trend Canonical Variates Analysis

1619

as the dependent variables for the subsequent population-level OrT/CVA analysis. 3.3 Results. In the initial OrT/CVA analysis of the retention period of the working memory task, which involved 16 subjects and three levels of memory load (one-, three-, and six-letter arrays), the first two principal components of the OrT/CVA combined linearly to produce an activation pattern that expressed a statistically significant ordinal trend effect. Here, statistical significance indicated that a regional activation pattern was present in the retention period whose functional connectivity was sustained across increasing levels of memory load. This OrT pattern accounted for 5.8% of the variance in the raw fMRI data set. Brain regions that concomitantly increased in activation (as ascertained by the bootstrap test with a threshold of |ICVvoxel | > 2) for the majority of subjects as a function of memory load were found mainly in parietal areas (BA 7 and BA 40), frontal/prefrontal areas (BA 6,8,9), right fusiform gyrus (BA 19), and left superior temporal gyrus (BA 22). Brain regions that concomitantly decreased in activation for the majority of subjects were found mainly in the anterior and posterior cingulate gyri (BA 31, 24), insula (BA 13), cuneus (BA 19), right parahippocampal gyrus (BA 19), and medial frontal gyrus (BA 10). For a complete listing of both areas of increased and decreased activation see Tables 2 and 3 and Figure 2. Based on the number-of-exceptions statistic (described in step 4 of the OrT/CVA algorithm), there was a significant ordinal trend ( p < 0.01, 2 exceptions; see Figure 3). This OrT pattern of load-related regional activations was forward-applied into the fMRI data set of the additional 24 subjects, who also were scanned while well rested. The matrix of pattern expression values for three load conditions, for each of 24 subjects, yielded a value of 5 for the number-of-exception statistic and p < 0.001. The p-value was computed using a Monte Carlo method similar to that described in step 4 of the OrT/CVA algorithm, where the p-value is the probability of obtaining a statistic of five or less from data sets that were generated from the statistics of gaussian random noise. The p-value reported here is based on 10,000 Monte Carlo simulations of data sets in which individual images contained 115 resolution elements each. Although the OrT pattern accounted for 5.8% of variance in the fMRI data set from which it was originally derived, the same OrT pattern accounted for less variance (i.e., 2.0% variance) in the fMRI data set into which it had been forward applied (the data set of the 24 additional subjects). This reduction in the variance-accounted-for most likely reflects a limitation in terms of the accuracy with which a true OrT pattern can be estimated from an original sample of 16 subjects. Notwithstanding, this reduction in varianceaccounted-for does not detract from the fact that ordinal trends were expressed to a significant degree by the estimated OrT pattern in the fMRI data set of the 24 additional subjects. Indeed, this outcome is consistent

1620

C. Habeck et al.

Table 2: FMRI Example: Talairach Locations of Nearest Gray Matter Locations with Significant Increased Activation Across Memory Load as Ascertained by a Bootstrap Resampling Test (ICV > 2.0). X

Y

Z

Anatomical Description

Brodmann Area

32 42 −8 28 38 6 −4 −40 −24 8 −22 −28 −59 42 34 −30 −42 −6 −6 −42 −40 51 65 14 6 12 53 −51

−52 −42 3 −65 −73 −3 −11 0 1 14 −62 −72 −39 40 42 −73 −76 6 14 41 −61 −68 −39 −80 −81 −65 −53 0

50 56 62 −15 −15 11 13 48 55 3 45 46 30 18 26 −13 −10 0 1 7 −14 9 0 −16 −18 53 −14 33

Superior parietal lobule Inferior parietal lobule Medial frontal gyrus Declive Declive Thalamus Thalamus Middle frontal gyrus Subgyral Caudate Superior parietal lobule Superior parietal lobule Inferior parietal lobule Middle frontal gyrus Middle frontal gyrus Fusiform gyrus Middle occipital gyrus Caudate Caudate Inferior frontal gyrus Fusiform gyrus Middle occipital gyrus Middle temporal gyrus Declive Declive Superior parietal lobule Inferior temporal gyrus Precentral gyrus

Brodmann area 7 Brodmann area 40 Brodmann area 6 Cerebellum Cerebellum Thalamus Medial dorsal nucleus Brodmann area 6 Brodmann area 6 Caudate head Brodmann area 7 Brodmann area 7 Brodmann area 40 Brodmann area 10 Brodmann area 9 Brodmann area 19 Brodmann area 18 Caudate head Caudate head Brodmann area 46 Brodmann area 37 Brodmann area 19 Brodmann area 21 Cerebellum Cerebellum Brodmann area 7 Brodmann area 20 Brodmann area 6

Source: Results come from Talairach Daemon Client 1.1, Research Imaging Center, University of Texas Health Science Center at San Antonio.

with the notion that, compared to a separate OrT/CVA analysis of a new data set, a substantial gain in statistical power can be achieved by the forward application of a previously obtained OrT pattern estimate. In contrast to the above results, the forward application of the OrT pattern did not reveal a significant ordinal trend effect in the comparable fMRI data of 18 sleep-deprived subjects. In these later subjects, who performed the working memory task after 48 hours of sleep deprivation, the forward application of the OrT pattern of the well-rested state produced six exceptions and a p = 0.11. A separate OrT/CVA analysis of the 18 sleep-deprived subjects, again performed on the fMRI data set of the retention period, also failed to produce an activation pattern that expressed significant OrT trends.

Ordinal Trend Canonical Variates Analysis

1621

Table 3: FMRI Example: Talairach Locations of Nearest Gray Matter Locations with Significant Decreased Activation Across Memory Load as Ascertained by a Bootstrap Resampling Test (ICV < − 2.0). X

Y

−44 −57 −59 −12 24 38 28 46 38 40 59 44 14 −12 67 4 −28 −61 −20 65 46 26 −57 −14 −42 8 22 55 −63 −53 26 −20 −24 12 38 −53 14 2 −61 −4 −8 −32 67

−66 −57 −49 −66 27 −15 −45 −52 −75 −56 −53 −53 −88 −88 −32 50 13 −53 39 −16 −21 −6 −9 20 −66 −14 −39 5 −14 −65 −50 −29 −8 −25 −79 −5 −44 1 −7 −50 −75 −47 −35

Z

Anatomical Description

Brodmann Area

36 34 36 −39 41 14 −3 −24 −30 −33 36 23 25 28 16 20 −4 23 35 −1 −1 −3 −16 58 9 −3 −38 −12 −9 −12 −24 5 −5 12 6 −17 −35 29 21 −33 24 −13 −10

Angular gyrus Supramarginal gyrus Supramarginal gyrus Inferior semilunar lobule Middle frontal gyrus Insula Parahippocampal gyrus Tuber Tuber Cerebellar tonsil Supramarginal gyrus Superior temporal gyrus Cuneus Cuneus Superior temporal gyrus Medial frontal gyrus Claustrum Supramarginal gyrus Superior frontal gyrus Superior temporal gyrus Superior temporal gyrus Lentiform nucleus Inferior temporal gyrus Superior frontal gyrus Middle temporal gyrus Brainstem Cerebellar tonsil Middle temporal gyrus Middle temporal gyrus Fusiform gyrus Culmen Thalamus Lentiform nucleus Thalamus Middle occipital gyrus Middle temporal gyrus Cerebellar tonsil Cingulate gyrus Postcentral gyrus Cerebellar tonsil Cuneus Fusiform gyrus Middle temporal gyrus

Brodmann area 39 Brodmann area 40 Brodmann area 40 Cerebellum Brodmann area 8 Brodmann area 13 Brodmann area 19 Cerebellum Cerebellum Cerebellum Brodmann area 40 Brodmann area 39 Brodmann area 19 Brodmann area 19 Brodmann area 22 Brodmann area 9 Cerebellum Brodmann area 40 Brodmann area 9 Brodmann area 21 Brodmann area 22 Putamen Brodmann area 21 Brodmann area 6 Brodmann area 37 Subthalamic nucleus Cerebellum Brodmann area 21 Brodmann area 21 Brodmann area 19 Cerebellum Pulvinar Lateral globus pallidus Pulvinar Brodmann area 19 Brodmann area 21 Cerebellum Brodmann area 24 Brodmann area 43 Cerebellum Brodmann area 18 Brodmann area 37 Brodmann area 21

Source: Results come from Talairach Daemon Client 1.1, Research Imaging Center, University of Texas Health Science Center at San Antonio.

1622

C. Habeck et al.

Figure 2: Activation pattern whose subject expression shows an ordinal trend across memory load during the retention period. The pattern estimate is based on the first two OrT principal components. Pattern voxels whose absolute values exceeded a threshold value of 2 in their inverse coefficient of variation (ICV) are shown in sagittal, coronal, and transverse projection views, produced with spm99 software package. (ICV values were estimated using a bootstrap method.) (A) Positively weighted areas—areas that are increasing in activation across memory load for a majority of subjects. (B) Negatively weighted areas—areas that are decreasing in activation across memory load for a majority of subjects.

In summary, in analyzing the effects of sleep deprivation on working memory, we first applied the OrT/CVA analysis to the fMRI data of well-rested subjects to obtain an activation pattern that indicated loadrelated processing that was operative during the delay period. Although the functional connectivity captured in this activation pattern appeared to be sustained with increasing memory load in well-rested subjects, our results suggest that it had been disrupted in subjects who were sleep deprived for 48 hours. Indeed, neither a forward application of the load-related pattern nor a separate OrT/CVA of the fMRI data set of the 18 sleepdeprived subjects produced a significant ordinal trend effect.

4 Application to Imaging of Visuomotor Learning Using PET We offer here a second application of the OrT methodology, in this case to neuroimaging data obtained in a study that used H2 15 O PET to investigate a subtle form of visuomotor adaptation, the learning of a novel visuomotor gain. The neural response to the cognitive challenge was not detectable using the conventional brain-wide analysis of voxel activity (statistical parametric mapping, SPM; (Friston et al., 1996)). Notwithstanding, some of the sites of activation were predictable a priori and included brain regions that

Ordinal Trend Canonical Variates Analysis

1623

would be expected to be strongly interactive. Indeed, an SPM analysis that was restricted to predictable activation sites did reveal significant responses ( p < 0.05, corrected) during adaptation (Krakauer et al., 2003). The aims of the OrT analysis were more ambitious: to demonstrate that a brain-wide analysis, different from a voxel-wise SPM99 analysis, could detect activations in the predicted regions and that the spatial covariance pattern is significantly associated with visuomotor adaptation—that is, the expression of the OrT activation pattern provides a reliable account of the subject differences in visuomotor adaptation. The results of the OrT analysis met both aims.

4.1 Study Design. The neuroimaging study examined a form of visuomotor adaptation in which subjects performed reaching movements. Individuals additionally had the task of learning to rescale the spatial mapping between actual hand movements and the visual appearance of their trajectories displayed on a monitor (Krakauer et al., 2003). The imaging technology used was H2 15 O PET; 10 subjects participated in the study. Basic task requirements were described in detail in previous publications (Ghilardi et al., 2000; Nakamura et al., 2001). In brief, all tasks required subjects to move a handheld cursor with their right hand on a digitizing tablet (Numonics Corporation, Model 2200) while their hand and target locations were displayed on a 15 inch computer screen. A computer controlled the experiment to generate screen displays and acquire kinematic data from the digitizing tablet at 200 Hz. On the day prior to PET scanning, all subjects received a session of training on the experimental setup, during which time they achieved a level of errorless performance on a baseline condition. The baseline condition (CONTROL) required subjects to move a cursor out and back in one uninterrupted movement from a central starting position to one of eight radially arrayed circular targets. In this condition, the relation between tablet and screen was one to one: the extent and the direction of the hand trajectory on the tablet were replicated on the screen. Each out-and-back movement took 1 second, and the succession of eight out-and-back movements was counterclockwise. This eight-target cycle was repeated eight times over a 96 second period. During PET scanning, a novel learning condition (GAINalt) was covertly introduced in which the tabletto-screen gain was altered every two cycles between 1:1.5 and 1:0.5, thereby maintaining a relatively constant level of challenge across the 96 second period. Two nonconsecutive scans were acquired for each subject performing the GAINalt task, denoted GAINalt1 and GAINalt2. Subject performance during each scan was characterized as a series of learning curves, one for each pair of cycles of constant gain, which were well fitted with single exponential functions. The coefficients of these exponential functions were used as estimates of the average rates of adaptation in the session. For this, the coefficients for the 1:1.5 and 1:0.5 gain change epochs were combined

1624

C. Habeck et al.

to obtain a mean adaptation rate for each scan and subject. This estimated rate of adaptation was the performance variable used in the correlational analysis with OrT pattern expression. 4.2 PET Preprocessing Steps. The PET data analyzed were 3D, H2 15 O PET scans of a 96-second duration. The same raw count images were used in both the SPM and OrT/CVA analyses, where the images were smoothed, aligned, and mapped into MNI coordinates using the SPM99 package (SPM99, Wellcome Department of Cognitive Neurology). Raw images were masked with aprobabilistic gray matter mask at a threshold of 0.2. The entire masked raw image of each subject and condition was used in the OrT analysis. However, in the spatially restricted SPM analysis, the brain areas included were limited to the left sensorimotor cortex (BA 1, 2, 3, and 4), premotor cortex and SMA (BA 6), posterior cingulate (BA 23), parietal (BA 5, 7, 40), and visual areas (BA 17, 18), as well as the subcortical areas of the left putamen, globus pallidus, and thalamus. Both the brain-wide SPM analysis and the spatially restricted SPM analysis sought to identify differences between CONTROL activation and the average activation in averages of two GAINalt scans. By contrast, the OrT analysis was designed to allow for the possibility of task repetition effects by modeling the following ordered triad of conditions: (1) the initial 96 second period of alternating gain (GAINalt1), (2) the second, (GAINalt2), and (3) baseline (CONTROL). In terms of the OrT nomenclature, the CONTROL task served as the baseline condition B; GAINalt2, as the condition of intermediate challenge E1; and GAINalt1, as the condition of highest challenge E2. This OrT analysis therefore permitted physiological repetitionsuppression effects that took the form of a negative ordinal trend across the prespecified task ordering (Ungerleider, Doyon, & Karni, 2002). In addition, OrT allowed individual differences in the expression of the activation pattern that could be accounted for by subject differences in the repetition effect or the rate of visuomotor adaptation. 4.3 Results. The OrT method identified a pattern of regional activity that was a linear combination of the first two principal components for which its change in expression between GAINalt1 and CONTROL was significantly correlated with the subject rate of adaptation (see Figure 4). To obtain this pattern, the difference in adaptation rate between GAINalt1

Figure 3: Subject expression of memory load–related OrT pattern constructed from the first 2 principal components of the data from the retention phase of 16 subjects for 1, 3, and 6 letters. Every subject’s expression in the 1-letter condition has been subtracted from the expression of all three conditions to heighten the visual impression of the variability in curve shapes. As a consequence, every

(B)

Pattern expression

(C)

Pattern expression

(A)

1625

Pattern expression

Ordinal Trend Canonical Variates Analysis

1 item 3 items 6 items Memory load during retention Figure 3: (cont.) subject’s pattern expression in the 1-letter conditions is now zero. (A) Task × subject expression curves for the 16 subjects from whom the pattern was originally derived. The number-of-exceptions statistic has the value 2, resulting in a p-value p < 0.01. (B) Forward application of the memory load–related pattern to 24 additional subjects, showing a preserved relationship between the task × subject expression of the pattern and memory load. The number-of-exceptions statistic has the value 5, confirming a significant ordinal trend, p < 0.001. (C) Forward application of the memory load–related pattern to the 18 sleep-deprived subjects, immediately following 48 hours of sleep deprivation. A relationship between the task × subject expression of the pattern and memory load is no longer evident. The number-of-exceptions statistic has the value 6, p = 0.11, which is insufficient to reject the null hypothesis of the absence of an ordinal trend.

1626

C. Habeck et al.

(A)

(B)

94 92 90 88 86 84

Subject network activity

% Error Reduction (initial 6 cycles)

96

82 80 Subject Network Activity: GAINAlt1 – Control

Control

GAINAlt2 GAINAlt1 Task Condition

Figure 4: (A) Relationship between the task × subject expression of the first two OrT principal components and subject rates of adaptation in the GAINalt1 and GAINalt2 conditions. Subject rates of adaptation in GAINalt1 were significantly predicted (R2 = 0.88; p < 0.01) by a linear combination of the component expressions in the individual GAINalt1—CONTROL subtraction images. Further, the expression of the same component combination in the GAINalt2— CONTROL subtraction images predicted rates of adaptation in the repeat condition (R2 = 0.55; p < 0.05; figure not shown). (B) Expression of task activity curves for each of 10 subjects for the activation pattern whose expression predicted subjects’ rates of gain adaptation. Each subject’s CONTROL value is subtracted from his GAINalt1 and GAINalt2 values, highlighting subject differences.

and CONTROL was used as the dependent variable in a multiple linear regression to produce a linear combination of the first two principal components. The p-value from the multiple regression analysis was p < 0.01. In addition, the subject × task expression of this pattern revealed that 9 of 10 subjects exhibited increasing ordinal trends from the CONTROL to GAINalt2 to GAINalt1 conditions (see Figure 4), which is a significant degree of concordance between pattern expression and the ordinal trend criterion ( p < 0.05) according to the type I error rate computed using the Monte Carlo method described earlier. In other words, the spatially unrestricted OrT/CVA revealed a pattern of activation that was statistically significant based on the criterion of ordinal trends and, separately, on the successful prediction of subject performance scores. Notwithstanding, subjects differed in the amount of decline, with some subjects showing right prefrontal-basal

Ordinal Trend Canonical Variates Analysis

sagittal

1627

coronal

transverse

Figure 5: Activation pattern whose expression in subjects’ GAINalt1— CONTROL and GAINalt2—CONTROL subtraction images predicted the respective GAINalt1 and GAINalt2 rates of adaptation. Pattern estimate is based on the first two OrT principal components. Pattern voxel values that exceed a threshold value of 2 in their ICV are shown in sagittal, coronol, and transverse projection views (spm99). The pattern consists of left medial cerebellum; left/right basal ganglia and thalamus; left/right primary and secondary visual cortices (BA 17, 18, 37); and brainstem/pons. It also shows some right frontal and prefrontal activation (BA 6, 46, 47).

ganglionic-cerebellar activity that was reduced almost to their respective null task levels, while others showed little or no reduction in activation (see Figure 4). The bootstrap pattern associated with gain adaptation identified areas similar to those identified by Krakauer et al. (2003)—the left and right putamen and the left cerebellum (see Figure 5 and Table 3)—but also showed additional activations as indicated in Table 1. It is important to state that there is no explicit consideration regarding extent of activation in the bootstrap method. Nevertheless, the number of contiguous voxels that reach significance may be a further indication of the importance of each region of the OrT pattern in mediating gain adaptation. Between the 96 second period in which subjects were initially challenged with tablet-to-screen gain changes and the second 96 second period of gain changes, individuals showed a substantial mean decline in their brain responses to the challenge in right prefrontal, basal ganglionic, and cerebellar activity. Were this level of decline to be replicated in a forward application of the OrT pattern to a new subject sample, the mean difference in OrT pattern expression would be significant at p < 0.05. 5 Discussion The role that OrT is expected to play in functional neuroimaging and cognitive neuroscience can be summarized as follows. OrT takes its place

1628

C. Habeck et al.

alongside the voxel-seeded PLS analysis as being only the second spatial covariance model that is specifically designed to recover latent aspects of functional connectivity in neuroimaging studies that involve parametric experimental designs. Both OrT and voxel-seeded PLS recover information about connectivity based solely on experimental design variables. In particular, there is no requirement in applications of either the OrT or voxel-seeded PLS analysis to provide a quantitative model of the uncertain relationship between functional brain circuitry and subject variables, such as assessments of performance in individual task conditions or general skill level (e.g., IQ or level of education). OrT and PLS analyses target distinct types of task × subject interactions, where each analysis reveals a different aspect of functional connectivity. In this regard, OrT is different in two respects. First, it models a type of task × subject interaction associated with sustained functional connectivity across graduated changes in task parameters. In fact, the interaction modeled by OrT contains information about functional connectivity that previously has not been used in either spatial covariance modeling or voxel-wise, univariate, and multivariate linear modeling. Second, although voxel-seeded PLS requires partial information about the brain regions involved in patterns of functional connectivity, OrT does not. OrT is guided simply by the theoretical prediction of directional changes in regional activity with changes in task parameter values. In short, OrT, like voxel-seeded PLS, represents a unique omnibus test of functional connectivity that is performed across multiple task conditions and the entire brain. From an applied perspective, we have presented the results of the OrT analysis of both event-related fMRI and H2 15 O PET studies of memory and learning. In part, the goal has been to demonstrate the statistical methods that are used to evaluate the specificity and sensitivity of the OrT method: detection of latent patterns that express ordinal trends on a subject-bysubject basis, estimating the salience of individual regions (voxels) in the latent OrT patterns, and the reliability of the regional weights. In addition, the empirical findings appear to have scientific merit in their own right. 5.1 Event-Related fMRI Study of Working Memory. We applied the OrT/CVA method to the retention data of a delayed-match-to-sample task in order to identify an activation pattern whose subject × task expression would reveal a positive monotonic trend with memory load. Such a pattern was successfully derived from a sample of 16 subjects ( p < 0.01) and its validity confirmed through forward application to a replication sample of 24 subjects ( p < 0.001). Figure 5 and Table 4 depict areas whose increase in activation parallels the increase in memory load found in the inferior and superior parietal lobe (BA 40, 7), the middle frontal gyrus (BA 9), and the left superior temporal gyrus (BA 22); the last region merits speculation that auditory rehearsal is taking place during the retention period (Baddeley, 2003). There also were areas that decreased their activation with increasing memory demand

Ordinal Trend Canonical Variates Analysis

1629

Table 4: PET Example: Talairach Locations of Nearest Gray Matter whose Significant Contribution to the Pattern Associated with Adaptation Rate Was Ascertained by a Bootstrap Resampling Test (ICV > 2). X

Y

−36 50 55 28 36 −26 40 −51 50 53 −63 −14 −10 22 −20 42 46 −18 32 24 14 −20 67 −8 67

−47 41 4 −12 −6 4 9 −44 −53 −65 −32 −23 −44 15 −26 −83 −46 −23 8 −84 −88 −88 −42 −60 −44

Z −13 9 31 2 −1 2 −6 −16 −12 −9 13 5 8 −4 −9 6 47 12 7 −8 23 −7 13 47 10

Anatomical Description

Brodmann Area

Fusiform Gyrus Inferior Frontal Gyrus Precentral Gyrus Lentiform Nucleus Claustrum Lentiform Nucleus Insula Fusiform Gyrus Inferior Temporal Gyrus Middle Occipital Gyrus Superior Temporal Gyrus Thalamus Posterior Cingulate Lentiform Nucleus Parahippocampal Gyrus Middle Occipital Gyrus Inferior Parietal Lobule Thalamus Claustrum Middle Occipital Gyrus Cuneus Middle Occipital Gyrus Superior Temporal Gyrus Precuneus Superior Temporal Gyrus

Brodmann area 37 Brodmann area 46 Brodmann area 6 Putamen ∗ Putamen Brodmann area 13 Brodmann area 37 Brodmann area 20 Brodmann area 37 Brodmann area 22 Pulvinar Brodmann area 29 Putamen Brodmann area 28 Brodmann area 19 Brodmann area 40 Pulvinar ∗ Brodmann area 18 Brodmann area 19 Brodmann area 18 Brodmann area 22 Brodmann area 7 Brodmann area 22

Source: Results come from Talairach Daemon Client 1.1, Research Imaging Center, University of Texas Health Science Center at San Antonio.

during the retention period, featuring the anterior and posterior cingulate gyri (BA 31,24) and the medial frontal gyrus (BA 10). Deactivations with experimental task parameters have received more attention in recent years and offer some points of contact with our results. The specific neuroanatomy of connections between medial and lateral prefrontal cortices as well as other cortical areas is an area of ongoing research (Barbas, 2000; Barbas, Ghashghaei, Dombrowski, & Rempel-Clower, 1999) that posits that the connectivity among these particular regions of PFC and posterior regions (involved in oculomotor guidance and spatial attention) contributes to the synthesis of memory, cognition, and emotion in general. A recent study using working-memory with an N-back design (Pochon et al., 2002) also found anterior medial prefrontal deactivating with increasing memory load. The authors of this study offer a rationale for the deactivation of the medial prefrontal cortex that is consistent with the general resource account framework (Engle, Conway, Tuholsky, & Shisler, 1995).

1630

C. Habeck et al.

A shift of resources away from ongoing, but inessential, processes to an increasingly demanding cognitive task might underlie the medial prefrontal deactivation in accordance with this framework. The amount of this shift might still be subject dependent, with a fixed ratio between activity increases and decreases, and result in a large covariance between areas detectable by a multivariate analysis technique. Because of the role of prefrontal limbic cortices (i.e., orbitofrontal and medial prefrontal cortices) in emotional processing (Barbas, 2000), the results suggest that a shifting balance during higher cognitive processing causes increasing activity in cortical cognitive areas and decreasing activity in the limbic and paralimbic structures. Such reciprocal changes in brain activation associated with emotional and cognitive processing are also found in mood disorders such as depression (Mayberg et al., 1999), although with a different relative sign to our findings. In depressed patients, hyperactivity in limbic and paralimbic areas is accompanied by decreased activity in cortical areas, resulting in worse cognitive performance. Negative mood and high memory demand might thus be interpreted as opposite ends of a common continuum, which is reflected in sustained functional connectivity (i.e., a fixed correlative relationship between regional activation), resulting in the changing level of expression of one covariance pattern only. Although the functional connectivity captured in the above activation pattern appears to be sustained with increasing memory load in well-rested subjects, it appears to be disrupted in subjects who were sleep deprived for 48 hours. In other words, one effect of sleep deprivation on working memory—in the retention period—is to disrupt the particular memory processing that normally mediates letter retention at low to moderate memory loads. This conclusion is based on two OrT analytic results. First, the fMRI data of 18 sleep-deprived subjects failed to produce significant positive ordinal trends when the OrT pattern that revealed positive ordinal trends in 40 well-rested subjects was forward-applied. Second, an independent OrT/CVA analysis of the 18 sleep-deprived subjects failed to produce an activation pattern that expressed significant ordinal trends. Apparently the effect of sleep deprivation is not simply to increase the load on the memory processes that are normally operating in well-rested subject at low to moderate load levels. The effect of sleep deprivation on working memory may be to induce nonadditive or nonmultiplicative load effects on memory processing, where the effects may be different in different individuals. 5.2 H2 15 O PET Study of Visuomotor Learning. We applied the OrT method to an H2 15 O data set obtained in a study of visuomotor learning. These PET data had previously been analyzed using SPM99 with conventional voxel-by-voxel modeling (Krakauer et al., 2003). A statistical significant activation pattern was obtained using OrT, where many of the regions with reliable levels of activation were predicted a priori. These regions include several of those normally activated during the execution of

Ordinal Trend Canonical Variates Analysis

1631

overlearned hand movements (i.e., the CONTROL task), which are not activated during sensory control tasks. In this regard, the OrT activation was similar to the pattern obtained in a spatially restricted SPM comparison of the gain adaptation task (i.e., the GAINalt task) and the CONTROL task. In the latter SPM analysis, a spatial mask was used to achieve statistically significant results. This mask delimited areas that routinely had been demonstrated to be activated during the execution of overlearned hand movements: the left primary sensorimotor cortex, (BA 3, 2, 1 and 4), premotor cortex and SMA (BA6), posterior cingulate (BA 23), parietal (BA 5,7,40), and visual areas (BA 17,18) as well as the subcortical areas of the left putamen, globus pallidus, thalamus, and cerebellum. The latter SPM analysis was based on the expectation that the brain areas that mediate gain adaptation reside for the most part within the motor network that is responsible for the execution of overlearned hand movements. However, right prefrontal areas of activation, which do not normally occur during overlearned hand movements, were also part of the OrT activation pattern. Potentially this combination of prefrontal, basal ganglionic, and cerebellar activation sites represents a pattern of strongly functionally connected brain areas. Moreover, combined with the significant association between the OrT pattern expression and subject rates of gain adaptation, the OrT finding raises the possibility that prefrontal regions may be involved in some aspect of gain adaptation. The necessary caveat is that the potential contribution of this discovery relies on a future series of more elaborate experimental investigations into the functional connectivity between prefrontal cortex, basal ganglia, and cerebellum during visuomotor adaptation. 5.3 The Invention of the OrT Design Matrix and Why the Matrix Works. As one might imagine, the OrT design matrix was not invented through a random process of trial and error. Although there was no guarantee that a matrix multiplication approach (i.e., a CVA approach) would actually work, we sought to design a matrix that selectively enhanced the voxel × task × subject variance of patterns that expressed ordinal trends. It was necessary to specify explicitly the different types of latent patterns whose voxel × task × subject variance must be reduced in the transformed data set. Without that stipulation, the application of PCA or SVD to the transformed data set would not necessarily produce major principal components that provided a good approximation to one or more patterns that express ordinal trends. By design, the OrT analysis relies on the major principal components to provide good approximations to patterns that express ordinal trends. As suggested in section 1, the invention of the OrT design matrix required a thorough understanding of the possible similarities and differences between the voxel × task × subject variances of patterns that express ordinal trends and patterns that do not. On the one hand, there are the similarities and differences between patterns that express the predicted mean trend and

1632

C. Habeck et al.

patterns that express mean directional changes that are different from the predicted trend. On the other hand, there are similarities and differences between patterns that express mean trends in the predicted direction but different types of task × subject interactions. Finally, there are similarities and differences between patterns that express ordinal trends in the predicted direction but exhibit different amounts of voxel × task × subject variance in the original data set. The OrT design matrix we invented reduced these similarities and differences to just three factors: task mean differences, within-task variances, and intertask correlations. To appreciate the importance of the novel third factor, consider a series of three experimental conditions, labeled B-E 1 E2, in which features 1 and 2 are identical in all the latent patterns of the data set, whereas the intertask correlations are different. Indeed, suppose it is ρit = CORR(b, e1 )/2 + CORR(e1 , e2 )/2 that distinguishes latent patterns that express ordinal trends from patterns that do not. In patterns that express ordinal trends, the (mean) intertask correlation will be moderately to highly positive—indeed, significantly more positive than the intertask correlations of patterns that do not express ordinal trends. A Monte Carlo sampling of 100,000 families of subject ordinal trends (across B-E 1 -E2 for 13 subjects) illustrates the statistical robustness of this identifying feature of OrT patterns (see Figure 6). In the OrT transformed data set, intertask correlations appear as an explicit term in the algebraic expression of the voxel × task × subject variance of individual latent patterns. For a simple example, consider the miniature data set described in section 2, which consisted of just two experimental conditions. In this example, the contributions of each latent pattern to the 2 × 2 regional covariance matrix Y [Q(Q Q)−1 ]Q Y of the transformed data set is the variance VAR(b + e1 ) = VAR(b) + VAR(e1 ) + 2COV(b, e1 ) associated with the latent pattern. In OrT analyses involving a series of three or more tasks, the contribution of each latent pattern to the meancentered covariance matrix also includes the additive factor of the squared mean differences between tasks. The difference in the COV(b, e1 ) values for the OrT and non-OrT patterns, which corresponds to the difference in their intertask correlations, was the unique factor that distinguished the two covariance patterns. In fact, for the OrT pattern in the miniature data set, COV(b, e1 ) was a large positive value contributing additively to the overall variance of the OrT pattern in [Q(Q Q)−1/2 ] Y, whereas COV(b, e1 ) was zero for the non-OrT pattern. The intertask correlations depicted in the Monte Carlo simulations of Figure 6 and in the miniature data set are a feature of all covariance matrices of OrT transformed data sets, regardless of the number of experimental conditions in the parametric series. The recognition that intertask correlations were a key feature of OrT patterns meant that our search for an optimal design matrix for OrT/CVA was limited to high-dimensional design matrices: namely T ∗ N × (T − 1)∗ N

Ordinal Trend Canonical Variates Analysis

1633

Cumulative frequency

1

0.8

D

B

C

A

0.6

0.8

0.6

0.4

0.2

0 −1

−0.8

−0.6 −0.4

−0.2

0

0.2

0.4

1

Inter–task r Figure 6: Correlational statistics for a Monte Carlo simulation of random samples of monotonic curves, computed for task triplets (ordered B-E1 -E2) and 13 subjects. Specifically, the cumulative distribution functions are depicted for (A) the maximum of CORR(b, e1 ) and CORR(e1 , e2 ); (B) the difference between this maximum value and Corr(b, e2 ); (C) Corr(b, e2 ); and (D) the difference between the minimum of the pair CORR(b, e1 ) and CORR(e1 , e2 ), and CORR(b, e2 ). Vertical lines indicate median and 5 percent values for individual cumulative distribution functions.

matrices. Our strategy for testing candidate matrices was to investigate simulated data sets in which low-dimensional design matrices failed in a significant way to recover latent patterns with ordinal trends. The latter design matrices are those frequently used in either voxel-wise univariate or multivariate analyses or in PLS analyses. Our test of feasibility was that the OrT design matrix Q did not fail in these worst-case scenarios. (See the appendix for details.) 5.4 Conclusion. The OrT analyses of the event-related fMRI and H2 15 O PET studies revealed patterns that suggested the presence of sustained functional activity in the face of considerable subject variation in the trajectories of their ordinal trends. Moreover, in this article, we have argued that although it may be difficult to anticipate the impact that normal subject

1634

C. Habeck et al.

variation has on regional functional connectivity, individual variation can be treated as an additional experimental dimension in neuroimaging studies rather than as an unexplained phenotypic variation. The OrT approach is to model functional connectivity as interactions between experimental parameters and endogenous variables. Indeed, these interactions, like any other type of experimental effect, must be relatively large to be reliably measured. In other words, OrT performs best when applied to studies that achieve an optimal trade-off between study designs that include normal phenotypic variation and experimental designs that exert sufficient control over individual differences so as to achieve sustained functional connectivity. At the same time, OrT analysis takes full advantage of experimental parametric designs to maximize statistical specificity. In particular, the specificity of the analysis increases dramatically with the number of conditions in the series. The probability of a predicted ordinal trend occurring by chance in one subject alone equals (T!)−1 ; the conjunction across subjects thus yields a probability of (T!)−N , explaining how the specificity of the method increases with the number of tasks and subjects. Moreover, statistical specificity increases in the OrT analyses considerably more rapidly than it does when analyses are limited to T × (T − 1)/2 pair-wise comparison tests that are corrected for multiple comparisons. The OrT experimental strategy would appear to be appropriate for studying not only young adults and their cognitive abilities, but also normal aging (Cabeza et al., 1997; Grady et al., 2003; Stern et al., in press) and neurological and psychiatric diseases (Alexander et al., 1999; Eidelberg et al., 1998; Fukuda et al., 2001). On the other hand, as we pointed out earlier, voxel-seeded PLS analyses have been particularly successful in detecting altered functional connectivity, which is also of considerable scientific value. Indeed, some neuroscientists (e.g., Kosslyn et al., 2002) would argue that individual differences in cognitive style would persist even in an ideal world without the practical limitations of experimental design, imaging technologies, and statistical sampling. Overall, OrT and voxel-seeded PLS together are useful for verifying the robustness of our cognitive theories in the face of factors whose influence on mental activity we do not yet know much about. For all the above reasons, both OrT and PLS analyses may take a prominent place in the toolkit of the multivariate modeler of brain imaging data.

Appendix: Derivation and Validation of OrT Design Matrix To answer whether a design matrix achieves a salience enhancement of the effects of interest is difficult. It supposes that the very knowledge sought after—knowledge of the underlying covariance structure of the

Ordinal Trend Canonical Variates Analysis

1635

data constituted by targeted and nontargeted activation patterns—is known beforehand, which is impossible. Monte Carlo simulations, on the other hand, afford a model instantiation of a data set with precise knowledge of the targeted and nontargeted activation patterns prior to the analysis, allowing a thorough assessment of the performance of different design matrix and analytic method choices in identifying the targeted activation patterns. This is not to claim that Monte Carlo simulations are an adequate substitute for the complexity of real neuroimaging data, but if these simple scenarios show deviations from the performance characteristics anticipated for routinely chosen design matrices, there is ground for suspicion that in the case of real-world neuroimaging data, the problems would be compounded. Monte Carlo simulations therefore present a test bed for verification (and possibly correction) that our intuitions about the performance of design matrix choices really hold up.

A.1 Unique Target Features. The objective of the OrT design matrix is to assign maximum salience to monotonic task-activity curves. One unique feature is that a family of N monotonic curves, randomly sampled from the set of all possible tripoint monotonic curves, nearly always exhibits positive intertask correlations between the subject levels of pattern expression. Second, of the three correlations among tasks, the largest is between consecutively ordered tasks, thereby identifying the third task as an end point of the ordering. Indeed, in a majority of cases, the two correlations between consecutively ordered tasks are greater than the correlation between nonconsecutively ordered tasks, thereby identifying both end points of the ordering. An elementary model of random between-task changes in pattern expression illustrates the stochastic features of a random sample of monotonic curves. For a task ordering B-E 1 -E2, the targeted pattern’s activity levels in tasks E 1 and E2, denoted e1 and e2 , respectively, are constructed from sums of identical and independently distributed random variables b, E1 − B and E2 − E1 , where each is a positive-valued random variable sampled from the uniform distribution U(0,1). e1 is distributed as b + E1 − B and e2 as b + E1 − B + E2 − E1 . These assumptions guarantee monotonic curves with as few constraints and as good a generality as possible. With increasing parametric load as specified by the task conditions, subjects increase their expression of the targeted topography, although the amount of the increase is independent of the current level of expression. As a consequence, the expected value of all three intertask correlations is positive, and in the majority of the cases, it holds that

min{CORR(b, e1 , CORR(e1 , e2 )} > CORR(b, e2 ).

1636

C. Habeck et al.

Figure 6 charts the results of a Monte Carlo simulation of 100,000 random samples of tripoint monotonic curves in which the correlations between consecutively ordered tasks are compared, sample by sample, to the correlation between nonconsecutively ordered tasks. The nonnegativity of the correlational statistics for cumulative distribution functions depicted in Figures 6A, 6B, and 6C is characteristic of almost all randomly sampled families of monotonic curves, even when b, E1 − B and E2 − E1 are not identically distributed. When comparing the factors that determine pattern salience in the original data Y and the factors that determine pattern salience for the OrTtransformed data [Q(Q Q)−1/2 ] YP in the OrT analysis, the weight given to the nonnegativity of intertask correlations is immediately evident. In the untransformed data sets, the factors that determine salience are the spatial extent of the pattern activation, the size of the mean trend effect, and the subject variance in within-task expression. Indeed, from the perspective of neuromodeling, we consider the worst-case scenario as the one where every nontarget is equal to that of the target in each of these variables. By comparison, in the OrT transformed data matrix [Q(Q Q)−1/2 ] YP there are four features of pattern expression that determine pattern salience: the spatial extent of the pattern activation, the size of the mean trend (i.e., the squared difference between the means of the summed pattern expressions b + e1 and e1 + e2 ), the subject variance in within-task expression, and the intertask correlations between subjects’ pattern expression for tasks B and E 1 and tasks E 1 and E2. In a comparison of the two lists of determinants of pattern salience, the principal difference is the increased salience with positive intertask correlations.

A.2. Monte Carlo Simulations. A.2.1 Data Set Design. The worst-case scenarios that are reported here are those in which the salience of every nontarget is equal to that of the target in the untransformed data Y. More specifically, each nontarget activation pattern is comparable to the target pattern in terms of the spatial extent of the pattern activation, the size of the mean trend effects in pattern expression, and the subject variance in within-task expression. The similarities between the targeted and nontargeted component processes have been further augmented. First, there is complete overlap in the voxels activated by the targeted and nontargeted component processes. Second, one or more nontargeted activation patterns exhibit a mean task trend identical to that expressed by the target pattern, although they differ from the target in that they do not express positive intertask correlations. Third, other nontargeted activation patterns express positive intertask correlations, although they differ from the target in the direction of their mean task trends. In the Monte Carlo simulations reported here, individual data sets consist of three task conditions in which the activation patterns of seven component

Ordinal Trend Canonical Variates Analysis

1637

processes are superimposed. The raw data matrix Y can thus be denoted as  b  e1  zk . Y= k=1 e2 k 7



The index k here denotes the different topographies, and zk therefore is a column vector with a number of rows equal to the number of voxels or regional resolution elements. Even without application of the projection operator P that removes the task-independent effects, one can appreciate the salience enhancement achieved through the application of the OrT design matrix Q by computing the voxel × voxel covariance matrix of Q Y, Y Q(Q Q)−1 Q Y ≈

7 2

b + 2e21 + e22 + 2b · e1 + 2e1 · e2 k zk zk .

k=1

The shadow processes essentially have zero intertask covariance in their subject expression. The ordinal trend effects for the alternative task orderings B-E2-E 1 and E 1 -B-E2 still have some residual intertask covariances, but as shown in Figure 6, in most cases these are lower than for the ordinal trend effects of the ordering B-E 1 -E2. All seven patterns are nonfocal activations, where the relative distribution function of voxel weights is matched in the different component processes. In each data set, voxel weights are sampled from the uniform distribution U(0,1) and independently distributed in different component processes. The number of voxels used in these simulated data sets is 500, approximating the number of resels normally contained in smoothed PET and fMRI images. In this Monte Carlo simulation, data sets consisted of 39 task × subject scans, or three scans for each of 13 subjects. No voxel noise was overlaid on these composite patterns of component activations. Taken as a whole, each composite activation pattern of a data set may be thought of as representing the footprint of the larger functional brain architecture that is common to all subjects and tasks. Figure 7 demonstrates some representative task-activity curves for the subject expression of the all the activation patterns used in the simulations. Among the seven activation patterns in a data set, there was a target for each of the three task orderings—an activation pattern that expressed monotonic task-activity curves for all subjects for all possible task orderings B-E1-E2, B-E2-E1, and E1-B-E2. (The end points in a task ordering are interchangeable since monotonically increasing and decreasing subject expressions are equivalent; therefore, there are three rather than six different combinations.) The effect size of the mean trend was the same in all three of these activation patterns. The remaining four activation patterns were nontargets with

1638

C. Habeck et al.

(A)

(B)

Target for B/E1/E2

Shadow of target for B/E1/E2

Target for B/E2/E1

Shadow of target for B/E2/E1

Target for E1/B/E2

Shadow of target for E1/B/E2

B

E1

E2

B

E1

E2

Figure 7: Representative illustration of the task-activity curves in the subject expression of 6 different types of activation patterns used for the Monte Carlo simulations. Column (A) shows processes that display ordinal trends for 3 different task orderings: B-E1-E2, E1-B-E2, and B-E1-E2. Column (B) shows the corresponding shadow processes—activation patterns whose subject expressions show the same monotonic increase in activity on the mean and have the same within-task variances, while allowing a substantial number of subjects to violate the requirement of monotonicity.

intertask correlations of zero, although the direction of a pattern’s mean trend was the same as that in one of the former three activation patterns. For this reason, we refer to each of the latter four activation patterns as a nontargeted shadow process. The elementary model of random between-task changes in pattern expression described in section A.1 was used to generate the family of monotonic task-activity curves for the former three activation patterns. The three sets of three model random variables were identically and independently

Ordinal Trend Canonical Variates Analysis

1639

distributed. Task expressions in the remaining four nontarget activation patterns were independently distributed as well. One hundred thousand simulations were performed to sample the different possible families of monotonic task-activity curves adequately. In the simulations reported here, we used the uniform distribution as the sampling distribution for within-task pattern expressions. However, simulations were also run using a normal sampling distribution, and practically identical results were obtained. A.2.2 OrT Performance. Performance assessment of the OrT analysis is based on the degree to which this CVA has enhanced the salience of targeted activation pattern. Enhanced salience is a relative judgment based on the comparison of the similarities between the activation pattern of the simulated target and the major singular images of OrT/CVA versus the similarity between the target and the major eigen images of the untransformed data YP. The indices of similarity are, respectively, the accuracy with which the first four singular images of OrT/CVA predict the targeted activation pattern in a multiple linear regression analysis and the accuracy with which the first four eigen images of YP predict the target pattern. The respective multiple linear regression coefficients (R2 ) are used as a goodness-of-fit measure for individual data sets. The same task ordering was used to specify the targeted pattern in all data sets. In order to avoid confusion with the examples of Ort/CVA before, we stress again that no inferential assessment was conducted in these simulations; neither was the ordinal trend evaluated in terms of statistical significance, nor was a bootstrap estimation of the robustness of voxel weights performed. The R2 statistic merely captures how much information about the target is captured by the first few principal components. This involves a multiple linear regression that presumes perfect knowledge of the targeted activation pattern—something that is impossible in a real-world neuroimaging context. R2 encodes the upper limit of what is maximally knowable about the target for each design matrix tested. The cumulative distribution functions (CDFs) of the R2 values were tabulated for the entire Monte Carlo simulation for both OrT/CVA and YP. The degree to which OrT/CVA provides a better R2 CDF than YP is the benchmark used to quantify the OrT/CVA enhancement of target salience. We also include a comparison of the OrT CDF with the CDFs associated with two other CVA design matrices that do not assign positive weight to intertask correlations. We used the Helmert design matrix (i.e., performed the SVD on [H(H H)−1/2 ] Y), as well as the design matrix of mean trends (McIntosh et al., 1996), M. M is a 3N × 2 matrix, containing as predictors the mean Helmert contrasts according to 

 −1 1 M =  1 1 . 0 −2

1640

C. Habeck et al.

The matrix [M(M M)−1/2 ] Y is then submitted to SVD, yielding 2 singular images. These comparisons afford an independent verification that the intertask correlations are essential to maximizing the salience of activation patterns that express monotonic task-activity curves. As discussed before, the Helmert design matrix assigns pattern salience to activation patterns in a manner that weights intertask correlations negatively. Meantrend CVA does not capture subject differences at all and can therefore be expected to perform less well than OrT/CVA. For the Monte Carlo simulations of the worst-case scenario, we therefore anticipated the CVA of mean trend effects as well as CVA with the Helmert design matrix to yield a target recovery that is not only inferior to that of OrT/CVA, but also inferior to that of an analysis of the minimally transformed data YP (demonstrating the unfortunate impact of an ill-chosen design matrix). Figure 8 presents the results of two Monte Carlo simulations of (1) the worst-case scenario, in which the mean trends of the four nontargeted shadow process are equal in size to mean trends of the other three activation patterns, and (2) a less severe scenario, in which the mean trends of the shadow processes are zero. For each of the two simulations, the three CVA design matrices were applied to YP for each of 100,000 data sets Y. For the OrT and Helmert CVAs, the first four singular images were used to predict the simulated target pattern of regional activation in a multiple linear regression analysis. For the CVA mean trend analysis, its two singular images were used in the regression analysis. The cumulative distribution function of the regression R2 was tabulated for 100,000 data sets, for each CVA method. These R2 CDFs were compared with the R2 CDF computed using the first four eigen images of YP. The findings for the CVA mean trend analysis and Helmert CVA are presented in Figures 8A and 8C. Figure 8A depicts the findings for the worst-case scenario. The R2 CDF of the Helmert CVA is shifted to the left of the CDF of YP. The relative positions of these two CDFs indicate that the targeted activation pattern actually has diminished salience in the Helmert CVA approach. The R2 CDF of the mean trend CVA is also shifted to the left of the YP-CDF—albeit slightly more so than the CDF of the Helmert CVA. A similar ordering of the R2 CDFs was obtained for data sets in which the shadow processes exhibited no mean task differences. In Figure 8C, the Helmert R2 CDF is shifted slightly to the left of the CDF of the mean trend CVA, which is shifted to the left of the CDF for YP. OrT/CVA produced better results, which are depicted in Figures 8B and 8D. In Figure 8B, the R2 CDFs are computed for data sets in which the mean trends of the nontargeted shadow process are equal in size to mean trends of the remaining three activation patterns. The median R2 value of the OrT/CVA was 0.72, indicating that for half of the simulated data sets, 72% or more of the variance of the target’s regional pattern weights was

Ordinal Trend Canonical Variates Analysis

1641

(B)

(A)

No mean trend

Matching mean trend Cumulative frequency

1

1

0.8

M

H M I

0.6

0.6

0.4

0.4

0.2

0.2 0

Cumulative frequency

0.8

I

H

0.2

0.4

0.6

0.8

1

0

1

1

0.8

0.8

I

0.6

Q

0.4

0.2

0.2 0.2

0.4

0.6

2

R

0.8

0.4

0.6

1

0.8

I

0.6

0.4

0

0.2

0

0.2

0.4

0.6

0.8

1

Q

1

2

R

Figure 8: OrT sensitivity to ordinal trends relative to that of a PCA on the reduced data matrix YP and two other CVAs—Helmert CVA and mean trend CVA. Target sensitivity charted as CDFs—Q, OrT; I, YP; H, Helmert; and M, mean trend—of the R2 regression statistic for two Monte Carlo simulations involving different shadow processes. (A) Four shadow processes with mean trends equal in size to that of the targeted component process. (B) Four shadow processes with no mean trend. Vertical lines indicate median and 5 percent R2 values for individual CDFs.

accounted for by the first four singular images. Fewer than 5 percent of the data sets produced R2 values below 0.38. The OrT CDF is shifted to the right of the R2 CDF of YP, indicating that the targeted activation pattern has markedly enhanced salience in the OrT approach. In this comparison, there is a 30% improvement in the median R2 value and a 65% improvement in the R2 value below which fewer than 5 percent of the data sets produced poorer predictions. Compared to the Helmert CVA, the OrT/CVA produced

1642

C. Habeck et al.

a 255% improvement in the median R2 value, and a 270% improvement at the 5 percent R2 value. An enhancement of target salience by OrT/CVA was also achieved in the data sets in which the shadow processes exhibited no mean task differences, as depicted in Figure 8D. The median R2 value of OrT/CVA is 0.87, indicating that for half of the simulated data sets, 87% or more of the variance of the target’s regional pattern weights was accounted for by the first four singular images. As before, the OrT CDF is shifted to the right of the R2 CDF of YP. In this comparison, there is a 10% improvement in the median R2 value, and a 20% improvement in the R2 value below which fewer than 5 percent of the data sets produced poorer predictions. Finally, compared to Helmert CVA, OrT/CVA produced a 45% improvement in the median R2 value and a 50% improvement at the 5 percent R2 value. Acknowledgments We thank Eric Zarahn and two anonymous reviewers for a critical reading of the manuscript for this article and many helpful questions and suggestions. This work was supported by federal grants NINDS R01 NS35069, NIA R01 AG16714 and NS02138. References Abbott, A. (2002). Alpine detector fails to confirm Italian sighting of dark matter. Nature, 417(6889), 575–576. Alexander, G. E., Mentis, M. J., Van Horn, J. D., Grady, C. L., Berman, K. F., Furey, M. L., Pietrini, P., Repoport, S. I., Schapiron, M. B., & Moeller, J. R. (1999). Individual differences in PET activation of object perception and attention systems predict face matching accuracy. Neuroreport, 10(9), 1965–1971. Baddeley, A. (1988). Cognitive psychology and human memory. Trends Neurosci., 11(4), 176–181. Baddeley, A. (2003). Working memory: Looking back and looking forward. Nat. Rev. Neurosci., 4(10), 829–839. Barbas, H. (2000). Connections underlying the synthesis of cognition, memory, and emotion in primate prefrontal cortices. Brain Res. Bull., 52(5), 319–330. Barbas, H., Ghashghaei, H., Dombrowski, S. M., & Rempel-Clower, N. L. (1999). Medial prefrontal cortices are unified by common connections with superior temporal cortices and distinguished by input from memory-related areas in the rhesus monkey. J. Comp. Neurol., 410(3), 343–367. Braver, T. S., Cohen, J. D., Nystrom, L. E., Jonides, J., Smith, E. E., & Noll, D. C. (1997). A parametric study of prefrontal cortex involvement in human working memory. Neuroimage, 5(1), 49–62. Cabeza, R., Anderson, N. D., Houle, S., Mangels, J. A., & Nyberg, L. (2000). Agerelated differences in neural activity during item and temporal-order memory

Ordinal Trend Canonical Variates Analysis

1643

retrieval: A positron emission tomography study. J. Cogn. Neurosci., 12(1), 197– 206. Cabeza, R., McIntosh, A. R., Tulving, E., Nyberg, L., & Grady, C. L. (1997). Agerelated differences in effective neural connectivity during encoding and recall. Neuroreport, 8(16), 3479–3483. DeFelipe, J., Elston, G. N., Fujita, I., Fuster, J., Harrison, K. H., Hof, P. R., Kawaguachi, Y., Martin, K. A. C., Kockland, K. S., Thomson, A. M., Wang, S. H., White, E. L., & Yuste, R. (2002). Neocortical circuits: Evolutionary aspects and specificity versus nonspecificity of synaptic connections: Remarks, main conclusions and general comments and discussion. J. Neurocytol., 31(3–5), 387– 416. Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. New York: CRC Press. Eidelberg, D., Moeller, J. R., Antonini, A., Kazumata, K., Nakamura, T., Dhawan, V., Budman, C., & Feigin, A. (1998). Functional brain networks in DYT1 dystonia. Ann. Neurol., 44(3), 303–312. Engle, R. W., Conway, A. R. A., Tuholsky , S. W., & Shisler, R. J. (1995). A resource account of inhibition. Psychological Science, 6, 122–125. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex, 1(1), 1–47. Fernandez-Duque, D., Baird, J. A., & Posner, M. I. (2000a). Awareness and metacognition. Conscious Cogn., 9(2 Pt. 1), 324–326. Fernandez-Duque, D., Baird, J. A., & Posner, M. I. (2000b). Executive attention and metacognitive regulation. Conscious Cogn., 9(2 Pt. 1), 288–307. Friston, K. J., Fletcher, P., Josephs, O., Holmes, A., Rugg, M. D., & Turner, R. (1998). Event-related fMRI: Characterizing differential responses. Neuroimage, 7(1), 30– 40. Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. (1991). Comparing functional (PET) images: The assessment of significant change. J. Cereb. Blood Flow Metab., 11(4), 690–699. Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. (1993). Functional connectivity: The principal-component analysis of large (PET) data sets. J. Cereb. Blood Flow Metab., 13(1), 5–14. Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. P., Frith, C. D., & Frackowiak, R. S. J. (1996). Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping, 4(1), 58–73. Friston, K. J., Josephs, O., Zarahn, E., Holmes, A. P., Rouquette, S., & Poline, J. (2000). To smooth or not to smooth? Bias and efficiency in fMRI time-series analysis. Neuroimage, 12(2), 196–208. Fukuda, M., Mentis, M. J., Ma, Y., Dhawan, V., Antonini, A., Lang, A. E., Lozano, A. M., Hammerstad, J., Lyons, K., Koller, W. C., Moeller, J. R., & Eidelberg, D. (2001). Networks mediating the clinical effects of pallidal brain stimulation for Parkinson’s disease: A PET study of resting-state glucose metabolism. Brain, 124(Pt. 8), 1601–1609. Gerstein, G. L., Perkel, D. H., & Subramanian, K. N. (1978). Identification of functionally related neural assemblies. Brain Res., 140(1), 43–62.

1644

C. Habeck et al.

Ghilardi, M., Ghez, C., Dhawan, V., Moeller, J., Mentis, M., Nakamura, T., Antonini, A., & Eidelberg, D. (2000). Patterns of regional brain activation associated with different forms of motor learning. Brain Res., 871(1), 127–145. Grady, C. L., McIntosh, A. R., & Craik, F. I. (2003). Age-related differences in the functional connectivity of the hippocampus during memory encoding. Hippocampus, 13(5), 572–586. Habeck, C., Rakitin, B. C., Moeller, J., Scarmeas, N., Zarahn, E., Brown, T., & Stern, Y. (2004). An event-related fMRI study of the neurobehavioral impact of sleep deprivation on performance of a delayed-match-to-sample task. Brain Res. Cogn. Brain Res., 18(3), 306–321. Horwitz, B. (1991). Functional interactions in the brain: Use of correlations between regional metabolic rates. J. Cereb. Blood Flow Metab., 11(2), A114–120. Kosslyn, S. M., Cacioppo, J. T., Davidson, R. J., Hugdahl, K., Lovallo, W. R., Spiegel, D., & Rose, R. (2002). Bridging psychology and biology: The analysis of individuals in groups. Am. Psychol., 57(5), 341–351. Krakauer, J. W., Ghilardi, M. F., Mentis, M., Barnes, A., Veytsman, M., Eidelberg, D., & Ghez, C. (2003). Differential cortical and subcortical activations in the learning of rotations and gains for reaching: A PET study. J. Neurophysiol., 91(2), 924–933. Mayberg, H. S., Liotti, M., Brannan, S. K., McGinnis, S., Mahurin, R. K., Jerabek, P. A., Silva, J. A., Tekell, J. L., Lancaster, J. L., & Fox, P. T. (1999). Reciprocal limbic-cortical function and negative mood: Converging PET findings in depression and normal sadness. Am. J. Psychiatry, 156(5), 675–682. McIntosh, A. R. (1999). Mapping cognition to the brain through neural interactions. Memory, 7(5–6), 523–548. McIntosh, A. R., Bookstein, F. L., Haxby, J. V., & Grady, C. L. (1996). Spatial pattern analysis of functional brain images using partial least squares. Neuroimage, 3(3 Pt. 1), 143–157. McIntosh, A. R., & Gonzalez-Lima, F. (1994). Network interactions among limbic cortices, basal forebrain, and cerebellum differentiate a tone conditioned as a Pavlovian excitor or inhibitor: Fluorodeoxyglucose mapping and covariance structural modeling. J. Neurophysiol., 72(4), 1717–1733. McIntosh, A. R., Rajah, M. N., & Lobaugh, N. J. (1999). Interactions of prefrontal cortex in relation to awareness in sensory learning. Science, 284(5419), 1531–1533. Mellet, E., Tzourio-Mazoyer, N., Bricogne, S., Mazoyer, B., Kosslyn, S. M., & Denis, M. (2000). Functional anatomy of high-resolution visual mental imagery. J. Cogn. Neurosci., 12(1), 98–109. Naatanen, R., Tervaniemi, M., Sussman, E., Paavilainen, P., & Winkler, I. (2001). “Primitive intelligence” in the auditory cortex. Trends Neurosci., 24(5), 283–288. Nakamura, T., Ghilardi, M. F., Mentis, M., Dhawan, V., Fukuda, M., Hacking, A., Moeller, J. R., Ghez, C., & Eidelberg, D. (2001). Functional networks in motor sequence learning: Abnormal topographies in Parkinson’s disease. Hum. Brain Mapp., 12(1), 42–60. Ostriker, J. P., & Steinhardt, P. (2003). New light on dark matter. Science, 300(5627), 1909–1913. Petersson, K. M., Nichols, T. E., Poline, J. B., & Holmes, A. P. (1999). Statistical limitations in functional neuroimaging. I. Non-inferential methods and statistical models. Philos. Trans. R. Soc. Lond. B Biol. Sci., 354(1387), 1239–1260.

Ordinal Trend Canonical Variates Analysis

1645

Pochon, J. B., Levy, R., Fossati, P., Lehericy, S., Poline, J. B., Pillon, B., Le Bihan, D., & Dubois, D. B. (2002). The neural system that bridges reward and cognition in humans: An fMRI study. Proc. Natl. Acad. Sci. USA, 99(8), 5669–5674. Stern, Y., Habeck, C., Moeller, J. R., Scarmeas, N., Anderson, K. E., Hilton, H. J., Flynn, J., Sackeim, H., & van Heertum, R. (In press). Brain networks associated with cognitive reserve in healthy young and old adults. Cereb. Cortex. Sternberg, S. (1966). High-speed scanning in human memory. Science, 153(736), 652– 654. Sternberg, S. (1969). Memory-scanning: Mental processes revealed by reaction-time experiments. Am. Sci., 57(4), 421–457. Ungerleider, L. G., Doyon, J., & Karni, A. (2002). Imaging brain plasticity during motor skill learning. Neurobiol. Learn. Mem., 78(3), 553–564. Venables, W. N., & Ripley, B. D. (1999). Modern statistics with S-Plus (3rd ed.). New York: Springer-Verlag. Worsley, K. J., Poline, J. B., Friston, K. J., & Evans, A. C. (1997). Characterizing the response of PET and fMRI data using multivariate linear models. Neuroimage, 6(4), 305–319. Zarahn, E. (2000). Testing for neural responses during temporal components of trials with BOLD fMRI. Neuroimage, 11(6 Pt. 1), 783–796.

Received October 13, 2003; accepted December 6, 2004.

LETTER

Communicated by Michael Carter

Investigating the Fault Tolerance of Neural Networks Elko B. Tchernev [email protected]

Rory G. Mulvaney [email protected]

Dhananjay S. Phatak [email protected] Computer Science and Electrical Engineering Department, University of Maryland Baltimore County, Baltimore, MD 21250, U.S.A.

Particular levels of partial fault tolerance (PFT) in feedforward artificial neural networks of a given size can be obtained by redundancy (replicating a smaller normally trained network), by design (training specifically to increase PFT), and by a combination of the two (replicating a smaller PFT-trained network). This letter investigates the method of achieving the highest PFT per network size (total number of units and connections) for classification problems. It concludes that for nontoy problems, there exists a normally trained network of optimal size that produces the smallest fully fault-tolerant network when replicated. In addition, it shows that for particular network sizes, the best level of PFT is achieved by training a network of that size for fault tolerance. The results and discussion demonstrate how the outcome depends on the levels of saturation of the network nodes when classifying data points. With simple training tasks, where the complexity of the problem and the size of the network are well within the ability of the training method, the hidden-layer nodes operate close to their saturation points, and classification is clean. Under such circumstances, replicating the smallest normally trained correct network yields the highest PFT for any given network size. For hard training tasks (difficult classification problems or network sizes close to the minimum), normal training obtains networks that do not operate close to their saturation points, and outputs are not as close to their targets. In this case, training a larger network for fault tolerance yields better PFT than replicating a smaller, normally trained network. However, since fault-tolerant training on its own produces networks that operate closer to their linear areas than normal training, replicating normally trained networks ultimately leads to better PFT than replicating fault-tolerant networks of the same initial size.

Neural Computation 17, 1646–1664 (2005)

© 2005 Massachusetts Institute of Technology

Investigating the Fault Tolerance of Neural Networks

1647

1 Introduction Fault tolerance of feedforward artificial neural networks (ANNs) has been investigated by many researchers (a sampling can be found in Sequin & Clay, 1990, Petsche & Dickinson, 1990; Segee & Carter, 1991, 1994; Neti, Schneider, & Young, 1992; Phatak & Koren, 1992, 1995; Protzel, Palumbo, & Arras, 1993, Murray & Edwards, 1994; Bishop, 1995; Phatak, 1999). Our prior work (Phatak, 1994; Phatak and Koren, 1992, 1995) related fault tolerance to the amount of redundancy required to achieve it. It was demonstrated that for the standard or typical artificial neural network (ANN) architectures wherein the units (or neurons) perform a weighted sum of inputs from other units, followed by a monotonic nonlinear squashing to generate their outputs, less than TMR (triple modular redundancy) is not sufficient to achieve complete fault tolerance. A simple method of replicating a seed network was shown to be one possible way of enhancing fault tolerance of ANNs. This method requires a large amount of redundancy even if the size of the seed network (which gets replicated) is kept minimal. A better alternative might be to use enhanced training algorithms that target fault tolerance. This was analyzed in Phatak (1994) and revealed that using permuted samples or cross validation during training does not lead to a discernible partial fault tolerance (PFT) enhancement. On the other hand, providing initial redundancy and modifying the training algorithm to utilize the extra parameters does improve the PFT to some extent. However, a brute force method of replications proposed in Phatak and Koren (1992, 1995) seems to achieve a higher PFT for the same level of redundancy (as compared with the fault-tolerant gradient descent training), which was a counter-intuitive result. A later work (Tchernev, Mulvaney, & Phatak, 2004) analyzed the exact replication factors needed to obtain perfect fault tolerance for the n − k − n encoder/decoder problem. It was shown that the traditional n − log2 n − n architecture, with larger initial size but operating closer to saturation, achieves fault tolerance at a significantly smaller final size than the minimal n − 2 − n architecture. In this letter, we experimentally investigate the relationships between network size, initial PFT, and replicated PFT, and look for optimal-sized seed networks that yield the highest PFT.

2 Experiments The problems being investigated are the two-class six-area problem, the two spirals problem, and the vowel problem. The two-class six-area problem (see Figure 1) is a specially constructed classification problem with a known minimal size of its solving network: a single hidden layer of three nodes. The two spirals problem (training and testing data sets together) is shown in Figure 4, and the vowel problem is taken from the UCI Machine Learning

1648

E. Tchernev, R. Mulvaney, and D. Phatak

repository (Blake & Merz, 1998). These are problems ranging from very easy (the two-class six-area problem) through moderate (the two spirals problem) to hard (the vowel problem). 3 Method For all problems, a strictly layered structure is assumed (all units in a layer connect to only the units in the next adjacent layer, and no layer-skipping connections to other units are present). Networks with various numbers of hidden units are trained with normal gradient descent as well as a modified version that specifically targets enhanced fault tolerance. Each network is then replicated from 1 to 20 times, and the resulting PFT is evaluated for each of the replication sizes under the single fault assumption (i.e., only a single link or unit can fail). Training for fault tolerance is tantamount to solving a constrained optimization problem, as discussed in Phatak (1994) and Phatak and Tchernev (2002). Instead of trying to solve it outright, as in Neti et al. (1992), the penalty function method approach is used. It is described in Luenberger (2003), and it incorporates the constraints into the objective function. Thenew objective function becomes E = p o (top − yop )2 + α h f p o (top − yopfh )2 where top , yop = target and actual outputs of output unit o on pattern p, yopfh = actual output of output unit o on pattern p upon fault f of hidden unit h, and α = weight of the extra terms relative to normal terms. Note that now there are two sets of terms in this objective function E. The first summation includes the normal error terms. The second summation accrues the errors that arise due to faults and is the penalty function. When E is minimized, the error between target outputs and the network outputs on faults is also minimized. Thus, the addition of these terms forces the search algorithm to look for a solution with better fault tolerance. These terms in the objective function that encourage fault tolerance are henceforth referred to as extra terms. The (nonnegative) parameter α is the weight of the penalty function (or the penalty) relative to the normal error terms, and typically, α < 1. As elaborated in Phatak and Koren (1992, 1995), replication involves adding as many copies of the hidden nodes and all their links as the replication factor is and adjusting the output nodes’ biases, while preserving the input and output node counts. For the links, three types of faults are considered: stuck at +MAX, at −MAX, and at 0, where MAX is the largest weight magnitude in the trained network. For each hidden unit, two types of faults are considered: its output stuck at +Output Max and at −Output Max. For a unit, considering a stuck-at-0 fault on the output is not necessary: if the other two faults can be tolerated, then a stuck-at-0 fault on the output of that unit can also be tolerated.

Investigating the Fault Tolerance of Neural Networks

1649

By definition, PFT is the fraction of outputs that remain correct when all single faults are considered. To generate this value, every weight and unit in the tested network is set stuck at each fault type. For each fault, all inputoutput patterns are applied, and the number of outputs that remain correct is accumulated. This total is divided by the total number of cases, which equals [#outputs × #io patterns × (3 × #weights + 2 × #hidden units)], where #X denotes number of X. The fraction (PFT) is calculated for different numbers of replications and investigated as a function of the total number of hidden units in the resulting network. Note that the magnitude of observed PFT change (and difference between pairs of PFTs) can be quite small—for example, the vowel problem, with 10 inputs, 11 outputs, and 528 training patterns, one nonreplicated network with a single 32 hidden-unitslayer (672 weights) can have a PFT step of 1 in 11,894,784 cases, which is about 8.4*10−8 . This can decrease with bigger seed networks, replication, and when averaging over more than one network. For each training method and size of the seed network, sample networks are generated and trained, and for each, the PFT is evaluated for an increasing number of replications. Only the results from the networks that have 100% classification accuracy on the training data set are taken, averaged, and discussed in this article. (The reason is that by definition, PFT is a correctness measure, and only initially correct networks reach 100% PFT on replication.) The focus of the investigation is on the relationship between PFT and final size of the network in nodes, since network size is both a measure and a constraint in terms of practical use. Replication, on the other hand, is computationally free compared to the cost of training a network. 4 Results 4.1 The Two-Class Six-Area Problem. The chosen network architecture has two inputs, one output, and one hidden layer with a variable number of nodes. The labels on the graphs in Figure 1 correspond to the number of nodes in the hidden layer. Since this is an easy problem for backpropagation training, the smallest network size to be successfully trained with both normal and fault-tolerant training is the theoretically minimal one of three hidden nodes. The decision lines of these theoretical solution nodes can be seen on Figure 1 together with the data points. 4.1.1 Initial PFT. Table 1 shows the initial (prereplication) averaged PFTs and 95% confidence intervals of the networks that were successfully trained (out of totally 50 runs per size). For hidden node counts of 3, 4, 6, 8, 16, and 24, the corresponding number of successfully trained networks was 35, 42, 47, 49, 49, and 46 for normal backpropagation, and 28, 34, 41, 48, 46, and 46 for fault-tolerant training. The table shows the fraction of correct outputs across all faults, as they change with network size. The two data sets correspond

1650

E. Tchernev, R. Mulvaney, and D. Phatak

Two-class six-area training set 1.5 1 0.5 Class 1 0

Class 2

-0.5 -1 -1.5 -1.5

-1

-0.5

0

0.5

1

1.5

Figure 1: Training data set for the two-class six-area problem. Table 1: Average PFT and Confidence Intervals of the Correct Runs of the TwoClass Six Area Problem, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden

3

4

6

8

16

24

bp PFT 95% ft PFT 95%

0.698 0.000 0.701 0.000

0.751 0.001 0.756 0.001

0.816 0.003 0.820 0.002

0.853 0.003 0.856 0.002

0.918 0.002 0.918 0.002

0.940 0.001 0.942 0.001

to the normal backpropagation training (bp) and its fault-tolerant variant (ft). The smallest network to solve the problem, the minimal three-hiddennode network, has a total of six nodes and corresponds to the first (left-most) data column of the table. As expected, PFT increases with network size, the networks become increasingly redundant, and fault-tolerant training produces better PFT than normal backpropagation. The curve of diminishing returns is clearly seen as bigger networks demonstrate less PFT gain. The only network size where the normal and the fault-tolerant training produce statistically identical PFT is at 16 hidden nodes; the two-tailed t-test distance for that pair is 69%.

Investigating the Fault Tolerance of Neural Networks

1651

PFT on replication, normal

fraction correct

1

3 4

0.9

6 0.8

8 16

0.7

24 ft

0.6 1

10 100 total number of units

1000

PFT on replication, fault-tolerant

fraction correct

1

3 4

0.9

6 0.8

8 16

0.7

24 ft

0.6 1

10 100 total number of units

1000

Figure 2: PFT of normally trained and fault-tolerant trained networks with different numbers of hidden-layer units, when using replication, plotted by resulting network size, two-class six-area problem.

4.1.2 Replication PFT. Plots of the entire PFT range, up to 100% PFT, of the successfully trained networks, when using replication, are shown in Figure 2 for plain backpropagation and for fault-tolerant backpropagation. The legend keys correspond to the number of hidden units in the seed networks that are replicated. The 95% confidence intervals would be too small to see on the plots at the data scale so are not included. The initial (nonreplicated) PFT for fault-tolerant backpropagation is included on both graphs, labeled ft, and demonstrates that replicating a smaller network dominates fault-tolerant training for the same network size, regardless of the replicated net’s training method. The graphs where PFT approaches 100% on replication (perfect fault tolerance) are shown in Figure 3. The resolution on the vertical axis is increased significantly to allow a clear view. The averaged point of reaching perfect

1652

E. Tchernev, R. Mulvaney, and D. Phatak PFT on replication, normal 1 fraction correct

3 4 6 8 16 24

0.995 10

100 total number of units

1000

PFT on replication, fault-tolerant 1 fraction correct

3 4 6 8 16 24

0.995 10

100 total number of units

1000

Figure 3: Normal and fault-tolerant trained networks with different number of hidden-layer units achieve perfect PFT when using replication, plotted by resulting network size, two-class six-area problem.

PFT is the result of just one network achieving 100% PFT after all others; even so, the variances are small, and the 95% confidence error bars, while visible at this scale, would only hinder the identification of the data points. Overall, for a given final size, the smallest working network achieves the best PFT on replication. These graphs show that for certain network sizes, the best PFT is not exhibited by the smallest replicated network; nevertheless, it is the smallest networks that eventually reach 100% PFT. The difference between the PFT of fault-tolerant training and that of normal backpropagation is small and decreases with the number of replications. Eventually, the smallest correct trained networks for both types of training achieve 100% PFT at the same final size. 4.2 The Two Spirals Problem. The training and testing data sets are shown in Figure 4. The chosen network architecture has two inputs, one

Investigating the Fault Tolerance of Neural Networks

7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

1653

training testing training + testing +

Figure 4: Data sets of the two spirals problem. Table 2: Average PFT and Confidence Intervals of the Correct runs of the Two Spirals Problem, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden

24

32

48

64

96

bp PFT 95% ft PFT 95%

— — 0.789 0.000

0.775 0.005 0.793 0.016

0.807 0.004 0.832 0.005

0.841 0.005 0.855 0.006

0.877 0.004 0.892 0.007

output, and one hidden layer with a variable number of nodes. The labels on the graphs correspond to the number of nodes in the hidden layer. The minimal size of a network of such architecture to solve the problem is not known; the smallest size to actually solve the problem had 32 hidden nodes for normal and 24 hidden nodes for fault-tolerant training. 4.2.1 Initial PFT. Table 2 shows the initial (prereplication) averaged PFTs and 95% confidence intervals of the networks that were successfully trained (out of totally 50 runs per size). For hidden node counts of 24, 32, 48, 64, and 96, the corresponding number of successfully trained networks was 0, 12, 12, 13, and 8 for normal backpropagation and 1, 3, 12, 5, and 3 for fault-tolerant training. The table shows the fraction of correct outputs across all faults as they change with network size. The two data sets correspond to the normal backpropagation training (bp) and its fault-tolerant variant (ft). The first (leftmost) data column in the table corresponds to the smallest network to solve the problem, which has 24 hidden and 27 total nodes, for fault-tolerant training. PFT increases with network size, and fault-tolerant training achieves a

1654

E. Tchernev, R. Mulvaney, and D. Phatak PFT on replication, normal

fraction correct

1

32 48

0.9

64 0.8

96

0.7 10

ft 100 1000 total number of units

10000

PFT on replication, fault tolerant 1 fraction correct

24 32

0.9

48 64

0.8

96 ft

0.7 10

100 1000 total number of units

10000

Figure 5: PFT of normally trained and fault-tolerant-trained networks with different number of hidden-layer units, when using replication, plotted by resulting network size, two spirals problem.

higher PFT than normal backpropagation; the difference between the training methods is statistically significant. 4.2.2 Replication PFT. Plots of the entire PFT range, up to 100% PFT of the successfully trained networks, when using replication, are shown in Figure 5 for plain backpropagation and for fault-tolerant training. The legend keys correspond to the number of hidden units in the seed networks that are replicated. The 95% confidence error bars, while visible at this scale, would only hinder the identification of the data points and cause visual clutter so are not included. The PFT of the nonreplicated fault-tolerant trained networks is included on both graphs, labeled ft, and demonstrates that replicating a smaller, normally trained network is dominated by fault-tolerant training for the same

Investigating the Fault Tolerance of Neural Networks

1655

PFT on replication, normal 1 fraction correct

32 48 0.9995 64 96 0.999 100

1000 total number of units

10000

PFT on replication, fault tolerant 1 fraction correct

24 32 48

0.9995

64 96 0.999 100

1000 total number of units

10000

Figure 6: Normally trained and fault-tolerant-trained networks with different numbers of hidden-layer units achieve perfect PFT when using replication, plotted by resulting network size, two spirals problem.

network size. Replicating a smaller fault-tolerant-trained network is similarly dominated by the bigger fault-tolerant-trained ones, except for the case of 24 hidden nodes. Regardless, after several replications, the trend reverses, and for a given final size, the smaller seed network achieves the better PFT. This is illustrated by plotting the areas close to perfect PFT on Figure 6, where it can be seen how the replicated smaller networks achieve 100% PFT at smaller final sizes. The difference in PFT between networks obtained by replicating corresponding sizes of fault-tolerant trained and normal seed networks, which is positive initially, reduces to insignificance with replication but passes through a statistically significant area where replicated large fault-toleranttrained networks have worse PFT than replicated normally trained networks of the same size.

1656

E. Tchernev, R. Mulvaney, and D. Phatak

Table 3: Average PFT and Confidence Intervals of the Correct Runs of the Vowel Problem, Coded Outputs, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden

32

48

64

96

bp PFT 95% ft PFT 95%

0.966 0.001 0.967 0.017

0.983 0.002 0.990 0.003

0.992 0.001 0.994 0.003

0.997 0.001 — —

4.3 The Vowel Problem. The vowel problem has 10 inputs and must classify its input patterns into 11 classes. The outputs of a neural network that solves it can encode the solutions either in exclusive form (with 11 output units, only 1 active at a time) or in coded form (with 4 output units, the class coded in binary). Runs were performed with one and with two hidden layers, but the complexity of the network with one hidden layer was such that no correct solutions were found for the fault-tolerant training case. The results presented here are for the two-layer exclusive case (10 input units, 2 hidden layers of variable equal number of units, 11 output units) and for the 2-layer coded case (10 input units, 2 hidden layers of variable equal number of units, 4 output units). The labels on the graphs correspond to the total number of nodes in the hidden layers; each layer contains half as many nodes. The minimal size of networks of such architecture to solve the problem is not known; the smallest sizes to actually solve the problem had 32 hidden nodes (two layers of 16 nodes) for each of the four training method–output combinations. The vowel problem was testing the more computationally intensive faulttolerant training with extra terms. Not only was no solution found for any size network with one hidden layer, but the maximum size that this method was able to successfully train was 64 hidden nodes (two layers of 32 nodes each). Normal training (which operates in a smaller state space) fared a little better and was able to obtain solutions in the hardest case (one hiddenlayer network), and to train successfully a 96-hidden-units network (two layers of 48 nodes each) in the easiest case (two hidden layers with coded outputs). 4.3.1 Initial PFT. The initial (prereplication) PFTs of the networks that were successfully trained (out of totally 20 runs per size) are shown in Tables 3 and 4 for the coded and the separate outputs architectures, respectively. For the coded case, for hidden node counts of 32, 48, 64, and 96, the corresponding number of successfully trained networks was 2, 5, 10, and 4 for normal backpropagation, and 2, 7, 5, and 0 for fault-tolerant training.

Investigating the Fault Tolerance of Neural Networks

1657

Table 4: Average PFT and Confidence Intervals of the Correct Runs of the Vowel Problem, Exclusive Outputs, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden

32

48

64

bp PFT 95% ft PFT 95%

0.981 0.002 0.985 0.000

0.991 0.001 0.995 0.001

0.996 0.000 0.997 0.000

PFT on replication, fault tolerant

PFT on replication, normal 1

0.98

32 48 64

0.98

96 ft

0.96

fraction correct

fraction correct

1

32 48 64 ft

0.96 10

100

1000

10

total number of units

PFT on replication, fault tolerant, magnified

PFT on replication, normal, magnified

1.00000 32 48 64 96

0.99999 200 400 600 total number of units

fraction correct

fraction correct

1.00000

100 1000 total number of units

32 48 64

0.99999 200

400

600

total number of units

Figure 7: PFT of normally trained and fault-tolerant-trained networks with different numbers of hidden-layer units achieve perfect PFT when using replication, plotted by resulting network size, vowel problem, coded outputs.

For the exclusive case, for hidden node counts of 32, 48, and 64, the corresponding number of successfully trained networks was 4, 3, and 1 for normal backpropagation and 1, 2, and 1 for fault-tolerant training.

1658

E. Tchernev, R. Mulvaney, and D. Phatak

PFT difference, fault tolerant - normal 0.008 32

fraction correct

0.006

48

0.004

64

0.002 0 -0.002

0

200 400 total number of units

600

fraction correct

PFT difference, magnified

0 32 48 64

-0.00002 200

400 total number of units

600

Figure 8: Difference between the PFT of fault-tolerant and normally trained networks with different number of hidden-layer units when using replication, plotted by resulting network size, vowel problem, coded outputs.

The tables show the averaged fraction of correct outputs across all faults, as a function of the hidden nodes count, with 95% confidence intervals. The smallest networks to solve the problem have 32 hidden nodes in two layers of 16 nodes each and correspond to the first (left-most) data columns of the tables. PFT increases with network size, and fault-tolerant training achieves better PFT than normal backpropagation. 4.3.2 Replication PFT. Plots of the entire PFT range of the successfully trained coded output networks, when using replication, are shown in Figure 7 for both normal and fault-tolerant training. The legend keys correspond to the number of hidden units in the seed networks that are replicated. The 95% confidence error bars are not included in these plots, as they would be invisible at the data scale.

Investigating the Fault Tolerance of Neural Networks PFT on replication, fault tolerant

PFT on replication, normal

0.98

1 fraction correct

fraction correct

1

1659

32 48

0.98

64

32 48 64 ft

ft

0.96

0.96

10

10

100 1000 total number of units

PFT on replication, fault tolerant, magnified

PFT on replication, normal, magnified

1.00000

32 48 64

0.99999 100 1000 total number of units

32 fraction correct

1.00000 fraction correct

100 1000 total number of units

48 64

0.99999 100 1000 total number of units

Figure 9: PFT of normally trained and fault-tolerant-trained networks with different numbers of hidden-layer units achieve perfect PFT when using replication. Plotted by resulting network size, vowel problem, exclusive outputs.

The initial PFT for fault-tolerant training is included on two of the graphs, labeled ft, and demonstrates that replicating a smaller, normally trained network is dominated by fault-tolerant training for the same network size. The data where PFT approaches 100% are shown in the magnified graphs. It can be seen that after enough replications, the initial levels of PFT have shifted. The smallest network to reach 100% PFT for normal training is the replicated 32 hidden units network; the smallest one to solve the problem. For fault-tolerant training, however, the smallest network to reach perfect fault tolerance is not the replicated result of the smallest initial solution. Actually, the initial relationship between the fault tolerances is preserved in the replications almost to the 100% mark. Until the last few replications, it is the biggest initial network that has the best fault tolerance on replication and is overtaken on the finish line by the two smaller networks.

1660

E. Tchernev, R. Mulvaney, and D. Phatak

PFT difference, fault tolerant - normal

fraction correct

0.004 32

0.003

48

0.002

64

0.001 0 -0.001

0

200 400 total number of units

600

PFT difference, magnified 0.00002 fraction correct

32 48 64 0

-0.00002 0

200 400 total number of units

600

Figure 10: Difference between the PFT of fault-tolerant and normally trained networks with different numbers of hidden-layer units when using replication. Plotted by resulting network size, vowel problem, exclusive outputs.

The difference between the PFT of the fault-tolerant and of the normally trained networks is shown in Figure 8 as an entire data range graph and as a strongly magnified region close to zero. It can be seen that initial better PFT notwithstanding, the replicated normally trained networks exhibit better fault tolerance than the extra-terms-trained one after several replications, and that relationship is retained. In two of the three cases, the normally trained network achieves 100% PFT earlier than the fault-tolerant one. The PFTs of the replicated successfully trained exclusive output networks are shown in Figure 9 for both normal and fault-tolerant training. The initial PFT for fault-tolerant training is included on both graphs, labeled ft, and demonstrates that replicating a smaller, normally trained network is dominated by fault-tolerant training for the same network size.

Investigating the Fault Tolerance of Neural Networks

1661

The data where PFT approaches 100% are shown on the magnified graphs. It can be seen that after enough replications, the initial levels of PFT have shifted. The smallest network to reach 100% PFT for normal training is the replicated 48 hidden units one—neither the smallest nor the largest to solve the problem. For fault-tolerant training, it is again the middle-sized network that is the first to reach perfect fault tolerance. The initial relationship between the fault tolerances is preserved in the replications almost to the 100% mark only for the normally trained case. Until the last few replications, it is the biggest initial network that has the best fault tolerance on replication and is overtaken on the finish line by the middle-sized network. This is not the case for fault-tolerant training, where the superiority of the middle-sized network is established after the first few replications. In none of the cases does the smallest correct network challenge the PFT of the larger ones; it has the worst PFT across all final sizes. The difference between the PFT of the extra-terms and the normally trained networks is shown in Figure 10 as an entire data range graph and as a strongly magnified region close to zero. It can be seen that initial better PFT notwithstanding, the replicated normal trained networks exhibit better fault tolerance than the extra-termstrained networks after several replications, and that relationship is retained. However, in only one of the three cases does the normally trained network achieve 100% PFT earlier than the fault-tolerant one. In the other two cases, the difference diminishes so that they achieve 100% PFT at the same final size.

5 Summary The data indicate the following trends. As the size of the seed network (which is later replicated) is increased, the initial PFT (without any replications) improves. The improvement exhibits diminishing returns and tapers off as the size of the initial network is further increased, regardless of whether the problem being solved is easy or hard. Replicating correct seed networks (with 100% classification rates on the training dataset) leads to the formation of networks with 100% PFT with respect to the training data set. Whether the smallest resultant network of this kind is obtained from the smallest seed network or not depends on the difficulty of the problem and the training method. The easier the problem and the training method, the more likely it is that the smallest correct network will achieve 100% final PFT at the smallest final size. Fault-tolerant training alone improves the PFT of any given seed network. For easy problems, replication PFT dominates fault-tolerant training for the same final size, while the converse is true for hard problems.

1662

E. Tchernev, R. Mulvaney, and D. Phatak

On replication, fault-tolerant seed networks produce networks with better PFT than networks obtained by replicating normal seed networks, but only for the first several replications. Eventually, the normal networks take over and in general reach 100% PFT at a smaller final size.

6 Discussion and Conclusion It is important to realize that the fault tolerance of a neural network depends on the area of operation of its units when classifying data points. Consider, for simplicity, that the last hidden layer before some output performs a mapping of the input space into a hypercube of dimensionality k, where k is the number of hidden-layer units that connect to the output. For classification problems, the output unit has the function of a hyperplane of dimensionality k−1, which splits the volume of the hypercube in two classification subspaces. In order for the network to be fault tolerant, each fault-induced change in the position of a data point in that hypercube must leave it on the same side of the hyperplane as it originally was. Replicating the hidden nodes of a network has the effect of reducing the influence of any one node or connection, so that a fault will reduce the amount of displacement in the hyperspace that each data point undergoes. Clearly, the closer to the vertices of the hypercube the points lie (the more saturated the outputs of the hidden units that produce it), the fewer replications are needed to completely eliminate the effect of a fault and achieve 100% fault tolerance. In the extreme case of threshold units and unit weights, triple redundancy would be sufficient to guarantee that a point does not leave its hypercube vertex on both weight and node failures. This effect is the reason for the drastically fewer replications (and total nodes) needed to achieve fault tolerance for the n − k − n problem, as derived in Tchernev et al. (2004). It is reasonable to posit that backpropagation training on easy problems (like the two-class six-area problem) will tend to find solutions that classify cleanly and drive the hidden and output units close to their saturation levels. Achieving perfect fault tolerance would require approximately the same number of replications from any starting size, which would favor the smallest seed network to reach the smallest solution overall. On the other hand, backpropagation on harder problems might find minimal solutions that barely classify and drive most of their units in their linear area, requiring more replications to become fault tolerant. This interaction of the two factors determines a point of diminishing returns, where the smaller initial size cannot compensate for the increased number of replications that are required. The conclusion with respect to normal backpropagation, then, would be that there exists an optimal seed network size that, when replicated, achieves the smallest perfectly tolerant net. For easy problems, this coincides with the minimal solution network. Training is able to find solutions that operate in a sufficiently saturated state; the large number of required replications

Investigating the Fault Tolerance of Neural Networks

1663

is offset by the small size, so that replicating this network is the almost optimal way to go. Larger networks, even if operating in a more saturated state initially and requiring fewer replications, lose out because of their final size. For harder problems, the smallest networks to solve them have nonsaturated networks close to the classification thresholdes; they require a larger number of replications, so that the smallest fault-tolerant final size is produced by some nonminimal seed network. Fault-tolerant training with extra terms, as implemented, does not lead to the same solutions as normal backpropagation. The extra terms constrain the saturation of the nodes in the network by leading to solutions where each connection has a smaller influence overall. This explains why, even though the initial fault tolerance of these networks is better than that of the normally trained ones, they eventually lose out on replication and do not achieve better fault tolerance for the same final size. In simple problems, where the solving networks are small or minimal, there is not enough redundancy to allow a search for fault tolerance, and that search, performed anyway, prevents the network from reaching a sufficiently saturated operating state. Therefore, not only is a fault-tolerant-trained network worse in PFT than a replicated normal one obtained from a smaller seed, but replication does not increase its fault tolerance as fast as replicating a more saturated normally trained one would. For hard problems, where the solving networks are larger than the (unknown) minimal ones, sufficient extra capacity exists to allow an efficient search for fault tolerance, so that the resulting network is more fault tolerant than a normal one replicated to the same size. However, because of its inherently nonsaturated operation, replicating the fault-tolerant-trained network cannot produce the same PFT gain, and the eventually smallest fault-tolerant network is the result of replicating a normally trained one. In conclusion, it can be said that for sufficiently hard problems, if 100% PFT is not required, the smallest network to achieve any particular level of PFT would be a fault-tolerance-trained one. This has serious limitations, however, since the method is extremely computationally intensive to be of realistic use for hard real-world classification problems. It is clear that replicating a seed network does not produce the smallest overall network for a given PFT level (unless starting from extremely saturated or threshold units), but there is no realistic alternative. Whether any other method for network construction and/or search (for example, genetic algorithms or genetic programming) can achieve 100% PFT for a realistic problem without using replication is an open question that is subject to further investigation.

Acknowledgments This work was supported by NSF grants ECS-9875705 and ECS-0196362.

1664

E. Tchernev, R. Mulvaney, and D. Phatak

References Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7, 108–116. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine: University of California. Luenberger, D. G. (2003). Linear and nonlinear programming. Norwell, MA: Kluwer Academic Publishers. Murray, A. F., & Edwards, P. J. (1994). Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5, 792–802. Neti, C., Schneider, M. H., & Young, E. D. (1992). Maximally fault tolerant neural networks. IEEE Transactions on Neural Networks, 3, 14–23. Petsche, T., & Dickinson, B. W. (1990). Trellis codes, receptive fields, and fault tolerant, self-repairing neural networks. IEEE Transactions on Neural Networks, 1, 154–166. Phatak, D. S. (1994). Fault tolerance of feed-forward artificial neural nets and synthesis of robust nets. Doctoral dissertation, University of Massachusetts. Phatak, D. S. (1999). Relationship between fault tolerance, generalization and the Vapnik-Chervonenkis (VC) dimension of feedforward ANNs. In Proceedings of the International Joint Conference on Neural Networks (pp. 705–709). Washington, DC. Phatak, D. S., & Koren, I. (1992). Fault tolerance of feedforward neural nets for classification tasks. In Proceedings of the International Joint Conference on Neural Networks (Vol. 2, pp. 386–391). Baltimore, MD. Phatak, D. S., & Koren, I. (1995). Complete and partial fault tolerance of feedforward neural nets. IEEE Transactions on Neural Networks, 6, 446–456. Phatak, D. S., & Tchernev, E. (2002). Synthesis of fault tolerant neural networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 1475– 1480). Honolulu, HI. Protzel, P. W., Palumbo, D. L., & Arras, M. K. (1993). Performance and fault-tolerance of neural networks for optimization. IEEE Transactions on Neural Networks, 4, 600– 614. Segee, B. E., & Carter, M. J. (1991). Fault tolerance of pruned multilayer networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 447–452). Seattle, WA. Segee, B. E., & Carter, M. J. (1994). Comparative fault tolerance of parallel distributed processing networks. IEEE Transactions on Computers, 43, 1323–1329. Sequin, C. H., & Clay, R. D. (1990). Fault tolerance in artificial neural networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 703-708). San Diego, CA. Tchernev, E., Mulvaney, R., & Phatak, D. S. (2004). Perfect fault tolerance of the n-k-n network (Tech. Rep. No. TR-CS-04-14). Baltimore, MD: University of Maryland Baltimore County.

Received May 7, 2004; accepted January 5, 2005.

REVIEW

Communicated by Terrence J. Sejnowski

How Close Are We to Understanding V1? Bruno A. Olshausen [email protected] Redwood Neuroscience Institute, Menlo Park, CA 94025, and Center for Neuroscience, University of California at Davis, Davis, CA 95616, U.S.A.

David J. Field [email protected] Department of Psychology, Cornell University, Ithaca, NY 14853, U.S.A.

A wide variety of papers have reviewed what is known about the function of primary visual cortex. In this review, rather than stating what is known, we attempt to estimate how much is still unknown about V1 function. In particular, we identify five problems with the current view of V1 that stem largely from experimental and theoretical biases, in addition to the contributions of nonlinearities in the cortex that are not well understood. Our purpose is to open the door to new theories, a number of which we describe, along with some proposals for testing them. 1 Introduction The primary visual cortex (area V1) of mammals has been the subject of intense study for at least four decades. Hubel and Wiesel’s original studies in the early 1960s created a paradigm shift by demonstrating that the responses of single neurons in the cortex could be tied to distinct image properties such as the local orientation of contrast (Hubel & Wiesel, 1959, 1968). Since that time, the study of V1 has become something of a miniature industry, to the point where the annual Society for Neuroscience meeting now routinely devotes multiple sessions entirely to V1 anatomy and physiology. Without doubt, much has been learned from these efforts. However, as we shall argue here, there remains a great deal that is still unknown about how V1 works and its role in visual system function. We believe it is quite probable that the correct theory of V1 is still far afield from the currently proposed theories. It may seem surprising to some that we should take such a stance. V1 does, after all, have a seemingly ordered appearance: a clear topographic map and an orderly arrangement of ocular dominance and orientation columns. Many neurons are demonstrably tuned for stimulus features such as orientation, spatial frequency, color, direction of motion, and disparity. And there has even emerged a fairly well-agreed-on “standard model” for V1 in which simple cells compute a linearly weighted sum of the input over Neural Computation 17, 1665–1699 (2005)

© 2005 Massachusetts Institute of Technology

1666

Image

I(x,y,t )

B. Olshausen and D. Field Response normalization

Receptive field

-

+

-

linear response

- or /

Pointwise non-linearity

Response

r (t )

K(x,y,t ) neighboring neurons

Figure 1: Standard model of V1 simple cell responses. The neuron computes a weighted sum of the image over space and time, and this result is normalized by the responses of neighboring units and passed through a pointwise nonlinearity (see e.g., Carandini, Heeger, & Movshon, 1997).

space and time (usually a Gabor-like function), which is then normalized by the responses of neighboring neurons and passed through a pointwise nonlinearity (see Figure 1). Complex cells are similarly explained in terms of a summation over the outputs of a local pool of simple cells with similar tuning properties but different positions or phases. A variety of models have been proposed for the response normalization (Heeger, 1991; Geisler & Albrecht, 1997; Schwartz & Simoncelli, 2001; Cavanaugh, Bair, & Movshon, 2002a), but the net result is often to think of V1 as a kind of “Gabor filter bank.” There are numerous papers showing that this basic model fits much of the existing data well, and many scientists have come to accept this as a working model of V1 function (see, e.g., Lennie, 2003a for a discussion). Indeed, such models are widely used to predict psychophysical performance (Graham & Nachmias, 1971; Watson, Barlow, & Robson, 1983; Anderson, Burr, & Morrone, 1991), and they have been shown to provide efficient representations of natural scenes (Olshausen & Field, 1996; Bell & Sejnowski, 1997). But behind this picture of apparent orderliness lies an abundance of unexplained phenomena, a growing list of untidy findings, and an increasingly uncomfortable feeling among many about how the experiments that have led to our current view of V1 were conducted in the first place. The main problem stems from the fact that cortical neurons are highly nonlinear—that is, they emit all-or-nothing action potentials, not analog values. They also adapt, so their response properties depend on the history of activity. Most important, cortical pyramidal cells have highly elaborate dendritic trees, and realistic biophysical models that include voltage-gated channels suggest that each thin branch could act as a nonlinear subunit, so that any one neuron could be computing many different nonlinear combinations of its inputs (Hausser & Mel, 2003; Polsky, Mel, & Schiller, 2004), in addition to being sensitive to coincidences (Softky & Koch, 1993; Azouz & Gray, 2000, 2003). Everyone knows that neurons are nonlinear, but few have acknowledged the implications for studying cortical function. Unlike linear systems, where

How Close Are We to Understanding V1?

1667

there exist mathematically tractable textbook methods for system identification, nonlinear systems cannot be teased apart using some straightforward, structuralist approach. That is, there is no unique “basis set” with which one can probe the system to characterize its behavior in general.1 Nevertheless, the structuralist approach has formed the bedrock of V1 physiology for the past four decades. Researchers have probed neurons with spots, edges, gratings, and a variety of mathematically elegant functions in the hope that the true behavior of neurons can be explained in terms of some simple function of these components. However, the evidence that this approach has been successful is lacking. We simply have no reason to believe that a population of interacting neurons can be reduced in this way. For any complex system, it seems reasonable to begin where the system acts rationally: to study the behavior under conditions where one’s models are relatively effective. But for a neural system, that leaves the question as to whether such behavior represents the relevant aspect of the neurons activity: Does this help us understand how neurons operate under natural conditions? Much of our understanding of V1 is derived from recording from one neuron at a time using simple stimuli (edges, gratings, spots). From this body of experiments has emerged the standard model that forms the basis for our conceptual understanding of V1. In recent years, a number of innovative studies have moved away from this basic approach, recording from multiple neurons with complex, ecologically relevant stimuli. Are these studies simply adding minor correction factors to our understanding, or will they require us to completely revamp the current theories? Are the current models close to accounting for the majority of responses in the majority of neurons in V1? How close are we to understanding V1? In this review, we present our reasons for believing that we may have far to go in understanding V1. We identify five fundamental problems with the current view of V1 function that stem largely from experimental and theoretical biases, in addition to the contributions of nonlinearities in the cortex that are not well understood. Furthermore, we attempt to quantify the level of our current understanding by considering two important factors: an estimate of the fraction of V1 neuron types that are typically characterized in experimental studies and the fraction of variance explained in the responses of these neurons under natural viewing conditions. Together, these two factors lead us to conclude that at present, we can rightfully claim to understand only 10% to 20% of how V1 actually operates under normal conditions. Our aim in pointing these things out is not simply to tear down the current framework. We ourselves have attempted to account for some aspects of the 1 The Volterra series expansion is often touted as a general approach for characterizing nonlinear systems, but it has been of little practical value in analyzing neural systems because it requires estimating many higher-order moments. In addition, it is an overly general “black box” approach that does not easily allow one to incorporate prior knowledge about the types of nonlinearities known to exist in the nervous system.

1668

B. Olshausen and D. Field

standard model in terms of efficient coding principles (sparse coding), so obviously we believe that we have made a good start. Rather, our goal is to show how much room there is for new theories and where the weaknesses in the current theories might lie. In the second half of the review, we describe a few of our favorite alternatives to the standard theories. A central conclusion that emerges from this exercise is that we need to begin seriously studying how V1 behaves with natural scenes, using multiunit recording techniques, in addition to explicitly describing any potential biases in the gathering of data. We believe this approach can help to reveal not just how much we know about neural coding in the visual pathway but also how much we do not know. 2 Five Problems with the Current View 2.1 Biased Sampling of Neurons. The vast majority of our knowledge about V1 function has been obtained from single unit recordings in which a single microelectrode is brought into close proximity with a neuron in cortex. Ideally, when doing this, one would like to obtain an unbiased sample from any given layer of cortex. But some biases are difficult to avoid. For instance, neurons with large cell bodies will give rise to extracellular action potentials that have larger amplitudes and propagate over larger distances than neurons with small cell bodies. Without careful spike sorting, the smaller extracellular action potentials may easily become lost in the background when in the vicinity of neurons with large extracellular action potentials. This creates a bias in sampling that is not easy to dismiss. Even when a neuron has been successfully isolated, detailed investigation of the neuron may be bypassed if it does not respond “rationally” to standard test stimuli or fit the stereotype of what the investigator believes the neuron should do. This is especially true for higher visual areas such as V4, but it is also true for V1. Such neurons are commonly regarded as “visually unresponsive.” It is difficult to know how frequently such neurons are encountered because often they simply go unreported, or else it is simply stated that only visually responsive units were used for analysis. While it is admittedly difficult to characterize the information processing capabilities of a neuron that seems unresponsive, it is still important to know in what way these neurons are unresponsive. What are the statistics of activity? Do they tend to appear bursty or tonic? Do they tend to be encountered in particular layers of cortex? And most important, are they merely unresponsive to bars and gratings, or are they also equally uninterpretable in their responses to a wider variety of stimuli, such as natural images? A seasoned experimentalist who has recorded from hundreds of neurons would probably have some sense of these things. But for the many readers not directly involved in collecting the data, there is no way of knowing these unreported aspects of V1 physiology. It is possible that someone may eventually come up with a theory that could account for some of these unresponsive neurons, but this cannot happen if no one knows they are there.

1669 measured mean rate (spikes/sec)

How Close Are We to Understanding V1?

0.08

15 10 5 0

0.06

(B).

0.04 0.02 0 0

5

10

15

20

firing rate (spikes/sec)

25

30

fraction of population

(A).

probability

0.1

20

0

2

4

6

8

10

2

4

6

8

10

1 0.8 0.6 0.4 0.2 0

0

threshold (spikes/sec)

Figure 2: (A) Exponential firing rate distribution with a mean of 1 Hz (dashed line denotes mean). (B) Resulting overall mean rate of the measured population (top) and fraction of the population captured (bottom) as a result of recording from neurons only above a given mean firing rate (threshold). A log-normal distribution of mean firing rates was assumed, with rate r = 10u , u ∼ N (−0.55, 0.5).

A related bias that arises in sampling neurons is that the process of hunting for neurons with a single microelectrode will typically steer one toward neurons with higher firing rates. One line of evidence suggesting that this is a significant bias comes from work estimating mean firing rates in the cortex based on energy consumption. Attwell and Laughlin (2001) and Lennie (2003b) calculate that the average activity must be relatively low—less than 1 Hz in primate cortex. However, in the single-unit literature, one finds many studies in which even the spontaneous or background rates are well above 1 Hz. This suggests that the more active neurons are substantially overrepresented (Lennie, 2003b). What makes matters worse is that if we assume V1 neurons exhibit a roughly exponential firing rate distribution, as has been demonstrated for natural scenes and other stimuli (Baddeley et al., 1997), then a mean firing rate of 1 Hz would yield the distribution shown in Figure 2A.2 With such a distribution, only a small fraction of neurons would exhibit the sorts of firing rates normally associated with a robust response. For example, the total probability for firing rates of even 5 Hz and above is 0.007, meaning that 2 Our own analysis of the firing rate distribution, as measured from the PSTH in response to repeated presentation of natural movies (measured from V1 of anaesthetized cats—J. Baker, S. C. Yen, C. M. Gray, personal communication to the authors, 2004), suggests that the distribution is actually power-law, which would mean that it is even more heavily skewed toward zero.

1670

B. Olshausen and D. Field

one would have to wait 1 to 2 minutes on average in order to observe a 1-second interval containing five or more spikes. It seems possible that such neurons could either be missed altogether or else purposely bypassed because they do not yield enough spikes for data analysis. For example, the overall mean firing rate of V1 neurons in the Baddeley et al. study was 4.0 Hz (SD 3.6 Hz), suggesting that these neurons constitute a subpopulation that were perhaps easier to find but not necessarily representative of the population as a whole. Interestingly, the authors point out that even this rate is considered low (which they attribute to anaesthesia), as previous studies (Legendy & Salcman, 1985) report the mean firing rate to be 8.9 Hz (SD 7.0 Hz). Given the variety of neurons in V1, it seems reasonable to presume there exists a heterogeneous population of neurons with different mean firing rates. If we assume some distribution over these rates, then it is possible to obtain an estimate of the fraction of the population characterized given a particular criterion response. And from that, we can calculate what the observed mean rate would be for that fraction. The result of such an analysis, assuming a log-normal distribution of mean rates with an overall mean of 1 Hz, is shown in Figure 2B. As one can see, an overall mean of 4 Hz implies that the selection criterion was somewhere between 1 and 2 Hz, which would capture less than 20% of the population. Neurophysiological studies of the hippocampus provide an interesting lesson about the sorts of biases introduced by low firing rates. Prior to the use of chronic implants, in which the activity of neurons could be monitored for extended periods while a rat explored its environment, the granule cells of the dentate gyrus were thought to be mostly high-rate “theta” cells (e.g., Rose, Diamond, & Lynch, 1983). But it eventually became clear that the majority are actually very low-rate cells (Jung & McNaughton, 1993) and that for technical reasons only high-rate interneurons were being detected in the earlier studies (W. E. Skaggs, personal communication to the authors, January 2004). In fact, Thompson and Best (1989) found that nearly twothirds of all hippocampal neurons that showed activity under anaesthesia became silent in the awake, behaving rat. This overall pattern appears to be upheld in macaque hippocampus, where the use of chronic implants now routinely yields neurons with overall firing rates below 0.1 Hz (Barnes et al., 2003), which differs by nearly two orders of magnitude from the “low baseline rates” of 8.1 Hz reported by Wirth et al. (2003) using acutely implanted electrodes. The dramatic turn of events afforded by the application of chronic implants combined with natural stimuli and behavior in the hippocampus can only make one wonder what mysteries could be unraveled when similar techniques are applied to visual cortex. What is the natural state of activity during free viewing of natural scenes, where the animal is actively exploring its environment? What are the actual average firing rates and other statistics of activity among layer 2/3 pyramidal cells? What are the huge numbers

How Close Are We to Understanding V1?

1671

of granule cells in macaque layer 4, which outnumber the geniculate fiber inputs by 30 to 1, doing? Do they provide a sparser code than their geniculate counterparts? And what about the distribution of actual receptive field sizes? Current estimates show that most parafoveal neurons in V1 have receptive field sizes on the order of 0.1 degree. But based on retinal anatomy and psychophysical performance, one would expect to find a substantial number of neurons with receptive fields an order of magnitude smaller, ca. 0.01 degree (Olshausen & Anderson, 1995). Such receptive field sizes are extremely rare, if not nonexistent, in the existing data on macaque V1 neurons collected using acute recording techniques (De Valois, Albrecht, & Thorell, 1982; Parker & Hawken, 1988). Overall, then, one can identify at least three different biases in the sampling of neurons: 1. Preference for neurons with large cell bodies and large extracellular action potentials 2. Preference for “visually responsive” neurons 3. Preference for neurons with high firing rates So where does this leave us? Let us be conservative. If we assume that 5% to 10% of neurons are missed because they have weak extracellular action potentials, another 5% to 10% are discarded because they are not visually unresponsive, and 50% to 60% are missed because of low firing rates (assuming a conservative threshold of 0.5 Hz in Figure 2), then even allowing for some overlap among these populations would yield the generous estimate that 40% of the population has actually been characterized (although we would not be surprised if that number is as low as 20%). 2.2 Biased Stimuli. Much of our current knowledge of V1 neural response properties is derived from experiments using reduced stimuli. Often these stimuli are ideal for characterizing linear systems—spots, white noise, or sine wave gratings—or else they are designed around preexisting notions of how neurons should respond. The hope is that the insights gained from studying neurons using these reduced stimuli will generalize to more complex situations, such as natural scenes. But of course there is no guarantee that this is the case. And given the nonlinearities inherent in neural responses, we have every reason to be skeptical. Sine wave gratings are ubiquitous tools in visual system neurophysiology and psychophysics. In fact, the demand for using these stimuli is so high that some companies produce lab equipment with specialized routines designed for this purpose (e.g., Cambridge Research Systems). But sine waves are special only because they are eigenfunctions of linear, time- or spaceinvariant systems. For nonlinear systems, they bear no particular meaning and occupy no special status. In the auditory domain, sine waves could be justified from the standpoint that many natural sounds are produced by

1672

B. Olshausen and D. Field

oscillating membranes. However, in the visual world, there are few things that naturally oscillate either spatially or temporally. The Fourier basis set is just one of many possible basis sets, and if the system is nonlinear, no one basis set will necessarily provide a proper account of the system. Bars of light, Gabor functions, Walsh patterns, or any other basis set will suffer from similar problems requiring assumptions of the types of nonlinearities that are present. The Gabor function has been argued to provide a good model of cortical receptive fields (Field & Tolhurst, 1986; Jones & Palmer, 1987). However, the methods used to measure the receptive field in the first place generally search for the best-fitting linear model. They are not tests of how well the receptive field model actually describes the response of the neuron. Not until recent work by Gallant and colleagues (David, Vinje, & Gallant, 2004) have these models been tested in ecological conditions. And as we discuss below, the results demonstrate that these models often fail to adequately capture the actual behavior of neurons. The use of white noise and m-sequences can provide some advantage over the traditional linear systems approach, as they can provide a wider range of stimuli than a simple basis set and are thus capable of mapping out the nonlinearity of a system if the nonlinearities take on particular forms (e.g., Nykamp & Ringach, 2002). In addition, by analyzing the eigenvectors of the spike-triggered covariance matrix, one can recover fairly complex nonlinear models, such as the hypothetical subunits composing a complex cell, or suppressive dimensions in the stimulus space (Touryan, Lau, & Dan, 2002; Rust, Schwartz, Movshon, & Simoncelli, 2004). However, there is only one way to map a nonlinear system with complete confidence: present the neuron with all possible stimuli. The scope of this task is truly breathtaking. Even an 8 × 8 pixel patch with 6 bits of gray level requires searching 2384 > 10100 possible combinations (a google of combinations). If we allow for temporal sensitivity and include a sequence of 10 such patches, we are exceeding 101000 . With the estimated number of particles in the universe estimated to be in the range of 1080 , it should be clear that this is far beyond what any experimental method could explore. In theory, a nonlinear neuron could behave quite rationally for all but a handful of these stimuli, so unless this handful has been measured, there is no way to be certain the neuron has been adequately characterized. The use of independent white noise can theoretically present a neuron with all possible stimuli. However, 10 hours of recording from a single neuron with a patch like that above at 30 frames per second will present just 106 out of the 101000 possible stimuli. Using such a tiny fraction of the possible stimuli allows mapping of the nonlinearities only if the nonlinearities are quite smooth. The deeper question is whether one can predict the responses of neurons from some combinatorial rule of the responses derived from a reduced set of stimuli. The response of the system to any reduced set of stimuli cannot be guaranteed to provide the information needed to predict the response to an arbitrary combination of those stimuli. Of course, we will never know

How Close Are We to Understanding V1?

1673

this until it is tested, and that is precisely the problem: the central assumption of the elementwise, reductionist approach has yet to be thoroughly tested. We believe that the solution to these problems is to turn to natural scenes. Our intuitions for how to reduce stimuli should be guided by the sorts of structure that occur in natural scenes, not arbitrary (or even elegant) mathematical functions or stimuli that are conceptually simple or happen to be easy to generate on a monitor. Since it is impossible to map out the response to all possible stimuli, some assumptions about the nature of the nonlinearity and the stimulus space must be made. The assumption we believe is appropriate is that the nonlinearities relevant to visual processing are most likely to be revealed when the system is presented with ecologically relevant stimuli. Traditionally, experimentalists have been reluctant to use natural scenes as stimuli because they seem highly variable and uncontrolled. But in recent years there has been significant progress in modeling the structure of natural images (Simoncelli & Olshausen, 2001), and already a number of studies have used some of the basic properties of natural scenes (1/ f 2 power spectrum, contrast distributions, texture statistics, etc.) to develop parametric descriptions of natural images that can be used to generate experimental stimuli (e.g., Knill, Field, & Kersten, 1990; Heeger & Bergen, 1995). In addition, there have been some recent attempts to map out the nonlinearities in response to natural images (Sharpee, Rust, & Bialek, 2004). And the development of several adaptive stimulus techniques looks to be a promising avenue for determining the relevant stimulus for sensory neurons (Foldiak, Xiao, Keysers, Edwards, & Perrett, 2004; Edin, Machens, Schutze, & Herz, 2004; O’Connor, Petkov, & Sutter, 2004). In summary, then, there are two main reasons for using natural scenes as stimuli: (1) by devoting resources to relevant ecological stimuli, the experimentalist has a greater chance of finding and mapping the nonlinearities relevant to the function of neurons, and (2) the responses to natural scenes provide an ecologically meaningful test of any neural model. Even if nonecological stimuli are used to map a neuron’s behavior, the true test that the characterization is correct is to demonstrate that one can predict the neurons behavior in ecological conditions. 2.3 Biased Theories. Currently in neuroscience, there is an emphasis on telling a simple story. This often encourages investigators to demonstrate when a theory explains data, not when a theory provides a poor model. In addition, editorial pressures can encourage one to make a tidy picture out of data that may actually be quite messy. This, of course, runs the risk of forcing a picture that does not actually exist. Theories then emerge that are centered around explaining a particular subset of published data, or which can be conveniently proven rather than being motivated by functional considerations: How does this help the brain to solve the real problems of vision? For instance, early work demonstrating the spatial frequency selectivity of neurons (e.g., Blakemore & Campbell, 1969) led a number of investigators

1674

B. Olshausen and D. Field

toward a Fourier view of the cortex. Such work led to thousands of studies devoted to questions regarding frequency tuning and the relevance of this tuning to the human detection and discrimination of sinusoidal gratings. This left us with complex theories for how we detect gratings, but with little understanding of how such a system would function in the natural world. Another example is the classification of V1 neurons into the categories of simple, complex, and hypercomplex or end-stopped. Simple cells are noted for having oriented receptive fields organized into explicit excitatory and inhibitory subfields, whereas complex cells are tuned for orientation but are relatively insensitive to position and the sign of contrast (black-white edge versus white-black edge). Hypercomplex cells display more complex shape selectivity, and some appear most responsive to short bars or the terminations of bars of light (so-called end-stopping). Are these categories real, or a result of the particular way neurons were stimulated and the data analyzed? A widely accepted theory that accounts for the distinction between simple and complex cells is that simple cells compute a (mostly linear) weighted sum of image pixels, whereas complex cells compute a sum of the squared and half-rectified outputs of simple cells of the same orientation—the socalled energy model (Adelson & Bergen, 1985). This theory is consistent with measurements of response modulation in response to drifting sine wave gratings, otherwise known as the F1/F0 ratio (Skottun et al., 1991). From this measure, one finds clear evidence for a bimodal distribution of neurons, with simple cells having ratios greater than one and complex cells having ratios less than one. Recently, however, it has been argued that this particular nonlinear measure tends to exaggerate or even introduce bimodality rather than reflecting an actual intrinsic property of the data (Mechler & Ringach, 2002). When receptive fields are instead characterized by the degree of overlap between zones activated by increments or decrements in contrast, one obtains a continuous, unimodal distribution when the overlap is expressed as the normalized distance between the zones, but a bimodal distribution when expressed as an overlap index (sum of widths minus the separation divided by sum of widths plus the separation) (Mata & Ringach, 2005; Kagan, Gur, & Snodderly, 2002). In addition, the energy model of complex cells does a poor job accounting for complex cells with a partial overlap of activating zones. Thus, the way in which response properties are characterized can have a profound effect on the resulting theoretical framework that is adopted to explain the results. The notion of two classes of neurons, simple and complex, has been firmly planted in the minds of modelers and experimentalists alike, but a closer examination of the data reveals that this classification scheme may actually be an artifact of the lens through which we view the data. The notion of end-stopped neurons introduces even more questions when one considers the structure of natural images. Most natural scenes are not littered with line terminations or short bars (see Figure 3, middle). Indeed, at the scale of a V1 receptive field, the structures in this image are

How Close Are We to Understanding V1?

1675

100

100

120

120

140

140

180

200

220

240

180

200

220

240

Figure 3: A natural scene (left) and an expanded section of it (middle). Far right shows the information conveyed by an array of complex cells at four different orientations. The length of each line indicates the strength of response of a complex cell at that location and orientation. The solid black line shows the location of the boundary of the log in the original image. Note that very few complex cells of the appropriate orientation are responding along this contour.

quite complex and defy the simple line-drawing-like characterization of a “blocks world.” Where in such an image would one expect an end-stopped neuron to fire? By asking this question, one could possibly be led to a more ecologically relevant theory of these neurons than suggested by simple laboratory stimuli. Another theory bias often embedded in investigations of V1 function is the notion that simple cells, complex cells, and hypercomplex cells are actually coding for the presence of edges, corners, or other two-dimensional (2D) shape features in images. However, much of this thinking is derived from a rather cartoon view of images. Computer vision studies provide clear evidence of the fallacy of the purely bottom-up approach. One cannot compute the presence even of simple edges of an object purely from the luminance discontinuities (i.e., using a filter such as a simple or complex cell model). As an example, Figure 3 demonstrates the result of processing a natural scene with the standard energy model of a complex cell. Far from making contours explicit, this representation creates a cluttered array of orientation signals that make it difficult to discern what is actually going on in the scene. Our perception of crisp contours, corners, and junctions in images is largely a post hoc phenomenon that is the result of massive inferential computations performed by the cortex, which are heavily informed by context and high-level knowledge. It could well be that our initial introspections about scene structure are a poor guide as to the actual problems faced by the cortex. In order to properly understand V1 function, our theories will need to be guided by functional considerations and an appreciation for the ambiguities contained in natural images rather than being biased by simplistic notions of feature detection that are suggested by the responses of a select population

1676

B. Olshausen and D. Field

of neurons recorded using simplified stimuli. One of the most challenging problems facing the cortex is that of inferring a representation of 3D surfaces from the 2D image (Nakayama, He, & Shimojo, 1995; see also section 3.4). This is not an easy problem to solve and still lies beyond the abilities of modern computer vision. It seems quite likely that V1 plays a role in solving this problem, but understanding how it does so will require going beyond bottom-up filtering models to consider how top-down information is used in the interpretation of images (Olshausen, 2003; Lee & Mumford, 2003; see also section 3.5 below). 2.4 Interdependence and Contextual Effects. It has been estimated that roughly 5% of the excitatory input in layer 4 of V1 arises from the lateral geniculate nucleus (LGN), with the majority resulting from intracortical inputs (Peters & Payne, 1993; Peters, Payne, & Budd, 1994). Thalamocortical synapses have been found to be stronger, making them more likely to be effective physiologically (Ahmed, Anderson, Douglas, Martin, & Nelson, 1994). Nevertheless, based on visually evoked membrane potentials, Chung and Ferster (1998) have argued that the geniculate input is responsible for just 35% of a layer 4 neuron’s response. This leaves 65% of the response determined by factors outside the direct feedforward input. Using optical imaging methods, Arieli, Sterkin, Grinvald, and Aertsen (1996) showed that the ongoing population activity can account for 80% of an individual V1 neuron’s response variance, and recent work using multielectrode arrays has shown that the ongoing activity V1 neurons is only slightly modified by visual input (Fiser, Chiu, & Weliky, 2004). Thus, we are left with the real possibility that somewhere between 60% and 80% of the response of a V1 neuron is a function of other V1 neurons, or inputs other than those arising from LGN. It should also be noted that recent evidence from the early blind has demonstrated that primary visual cortex has the potential for a wide range of multimodal input. Sadato et al. (1996) and Amedi, Raz, Pianka, Malach, and Zohary (2003) demonstrated that both tactile braille reading and verbal material can activate visual cortex in those who have been blind from an early age, even though no such activation occurs in those with normal sight. This implies that in the normal visual system, primary visual cortex has the potential for interactions with quite high-level sources of information. That V1 neurons are influenced by context—the spatiotemporal structure outside the classical receptive field (CRF)—is by now well known and has been the subject of many investigations over the past decade. Knierim and Van Essen (1992) showed that many V1 neurons are suppressed by a field of oriented bars outside the classical receptive field of the same orientation, and Sillito, Grieve, Jones, Cudeiro, and Davis (1995) have shown that one can introduce quite dramatic changes in orientation tuning based on the orientation of gratings outside the CRF. Other investigators have probed the spatial specificity of the surround using grating patches and demonstrated fairly specific zones of suppression (Walker, Ohzawa, & Freeman, 1999; Cavanagh,

How Close Are We to Understanding V1?

1677

Bair, & Movshan, 2002b). And these studies, in addition to others (see Series, Lorenceau, & Fr´egnac, 2003, for a review), have likely tapped only a portion of the interdependencies and contextual effects that actually exist. The problem in teasing apart contextual effects in such a piecemeal fashion is that one faces a combinatorial explosion in the number of possible spatial and featural configurations of surrounding stimuli such as bars or gratings. What we really want to know is how neurons respond within the sorts of context encountered in natural scenes. For example, given the results of Knierim and Van Essen (1992) using bar stimuli, or Sillito et al. (1995) using gratings, what should we reasonably expect to result from the sorts of context seen in the natural scene of Figure 3? Indeed, it is not even clear whether one can answer the question since the contextual structure here is so much richer and more diverse than that which has been explored experimentally. Some of the initial studies exploring the role of context in natural scenes have demonstrated pronounced nonlinear effects that tend to sparsify activity in a way that would have been hard to predict from the existing reductionist studies (Vinje & Gallant, 2000). More studies along these lines are needed, and most important, we need to understand how and why the context in natural scenes produces such effects. Another striking form of interdependence exhibited by V1 neurons is in the synchrony of activity. Indeed, the fact that one can even measure largescale signals such as the local field potential or electroencephalogram (EEG) implies that large numbers of neurons must be acting together. Gray, Konig, ¨ Engel, and Singer (1989) demonstrated gamma band synchronization between neurons in cat V1 when bars moved through their receptive fields in similar directions, suggesting that synchrony is connected to a binding or segmentation process. More recently, Worg ¨ otter ¨ et al. (1998) have shown that receptive field sizes change significantly with the degree of synchrony exhibited in the EEG, and Maldonado, Babul, Singer, Rodriguez, and Grun ¨ (2004) have shown that periods of synchronization preferentially occur during periods of fixation as opposed to during saccades or drifts. However, what role synchrony plays in the normal operation of V1 neurons is entirely unclear, and it is fair to say that this aspect of response variance remains a mystery. 2.5 Ecological Deviance. We have argued above for experiments that measure the responses of neurons in ecological conditions even when no model is capable of predicting the results—or, we should say, especially if no model can predict the results. Publishing findings only in conditions when a particular model works would be poor science. It is important to know not only where the current models can successfully predict neural behavior, but also under what conditions they break down and why. And as we have emphasized above, it is most important to know how they fare under ecological conditions. If the current models fail to predict neural responses under such conditions, then the literature should reflect this.

1678

B. Olshausen and D. Field

In the past few years, a number of labs have begun using natural scenes as stimuli when recording from neurons in the visual pathway (Dan, Atick, & Reid, 1996; Baddeley et al., 1996; Keysers, Xiao, Foldiak, & Perrett, 2001; Vinje & Gallant, 2002; Ringach, Hawken, & Shapley, 2002; Smyth, Willmore, Baker, Thompson, & Tolhurst, 2003; David et al., 2004). In particular, the Gallant lab at UC Berkeley has taken the approach of attempting to determine how well one can predict the responses of V1 neurons to natural stimuli using a variety of different models. However, assessing how well these models fare, and what it implies about our current understanding of V1, is difficult for at least three reasons. First, one must make several assumptions (either implicitly or explicitly) regarding what aspects of the response are relevant to the model. Spike counts will show significant variability over repeated trials (Tolhurst, Movshon, & Dean, 1983). One can take the average over a number of presentations, but this implicitly assumes that the variability can be attributed to noise. This can be questioned, especially considering that in many cases, individual spikes have been shown to have relatively high reliability (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). The trial-to-trial variability could well be due to internally generated dynamics that plays an important role in information processing that we simply do not as yet understand (Arieli et al., 1996; Fiser et al., 2004; see also section 3.1). Furthermore, to take averages, one must make assumptions regarding the temporal window over which the average is computed. Second, these studies are best performed with an awake, behaving animal. In such conditions, there are limitations to the spatial and temporal accuracy with which the gaze can be measured. When averaging across presentations of stimuli, one must ascertain whether or not the same stimulus was actually presented. Again, one must make an assumption as to what spatiotemporal window to use. The third problem is that whatever model is chosen, one is always subject to the criticism that the model is not sufficiently elaborate. Thus, any inability to predict the neuron’s response might be argued to be simply due to some missing element in the model. For example, David et al. (2004) have explored two different types of models: a linearized spatiotemporal receptive field model, in which the neuron’s response is essentially a weighted sum of the image pixels over space and time, and a phase-separated Fourier model, which allows one to capture the phase invariance nonlinearity of a complex cell. These models can typically explain between 20% and 40% of the response variance. Correcting for intertrial variability improves matters somewhat and it is possible that with more trials and the addition of other nonlinearities such as contrast normalization, adaptation, and response saturation, the fraction of variance explained could rise even more above these levels (and this is a current direction of these studies).

(A).

spike count (35 msec bins)

How Close Are We to Understanding V1?

1679

1 PSTH model

0.8 0.6 0.4 0.2 0 0

5

10

15

20

25

30

time (sec)

18 ms

53 ms

88 ms

123 ms

158 ms

193 ms

(B).

Figure 4: Activity of a V1 neuron in anesthetized cat in response to a natural movie. (A) The PSTH of the neuron’s response (dashed line), together with the predicted response (solid line) generated from the model: r (t) = α h( x k(x, t) ∗ I (x, t) + θ ) p + r0 . The function h( ) is a half-wave rectifying function, and the parameters α, p, θ , and r0 are fit to minimize the squared error with the data. The resulting correlation coefficient in this case is 0.36. Average spike counts were obtained by averaging across 100 trials in 35 ms bins (corresponding to the frame rate). (B) The kernel k(x, t) was measured via reverse correlation with an m-sequence and is shown here as a series of frames in 35 ms intervals, with the center time of the interval displayed above each frame.

We believe such reports are critically important for several reasons. First, such results create a benchmark for showing how well the standard or basic models actually predict ecologically relevant data. Second, these are well-established models that have been given a fair run for their money. One could imagine any number of improvements to these models, and it will be interesting to see if they fare better, but in the meantime, these results provide a useful baseline for comparison. Furthermore, these are the data that represent the ultimate goal of any computational model, and so they are crucial to presenting a complete picture of V1 function. Given the nature of the errors, we do not believe that the addition of simple response nonlinearities such as contrast normalization is likely to improve matters much. Given these results with both linear and Fourier power models, our conjecture is that the best-case scenario is that the percentage of variance explained is likely to asymptote at 30% to 40% with the standard model. One of the reasons for our pessimism is due to the way in which these models fail. For example, Figure 4 shows data collected from the laboratory of Charles Gray at Montana State University, Bozeman, in which the activities of V1 neurons in anesthetized cat are recorded in response to repeated presentations of a natural movie (C. M. Gray, J. Baker, & S. C. Yen personal

1680

B. Olshausen and D. Field

communication to the authors, 2004). Figure 4A shows the peristimulus time histogram (PSTH) of a typical V1 simple cell, whose receptive field as measured from an M-sequence kernel is similar to those found in the literature—a Gabor-like function that translates over time (i.e., space-time inseparable). Superimposed on this is the predicted response generated by convolving the neuron’s space-time receptive field (see Figure 4B) with the movie, and putting the result through a point-wise nonlinearity (including a gain factor and offset term). The neuron tends to exhibit sparse, punctate responses, some of which are predicted by the receptive field model and others not. In most cases, the model response undershoots the PSTH, and this cannot simply be addressed by increasing the gain or narrowing the response of the model, because there are many other episodes where the model predicts responses of equal magnitude in which there is little or no response from the neuron. One could possibly obtain a better fit to the data by including additional terms modeling suppression (Rust et al., 2004) and temporal adaptation (Lesica, Boloori, & Stanley, 2003), or even a spiking mechanism (Paninski, Pillow, & Simoncelli, 2004), but we believe it is useful to see how much the linear, driving term of the model alone fares under these circumstances. Moreover, these additions are essentially single-neuron mechanisms. What seems to be suggested by our initial informal observations of multiple simultaneously recorded units is that a more complex network nonlinearity is at work here, and that describing any one neuron’s behavior will require including the influence of other simultaneously recorded neurons. An important lesson of these findings is that simply mapping out receptive fields does not provide a complete understanding of V1 response properties. For example, Ringach et al. (2002) have shown that it is possible to map out receptive fields using natural scenes, and they show that it is even possible to recover some nonlinear effects such as cross-orientation inhibition with this technique. However, the resulting receptive field models were not tested by comparing their predictions to the actual activity of neurons in response to natural movies. Without doing so, it is difficult to assess how well such models capture the function of the neuron. Unfortunately, journals are often unprepared to publish results when a study demonstrates the failure of a model, unless the study also presents a competing model that works well. Part of this may seem understandable since a model might fail for a variety of reasons. However, until a benchmark is placed in the literature, it is impossible to determine how good a model actually is. And given the magnitude of the task before us, it could take years before a good model emerges. In the meantime, what would be most helpful is to accumulate a database of single-unit or multiunit data (stimuli and neural responses) that would allow modelers to test their best theory under ecological conditions. Finally, it should be noted that better success has been obtained in using receptive field models to predict the responses of neurons to natural scenes

How Close Are We to Understanding V1?

1681

in the LGN (Dan et al., 1996), or the response of cortical neurons to purely static images (Smyth et al., 2003), although they are still far from making perfect predictions. This would seem to suggest that much of the difficulty in predicting responses in cortex has to do with the effects of the massive, recurrent intracortical circuitry that is engaged during natural vision. 2.6 Summary. Table 1 presents a summary of the five problems we have identified with the current view of V1 that has emerged from the data collected to date, along with some of the solutions that we have suggested could possibly help in obtaining a more complete picture of V1 function. Given the limitations described above, is it possible to quantify how well we currently understand V1 function? We attempt to estimate this as follows: [Fraction understood] = Fraction of variance explained from neurons recorded

× [Fraction of population recorded] . If we consider that roughly 40% of the population of neurons in V1 has actually been recorded from and characterized, together with our conjecture that 30% to 40% of the response variance of these neurons can be explained under natural conditions using the currently established models, then we are left to conclude that we can currently account for 12% to 16% of V1 function. Thus, approximately 85% of V1 function has yet to be explained (see Figure 5).3 3 New theories Given the above observations, it becomes clear that there is so much unexplored territory that it is very difficult to rule out theories at this point (although there are some obvious bounds dictated by neural architecture, such as fan-in/fan-out and the spatial extent of axonal and dendritic arbors). In the sections below, we discuss some of the theories that are plausible given our current data. The goal here is not to provide a detailed review of the theories currently in the literature. Rather, it is to provide a few examples of the range of theories that are consistent with the experimental data. It must be emphasized that considering that there may exist a large family of neurons with unknown properties and given the low level of prediction for the neurons studied, there is still considerable room for theories dramatically different from those theories presented here.

3

We have primarily drawn on the Gallant lab’s data for obtaining the percentage of variance explained, and so we are assuming that their methods for isolating neurons are subject to the same biases in sampling discussed earlier.

Use chronically implanted electrodes; parallel recording arrays

Solution

Use natural scenes, ecologically relevant stimuli

Large neurons; visually Use of reduced responsive neurons; stimuli such as neurons with high bars, spots, and firing rates gratings

Biased Stimuli

Problem

Biased Sampling

Consider more functional/computational theories that solve problems of vision

Simple/complex cells; data-driven theories

Biased Theories

Examine how context affects responses in natural scenes

Influence of intracortical input; effect of context; synchrony

Interdependence and Context

Develop models that can account for responses to natural images

Responses to natural scenes deviate from predictions of standard models

Ecological Deviance

Table 1: Five Problems with the Current view of V1 and Some Possible Solutions for Obtaining a More Complete Picture.

1682 B. Olshausen and D. Field

How Close Are We to Understanding V1?

1683

Variance explained

1.0

~0.4

~85% of V1 function not understood

0.3-0.4

0 0

1.0

Proportion of cells studied Figure 5: 85% of V1 function remains to be understood.

3.1 Dynamical Systems and the Limits of Prediction. Imagine tracking a single molecule within a hot gas as it interacts with the surrounding molecules. The particular trajectory of one molecule will be erratic and fundamentally unpredictable without knowledge of all other molecules with potential influence. Even if we presumed that the trajectory of the particular molecule was completely deterministic and following simple laws, in a gas with large numbers of interacting molecules, one could never provide a prediction of the path of a single molecule except over very short distances. In theory, the behavior of single neurons may have similar limitations. To make predictions of what a single neuron will do in the presence of a natural scene may be fundamentally impossible without knowledge of the surrounding neurons. The nonlinear dynamics of interacting neurons may put bounds on how accurately the behavior of any neuron can be predicted. And at this time, we cannot say where that limit may be. What is fascinating in many ways, then, is that neurons are as predictable as they are. For example, work from the Gallant lab has shown that under conditions where a particular natural scene sequence is repeated to a fixating macaque monkey, a neuron’s response from trial to trial is fairly reliable (e.g., Vinje & Gallant, 2000). This clearly suggests that the response is dependent in large part on the stimulus, certainly more than a molecule in the gas model. So how do we treat the variability that is not explained by the stimulus? We may find that the reliability of a local group of neurons is more predictable than a single neuron, which would then require multielectrode recording to attempt to account for the remaining variance. For example, Arieli et al. (1996) have shown that much of the intertrial variability may be explained in terms of large-scale fluctuations in ongoing activity

1684

B. Olshausen and D. Field

of the surrounding population of neurons measured using optical recording, and Fiser et al. (2004) have similarly shown that ongoing population activity as measured with multielectrode arrays is only loosely modulated by visual input. However, what role these large-scale fluctuations play in the normal processing of natural scenes has yet to be investigated. 3.2 Sparse, Overcomplete Representations. One effort to explain many of the nonlinearities found in V1 is based on the idea that neurons are attempting to achieve some degree of gain control (Geisler & Albrecht 1992). Because any single neuron lacks the dynamic range to handle the range of contrasts in natural scenes, it is argued, the contrast response must be normalized. Here we provide a different line of reasoning to explain the observed response nonlinearities of V1 neurons (further details are provided by Olshausen & Field, 1997, and Field & Wu, 2004). We argue that the spatial nonlinearities serve primarily to reduce the linear dependencies that exist in an overcomplete code, and as we shall see, this leads to a fundamentally different set of predictions about the population activity. Consider the number of vectors needed to represent a particular set of data with dimensionality D (e.g., an 8 × 8 pixel image patch would have D = 64). No matter what form the data take, such data never require more than D linearly independent vectors to represent it. A system where data with dimensionality D are spanned by D vectors is described as critically sampled. Such critically sampled systems (e.g., orthonormal bases) are popular in the image coding community as they allow any input pattern to be represented uniquely, and the transform and its inverse are easily computed. The wavelet code, for example, has seen widespread use, and wavelet-like codes similar to that of the visual system have been shown to provide very high efficiency in terms of sparsity when coding natural scenes (e.g., Field, 1987). Some basic versions of independent component analysis (ICA) also attempt to find a critically sampled basis that minimizes the dependencies among the vectors, and the result is a wavelet-like code with tuning much like the neurons in V1 (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). However, the visual system is not using a critically sampled code. In cat V1, for example, there are 25 times as many output fibers as there are input fibers from the LGN, and in macaque V1, the ratio is on the order of 50 to 1. Such overcomplete codes have one potential problem: the vectors are not linearly independent. Thus, if neurons were to compute their output simply from the inner product between their weight vector and the input, their responses will be correlated. Figure 6A shows an example of a two-dimensional data space represented by three neurons with linearly dependent weight vectors. Even assuming the outputs of these units are half-rectified so they produce only positive values, the data are redundantly represented by such a code. The only way to remove this linear dependence is through a nonlinear transform.

How Close Are We to Understanding V1?

(A).

(B).

90 120

60

150

0

210

330 240

300 270

(C).

90 120

0

210

330 240

300 270

60

150

30

180

90 120

60

150

30

180

1685

30

180

0

210

330 240

300 270

Figure 6: Overcomplete representation. (A) The iso-response contours of three linear neurons (with half-wave rectification) having linearly dependent weight vectors. A stimulus falling anywhere along a given contour will result in the same response from the neuron. A stimulus falling in the upper half-plane will result in responses on all three neurons, even though only two would be required to uniquely determine its position in the space. (B) Curving the response contours removes redundancy among these neurons. Now only two neurons will code for a stimulus anywhere in this space. (C) A full tiling of the 2D stimulus space now requires eight neurons, which would be overcomplete as a linear code, but critically sampled given this form of nonlinear response.

One of the nonlinear transforms that will serve this goal is shown in Figure 6B. Here, we show the iso-response curves for the same three neurons. This curvature represents an unusual nonlinearity. For example, consider the responses of a unit to two different stimuli: the first stimulus aligned with the neuron’s weight vector and a second stimulus separated by 90 degrees. The second stimulus will have no effect on the neuron on its own since its vector is orthogonal to that of the neuron. However, when added to the first vector, the combined stimulus will be on a lower iso-response curve (i.e., the neuron will have reduced its activity). In other words, the response curvature of the neuron results in a nonlinearity with the characteristic nonclassical, suppressive behavior: stimuli that on their own have no effect on the neuron (stimuli orthogonal to the principal direction of the neuron) can modulate the behavior of an active neuron. This general nonlinearity comes in several forms and includes end-stopping and cross-orientation inhibition, and is what is typically meant by the term nonclassical surround. Indeed, as Zetzsche, Krieger, and Wegmann (1999) note, this curvature is simply a geometrical interpretation of such behaviors. With the addition of a compressive nonlinearity, this curvature results in the behavior described as contrast normalization. In contrast to the gain control or divisive normalization theory, we argue that the nonlinearities observed in V1 neurons are present primarily to allow a large (overcomplete) population of neurons to represent data using a small number of active units, a process we refer to as sparsification. The goal is not to develop complete independence, as the activity of any neuron

1686

B. Olshausen and D. Field

partially predicts the lack of activity in neighboring neurons. However, the code allows for expanding the dimensionality of the representation without incurring the linear dependencies that would be present in a nonorthogonal code. Importantly, this model predicts that the nonlinearities are a function of the angle between the neuron’s weight vector and those surrounding it. Future multielectrode recordings may provide the possibility to test this theory. From the computational end, we have found that our sparse coding network (Olshausen & Field, 1996, 1997) produces nonlinearities much like those proposed. Our hope, then, is that many of the nonlinearities that have been observed in V1 can eventually be explained within one general framework of efficient coding. 3.3 Contour Integration. There is now considerable physiological and anatomical evidence showing that V1 neurons have a rather selective connection pattern both within and between layers. For example, research investigating the lateral projections of pyramidal neurons in V1 has shown that the long-range lateral connections project primarily to regions of the cortex with similar orientation columns, as well as to similar ocular dominance columns and cytochrome oxidase blobs (Malach, Amir, Harel, & Grinvald, 1993; Yoshioka, Blasdel, Levitt, & Lund, 1996). Early studies exploring the horizontal connections in V1 discovered that selective longrange connections extend laterally for 2 to 5 mm parallel to the surface (Gilbert & Wiesel, 1979), and studies on the tree shrew (Rockland & Lund, 1983; Bosking, Zhang, Schofield, & Fitzpatrick, 1997), primate (e.g., Malach et al., 1993; Sincich & Blasdel, 2001), ferret (Ruthazer & Stryker, 1996), and cat (e.g., Gilbert & Weisel, 1989) have all demonstrated significant specificity in the projection of these lateral connections. A number of neurophysiological studies also show that colinearly oriented stimuli presented outside the classical receptive field have a facilitatory effect (Kapadia, Ito, Gilbert, & Westheimer, 1995; Kapadia, Westheimer, & Gilbert, 2000; Polat, Mizobe, Pettet, Kasamatsu, & Norcia, 1998). The results demonstrate that when a neuron is presented with an oriented stimulus within its receptive field, a second collinear stimulus will sometimes increase the response rate of the neuron while the same oriented stimulus presented orthogonal to the main axis of orientation (displaced laterally) will produce inhibition, or at least less facilitation. These results suggest that V1 neurons have an orientation- and positionspecific connectivity structure beyond what is usually included in the standard model. One line of research suggests that this connectivity helps resolve the ambiguity of contours in scenes and is involved in the process of contour integration (e.g., Field, Hayes, & Hess, 1993). This follows from work showing that the amplification of locally coaligned, oriented elements provides an effective means of identifying contours in natural scenes (Parent & Zucker, 1989; Sha’ashua & Ullman, 1988; Ben-Shahar & Zucker, 2004). This

How Close Are We to Understanding V1?

1687

type of mechanism could work in concert with the sparsification nonlinearities mentioned above, since the facilitatory interactions would primarily occur among elements that are nonoverlapping—that is, receptive fields whose weight vectors are orthogonal. An alternative theoretical perspective is that the effect of these orientation- and position-specific connections should be mainly suppressive, with the goal of removing dependencies among neurons that arise due to the structure in natural images (Schwartz & Simoncelli, 2001). In contrast to the contour integration hypothesis, which proposes that the role of horizontal connections is to amplify the structure of contours, this model would attempt to attenuate the presence of such structure in the V1 representation. Although this may be a desirable outcome in terms of redundancy reduction, we would argue that the cortex has objectives other than redundancy reduction per se (Barlow, 2001). Chief among these is to provide a meaningful representation of image structure that can be easily read out and interpreted by higher-level areas. Finally, it is important to note, with respect to the discussion in the previous section, that the type of redundancy we are talking about here is due to long-range structure in images beyond the size of a receptive field, not that which is simply due to the overlap among receptive fields. Thus, we propose that the latter should be removed via sparsification, while the former should be amplified by the long-range horizontal connections in V1. 3.4 Surface Representation. We live in a three-dimensional world, and the fundamental causes of images that are of behavioral relevance are surfaces, not two-dimensional features such as spots, bars, edges, or gratings. Moreover, we rarely see the surface of an object in its entirety. Occlusion is the rule, not the exception, in natural scenes. It thus seems quite reasonable to think that the visual cortex has evolved effective means to parse images in terms of the three-dimensional structure of the environment: surface structure, foreground-background relationships, and so forth. Indeed, there is now a strong body of psychophysical evidence showing that 3D surfaces and figure-ground relationships constitute a fundamental aspect of intermediate-level representation in the visual system (Nakayama et al., 1995; see also Figure 7). Nevertheless, it is surprising how little V1 physiology has actually been devoted to the subject of three-dimensional surface representation. Some recent studies in extrastriate cortex have begun to yield interesting findings (Nguyenkim & DeAngelis, 2003; Zhou, Friedman, & von der Heydt, 2000; Bakin, Nakayama, & Gilbert, 2000), but V1’s involvement in surface representation remains a mystery. Although many V1 neurons are disparity selective, this by itself does not tell us how surface structure is represented or how figure-ground relationships of the sort depicted in Figure 7 are resolved.

1688

B. Olshausen and D. Field

Figure 7: The three line strokes at left are interpreted as different objects depending on the arrangement of occluders. Thus, pattern completion depends on resolving figure-ground relationships. At what level of processing is this form of completion taking place? Since it would seem to demand access to high-resolution detail in the image, it cannot simply be relegated to high-level areas.

At first sight, it may seem preposterous to suppose that V1 is involved in computing three-dimensional surface representations. But again, given how little we actually know about V1, combined with the importance of 3D surface representations for guiding behavior, it is a plausible hypothesis to consider. In addition, problems such as occlusion demand resolving figure-ground relationships in a relatively high-level representation where topography is preserved (Lee & Mumford, 2003). There is now beginning to emerge physiological evidence supporting this idea. Neurons in V1 have been shown to produce a differential response to the figure versus background in a scene of texture elements (Lamme, 1995; Zipser, Lamme, & Schiller, 1996), and a substantial fraction of neurons in V1 are selective to border ownership (Zhou et al., 2000). In addition, Lee, Mumford, Romero, and Lamme (1998) have demonstrated evidence for a medial axis representation of surfaces in which V1 neurons become most active along the skeletal axis of an object. It seems quite possible that such findings are just the tip of the iceberg. 3.5 Top-Down Feedback and Disambiguation. Although our perception of the visual world is usually quite clear and unambiguous, the raw image data that we start out with is not. Looking back at Figure 3, one can see that even the presence of a simple contour can be ambiguous in a natural scene. The problem is that information at the local level is insufficient to determine whether a change in luminance is due to an object boundary, simply part of a texture, or a change in reflectance. Although boundary junctions are also quite crucial to the interpretation of a scene, a number of studies have shown that human observers are poor judges of what constitutes a boundary or junction when these features are shown in isolation (Elder, Beniaminov, & Pintilie, 1999; McDermott, 2004). Thus, the calculation of what forms a boundary is dependent on the context, which provides

How Close Are We to Understanding V1?

1689

information about the assignment of figure and ground, surface layout, and so forth. Arriving at the correct interpretation of an image, then, constitutes something of a chicken- and-egg problem between lower and higher levels of image analysis. The low-level shape features that are useful for identifying an object—edges, contours, surface curvature, and the like—are typically ambiguous in natural scenes, so they cannot be computed directly based on a local analysis of the image. Rather, they must be inferred based on global context and higher-level knowledge. However, the global context itself will not be clear until there is some degree of certainty about the presence of low-level shape features. A number of theorists have thus argued that recognition depends on information circulating through corticocortical feedback loops in order to disambiguate representations at both lower and higher levels in parallel (Mumford, 1994; Ullman, 1995; Lewicki & Sejnowski, 1996; Rao & Ballard, 1999; Young, 2000; Lee & Mumford, 2003; Hawkins & Blakeslee, 2004). An example of disambiguation at work in the visual cortex can be seen in the resolution of the aperture problem in computing the direction of motion. Because receptive fields limit the field of view of a neuron to just a portion of an object, it is not possible for any one neuron to signal with certainty the true direction of the object in a purely bottom-up fashion. Pack, Berezovskii, and Born (2001) have shown that the initial phase of response of neurons in MT signals the direction of motion directly orthogonal to a contour and that the latter phase of the response reflects the actual direction of the object that the contour is part of, presumably from the interaction with other neurons viewing other parts of the object. Interestingly, this effect does not occur under anesthesia. A similar delayed-response effect has been demonstrated in end-stopped V1 neurons as well (Pack, Livingstone, Duffy, & Born, 2003). Recent evidence from fMRI points to a disambiguation process occurring in V1 during shape perception (Murray, Kersten, Olshausen, Schrater, & Woods, 2002). Subjects viewed a translating diamond that was partially occluded so that the vertices are invisible, resulting in a bistable percept in which the line segments forming the diamond are seen moving independently in one case, and coherently in the direction of the object motion in the other case. When subjects experience the coherent motion and shape percept, activity in lateral occipital cortex (LOC) increases while activity in V1 decreases. This is consistent with the idea that when neurons in LOC are representing the diamond, they feed back this information to V1 so as to refine the otherwise ambiguous representations of contour motion. If the refinement of activity attenuates the many incorrect responses while amplifying the few that are consistent with the global percept, the net effect could be a reduction as seen in the BOLD signal measured by fMRI. An alternative interpretation for the reduction in V1 is based on the idea of predictive coding (Rao & Ballard, 1999), in which higher areas actually subtract their predictions from lower areas.

1690

B. Olshausen and D. Field

There exists a rich set of feedback connections from higher levels into V1, but little is known about the computational role of these connections. Recent experiments in which higher areas are cooled to look at the effect on activity in lower areas seem to suggest that these connections play a role in enhancing the salience of stimuli (Hupe et al., 1998), and Shapley (2004) has concluded that top-down feedback is necessary to account for the spatial extent of surround inhibition. But we would argue that feedback has a far more important role to play in disambiguation, and as far as we know, no one has yet investigated the effect of feedback using such cooling techniques under normal conditions that would require disambiguation (e.g., natural scenes). 3.6 Dynamic Routing. A challenging problem faced by any visual system is that of forming object representations that are invariant to position, scale, rotation, and other common deformations of the image data. The currently accepted, traditional view is that complex cells constitute the first stage of invariant representation by summing over the outputs of simple cells whose outputs are half-rectified and squared—the classical “energy model” (Adelson & Bergen 1985). In this way, the neuron’s response changes only gradually as an edge is passed over its receptive field. This idea forms the basis of so-called Pandemonium models, in which a similar feature extraction and pooling process is essentially repeated at each stage of visual cortex (see Tarr, 1999, for a review). However, the Pandemonium model cannot provide a complete account of perception because it does not preserve information about relative phase or the spatial relationships among features. Clearly, though, we have conscious access to this information. The ability to navigate, grasp, and interact with foreign objects implies that we have the ability to perceive spatial relationships among features without ever doing “object recognition.” In addition, resolving figure-ground relationships and occlusion demands that higher levels of analysis have access to information about spatial relationships as well. One of us has proposed a model for forming invariant representations that preserves relative spatial relationships by explicitly routing information at each stage of processing (Olshausen, Anderson, & Van Essen, 1993). Rather than passively pooling, information is dynamically linked from one stage to the next by a set of control neurons that progressively remap information into an object-centered reference frame. It is thus proposed that there are two distinct classes of neurons: those conveying image and feature information and those controlling the flow of information. The former corresponds to the invariant part, the latter to the variant part. The two are combined multiplicatively, so that mathematically it is equivalent to a bilinear model (e.g., Tenenbaum & Freeman, 2000; Grimes & Rao, 2005). Is it possible that dynamic routing occurs in V1 and underlies the observed shift-invariant properties of complex cells? If so, there are at least

How Close Are We to Understanding V1?

1691

two things we would expect to see: (1) that at any given moment, a complex cell is effectively connected to only one or a small fraction of simple cells to which it is physically connected, and (2) that there are control neurons that dynamically gate these connections. Interestingly, the observed invariance properties of complex cells are just as consistent with the idea of routing as they are with pooling. What could possibly distinguish between these models is to look at the population activity: if the complex cell outputs are the result of passive pooling, then one would expect a dense, distributed representation of contours among the population of complex cells. If information is dynamically routed, though, the representation at the complex cell level would remain sparse. The control neurons, on the other hand, would look something like contrast normalized simple cells, which represent phase independent of magnitude (Zetzsche & Rohrbein, 2001). One of the main predictions of the dynamic routing model is that the receptive fields of the invariant neurons would be expected to shift depending on the state of the control neurons. Such effects have been seen in V4, where some neurons shift their receptive fields depending on where the animal is directing its attention (Moran & Desimone, 1985; Connor, Preddie, Gallant, & Van Essen, 1997. And in V1, Brad Motter has shown that neurons appear to shift their receptive fields in order to compensate for the small eye movements that occur during fixation (Motter & Poggio, 1990; Motter, 1995), although Gur and Snodderly (1997) provide evidence to the contrary. Thus, there exists some evidence for dynamic routing in visual cortex, but further experiments are needed in order to characterize how and to what extent this occurs in V1 under normal viewing conditions. 4 Conclusion Our goal in this review has been to point out that there are still substantial gaps in our knowledge of V1 function and, more important, that there is more room for new theories to be considered than the current conventional wisdom might allow. We have identified five specific problems with the current view of V1, emphasizing the need for using natural scenes in experiments, in addition to multiunit recording methods, in order to obtain a more representative picture of V1 function. While the single-unit, structuralist approach has been a useful enterprise for getting a handle on basic response properties, we feel that its usefulness as a tool for investigating V1 function has been nearly exhausted. It is now time to dig deeper, using richer, ecologically relevant experimental paradigms, and developing theories that can help to elucidate how the cortex performs the computationally challenging problems of vision. As we explore the response properties of V1 neurons using natural scenes, we are likely to uncover some interesting new phenomena that defy explanation with current models. It is at this point that we should be prepared to revisit the structuralist approach in order to tease apart what is going on.

1692

B. Olshausen and D. Field

Reductionism does have its place, but it needs to be motivated by functionally and ecologically relevant questions, similar to the European tradition in ethology (Tinbergen, 1972). At what point will we actually understand V1? This is obviously a difficult question to answer, but we believe at least three ingredients are required: (1) an unbiased sample of neurons of all types, firing rates, and layers of V1; (2) the ability to observe simultaneously the activities of hundreds of neurons in a local population; and (3) the ability to predict, or at least qualitatively model, the responses of the population under natural viewing conditions. Given the extensive feedback connections into V1, in addition to the projections from pulvinar and other sources, it seems unlikely that we will ever understand V1 in isolation. Thus, our investigations must also be guided by how V1 fits into the bigger picture of thalamo-cortical function. Acknowledgments We thank Bill Skaggs for discussions on hippocampal physiology, Charlie Gray and Jonathan Baker for sharing preliminary data, Jack Gallant for clarifying the issues involved in predicting neural responses, and Jeff Johnson and Issac Trotts for comments on the manuscript. We also thank the two anonymous reviewers for providing many useful suggestions and pointers to relevant literature. This work was supported by NGIA contract HM 1582-05-C-0007 to D.J.F. Many of the ideas in this review were first developed in Olshausen, B. A., & Field, D. J. (2005). In T. J. Sejnowski & L. van Hemmen (Eds.), Twenty-three problems in systems neuroscience. New York: Oxford University Press. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America, A, 2, 284–299. Ahmed, B., Anderson, J. C., Douglas, R. J., Martin, K. A., & Nelson, J. C. (1994). Polyneuronal innervation of spiny stellate neurons in cat visual cortex. J. Comp. Neurol., 341, 39–49. Amedi, A., Raz, N., Pianka, P., Malach, R., & Zohary, E. (2003). Early ‘visual’ cortex activation correlates with superior verbal memory performance in the blind. Nat. Neurosci., 6, 758–766. Anderson, S. J., Burr, D. C., & Morrone, M. C. (1991). Two-dimensional spatial and spatial-frequency selectivity of motion-sensitive mechanisms in human vision. Journal of the Optical Society of America A, 8, 1340–1351. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Attwell, D., & Laughlin, S. B. (2001). An energy budget for signaling in the grey matter of the brain. J. Cereb. Blood Flow Metab., 21, 1133–1145.

How Close Are We to Understanding V1?

1693

Azouz, R., & Gray, C. M. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. Proc. Natl. Acad. Sci. U.S.A., 97, 8110–8115. Azouz, R., & Gray, C. M. (2003). Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo. Neuron, 37, 513–523. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B, 264, 1775–1783. Bakin, J. S., Nakayama, K., & Gilbert, C. D. (2000). Visual responses in monkey areas V1 and V2 to three-dimensional surface configurations. J. Neurosci., 20, 8188–8198. Barlow, H. B. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12, 241–253. Barnes, C. A., Skaggs, W. E., McNaughton, B. L., Haworth, M. L., Permenter, M., Archibeque, M., & Erickson, C. A. (2003). Chronic recording of neuronal populations in the temporal lobe of awake young adult and geriatric primates. Program No. 518.8. Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural images are edge filters. Vision Research, 37, 3327–3338. Ben-Shahar, O., & Zucker, S. W. (2004). Geometrical computations explain projection patterns of long-range horizontal connections in visual cortex. Neural Computation, 16, 445–476. Blakemore, C., & Campbell, F. W. (1969). On the existence of neurones in the human visual system selectively sensitive to the orientation and size of retinal images. J. Physiol., 203, 237–260. Bosking, W. H., Zhang, Y., Schofield, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17, 2112–2127. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17, 8621–8644. Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002a). Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. J. Neurophys., 88, 2530–2546. Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002b). Selectivity and spatial distribution of signals from the receptive field surround in macaque V1 neurons. J. Neurophys., 88, 2547–2556. Chung, S., & Ferster, D. (1998). Strength and orientation tuning of the thalamic input to simple cells revealed by electrically evoked cortical suppression. Neuron, 20, 1177–1189. Connor, C. E., Preddie, D. G., Gallant, J. L., & Van Essen, D. C. (1997). Spatial attention effects in macaque area V4. J. Neurosci., 17, 3201–3214. Den, Y., Atick, J. J., & Reid, R. C. (1996). Efficient coding of natural scenes in the lateral geniculate nucleus: Experimental test of a computational theory. Journal of Neuroscience, 16, 3351–3362. David, S. V., Vinje, W. E., & Gallant, J. L. (2004). Natural stimulus statistics alter the receptive field structure of V1 neurons. J. Neurosci., 24, 6991–7006.

1694

B. Olshausen and D. Field

De Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Res., 22, 545–559. Edin, F., Machens, C. K., Schutze, H., & Herz, A. V. (2004). Searching for optimal sensory signals: Iterative stimulus reconstruction in closed-loop experiments. J. Comput. Neurosci., 17, 47–56. Elder, J. H., Beniaminov, D., & Pintilie, G. (1999). Edge classification in natural images. Investigative Ophthalmology and Visual Science, 40, S357. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4, 2379–2394. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association field.” Vision Research, 33, 173–193. Field, D. J., & Tolhurst, D. (1986). The structure and symmetry of simple-cell receptivefield profiles in the cat’s visual cortex. Proc. R. Soc. Lond. B. Biol. Sci., 228, 379–400. Field, D. J., & Wu, M. (2004). An attempt towards a unified account of non-linearities in visual neurons. Journal of Vision, 4, 283a. Fiser, J., Chiu, C., & Weliky, M. (2004). Small modulation of ongoing cortical dynamics by sensory input during natural vision. Nature, 431, 573–578. Foldiak, P., Xiao, D., Keysers, C., Edwards, R., & Perrett, D. I. (2004). Rapid serial visual presentation for the determination of neural selectivity in area STSa. Prog. Brain Res., 144, 107–116. Geisler, W. S., & Albrecht, D. G. (1992). Cortical neurons: Isolation of contrast gain control. Vision Research, 32, 1409–1410. Geisler, W. S., & Albrecht, D. G. (1997). Visual cortex neurons in monkeys and cats: Detection, discrimination and identification. Visual Neuroscience, 14, 897–919. Gilbert, C. D., & Wiesel, T. N. (1979). Morphology and intracortical projections of functionally characterised neurones in the cat visual cortex. Nature, 280, 120–125. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Graham, N., & Nachmias, J. (1971). Detection of grating patterns containing two spatial frequencies: A test of single-channel and multiple-channels models. Vision Research, 11, 251–259. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Grimes, D. B., & Rao, R. P. (2005). Bilinear sparse coding for invariant vision. Neural Comput., 17, 47–73. Gur, M., & Snodderly, D. M. (1997). Visual receptive fields of neurons in primary visual cortex (V1) move in space with the eye movements of fixation. Vision Res., 37, 257–265. Hausser, M., & Mel, B. (2003). Dendrites: Bug or feature? Current Opinion in Neurobiology, 13, 372–383. Hawkins, J., & Blakeslee, S. (2004). On Intelligence. New York: Holt. Heeger, D. J. (1991). Computational model of cat striate physiology. In M. S. Landy & A. Movshan (Eds.), Computational models of visual perception (pp. 119–133). Cambridge, MA: MIT Press. Heeger, D. J., & Bergen, J. R. (1995, August). Pyramid based texture analysis/synthesis. Computer Graphics Proceedings, 229–238.

How Close Are We to Understanding V1?

1695

Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. J. Physiol., 148, 574–591. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hupe, J. M., James, A. C., Payne, B. R., Lomber, S. G., Girard, P., & Bullier, J. (1998). Cortical feedback improves discrimination between figure and background by V1, V2 and V3 neurons. Nature, 394, 784–787. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 58, 1233– 1258. Jung, M. W., & McNaughton, B. L. (1993). Spatial selectivity of unit activity in the hippocampal granular layer. Hippocampus, 3, 165–182. Kagan, I., Gur, M., & Snodderly, D. M. (2002). Spatial organization of receptive fields of V1 neurons of alert monkeys: Comparison with responses to gratings. J. Neurophysiol., 88, 2557–2574. Kapadia, M. K., Ito, M., Gilbert, C. D., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15, 843–856. Kapadia, M. K., Westheimer, G., & Gilbert, C. D. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. Journal of Neurophysiology, 84, 2048–2062. Keysers, C., Xiao, D. K., Foldiak, P., & Perrett, D. I. (2001). The speed of sight. J. Cogn. Neurosci., 13, 90–101. Knierim, J. J., & Van Essen, D. C. (1992). Neuronal responses to static texture patterns in area V1 of the alert macaque monkey. J. Neurophys., 67, 961–980. Knill, D. C., Field, D., & Kersten, D. (1990). Human discrimination of fractal images. J. Opt. Soc. Am. A, 7, 1113–1123. Lamme, V. A. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. J. Neurosci., 15, 1605–1615. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A, 20, 1434–1448. Lee, T. S., Mumford, D., Romero, R., & Lamme, V. A. (1998). The role of the primary visual cortex in higher level vision. Vision Res., 38, 2429–2454. Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53, 926– 939. Lennie, P. (2003a). Receptive fields. Curr. Biol., 13, R216–219. Lennie, P. (2003b). The cost of cortical computation. Curr. Biol., 13, 493–497. Lesica, N. A., Boloori, A. S., & Stanley, G. B. (2003). Adaptive encoding in the visual pathway. Network, 14, 119–135. Lewicki, M. S., & Sejnowski, T. J. (1996). Bayesian unsupervised learning of higher order structure. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. USA, 90, 10469–10473.

1696

B. Olshausen and D. Field

Maldonado, P., Babul, C., Singer, W., Rodriguez, E., & Grun, ¨ S. (2004). Synchrony and oscillations in primary visual cortex of monkeys viewing natural images. Manuscript submitted for publication. Mata, M. L., & Ringach, D. L. (2005). Spatial overlap of “on” and “off” subregions and its relation to response modulation ratio in macaque primary visual cortex. J. Neurophys., 93, 919–928. McDermott, J. (2004). Psychophysics with junctions in real images. Perception, 33, 1101–1127. Mechler, F., & Ringach, D. L. (2002). On the classification of simple and complex cells. Vision Res., 42, 1017–1033. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extra striate cortex. Science, 229, 782–784. Motter, B. C. (1995). Receptive field border stabilization during visual fixation. Investigative Ophthalmology and Visual Science, 36, S691. Motter, B. C., & Poggio, G. F. (1990). Dynamic stabilization of receptive fields of cortical neurons (VI) during fixation of gaze in the macaque. Exp. Brain Res., 83, 37–43. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch, & J. L. Davis (Eds.), Large scale neuronal theories of the brain (pp. 125–152). Cambridge, MA: MIT Press. Murray, S. O., Kersten, D., Olshausen, B. A., Schrater, P., & Woods, D. L. (2002). Shape perception reduces activity in human primary visual cortex. Proceedings of the National Academy of Sciences, USA, 99(23), 15164–15169. Nakayama, K., He, Z. J., & Shimojo, S. (1995). Visual surface representation: A critical link between lower-level and higher level vision. In S. M. Kosslyn & D. N. Osherson (Eds.), In an invitation to cognitive science (pp. 1–70). Cambridge, MA: MIT Press. Nguyenkim, J. D., & DeAngelis, G. C. (2003). Disparity-based coding of threedimensional surface orientation by macaque middle temporal neurons. J Neurosci., 23(18), 7117–7128. Nykamp, D. Q., & Ringach, D. L. (2002). Full identification of a linear-nonlinear system via cross-correlation analysis. Journal of Vision, 2, 1–11. O’Connor, K. N., Petkov, C. I., & Sutter, M. L. (2004). Stimulus optimization for auditory cortical neurons. Society for Neuroscience Abstracts, 529.14. Olshausen, B. A. (2003). Principles of image representation in visual cortex. In L. M. Chalupa & J. S. Werner (Eds.), The visual neurosciences (pp. 1603–1615). Cambridge, MA: MIT Press. Olshausen, B. A., & Anderson, C. H. (1995). A model of the spatial-frequency organization in primate striate cortex. In J. M. Bower (Ed.), The neurobiology of computation: Proceedings of the Third Annual Computation and Neural Systems Conference (pp. 275–280). Norwell, MA: Kluwer. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607– 609.

How Close Are We to Understanding V1?

1697

Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pack, C. C., Berezovskii, V. K., & Born, R. T. (2001). Dynamic properties of neurons in cortical area MT in alert and anaesthetized macaque monkeys. Nature, 414, 905–908. Pack, C. C., Livingstone, M. S., Duffy, K. R., & Born, R. T. (2003). End-stopping and the aperture problem: Two-dimensional motion signals in macaque V1. Neuron, 39, 671–680. Paninski, L., Pillow, J. W., & Simoncelli, E. P. (2004). Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. Neural Comput., 16, 2533– 2561. Parent, P., & Zucker, S. (1989). Trace inference, curvature consistency and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 823–839. Parker, A. J., & Hawken, M. J. (1988). Two-dimensional spatial structure of receptive fields in monkey striate cortex. Journal of the Optical Society of America A, 5, 598–605. Peters, A., & Payne, B. R. (1993). Numerical relationships between geniculocortical afferents and pyramidal cell modules in cat primary visual cortex. Cereb. Cortex, 3, 69–78. Peters, A., Payne, B. R., & Budd, J. (1994). A numerical analysis of the geniculocortical input to striate cortex in the monkey. Cereb. Cortex., 4, 215–229. Polat, U., Mizobe, K., Pettet, M. W., Kasamatsu, T., & Norcia, A. M. (1998). Collinear stimuli regulate visual responses depending on cell’s contrast threshold. Nature, 391, 580–584. Polsky, A., Mel, B. W., & Schiller, J. (2004). Computational subunits in thin dendrites of pyramidal cells. Nat. Neurosci., 7, 621–627. Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci., 2, 79– 87. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ringach, D., Hawken, M., & Shapley, R. (2002). Receptive field structure of neurons in monkey primary visual cortex revealed by stimulation with natural image sequences. Journal of Vision, 2, 12–24. Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. Journal of Comparative Neurology, 216, 303–318. Rose, G., Diamond, D., & Lynch, G. S. (1983). Dentate granule cells in the rat hippocampal formation have the behavioral characteristics of theta neurons. Brain Res., 266, 29–37. Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. P. (2004). Spike-triggered characterization of excitatory and suppressive stimulus dimensions in monkey V1. Neurocomputing, 58-60C, 793–799. Ruthazer, E. S., & Stryker, M. P. (1996). The role of activity in the development of long-range horizontal connections in area 17 of the ferret. Journal of Neuroscience, 16, 7253–7269. Sadato, N., Pascual-Leone, A., Grafman, J., Ibanez, V., Deiber, M. P., Dold, G., & Hallett, M. (1996). Activation of the primary visual cortex by braille reading in blind subjects. Nature, 380, 526–528.

1698

B. Olshausen and D. Field

Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Series, P., Lorenceau, J., & Fr´egnac, Y. (2003). The “silent” surround of V1 receptive fields: Theory and experiments. J. Physiol. Paris., 97, 453–474. Sha’ashua, A., & Ullman, S. (1988). Structural Saliency: The detection of globally salient structures using a locally connected network. In Proceedings of the International Conference on Computer Vision, Tampa, Florida (pp. 321–327). Washington, DC: IEEE Computer Society Press. Shapley, R. (2004). A new view of the primary visual cortex. Neural Networks, 17, 615–623. Sharpee, T., Rust, N. C., & Bialek, W. (2004). Analyzing neural responses to natural signals: Maximally informative dimensions. Neural Computation, 16, 223–250. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378, 492–496. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annu. Rev. Neurosci., 24, 1193–1216. Sincich, L. C., & Blasdel, G. G. (2001). Oriented axon projections in primary visual cortex of the monkey. Journal of Neuroscience, 21, 4416–4426. Skottun, B. C., De Valois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulation. Vision Res., 31, 1079–1086. Smyth, D., Willmore, B., Baker, G. E., Thompson, I. D., & Tolhurst, D. J. (2003). The receptive-field organization of simple cells in primary visual cortex of ferrets under natural scene stimulation. J. Neurosci., 23, 4746–4759. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Tarr, M. J. (1999). News on views: Pandemonium revisited. Nat. Neurosci., 2, 932–935. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12, 1247–1283. Thompson, L. T., & Best, P. J. (1989). Place cells and silent cells in the hippocampus of freely-behaving rats. J. Neurosci., 9, 2382–2390. Tinbergen, N. (1972). The animal in its world: Explorations of an ethologist. Cambridge, MA: Harvard University Press. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res., 23, 775–785. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Ullman, S. (1995). Sequence seeking and counter streams: A computational model for bidirectional information flow in the visual cortex. Cereb. Cortex., 5, 1–11. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265, 359–366. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Vinje, W. E., & Gallant, J. L. (2002). Natural stimulation of the nonclassical receptive field increases information transmission efficiency in V1. J. Neurosci., 22, 2904– 2915.

How Close Are We to Understanding V1?

1699

Walker, G. A., Ohzawa, I., & Freeman, R. D. (1999). Asymmetric suppression outside the classical receptive field of the visual cortex. Journal of Neuroscience, 19, 10536– 10553. Watson, A. B., Barlow, H. B., & Robson, J. G. (1983). What does the eye see best? Nature, 302, 419–422. Wirth, S., Yanike, M., Frank, L. M., Smith, A. C., Brown, E. N., & Suzuki, W. A. (2003). Single neurons in the monkey hippocampus and learning of new associations. Science, 300, 1578–1581. Worg ¨ otter, ¨ F., Suder, K., Zhao, Y., Kerscher, N., Eysel, U. T., & Funke, K. (1998). Statedependent receptive-field restructuring in the visual cortex. Nature, 396, 165–168. Yoshioka, T., Blasdel, G. G., Levitt, J. B., & Lund, J. S. (1996). Relation between patterns of intrinsic lateral connectivity, ocular dominance, and cytochrome oxidasereactive regions in macaque monkey striate cortex. Cerebral Cortex, 6, 297–310. Young, M. P. (2000). The architecture of visual cortex and inferential processes in vision. Spatial Vision, 13, 137–146. Zetzsche, C., Krieger, G., & Wegmann, B. (1999). The atoms of vision: Cartesian or polar? J. Opt. Soc. Am. A, 16, 1554–1565. Zetzsche, C., & Rohrbein, F. (2001). Nonlinear and extra-classical receptive field properties and the statistics of natural scenes. Network, 12, 331–350. Zhou, H., Friedman, H. S., & von der Heydt, R. (2000). Coding of border ownership in monkey visual cortex. Journal of Neuroscience, 20, 6594–6611. Zipser, K., Lamme, V. A., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. J. Neurosci., 16, 7376–7389.

Received May 6, 2004; accepted January 31, 2005.

NOTE

Communicated by Rajesh Rao

Motion Contrast Classification Is a Linearly Nonseparable Problem Alireza S. Mahani [email protected]

Ralf Wessel [email protected] Physics Department, Washington University, St. Louis, MO 63130, U.S.A.

Sensitivity to image motion contrast, that is, the relative motion between different parts of the visual field, is a common and computationally important property of many neurons in the visual pathways of vertebrates. Here we illustrate that, as a classification problem, motion contrast detection is linearly nonseparable. In order to do so, we prove a theorem stating a sufficient condition for linear nonseparability. We argue that nonlinear combinations of local measurements of velocity at different locations and times are needed in order to solve the motion contrast problem.

1 Introduction Many neurons in the visual pathways of vertebrates, such as cells in the middle temporal area of primates (Allman, Miezin, & McGuinness, 1985) and cells in the deep layers of the avian optic tectum (Frost & Nakayama, 1983), are sensitive to image motion contrast. Motion contrast sensitivity is believed to play a major role in important perceptual functions such as figure-ground segregation (Nakayama, 1985). Here we illustrate that as a classification problem, motion contrast detection is linearly nonseparable. Section 2 defines the concept of linear separability for a binary classification problem. Section 3 offers a sufficient condition for a problem to be linearly nonseparable. Section 4 uses the result of section 3 to prove that motion contrast detection is linearly nonseparable. Section 5 generalizes the nonseparability of the motion contrast problem to include a certain type of nonlinear preprocessing. In section 6 we discuss what happens when image intensity is used as input for calculating motion contrast. Section 7 offers concluding remarks. Neural Computation 17, 1700–1705 (2005)

© 2005 Massachusetts Institute of Technology

Motion Contrast Classification

1701

2 Defining Linear Separability Suppose that we have a set of n d-dimensional samples x 1 , . . . , x n , n1 of which are in the subset 1 and n2 = n − n1 in the subset 2 . If we form a linear combination of the components of xi , we obtain the scalar t . xi . yi = w

(2.1)

This operation produces a set of n samples y1 , . . . , yn from the input samples x 1 , . . . , x n . The two subsets 1 and 2 are linearly separable if there exists a and a scalar γ such that vector w ∀xi ∈ 1 , yi > γ

and ∀xi ∈ 2 , yi < γ ,

(2.2)

with yi ’s defined in equation 2.1. In other words, two classes are linearly separable if they can be completely separated by a hyperplane in the feature space. If such a hyperplane cannot be found, the two classes are called linearly nonseparable. For more on linear separability, see Duda, Hart, and Stork (2001). 3 A Sufficient Condition for Linear Nonseparability Given the above definition of linear separability, it is easy to prove the following theorem. Theorem 1. Two classes are linearly nonseparable if we can find two samples from each class, x 1 , x 2 ∈ 1 and x 3 , x 4 ∈ 2 , and two numbers α and β, such that α x 1 + (1 − α) x 2 = β x 3 + (1 − β) x 4 ,

(3.1)

0 ≤ α, β ≤ 1.

(3.2)

and γ such that w t . x 1 > Proof. If the two classes are separable, there are w t t t . x 2 > γ , w . x 3 < γ , and w . x 4 < γ . From these four inequalities, we γ, w t .(α x 1 + (1 − α) x 2 ) > (1 − α + α)γ = γ and w t .(β x 3 + (1 − conclude that w β) x 4 ) < (1 − β + β)γ = γ . These in turn indicate that α x 1 + (1 − α) x 2 = β x 3 + (1 − β) x 4 , which leads to contradiction. As a special case, we can have α = β = 0.5. Equation 3.1 will now read x 1 + x 2 = x 3 + x 4 ; the sums of the two samples in each class are the same.

1702

A. Mahani and R. Wessel

Motion contrast

No motion contrast

x1

x3

x2

x4

x1 + x 2

x3 + x 4

Figure 1: Pairs of points in feature space for motion contrast versus no motion contrast (top two rows). Arrows indicate the velocity of each dot, which is either v or −v. Both pairs have the same sum. This, according to theorem 1, is sufficient to prove the nonseparability of this classification problem.

4 Linear Nonseparability of the Motion Contrast Classification Problem Using theorem 1, we can easily demonstrate the nonseparability of motion contrast detection problem for two dots. Figure 1 illustrates a case where the sums of two samples from the two classes (with and without motion contrast) are equal. In terms of equation 3.1, we have x 1 = (v, −v), x 2 = (−v, v) and x 3 = (v, v), x 4 = (−v, −v), and therefore x 1 + x 2 = x 3 + x 4 = (0, 0). According to theorem 3.1, this proves that the problem is linearly nonseparable. The same argument can be applied to an object against a background, ignoring the occlusion and edge effects. 5 Point-Wise Static Preprocessing We have observed that a simple linear transformation of the input cannot separate the coherent and noncoherent cases. It can easily be shown that even if this linear transformation is preceded by a point-wise static nonlinear transformation, separation of the two classes of inputs still cannot be achieved. A point-wise static transformation, if considered in isolation, is nothing more than a scalar function of a scalar variable: g(v). The qualifiers “static” and “point-wise” are meant to emphasize that as a transformation applied to a dynamic scalar field, such as a time-varying image intensity field, it does not integrate (in the general sense) the values of the function over time or space to produce the output. Again, consider the two-dot scenario with x 1 = (v, −v), x 2 = (−v, v) (motion contrast) and x 3 = (v, v), x 4 = (−v, −v) (no motion contrast). If we apply g( ) to each of these samples, we find x 1 = (g(v), g(−v)), x 3 = (g(v), g(v)),

x 2 = (g(−v), g(v)), x 4 = (g(−v), g(−v)).

(5.1)

Motion Contrast Classification

1703

Obviously, we still have x 1 + x 2 = x 3 + x 4 . Therefore, the problem remains linearly nonseparable.

6 Image Intensity Field as Input So far we have assumed that motion contrast detection uses the velocity field as input. It is indeed possible to detect motion contrast by directly using the 1 and image intensity. For example, consider two dots moving at velocities v 2 . A reasonable definition of the motion contrast (MC) of this arrangement v 2 |. Clearly, v 1 = dr1 /dt and v 2 = dr2 /dt with r1 and r2 being is MC = | v1 − v the displacement vectors for the two objects. We can now rephrase mo 2 | = |dr1 /dt − dr2 /dt| = |d(r1 − r2 )/dt|. The tion contrast as MC = | v1 − v last expression shows that we can detect motion contrast between two moving dots by monitoring the change in their distance rather than calculating their velocities first. Dellen, Clark, and Wessel (2004) have suggested a more general method for detecting motion contrast that does not need velocity information. Is motion contrast detection still linearly nonseparable if we use intensity instead of velocity information? The answer is yes. An argument very similar to that presented here can be made for the intensity-based approach. A point-wise static nonlinearity applied to the intensity field does not change the linear nonseparability either. This is because if we consider two widely separated objects in motion and apply the point-wise static transformation to the resulting intensity field, we arrive at two objects with possibly distorted profiles moving at the same velocities as the original intensity field, and the distortion is independent of velocities. Therefore, if the original problem is linearly nonseparable, so is the new one.

7 Discussion In summary, we showed that motion contrast detection cannot happen through a linear integration of local inputs, even if these local inputs have been preprocessed by a static point-wise transformation (linear or nonlinear). In the primate visual cortex, the earliest area with motion contrast sensitive neurons is the middle temporal cortex (area MT). The main pathway leading from the retina to area MT passes through the lateral geniculate nucleus (LGN) and primary visual cortex (area V1 ). The presence of multiple stages between where the image intensity map is represented and where motion contrast is represented has naturally led many researchers to assume that in primates, velocity field is calculated prior to the computation of motion contrast. As a result, in building cortical models of image motion processing, most of the attention and effort has been devoted to estimating local image velocity (Poggio & Reichardt, 1973; Adelson & Bergen, 1985;

1704

A. Mahani and R. Wessel

Simoncelli & Heeger, 1998). All these models require nonlinear processing of image intensity in order to estimate local image velocity. The highly nonlinear (phase-invariant) response of complex cells in area V1 (Skottun et al., 1991) is an important example of such nonlinearities. In this article, however, we began by assuming that the velocity field was provided as input to the motion contrast detection problem. Therefore, our conclusion that nonlinearities other than the point-wise static type are needed to perform the motion contrast detection task is independent of the well-known fact that estimation of velocity field requires nonlinear processing. Simoncelli and Heeger (1998) use a nonlinear operation called divisive normalization to reduce the dependency of the response of their model MT neurons on image contrast. In its original form, this nonlinearity is restricted to the classical (center) receptive field of the neuron. The authors have suggested, however, that a similar normalization that pools neurons with opposite-direction preferences in the center and surround areas may result in motion contrast sensitivity. While our result cannot confirm the above hypothesis regarding the origin of motion contrast sensitivity in the MT neurons, they are certainly consistent with each other, as divisive normalization involves nonlinear interactions across space and therefore is not point-wise. Physiologically plausible ways of implementing such nonlinearities, using inhibitory feedback, have been suggested and (indirectly) tested (for example, see Carandini, Heeger, & Movshon, 1997). In the nonmammalian visual system, deep tectal neurons with motion contrast sensitivity are suspected to be only one synapse away from the retina (Luksch, 2003). While this observation does not exclude the possibility of other inputs to such cells carrying the local velocity information, it certainly motivates research on possible ways to extract motion contrast information directly from the image intensity (Dellen et al., 2004). In this article, we argued that such a direct solution would still require nonlinear processing of a similar nature to that required while using local velocity as input. Anatomical (Luksch, 2003) and electrophysiological (Wang, 2003) studies suggest the existence of a number of other neural structures and pathways, both within and outside the optic tectum, with a potentially nonlinear role in shaping the complex response properties of deep tectal neurons, including their motion contrast sensitivity. Examples are the network of horizontal neurons in the optic tectum and the feedback loop between the optic tectum and its midbrain satellite, nucleus isthmi.

Acknowledgments We thank Anders E. Carlsson and the anonymous reviewers for their critical reading of the manuscript. This work was supported by grants from the Whitehall Foundation and the McDonnell Center for Higher Brain Function to R.W.

Motion Contrast Classification

1705

References Adelson E. H., & Bergen J. R. (1985). Spatiotemporal models for the perception of motion. Journal of the Optical Society of America A, 2, 284–299. Allman, J. M., Miezin, F. M., & McGuinness, E. (1985). Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Annual Review of Neuroscience, 8, 407–430. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17, 8621–8644. Dellen, B. K., Clark, J. W., & Wessel R. (2004). Motion contrast computation without directionally selective motion sensors. Physical Review E, 70, 031907. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Frost, B. J., & Nakayama, K. (1983). Single visual neurons code opposing motion independent of direction. Science, 220, 744–745. Luksch, H. (2003). Cytoarchitecture of the avian optic tectum: Neuronal substrate for cellular computation. Reviews in the Neurosciences, 14, 85–106. Nakayama, K. (1985). Biological image motion processing: A review. Vision Research, 25, 625–660. Poggio, T., & Reichardt, W. (1973). Considerations on models of movement detection. Kybernetik, 13, 223–227. Simoncelli, E. P., & Heeger, D. J. (1998). A model of neuronal responses in visual area MT. Vision Research, 38, 743–761. Skottun, B. C., De Valois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulations. Vision Research, 31, 1079–1086. Wang, S. R. (2003). The nucleus isthmi and dual modulation of the receptive field of tectal neurons in non-mammals. Brain Research Reviews, 41, 13–25.

Received April 23, 2004; accepted January 27, 2005.

NOTE

Communicated by Aapo Hyvarinen

Edgeworth-Expanded Gaussian Mixture Density Modeling Marc M. Van Hulle [email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie B-3000 Leuven, Belgium

Instead of increasing the order of the Edgeworth expansion of a single gaussian kernel, we suggest using mixtures of Edgeworth-expanded gaussian kernels of moderate order. We introduce a simple closed-form solution for estimating the kernel parameters based on weighted moment matching. Furthermore, we formulate the extension to the multivariate case, which is not always feasible with algebraic density approximation procedures.

1 Introduction The Edgeworth expansion of the one-dimensional gaussian density function has been popular ever since it was introduced in independent component analysis (ICA) (Comon, 1994; Amari, Cichocki, & Yang, 1996) and projection pursuit (Jones & Sibson, 1987). However, objections have been raised to the Edgeworth expansion that it would not estimate well the structure near the centroid of the density (Friedman, 1987), that it would be sensitive to outliers (Huber, 1985), and that it would not always result in positive definite approximations (Draper & Tierney, 1972). Other algebraic density approximations have been introduced, often as alternatives to the Edgeworth expansion, such as the normal inverse gaussian density (Barndorff-Nielsen, 1978), the generalized lambda distribution (Ramberg & Schmeiser, 1974), and the approximative maximum entropy approach (Hyv¨arinen, 1998). But they have their limitations as well. The obvious way to improve the density estimation accuracy is to increase the order of the Edgeworth expansion, but this does not lead to an immediate improvement; more seriously, it potentially renders the estimate even more sensitive to outliers. Rather than increasing the order of the Edgeworth expansion of a single gaussian kernel, we suggest using mixtures of Edgeworth-expanded gaussian kernels of moderate order. This leads to a simple closed-form solution for estimating the kernel parameters. Furthermore, we formulate the extension to the multivariate case. Neural Computation 17, 1706–1714 (2005)

© 2005 Massachusetts Institute of Technology

Edgeworth-Expanded Gaussian Mixture Density Modeling

1707

2 Gaussian Mixture Let v be a random scalar in V ⊆ generated from the probability density p(v). The homogeneous, heteroscedastic gaussian mixture density estimate of the unknown probability density is then p(v) ≈ p˜ (v|W, σ ) =

1 p(v|i, wi , σi ), N i

(2.1)

with N the number of gaussian kernels used, and with −(v − wi )2 p(v|i, wi , σi ) = √ . exp 2σi2 2πσi 1

(2.2)

A standard procedure to estimate the parameters W = (w1 , . . . , w N ) and

σ = (σ1 , . . . , σ N ) is by minimizing the (average) negative log likelihood for the sample S = {vn |n = 1, . . . , M} (Redner & Walker, 1984), F = −log L = −

1 log p˜ (vn |W, σ ), M n

(2.3)

through an expectation-maximization (EM) approach (Dempster, Laird, & Rubin, 1977), which leads to the following fixed-point update rules: P(i|vn )vn wi = n , n n P(i|v ) P(i|vn )(vn − wi )2 2 σi = n , n n P(i|v )

(2.4)

, and where we have with P(i|v) the posterior probabilities P(i|v) = p(v|i) j p(v| j) omitted the kernel parameters to simplify the notation. The wi and σi2 can also be regarded as weighted first and second moments, weighted by the ith kernel’s posterior probability.

3 Edgeworth Expansion The Edgeworth series expansion of the density p(v) around its best gaussian estimate φ p (i.e., with the same mean µ and standard deviation σ as p) is

1708

M. Van Hulle

Table 1: Mean Squared Error (MSE) and Kullback-Leibler Divergence (KL) Density Estimation Performance. MSE/KL [×10−4 ] pdf Cauchy λ = 1 Gaussian σ = 1 Laplacian Ø = 1 Triangular Uniform [−1, 1]

N = 1 kernel k=1

k=3

k=4

N = 3 kernels k=5

50.5/97.8 30.4/80.3 30.4/80.3 30.4/80.3 (0)/0.676 (0)/0.676 (0)/0.676 (0)/0.675 28.3/23.5 12.7/17.3 12.7/17.3 12.7/17.3 8.53/2.49 4.38/1.55 4.38/1.55 4.39/1.55 115/11.9 69.5/11.4 69.5/11.4 69.5/11.4

k=1

k=3

21.0/75.5 3.83/2.89 8.69/20.3 6.70/1.12 45.9/12.4

20.1/67.9 1.36/1.23 3.75/15.6 1.67/0.980 32.0/5.35

Notes: One hundred thousand data points were drawn from a given density (pdf), using N = 1 kernel and the first k = 1, 3, 4, and 5 terms in the Edgeworth expansion, equation 3.1, and using N = 3 gaussian (k = 1) and N = 3 Edgeworth kernels (k = 3). By (0), we mean smaller than 10−5 . The triangular pdf is isosceles with unit height.

given by Barndorff-Nielsen and Cox (1989): 1 1 p(v) ≈ φ p (v) 1 + κ3 H3 (v) + κ4 H4 (v) 3! 4! 10 1 + κ32 H6 (v) + κ5 H5 (v) + . . . , 6! 5!

(3.1)

with κi the ith standardized cumulant and Hi the Hermite polynomial of order i, H3 (v) = z3 − 3z, H4 = z4 − 6z2 + 3, H5 = z5 − 10z3 + 15z, H6 = z6 − 15z4 + 45z2 − 15, with z the standardized scalar z = v−µ . σ The disadvantage of the Edgeworth expansion is that it does not estimate well the structure of the density near the centroid (Friedman, 1987) and that it is sensitive to outliers (Huber, 1985). Furthermore, the Edgeworth expansion cannot be made arbitrarily good by including terms of higher and higher orders (e.g., see Table 1 for N = 1 kernel and five different densities). In an attempt to remedy this, we propose using kernel mixtures consisting of low-order Edgeworth-expanded gaussian kernels. 4 Edgeworth-Expanded Gaussian Mixture We take for the p(v|i) in equation 2.1 the Edgeworth expanded kernels in equation 3.1, up to the first three terms between the brackets (thus including the third- and fourth-order Hermite polynomials), 1 1 1 p(v) ≈ φ pi (v) 1 + κ3i H3 (v) + κ4i H4 (v) , (4.1) N i 3! 4! and estimate the cumulants through the following weighted moments, weighted by the posterior probabilities P(i|v) (for the derivation, see the

Edgeworth-Expanded Gaussian Mixture Density Modeling

1709

appendix): P(i|vn )vn , µ1i = n n n P(i|v ) P(i|vn )(vn − wi )2 µ2i = n , n n P(i|v ) P(i|vn )(vn − wi )3 µ3i = n , n n P(i|v ) P(i|vn )(vn − wi )4 µ4i = n , n n P(i|v )

(4.2)

with µ1i ≡ wi , µ2i ≡ σi2 , and note that2 the standardized third and fourth cuµ −3µ mulants are κ3i = µσ3i3 and κ4i = 4i σ 4 2i . We determine the kernel parameters i i using the traditional EM scheme. When reconsidering the densities, we obtain for N = 3 kernels, without and with Edgeworth expansion, the results listed in the last two columns of Table 1, respectively. We observe that the Edgeworth-expanded mixtures yield better results compared to gaussian mixtures and to the case of one (Edgeworth expanded or not) gaussian kernel (except when the input density is gaussian since there a single gaussian kernel is obviously optimal). The result is depicted graphically for the uniform distribution in Figure 1. Note the smaller ripple in the Edgeworth-expanded case. Another practical observation is that the risk of negative densities is reduced with Edgeworthexpanded mixtures compared to the case of a single Edgeworth-expanded kernel.

0.8

|

0.6

|

0.4

|

|

0.2

|

|

B

|

p

p

0.8

|

A

0.6 0.4 0.2

| -1

| 0

| 1

| 2 v

0.0 | -2

|

0.0 | -2

| -1

| 0

| 1

| 2 v

Figure 1: Gaussian mixture density estimate (dashed line) (A) and Edgeworthexpanded mixture density estimate (dashed line) (B) for the case of three kernels (thin solid lines) and the uniform input distribution [−1, 1] (thick solid line).

1710

M. Van Hulle

5 Multivariate Case In case v = [v1 , . . . , vd ] ∈ V ⊆ Rd is a random vector drawn from the probability density p(v), we need to consider multivariate Edgeworth expansions. The Edgeworth expansion of p(v), up to order five about its best normal estimate, is given by (Barndorff-Nielsen & Cox, 1989), p(v) ≈ φ p (v) 1 +

+

i, j,k,l, p,q

1 1 κi, j,k Hi jk (v) + κi, j,k,l Hi jkl (v) 3! 4! i, j,k i, j,k,l 1 κi, j,k κl, p,q Hi jklpq (v) , 72

(5.1)

with Hi jk the i jkth Hermite polynomial, with i, j, k the corresponding input dimensions, i, j, k ∈ {1, . . . , d}, and κi, j,k the corresponding standardized κ cumulant, κi, j,k = √ 2i jk 2 2 , with κi jk the third cumulant over input dimenσi σ j σk sions i, j, k, and where the sum over all combinations i, j, k is considered, and Hi jkl the i jklth Hermite polynomial over input dimensions i, j, k, l and κ the corresponding standardized cumulant κi, j,k,l , κi, j,k,l = √ 2 i jkl2 2 2 , with σi σ j σk σl κi jkl the fourth cumulant over input dimensions i, j, k, l (for the connection between moments and cumulants in the multivariate case; see McCullagh, 1987). We can then consider mixtures of kernels in equation 5.1, as done previously, for the one-dimensional case. As a simplification, one could consider only the third- and fourth-order Hermite polynomials for the largest second moment(s) of each kernel. 6 Other Expansions Several methods exist for approximating algebraic functions of random variables. Following the tradition of adopting flexible transforms for densities combined with moment matching is the class of the normal inverse gaussian density (NIG) (Barndorff-Nielsen, 1978). In the case of a single flexible “kernel,” the parameters can be obtained by solving a nonlinear system of equations, for example, one for the mean, the variance, the skewness, and the kurtosis of the data set (thus, unlike the Edgeworth expansion, no closedform solution). The NIG kernel leads to a slightly improved performance with respect to (the tails of) the Edgeworth expansion (Eriksson, Forsberg, & Ghysels, 2004), but at the expense of requiring a parameter estimation algorithm. Kernel mixture density modeling is not feasible since then we would have more parameters than equations. Also, the inverse gaussian density transform is not extendable to the multivariate case.

Edgeworth-Expanded Gaussian Mixture Density Modeling

1711

Another class is the generalized lambda distribution (GλD) (Ramberg & Schmeiser, 1974). Finding its parameters from moments requires minimizing a system of two bivariate nonlinear equations, which can be a challenging problem (for review, see Lakhany & Mausser, 2000). Again, since no closed-form solution is available, kernel mixture density modeling is not feasible. Also, since GλD works with quantile functions, it is in essence limited to the one-dimensional case. Finally, we mention the approximative maximum entropy approach (Hyv¨arinen, 1998), which yields a density expansion of a gaussian similar to Edgeworth’s, but in terms of a series of functions rather than a series of Hermite polynomials of increasing order. These functions can be selected so that the estimation of their coefficients becomes less sensitive to outliers. Kernel mixture density modeling is in principle feasible since the firstand higher-order weighted moments with respect to these functions can be defined (similar to our procedure). However, the technique is not readily extendable to the multivariate case, since the multivariate orthonormal functions need to be supplied by the user. 7 Conclusion Rather than increasing the order of the Edgeworth expansion of a single gaussian kernel, we introduced the Edgeworth-expanded gaussian mixture of moderate order and applied it to density estimation in an EM format. We showed the improved density estimation accuracy for the univariate case and formulated the extension to the multivariate case. We acknowledge that the practical value of using Edgeworth-expanded kernels decreases with increasing numbers of kernels in the mixture. Hence, for example, our approach could be applied to density-based clustering and classification where the class conditionals are preferably modeled with small numbers of kernels. The Edgeworth-based density estimate could also serve as a starting point to derive differential entropy estimates, which could then be used in independent component analysis and projection pursuit. Finally, it is clear that instead of expanding gaussian kernels, one could equally well choose another type of reference kernel. However, using the gaussian mixture itself as the reference “kernel” leads to unwieldy Hermite polynomials with no guarantee of orthogonality and no immediate clue as to how to compute the coefficients in the expansion. Appendix: Derivation of Equation 4.2 We have the following parameters to optimize: the first and second moments of the gaussian kernels, µ1i and µ2i , ∀i, and the standardized third and fourth cumulants, κ3i and κ4i , ∀i, since these appear in the truncated Edgeworth expansion. Since the third and fourth cumulants cannot be made explicit from their zero derivatives of the log likelihood, we consider the following

1712

M. Van Hulle

constrained optimization problem: min{F } subject to : 2 µ4i − 3σi4 µ3i 2 2 2 (κ3i ) − ≤ 0 and (κ4i ) − ≤ 0, ∀i, σi4 σi3

(A.1)

with F the average negative log likelihood, and solve it with the Lagrange multipliers technique: 1 1 log p(v|i, wi , σi , κ3i , κ4i ) M n N i 2 µ4i − 3σi4 µ3i 2 2 2 λ3i (κ3i ) − λ4i (κ4i ) − + + σi4 σi3 i i

Fconstr = −

(A.2) with the λ3i and λ4i , ∀i, the Lagrange multipliers. Note that the constraints are the differences between the squared third and fourth cumulants and the squared expressions of the moments from which they are computed. The solution for µ1i and µ2i is identical to that of nthe traditional gausφ (v ) sian mixture, equation 2.4, when assuming that p˜pi(vn ) ≈ P(i|vn ), with φ pi representing the best gaussian estimate of p(v|i, wi , σi , κ3i , κ4i ). Consider now the derivatives with respect to κ3i : 1 φ pi (vn ) ∂ Fconstr =− H3i + λ3i 2 κ3i = 0, ∂κ3i 3!NM n p˜ (vn ) with H3i the third-order Hermite polynomial of kernel i, H3i (v) = z3 − 3z, φ (vn ) i z = v−w . If we assume that p˜pi(vn ) ≈ P(i|vn ) and divide the equation by σi P(i|vn )z n n = 0, we obtain n P(i|v ), and note that P(i|vn ) n

0=− =−

1 P(i|vn )z3 2 κ3i + λ3i n) n 3!NM n P(i|v n n P(i|v ) 2 κ3i 1 µ3i + λ3i . n 3!NM σi3 n P(i|v )

(A.3)

Assuming that the slack variable in the inequality constraint is zero, then tells us λ3i = 0, and we have an equality constraint. The equality constraint P(i|vn ) that κ3i = µσ3i3 , so that from equation A.3, it follows that λ3i = 2n3!NM . Asi suming that the slack variable is nonzero, λ3i = 0, we obtain from equation

Edgeworth-Expanded Gaussian Mixture Density Modeling

1713

A.3 the trivial solution µσ3i3 = 0, a solution we need to reject when the sample i has a nonzero weighted third moment. Reversing the inequality sign in the constraint leads to the same conclusion. Analogously, from ∂ F∂κconstr = 0, we obtain that 4i 0=−

1 4!NM

2 κ4i µ4i − 3 + λ4i , 4 n σi n P(i|v )

(A.4) µ −3σ 4

so that for a zero slack variable, we obtain κ4i = 4i σ 4 i . The nonzero slack i variable solution needs to be rejected since the sample has a nonzero weighted fourth moment. Reversing the inequality sign leads to the same conclusion. Acknowledgments I am supported by research grants received from the Belgian Fund for Scientific Research–Flanders (G.0248.03 and G.0234.04), the Interuniversity Attraction Poles Programme–Belgian Science Policy (IUAP P5/04), the Flemish Regional Ministry of Education (Belgium) (GOA 2000/11), and the European Commission (IST-2001-32114, IST-2002-001917, and NEST-2003012963). References Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Barndorff-Nielsen, O. E. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics, 5, 151–157. Barndorff-Nielsen, O. E., & Cox, D. R. (1989). Inference and asymptotics. London: Chapman and Hall. Comon, P. (1994). Independent component analysis—a new concept? Signal Process., 36(3), 287–314. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Royal Statistical Soc., B, 39, 1–38. Draper, N. R., & Tierney, D. E. (1972). Regions of positive and unimodal series expansion of the Edgeworth and Gram-Charlier approximations. Biometrika, 59, 463–465. Eriksson, A., Forsberg, L., & Ghysels, E. (2004). Approximating the probability functions of random variables: A new approach. Presented at the 2004 Far Eastern Meeting of the Econometric Society, Yonsei, Korea. Available online at: http://www.unc.edu/∼eghysels/papers/APR4.pdf. Friedman, J. (1987). Exploratory projection pursuit. J. American Statistical Association, 82(397), 249–266. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475.

1714

M. Van Hulle

Hyv¨arinen, P. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Jones, M., & Sibson, R. (1987). What is projection pursuit? J. Royal Statistical Society A, 150, 1–36. Lakhany, A., & Mausser, H. (2000). Estimating the parameters of the generalized lambda distribution. ALGO Research Quarterly, 3(3), 47–58. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hall. Ramberg, J., & Schmeiser, B. (1974). An approximate method for generating asymmetric random variables. Communications of the ACM, 17(2), 78–82. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev., 26, 195–239.

Received November 23, 2004; accepted January 27, 2005.

LETTER

Communicated by Herbert Jaeger

Movement Generation with Circuits of Spiking Neurons Prashant Joshi [email protected]

Wolfgang Maass [email protected] Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria

How can complex movements that take hundreds of milliseconds be generated by stereotypical neural microcircuits consisting of spiking neurons with a much faster dynamics? We show that linear readouts from generic neural microcircuit models can be trained to generate basic arm movements. Such movement generation is independent of the arm model used and the type of feedback that the circuit receives. We demonstrate this by considering two different models of a two-jointed arm, a standard model from robotics and a standard model from biology, that each generates different kinds of feedback. Feedback that arrives with biologically realistic delays of 50 to 280 ms turns out to give rise to the best performance. If a feedback with such desirable delay is not available, the neural microcircuit model also achieves good performance if it uses internally generated estimates of such feedback. Existing methods for movement generation in robotics that take the particular dynamics of sensors and actuators into account (embodiment of motor systems) are taken one step further with this approach, which provides methods for also using the embodiment of motion generation circuitry, that is, the inherent dynamics and spatial structure of neural circuits, for the generation of movement.

1 Introduction Using biologically realistic neural circuit models to generate movements is not so easy, since these models are made of spiking neurons and dynamic synapses, which exhibit a rich inherent dynamics on several temporal scales. This tends to be in conflict with movement tasks that require sequences of precise motor commands on a relatively slow timescale. However, we show that without the construction of any particular circuit, training a linear readout to take a suitable weighted sum (with fixed weights after training) of the output activity of a fairly large number of neurons in a generic neural microcircuit model provides a very general paradigm for movement generation. It is obviously reminiscent of a number of experimental results (see, e.g., Neural Computation 17, 1715–1738 (2005)

© 2005 Massachusetts Institute of Technology

1716

P. Joshi and W. Maass

Wessberg et al., 2000) that show that a suitable weighted sum of the activity from a fairly large number of cortical neurons in monkeys predicts quite well the trajectory of hand positions for a variety of arm movements. Obviously the neural microcircuit model assumes here a similar role as a kernel for support vector machines in machine learning (for details, see Maass, Legenstein, & Bertschinger, 2004, and Maass, Natschl¨ager, & Markram, 2004). This letter demonstrates that controllers made from generic neural microcircuits are functionally generic in the sense that readouts from such circuits can learn to control the arm regardless of the model used to describe the arm dynamics, the type of feedbacks used (visual or proprioceptive), and the type of movements that are generated. This is shown here by teaching the same generic neural circuit to generate reaching movements for two different models with different kinds of feedbacks. The first model used (model 1) is the standard model of a two-joint robot arm described in Slotine and Li (1991). The other model (Todorov, 2000, 2003) comes from biology and relates the activity of neurons in the cortical motor area M1 to the kinematics of the arm (model 2). It turns out that both the spatial organization of information streams, especially the population coding of slowly varying input variables, and the inherent dynamics of the generic neural microcircuit model have a significant impact on its capability to generate movements. In particular, it is shown that the inherent dynamics of neural microcircuits allows these circuits to cope with rather large delays for proprioceptive and sensory feedback. In fact, it turns out that the performance of this generic neurocontroller is optimal for feedback delays that lie in the biologically realistic range of 50 to 280 ms. Furthermore, it is shown that other readout neurons from the same neural microcircuit model can be trained simultaneously to estimate results of such feedback, and that in the absence of real feedback, the precision of reaching movements can be improved significantly if the circuit gets access to these estimated feedbacks. This work complements preceding work where generic neural microcircuit models were used in an open loop for a variety of sensory processing tasks (Buonomano & Merzenich, 1995; Maass, Natschl¨ager, & Markram, 2002; Maass, Natschl¨ager, & Markram, 2004). It turns out that the demands on the precision of real-time computations carried out by such circuit models are substantially higher for closed-loop applications such as those considered in this article. The paradigm for movement generation discussed in this letter is somewhat related to preceding work (Ijspeert, Nakanishi, & Schaal, 2003), where a fixed parameterized system of differential equations was used instead of neural circuits, and to the melody generation and prediction of chaotic time series with artificial neural networks in discrete time of J¨ager (2002) and J¨ager and Haas (2004). In these other models, no effort is made to choose a movement generator whose inherent dynamics has a similarity to that of biological neural circuits. It has not yet been sufficiently

Movement Generation with Circuits of Spiking Neurons

1717

investigated whether feedback, especially feedback with a realistic delay, can have similarly beneficial consequences in these other models. No effort was made in this article to make the process by which the neural circuit model (more specifically, the readouts from this circuit) learns to generate specific movement primitives biologically realistic. Hence, the results of this article provide evidence only that a generic neural microcircuit can hold the information needed to generate certain movement primitives and that it can generate a suitable slow dynamics with high precision. The structure of this letter is as follows. Section 2 describes the neural microcircuit model. This is followed by the description of the robot arm model (model 1) in section 3. Sections 4, 5, and 6 present results of computer simulations for model 1. Section 7 repeats the experiment described in section 4 for the biologically motivated arm model (model 2). Finally, we discuss robustness issues related to our new paradigm for movement generation in section 8. A preliminary version of some results from this letter (for movements of just one fixed temporal duration, and without model 2) was presented at a conference (Joshi & Maass, 2004). 2 Generic Neural Microcircuit Models In contrast to common artificial neural network models, neural microcircuits in biological organisms consist of diverse components, such as different types of spiking neurons and dynamic synapses, that are each endowed with an inherently complex dynamics of its own. This makes it difficult to construct neural circuits out of biologically realistic computational units that solve specific computational problems, such as generating arm movements to various given targets. In fact, the generation of a smooth arm movement appears to be particularly difficult for a circuit of spiking neurons, since the dynamics of arm movements takes place on a timescale of hundreds of milliseconds, whereas the inherent dynamics of spiking neurons takes place on a much faster timescale. We show that this problem can be solved, even with a generic neural microcircuit model whose internal dynamics has not been adjusted or specialized for the task of creating arm movements, by taking as activation command for a muscle at any time t a weighted sum w × z(t) of the vector z that describes the current firing activity of all neurons in the circuit.1 The weight vector w, which remains fixed after training, is the only part that needs to be specialized for the generation of a particular movement task. Each component of z(t) models the impact that a particular neuron v may have on the membrane potential of a generic readout neuron. Thus, each spike of neuron v is replaced by a pulse of

1 As usual, a constant component is formally included in z(t) so that the term w × z(t) may contain some fixed bias.

1718

P. Joshi and W. Maass

unit amplitude 1 that decays exponentially with a time constant of 30 ms. In other words, z(t) is obtained by applying a low-pass filter to the spike trains emitted by the neurons in the generic neural microcircuit model. Note that it is already known that hand trajectories of monkeys can be recovered from the current firing activity z(t) of neurons in motor cortex through the same types of weighted sums as considered in this article (Wessberg et al., 2000). In principle, one can also view various parameters within the circuit as being subject to learning or adaptation, for example, in order to optimize the dynamics of the circuit for a particular range of control tasks. However, this has turned out to be not necessary for the applications described in this article, although it remains an interesting open research problem how unsupervised learning could optimize a circuit for motor control tasks. One advantage of viewing the weight vector w as being plastic is that learning is quite simple and robust, since it amounts to linear regression—in spite of the highly nonlinear nature of the control tasks to which this setup is applied. Another advantage is that the same neural microcircuit could potentially be used for various other information processing tasks (e.g., prediction of sensory feedback; see section 6) that may be desirable for the same or other tasks. The generic microcircuit models used for the closed-loop control tasks described in this article were similar in structure to those used earlier for various sensory processing tasks in an open loop. More precisely, we considered circuits consisting of 600 leaky integrate-and-fire neurons arranged on the grid points of a 20 × 5 × 6 cube in 3D (see Figure 1). Twenty percent of these neurons were randomly chosen to be inhibitory. Synaptic connections were chosen according to a biologically realistic probability distribution that favored local connections but also allowed some long-range connections. Biologically realistic models for dynamic synapses were employed instead of the usual static synapses of artificial neural network models. Parameters of neurons and synapses were chosen to fit data from microcircuits in rat somatosensory cortex (based on Gupta, Wang, & Markram, 2000, and Markram, Wang, & Tsodyks, 1998; see the appendix). In order to test the noise robustness of movement generation by the neural microcircuit model, the initial condition of the circuit was randomly drawn (initial membrane potential for each neuron drawn uniformly from the interval [13.5 mV, 14.9 mV], where 15 mV was the firing threshold). In addition a substantial amount of noise was added to the input current of each neuron throughout the simulation at each time step; a new value for the noise input current with mean 0 and SD of 1 nA was drawn for each neuron and added (subtracted) to its input current. In the case of the arm model that is considered in sections 3 through 6 (model 1), the neural circuit receives analog input streams from six sources (from eight sources in the experiment with internal predictions discussed

Movement Generation with Circuits of Spiking Neurons

1719

Figure 1: Spatial arrangement of neurons in the neural microcircuit models considered in this letter. The neurons in the six layers on the left-hand side encode the values of the six input-feedback variables xdest , ydest , θ1 (t − ), θ2 (t − ), τ1 (t), τ2 (t) in a standard population code. Connections from these six input layers (shown for a few selected neurons), as well as connections between neurons in the subsequent six processing layers, are chosen randomly according to a probability distribution discussed in the text (a typical example is shown).

in Figures 8 and 9). A critical factor for the performance of these neurocontrollers is the way in which these time-varying analog input streams are fed into the circuit. The outcomes of the experiments discussed in this article would have been all negative if these analog input streams were fed into the circuit as time-varying input currents. Apparently the variance of the resulting spike trains was too large to make the information about the slowly varying values of these input streams readily accessible to the circuit. Therefore, we employed instead a standard form of population coding (Pouget & Latham, 2003). Each of the six time-varying input variables was mapped onto an array of 50 symbolic input neurons with bell-shaped tuning curves (see the appendix). Thus, the value of each of the six input variables is encoded at any time by the output values of the associated 50 symbolic input neurons (of which at least 43 neurons output at any time the value 0). The neurons in each of these six input arrays are connected2 with one of the six layers consisting of 100 neurons in the circuit of 100 × 6 ((20 × 5) × 6) integrate-and-fire neurons, providing a time-varying input current to a randomly selected subset of integrate-and-fire neurons on that layer (see Figure 1).

2 With a value of 3.3 for λ in the formula for the connection probability given in the appendix.

1720

P. Joshi and W. Maass

Figure 2: Closed-loop application of a generic neural microcircuit. The weight vectors of the linear readouts from this circuit that produce the next motor commands τ1 (t + 1), τ2 (t + 1) are the only parameters that are adjusted during training. After training, the neural circuit receives in this closed loop as inputs a target position xdest , ydest for the tip of the robot arm (in Cartesian coordinates; these inputs remain constant during the subsequent arm movement) as well as feedback θ1 (t − ), θ2 (t − ) from the arm representing previous values of joint angles delayed by an amount , as well as “efferent copies” τ1 (t), τ2 (t) of its preceding motor commands. All the dynamics needed to generate the movement is then provided by the inherent dynamics of the neural circuit in response to the switching on of the constant inputs (and in response to the dynamics of the feedbacks). During training of the readouts from the generic neural circuits, the proprioceptive feedbacks θ1 (t − ), θ2 (t − ) and the efferent copies of previous motor commands τ1 (t), τ2 (t) are replaced by corresponding values for a target movement, which are given as external inputs to the circuit (“imitation learning”).

3 A Two-Joint Robot Arm as a Benchmark Nonlinear Control Task We first trained a generic neural microcircuit model (see Figures 1 and 2) to control a standard model for a two-joint robot arm (model 1; see Figure 3). This model is used in Slotine and Li (1991) as a standard reference model for a complex nonlinear control task (see in particular sections 6 and 9). It is assumed that the arm is moving in a horizontal plane, so that gravitational forces can be ignored.

Movement Generation with Circuits of Spiking Neurons

1721

l2

lc

2

I2 , m2

Θ2, τ2

l1

lc

I1 , m1

1

Θ1, τ1

Figure 3: Standard model of a two-joint robot arm.

Using the well-known Lagrangian equation in classical dynamics, the dynamic equations for this arm model are given by equation 3.1:

H11 H12 H21 H22

¨ θ1 −h θ˙2 −h(θ˙1 + θ˙2 ) τ θ˙ 1 + = 1 , h θ˙1 0 τ2 θ˙ 2 θ¨ 2

(3.1)

with θ = [θ1 θ2 ]T being the two joint angles, τ = [τ1 τ2 ]T being the joint input torques to the two joints, and H11 = m1 lc12 + I1 + m2 l1 2 + lc22 + 2l1 lc2 cos θ2 + I2 H12 = H21 = m2 l1 lc2 cos θ2 + m2 lc22 + I2 H22 = m2 lc22 + I2 h = m2 l1 lc2 sin θ2 . Equation 3.1 can be compactly written as ˙ θ˙ = τ, H(θ )θ¨ + C(θ, θ) where H represents the inertia matrix and C represents the matrix of Coriolis and centripetal terms. I1 , I2 are the moments of inertia of the two joints. The

1722

P. Joshi and W. Maass

values of the parameters that were used in our simulations were m1 = 1, m2 = 1, lc1 = 0.25, lc2 = 0.25, I1 = 0.03, and I2 = 0.03. The closed-loop control system that we used is shown in Figure 2. During training of the weights of the linear readouts from the generic neural microcircuit model, the circuit was used in an open loop with target values for the output torques provided by equation 3.1 (for a given target trajectory {θ1 (t), θ(t)}), and feedbacks from the plant replaced by the target values of these feedbacks for the target trajectory. The delay of the proprioceptive or sensory feedback is assumed to have a fixed value of 200 ms, except for section 6, where we study the impact of this value for the precision of the movement. For each such target trajectory, 20 variations of the training samples were generated, for which at each time step3 t a different noise value of 10−5 × ρ was added to each of the input channels where ρ is a random number drawn from a gaussian distribution with mean 0 and SD 1, multiplied by the current value of that input channel. The purpose of this extended training procedure was to make the readout robust with regard to deviations from the target trajectory caused by faulty earlier torque outputs given by the readouts from the neural circuit (see section 8). Each target trajectory had a time duration of 500 ms. 4 Teaching a Generic Neural Microcircuit Model to Generate Basic Movements As a first task, the generic neural microcircuit model described in section 2 was taught to generate with the two-joint arm described in section 3 the four movements indicated in Figure 4. In each case, the task was to move the tip of the arm from point A to point B on a straight line, with a biologically realistic bell-shaped velocity profile. The two readouts from the neural microcircuit model were trained by linear regression to output the joint torques required for each of these movements.4 3 All time steps were chosen to have a length of 2 ms, except for the experiment reported in Figure 6, where a step size of 1 ms was used to achieve a higher precision. 4 Training data were generated as follows. For a given start point x start , ystart and target end point xdest , ydest of a movement (both given in Cartesian coordinates), an interpolating trajectory of the tip of the arm was generated according to the following equation given in Flash and Hogan (1965):

x(t) = xstart + (xstart − xdest ) · (15τ 4 − 6τ 5 − 10τ 3 ) y(t) = ystart + (ystart − ydest ) · (15τ 4 − 6τ 5 − 10τ 3 ), where τ = t/MT and MT is the target movement time (in this case, MT = 500 ms). From this target trajectory for the end point of the robot arm, we had generated target trajectories of the angles 1 , 2 of the robot arm by applying standard equations from geometry (see, e.g., Craig, 1955). From these, the target trajectories of the torques were generated according to equation 3.1.

Movement Generation with Circuits of Spiking Neurons 2

2

B

B 1.5

1.5

1

1

0.5

0.5

1723

A

A 0

0

0.5

1

1.5

2

2

0

0.5

1

1.5

2

2

B

1.5

1.5

A

1

A

B

1

0.5 0

0

0.5 0

0.5

1

1.5

2

0

0

0.5

1

1.5

2

Figure 4: Initial position A and end position B of the robot arm (model 1) for four target movements, scaled in meters. The target trajectory of the tip of the robot arm and of the elbow are indicated by dashed and dashed-dotted lines. One sees clearly that even simple linear movements of the tip to the arm require quite nonlinear movements of the elbow.

Twenty noisy variations of each of the four target movements were used for the training of the two readouts by linear regression, as specified in section 3. Note that each readout is simply modeled as a linear gate with weight vector w applied to the liquid state x(t) of the neural circuit. This weight vector is fixed after training, and during validation all four movements are generated with this fixed-weight vector at the readout. The performance of the trained neural microcircuit model during validation in the closed loop (see Figure 2) is demonstrated in Figure 5. When the circuit receives as input the coordinates xdest , ydest of the end point B of one of the target movements shown in Figure 4, the circuit autonomously generates in a closed loop the torques needed to move the tip of the two-joint arm from the corresponding initial point A to this end point B.5

5 In these experiments, no effort was made to stabilize the end point of the arm at or near the target position. Rather, the movement was externally halted at the end of the allotted time period of 500 ms. Hence, the neural circuit model acts as a movement generator rather than as a controller. However, we are not aware of a fundamental obstacle that would make it impossible to teach a circuit to stabilize the arm once it has reached the target position (which the circuit receives as an extra input).

1724

P. Joshi and W. Maass 2

2

B

B 1.5

1.5

1

1

0.5

0.5

0

0

A 0.5

1

1.5

2

2

0

A

0

0.5

1

1.5

2

2

B

1.5

1.5

A

B

A 1

1

0.5

0.5

0

0

0.5

1

1.5

2

0

0

0.5

1

1.5

2

Figure 5: Target trajectories of the tip of the robot arm as in Figure 4 (solid) and resulting trajectories of the tip of the robot arm in a closed loop for one of the test runs (dashed line). The dots around the target end points show the end points of the tip of the robot arm for 10 test runs for each of the movements (enlarged inserts show a 20 cm × 20 cm area with target end point B marked by a black open triangle). Differences are due to varying initial conditions and simulated inherent noise of the neural circuit. Nevertheless, all movement trajectories converged to the target, with an average deviation from the target end point of 4.72 cm and the SD of 0.85 cm (scale of figures in m).

Obviously temporal integration capabilities of the controller are needed for the control of many types of movements. The next experiment was designed to test explicitly this capability of neurocontrollers constructed from generic circuits of spiking neurons. Figure 6 shows results for the case where the readouts from the neural microcircuit have been trained to generate an arm movement with an intermediate stop of all movement from 225 to 275 ms (see the velocity profile at the bottom of Figure 6). The initiation of the continuation of the movement at time t = 275 ms has to take place without any external cue, just on the basis of the inherent temporal integration capability of the neural circuit. For demonstration purposes, we chose for the experiment reported in Figure 6 a feedback delay of just 1 ms, so that all circuit inputs are constant during 49 ms of the 50 ms while the controller has to wait, forcing the readouts to decide only on the basis of the inherent circuit dynamics when to move on. Nevertheless, the average deviation of the tip of the robot arm for 20 test runs (with noisy initial conditions and noise

Movement Generation with Circuits of Spiking Neurons

1725

A circuit response 250 500 0

0.1

0.2

0.3

0.4

0.5

−50 0 20

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

B 50 τ1

0

τ

2

0 −20 0 2

θ

1

0

1.5

velocity

θ

2

−2 0 2 1 0 4 2 0 0

time (s)

Figure 6: Demonstration of the temporal integration capability of the neural controller. The data shown are for a validation run for a circuit that has been trained to generate a movement that requires an intermediate stop and then autonomous continuation of the movement after 50 ms. (A) Spike raster of the 600 neurons on the right-hand side of Figure 1. Note that the readout neurons receive at time t only information about the last few spikes before time t (more precisely, they receive at time t the liquid state x(t) of the circuit as their only input). (B) Target time courses of the joint angles θ1 , θ2 , joint torques τ1 , τ2 , and the velocity of the tip of the robot arm are shown as a solid line; actual time courses of these variables during a validation run in closed loop are shown as dashed lines.

on feedbacks as before) was just 6.86 cm, and the bottom part of Figure 6 shows (for a sample test run) that the tip of the robot arm came to a halt during the period from 225 to 275 ms, and then autonomously continued to move.

1726

P. Joshi and W. Maass A 1.7 Start point End points used for training End points used for validation Desired path Performance on a sample seen during training Performance on a sample not seen during training

1.5

1.3 0.6

0.8

1

1.2

1.4

1.6

B velocity

3 Desired Velocity Observed Velocity

2 1 0

0

0.1

0.2

0.3

0.4

0.5

time (s)

Figure 7: Generation of reaching movements to new target end points that lie between end points used for training. (A) Generalization of movement generation to five target end points (small circles) that were not among the eight target end points (small squares) that occurred during training. Movement to a new target end point was initiated by giving its Cartesian coordinates as constant inputs to the circuit. Average deviation for 15 runs with new target end points: 10.3 cm (4.8 cm for target end points that occurred during training). (B) The velocity profile for one of the movements to a new target end point the solid line is the ideal bell-shaped velocity profile; the actual profile is a dashed line.

5 Generalization Capabilities The trained neurocontroller (with the weights of the linear readouts being the only parameters that were adjusted during training) had some limited capabilities to generate arm-reaching movements to new targets. For the experiment reported in Figure 7, the circuit was trained to generate from a common initial position reaching movements to eight different target positions, given in terms of their Cartesian coordinates as constant inputs xdest , ydest to the circuit. After training, the circuit was able to generate with fairly high-precision reaching movements to other target points never used during training, provided that they were located between target points used

Movement Generation with Circuits of Spiking Neurons

1727

for training. The autonomously generated reaching movements moved the tip of the robot arm on a rather straight line with a bell-shaped velocity profile, just as for those reaching movements to targets that were used for training. 6 On the Role of Feedback Delays and Autonomously Generated Feedback Estimates Our model assumes that the neural circuit receives as inputs, in addition to the constant target end points and efferent copies τ1 (t), τ2 (t) of its movement commands with very little delay, proprioceptive or visual feedback that provides at time t information about the values of the angles of the joints at time t − . Whereas it is quite difficult to construct circuits or other artificial controllers for imitating movements that can benefit significantly from feedback (e.g., with the approach of Ijspeert et al., 2003), especially if this feedback is significantly delayed, we show in Figure 8 that neurocontrollers built from generic neural microcircuit models are able to generate and control movements for feedback with a wide range of delays. In fact, Figure 8 shows that the smallest deviation between the target end point xdest , ydest and the actual end point of the tip of the robot arm is not achieved when this delay has a value of 0, but for a range of delays between 50 and 280 ms. In order to make sure that this surprising result is not an artifact of some particular randomly drawn neural microcircuit model or a particular arm movement, it has been tested on each of 10 randomly drawn neural microcircuits with 12 different movements (four trajectories as shown in Figure 4, each created at three different speeds, resulting in movement times of 300, 500, and 700 ms). The results of these statistical experiments are reported in Figure 8. The right-most point on each of the three curves shows the performance achieved without any feedback (since for this point, the delay of the feedback is as large as the duration of the whole movement). Compared with that, feedback with a suitable delay reduces the imprecision of the movement by at least 50%. Altogether, these data show that the best values for the feedback delay lie in the range of 50 to 280 ms. The upper bound for this interval depends somewhat on the duration of the movement. A possible explanation for the fact that feedback with a delay of less than 50 ms is less helpful is that in this case, the current target circuit output is very similar to the currently arriving feedback, and hence it is more difficult for the circuit to learn the map from current feedback to current target output in a noise-robust fashion. In addition, a delayed feedback complements the inherent temporal integration property of the neural microcircuit model (see Maass, Natschl¨ager, & Markram, 2004), and therefore tends to enlarge the time constant for the fading of memory in the closed-loop system. Hence these neurocontrollers perform best for a range of feedback delays that contain typical values of actual delays for proprioceptive and visual feedback measured in a variety of species (e.g., 120 ms for proprioceptive feedback

1728

P. Joshi and W. Maass

B 0.2

0.2

0.15

0.15

Error (m)

Error (m)

A

0.1 0.05 0 0

0.05 0 0

100 200 300 Feedback Delay ∆(ms)

200 400 Feedback Delay ∆(ms)

D

C 0.2

0.2

0.15

0.15

Error (m)

Error (m)

0.1

0.1 0.05 0 0

200 400 600 Feedback Delay ∆(ms)

No Feedback of Estimated θs With Feedback of Estimated Θs

0.1 0.05 0 0

200 400 Feedback Delay (ms)

Figure 8: Influence of feedback delay on movement error. Error is defined as the difference of the desired and observed end point of movement. The delay is for proprioceptive feedbacks θ1 (t − ), θ2 (t − ). The curves show the averages, and the vertical bars show the SD of the data achieved for 400 movements for each value of (four different movements as shown in Figure 4 repeated 10 different times with different random initial conditions of the circuit and different online noise for each of 10 randomly drawn generic neural microcircuit models). (A–C) Data for three movement durations: 300, 500, and 700 ms. (D) In the upper curve, results for a slightly larger neural circuit (consisting of 800 instead of 600 early integrate-and-fire neurons). The lower (dashed) curve in D shows the performance of the same circuits when internally generated estimates of proprioceptive feedbacks (for a delay of 200 ms) were fed back as additional inputs to the neural circuit. Note that the use of such internally estimated feedbacks not only improves the movement precision for all values of the actual feedback delay expected for = 200 ms, but also reduces the SD of the precision achieved for different circuits considerably.

and 200 ms for visual feedback is reported in van Beers, Baraduc, & Wolpert, 2002). In another computer experiment, we examined the potential benefit of using estimated feedback for the neurocontroller under consideration. Estimation of feedback is very easy for such neural architecture, since the generic

Movement Generation with Circuits of Spiking Neurons

1729

Figure 9: Information flow for the case of autonomously generated estimates θˆ (t − 200 ms) of delayed feedback θ(t − 200 ms). The rest of the circuit is as in Figure 2.

neural microcircuit model that generates (via suitable readouts) the movement commands has not been specialized in any way for this movement generation task and can simultaneously be used as information reservoir for estimating feedback. More precisely, two additional readouts were added and trained to estimate at any time t the values of the joint angles θ1 and θ2 at time t − 200 ms, or 200 ms earlier. These delayed values were chosen as targets for these two additional readouts during training, since the previously reported results (see, in particular, Figure 8B) show that feedback of the actual values of θ1 and θ2 with a delay of 200 ms is quite beneficial for the precision of the movement that is generated. After training, the weights of these two additional readouts were frozen (for the first two readouts, which produce the movement commands).6 The outputs of these two additional readouts were also fed back into the circuit (without delay; see Figure 9). Compared with the architecture shown in Figure 2, the neural circuit now receives two additional time-varying inputs. These were fed into the circuit in the same way as the other six inputs (described in section 2). Thus, two additional arrays consisting of 50 neurons each were used for a population coding of these time-varying input variables, and 2 “columns” consisting of 6 Since neither the training of the readouts for movement commands nor the training of the readouts for retrodiction of sensory feedback changes the neural circuit itself, it does not matter whether these readouts are trained sequentially or in parallel. In our experiments both types of readouts were trained simultaneously, while the target values of both θ1 (t − ), θ2 (t − ) and θ1 (t − 200), θ2 (t − 200) were given to the circuits as additional inputs during training (where is the assumed actual feedback delay plotted on the x-axis in Figure 8D).

1730

P. Joshi and W. Maass

100 neurons each were added of the neural circuit that received the outputs of these two additional input arrays. The top solid line in Figure 8D shows the result, computed in the same way as in the other panels of Figure 8 for the case when the values of the estimates of θ1 (t − 200 ms) and θ2 (t − 200 ms) produced by the two additional readouts were not fed back into the circuit. The bottom dashed line shows the result when these estimates were available to the circuit via feedback. Although this additional feedback does not provide any new information to the circuit, but only collects and redistributes information within the neural circuit, it significantly improved the performance of the neurocontroller for all values of the actual delay of feedback about the values of θ1 and θ2 (except for = 200 ms). The value on the right-most point of the lower curve for = 500 ms shows the improvement achieved by using estimated sensory feedback when no feedback arrives at all, since the total movements lasted 500 ms. Altogether, the use of internally estimated feedback improved the precision of the movement by almost 50% for most values of the delay of the actual feedback. 7 Application to a Biological Model for Arm Control Whereas in the preceding section we focused on a model for a robot arm as a standard example for a highly nonlinear control task, we demonstrate in this section that the same paradigm for movement generation can also be applied to a well-known model for cortical control of arm movements in primates (Todorov, 2000, 2003). This model proposes a direct relationship between the firing rate c j of individual neuron j in primary motor cortex M1 (relative to some baseline firing rate C) and the kinematics (in cartesian coordinates) and end point force f e xt of the hand, which is viewed here simply as the tip of a two-joint arm: c j (t − d) =

uTj 2

¨ + kx(t)) + b uTj x(t) ˙ (F −1 fe xt (t) + mx(t) .

(7.1)

The vector u j denotes the direction in which the end point force is generated due to activation of muscles by neuron j (assuming cosine tuning of neurons). In our simulations we simply took four unit vectors u j pointing up, down, left, and right. fe xt (t) is the end point force that the hand applies ˙ and x¨ are the position, velocity, and acceleraagainst external objects. x, x, tion of the hand, respectively (we usually write x, y for the hand position x in a two-dimensional space). Although the precise relationship between the activity of neurons in motor cortex and the activation of individual muscles is extremely complicated and highly nonlinear, a derivation given in in Todorov (2000) suggested that equation 7.1 provides a quite good (almost linear) local approximation to

Movement Generation with Circuits of Spiking Neurons

1731

multijoint kinematics over a small work space. As a consequence, we have applied this model only for arm movements when the hand moves on the boundaries of a 28.28 cm × 28,28 cm square. Since we are concerned only with the movement of the hand in its work space and do not require the hand to exert an end point force on external world objects, fe xt (t) can be set to 0. For simplicity, we have also set the transmission delay d from cortex to muscles to 0 (but our model would work just as well for other values of d). This simplifies the model to c j (t) =

uTj 2

˙ (m¨x(t) + kx(t)) + b uTj x(t) .

(7.2)

In our computer experiments, we applied this model with the parameter values b = 10 Ns/m, k = 50 N/m, and m = 1 kg, suggested in Todorov (2000). In order to produce a paradigm for arm control by cortical circuits, we took a generic cortical microcircuit model consisting of 800 neurons as described in section 2. Four readouts that received inputs from all neurons in this microcircuit model were trained by linear regression to assume the role of these four neurons in motor cortex that directly control arm muscles resulting in hand movements according to equation 7.2.7 We trained these readout neurons to produce four different hand movements along the edges of a 28.28 cm × 28.28 cm square whose diagonals were parallel to the x- and y-axis, respectively. The inputs to the neural microcircuit were the coordinates xdest , ydest of the desired target end point of the hand, efferent copies of the outputs c 1 (t), . . . , c 4 (t) of motor neurons 1, . . . , 4, and feedback x(t − 200) ms, y(t − 200) ms about preceding hand positions with a delay of 200 ms that is biologically realistic for visual feedback into motor cortex. The values of these eight inputs were fed into the 800 neuron microcircuit model in the same way as for the eight-input circuit discussed at the end of the preceding section. The results for this experiment are shown in Figure 10. The average deviation over 40 runs of the tip of the arm from the desired end point was 0.13 cm with a SD of 7.2295 × 102 cm. It is interesting to note that the generic neural microcircuit can also learn to generate movements for this quite different arm model. Another point of interest is that the control performance of the generic neural microcircuit is independent of the kind of feedback that it is receiving (cf. the angles in the earlier model and position coordinates in this model).

7

Target trajectories of the end point of the arm were generated for training as described in note 4. Target outputs c j (t) for the readouts were generated from these trajectories by equation 7.2.

1732

P. Joshi and W. Maass

A circuit response 795

400

0

0.1

0.2

0.3

0.4

0.5

20 0 60

0.1

0.2

0.3

0.4

0.5

20 0 −20

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0 1.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

0.1

0.2

0.3

0.4

0.5

B

c

3

c

2

c1

60

−40

c

4

−25

x

−35

y

1 0 1.4

velocity

1.2 0 2 1 0 0

time (s)

Figure 10: Generation of an arm movement for biological model for cortical control of muscle activations. (A) Spike raster analogous to Figure 6. (B) Solid lines denote target values, and dashed lines show the performance of simulated readouts c 1 , . . . , c 4 from a simulated microcircuit in motor cortex that receives significantly delayed information about earlier hand positions as feedback (simulating visual feedback to motor cortex). Scales for c 1 , . . . , c 4 in N, for x, y in m, for the velocity of the hand in m/s.

Movement Generation with Circuits of Spiking Neurons

1733

8 How to Make the Movement Generation Noise Robust Mathematical results from approximation theory (see the appendix of Maass et al., 2002, and Maass & Markram, 2004, for details) imply that a sufficiently large neural microcircuit model (which contains sufficiently diverse dynamic components to satisfy the separation property) can in principle (if combined with suitable static readouts) uniformly approximate any given time-invariant fading memory filter F . Additional conditions have to be met for successful applications of neural microcircuit models in closed-loop movement generation tasks, such as those considered in this article. First, one has to assume that the approximation target for the neural microcircuit, some movement generator F for a plant P, is a time-invariant fading memory filter (if considered in an open loop). But without additional constraints on the plant or target movement generator F , one cannot guarantee that neural microcircuits L that uniformly approximate F in an open loop can successfully generate similar movements of the plant P. Assume that F can be uniformly approximated by neural microcircuit models L, that is, there exists for every ε > 0 some neural microcircuit model L so that ||(F u)(t) − (Lu)(t)|| ≤ ε for all times t and all input functions u(·) that may enter the movement generator. Note that the feedback f from the plant has to be subsumed by these functions u(·), so that u(t) is in general of the form u(t) = u0 (t), f (t), where u0 (t) are external movement commands8 and f (t) is the feedback (both u0 (t) and f (t) are in general multidimensional). Assume that such microcircuit model L has been chosen for some extremely small ε > 0. Even if the plant P has the common bounded input–bounded output (BIBO) property, it may magnify the differences ≤ ε between outputs from F and outputs from L (which may occur even if F and L receive initially the same input u) and produce for these two cases feedback functions f F (s), f L (s) whose difference is fairly large. The difference between the outputs of F and L for these different feedbacks f F (s), f L (s) as inputs may become much larger than ε, and hence the outputs of F and L with plant P may eventually diverge in this closed loop. This situation does in fact occur in the case of a two-joint arm as plant P. Hence, the assumption that L approximates F uniformly within ε cannot guarantee that ||(F u F )(t) − (Lu L )(t)|| ≤ ε for all t (where u F (t) := u0 (t), f F (t) and u L (t) := u0 (t), f L (t)), since even ||(F u F )(t) − (F u L )(t)|| may already become much larger than ε for sufficiently large t. This instability problem can be solved by training the readout from the neural circuit L to create an “attractor” around the trajectory generated by F in the noise-free case. This is possible because the current liquid

8

In our experiments u0 (t) was a very simple two-dimensional function with value 0, 0 for t < 0 and value xdest , ydest for t ≥ 0. All other external inputs to the circuit were given only during training.

1734

P. Joshi and W. Maass

state of the circuit depends not just on the most recent feedback to the circuit, but also on the preceding stream of feedbacks (therefore, the liquid state also contains information about which particular part of the movement has to be currently carried out) as well as on the target end position xdest , ydest . If one trains the readout from circuit L to ensure that ||(Lu F )(t) − (Lu L )(t)|| stays small when u F (·) and u L (·) did not differ too much at preceding time steps, one can bound ||(F u F )(t) − (Lu L )(t)|| by ||(F u F )(t) − (Lu F )(t)|| + ||(Lu F )(t) − (Lu L )(t)|| ≤ ε + ||(Lu F )(t) − (Lu L )(t)|| and thereby avoid divergence of the trajectories caused by F and L in the closed-loop system. This makes clear why it was necessary to train the readouts of the neural microcircuit models L to produce the desired trajectory not just for the ideal feedback u F (t) but also for noisy variations of u F (t) = u0 (t), f F (t) that represent possible functions u L (t) that arise if the approximating circuit L is used in the closed loop. 9 Discussion Whereas traditional models for neural computation focused on constructions of neural implementations of Turing machines or other off-line computational models, more recent results have demonstrated that biologically more realistic neural microcircuit models consisting of spiking neurons and dynamic synapses are well suited for real-time computational tasks (Buonomano & Merzenich, 1995; Maass et al., 2002; Maass, Natschl¨ager, & Markram, 2004; Natschl¨ager & Maass, 2004). Previously, only sensory processing tasks such as speech recognition or visual movement analysis (Buonomano & Merzenich, 1995; Maass et al., 2002; Legenstein, Martram, & Maass, 2003) were considered in this context as benchmark tests for real-time computing. In this letter, we have applied such generic neural microcircuit models for the first time in a biologically more realistic closed-loop setting, where the output of the neural microcircuit model directly influences its future inputs. Obviously closed-loop applications of neural microcircuit models provide a harder computational challenge than open-loop sensory processing, since small imprecisions in their output are likely to be amplified by the plant to yield even larger deviations in the feedback, which is likely to increase even further the imprecision of subsequent movement commands. This problem can be solved by teaching the readout from the neural microcircuit during training to ignore smaller recent deviations reported by feedback, thereby making the target trajectory of output torques an attractor in the resulting closed-loop dynamical system. After training, the learned reaching movements are generated completely autonomously by the neural circuit once it is given the target end position of the tip of the robot arm as (static) input. We have demonstrated that the capability of the neural circuit to generate reaching movements generalizes to novel target end positions of the tip of

Movement Generation with Circuits of Spiking Neurons

1735

the arm that lie between those that occurred during training (see Figure 7). The velocity profile for these autonomously generated new reaching movements exhibits a bell-shaped velocity profile, as for the previously taught movements. We propose to view the basic arm movements that are generated in this way as possible implementations of muscle synergies, that is, of rather stereotypical movement templates (d’Avella, Saltiel, & Bizzi, 2003). In this interpretation, the learning of a larger variety of arm movements requires superposition of time-shifted versions of several different basic movement templates of the type as are considered in this letter. Such learning on a higher level is a topic of currrent research. Surprisingly, the performance of the neural microcircuit model for generating movements not only deteriorates if the (simulated) proprioceptive feedback is delayed by more than 280 ms or if no feedback is given at all, but also if this feedback arrives without any delay. Our computer simulations suggest that the best performance of such neurocontrollers is achieved if the feedback arrives with a biologically realistic delay in the range of 50 to 280 ms. If the delay assumes other values or is missing altogether, a significant improvement in the precision of the generated reaching movements can be achieved if additional readouts from the same neural microcircuit models that generate the movements are taught to estimate the values of the feedback with an optimal delay of 200 ms, and if the results of these internally generated feedback estimates are provided as additional inputs to the circuit (see Figure 8D). Apart from these effects resulting from the interaction of the inherent circuit dynamics with the dynamics of externally or internally generated feedbacks, also the spatial organization of information streams in the simulated neural microcircuit plays a significant role. The capability of such a circuit to generate movement is quite poor if information about slowly varying input variables (such as externally or internally generated feedback) is provided to the circuit in the form of a firing rate of a single neuron (not shown) rather than through population coding (see the description in section 2) as implemented for the experiments reported in this article. Another interesting point is that our model for motor control can learn to control the arm movement regardless of the model that is used to describe the dynamics of the arm movement and the types of feedback that the circuit is receiving. One of the two arm models that was tested (see section 7) is a model for cortical control of muscle activation. Hence, our model also provides a new hypothesis for the computational function of neural circuits in the motor cortex. The results presented in this letter may be viewed as a first step toward an exploration of the role of the embodiment of motion generation circuitry, that is, of concrete spatial neural circuits and their inherent temporal dynamics, in motor control. This complements existing work on the relevance of the embodiment of actuators to motor control (Pfeifer, 2002).

1736

P. Joshi and W. Maass

Appendix: Specification of Generic Neural Microcircuit Models Neuron parameters: Membrane time constant 30 ms, absolute refractory period 3 ms (excitatory neurons), 2 ms (inhibitory neurons), threshold 15 mV (for a resting membrane potential assumed to be 0), reset voltage drawn uniformly from the interval [13.8, 14.5 mV] for each neuron, constant nonspecific background current Ib uniformly drawn from the interval [13.5 nA, 14.5 nA] for each neuron, noise at each time-step Inoise drawn from a gaussian distribution with mean 0 and SD of 1nA, input resistance 1 M. For each simulation, the initial conditions of each integrate-and-fire neuron, that is, the membrane voltage at time t = 0, were drawn randomly (uniform distribution) from the interval [13.5 mV, 14.9 mV]. The probability of a synaptic connection from neuron a to neuron b (as well as that of a synaptic connection from neuron b to neuron a ) was defined as C · exp(−D2 (a , b)/λ2 ), where D(a , b) is the Euclidean distance between neurons a and b and λ is a parameter that controls both the average number of connections and the average distance between neurons that are synaptically connected (we set λ = 1.2). Depending on whether the pre- or postsynaptic neurons were excitatory (E) or inhibitory (I ), the value of C was set according to Gupta et al. (2000) to 0.3 (E E), 0.2 (E I ), 0.4 (I E), 0.1 (I I ). We modeled the (short-term) dynamics of synapses according to the model proposed in Markram et al. (1998), with the synaptic parameters U (use), D (time constant for depression), and F (time constant for facilitation) randomly chosen from gaussian distributions that model empirically found data for such connections. Depending on whether a and b were excitatory (E) or inhibitory (I ), the mean values of these three parameters (with D, F expressed in seconds, s) were chosen according to Gupta et al. (2000) to be .5, 1.1, .05 (E E), .05, .125, 1.2 (E I ), .25, .7, .02 (I E), .32, .144, .06 (II ). The SD of each parameter was chosen to be 50% of its mean. The mean of the scaling parameter A(in nA) was chosen to be 70 (EE), 150 (EI), −47 (IE), −47 (II). In the case of input synapses, the parameter A had a value of 70 nA if projecting onto a excitatory neuron and −47 nA if projecting onto an inhibitory neuron. The SD of the A parameter was chosen to be 70% of its mean and was drawn from a gamma distribution. The postsynaptic current was modeled as an exponential decay exp (−t/τs ) with τs = 3 ms (τs = 6 ms) for excitatory (inhibitory) synapses. The transmission delays between neurons were chosen uniformly to be 1.5 ms (E E) and 0.8 ms for the other connections. We applied the following input convention. Each input variable is first scaled into the range [0, 1]. This range is linearly mapped onto an array of 50 symbolic input neurons. At each time step, one of these 50 neurons, whose number n(t) ∈ {1, . . . , 50}, reflects the current value i n (t) ∈ [0, 1], which is the normalized value of input variable i(t) (e.g., n(t) = 1 if i n (t) = 0, n(t) = 50 if i n (t) = 1). The neuron n(t) then outputs at time t the value of i(t). In addition, the three closest neighbors on both sides of neuron n(t) in this

Movement Generation with Circuits of Spiking Neurons

1737

linear array get activated at time t by a scaled-down amount according to a gaussian function (the neuron number n outputs at time step t the value −(n−n(t))2 i(t) · σ √12π e 2π 2 , where σ = 0.8). Acknowledgments This work was partially supported by the Austrian Science Fund FWF, project P15386, and PASCAL, project IST2002-506778, of the European Union. References Buonomano, D. V., & Merzenich, M. M. (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Craig, J. J. (1955). Introduction to robotics: Mechanics and control (2nd ed.). Reading, MA: Addison-Wesley. d’Avella, A., Saltiel, P., & Bizzi, E. (2003). Combination of muscle synergies in the construction of a natural motor behavior. Nature Neuroscience, 6(3), 300– 308. Flash, T., & Hogan, N. (1965). The coordination of arm movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5(7), 1699–1703. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1547–1554). Cambridge, MA: MIT Press. J¨ager, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (GMD Report 159). German National Research Center for Information Technology. J¨ager, H., & Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304, 78–80. Joshi, P., & Maass, W. (2004). Movement generation and control with generic neural microcircuits. In A. J. Ijspeert, M. Murata, & N. Wakamiya (Eds.), Biologically inspired approaches to advanced information technology (pp. 258–273). Legenstein, R. A., Markram, H., & Maass, W. (2003). Input prediction and autonomous movement analysis in recurrent circuits of spiking neurons. Reviews in the Neurosciences (Special Issue on Neuroinformatics of Neural and Artificial Computation), 14(1–2), 5–19. Available at http://www.igi.tugraz.at/ maass/publications.html. Maass, W., Legenstein, R. A., & Bertschinger, N. (2004). Methods for estimating the computational power and generalization capability of neural microcircuits. Submitted for publication. Available online: http://www.igi.tugraz.at/ maass/publications.html.

1738

P. Joshi and W. Maass

Maass, W., & Markram, H. (2004). On the computational power of recurrent circuits of spiking neurons. Journal of Computer and System Sciences, 69, 593–616. Available online: http://www.igi.tugraz.at/maass/publications.html. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Available online: http://www.igi. tugraz.at/maass/publications.html. Maass, W., Natschl¨ager, T., & Markram, H. (2004). Computational models for generic cortical microcircuits. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach (pp. 575–605). Boca Raton, FL: Chapman & Hall/CRC. Available online: http://www.igi.tugraz.at/maass/publications.html. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci., 95, 5323–5328. Natschl¨ager, T., & Maass, W. (2004). Information dynamics and emergent computation in recurrent circuits of spiking neurons. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems, 16 (pp. 1255– 1262). Cambridge, MA: MIT Press. Available online: http://www.igi.tugraz.at/ maass/publications.html. Pfeifer, R. (2002). On the role of embodiment in the emergence of cognition: Grey Walter’s turtles and beyond. In Proc. of the Workshop ”The Legacy of Grey Walter.” Bristol. Available online: http://www.ifi.unizh.ch/groups/ailab/. Pouget, A., & Latham, P. E. (2003). Population codes. In M. A., Arbib (Ed.), The handbook of brain theory and neural networks (pp. 893–897). Cambridge, MA: MIT Press. Slotine, J. J. E., & Li, W. (1991). Applied nonlinear control. Englewood Cliffs, NJ: Prentice Hall. Todorov, E. (2000). Direct control of muscle activation in voluntary arm movements: A model. Nature Neuroscience, 3(4), 391–398. Todorov, E. (2003). On the role of primary motor cortex in arm movement control. In M. L. Latash & M. Levin (Eds.), Progress in motor control III (pp. 125–166). Champaign, IL: Human Kinetics. van Beers, R. J., Baraduc, P., & Wolpert, D. M. (2002). Role of uncertainty in motor control. Phil. Trans. R. Soc. Lond. B, 357, 1137–1145. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810), 361–365.

Received March 10, 2004; accepted December 8, 2004.

LETTER

Communicated by Angela Yu

Cognitive Enhancement Mediated Through Postsynaptic Actions of Norepinephrine on Ongoing Cortical Activity Osamu Hoshino [email protected] Department of Intelligent Systems Engineering, Ibaraki University, Nakanarusawa 4-12-1, Hitachi-shi, Ibaraki 316-8511, Japan

We propose two distinct types of norepinephrine (NE)-neuromodulatory systems: an enhanced-excitatory and enhanced-inhibitory (E-E/E-I) system and a depressed-excitatory and enhanced-inhibitory (D-E/E-I) system. In both systems, inhibitory synaptic efficacies are enhanced, but excitatory ones are modified in a contradictory manner: the E-E/E-I system enhances excitatory synaptic efficacies, whereas the D-E/E-I system depresses them. The E-E/E-I and D-E/E-I systems altered the dynamic property of ongoing (background) neuronal activity and greatly influenced the cognitive performance (S/N ratio) of a cortical neural network. The E-E/E-I system effectively enhanced S/N ratio for weaker stimuli with lower doses of NE, whereas the D-E/E-I system enhanced stronger stimuli with higher doses of NE. The neural network effectively responded to weaker stimuli if brief γ -bursts were involved in ongoing neuronal activity that is controlled under the E-E/E-I neuromodulatory system. If the E-E/E-I and the D-E/E-I systems interact within the neural network, depressed neurons whose activity is depressed by NE application have bimodal property. That is, S/N ratio can be enhanced not only for stronger stimuli as its original property but also for weaker stimuli, for which coincidental neuronal firings among enhanced neurons whose activity is enhanced by NE application are essential. We suggest that the recruitment of the depressed neurons for the detection of weaker (subthreshold) stimuli might be advantageous for the brain to cope with a variety of sensory stimuli. 1 Introduction It is widely accepted that the release of norepinephrine (NE) into target brain areas through noradrenergic (e.g., locus coeruleus, LC) pathways facilitates the efficacies of both excitatory and inhibitory synaptic transmissions within the targeted neuronal circuits (Woodward, Moises, Waterhouse, Hoffer, & Freedman, 1979; Woodward, Moises, Waterhouse, Heh, & Cheun, 1991; Waterhouse, Mouradian, Sessler, & Lin, 2000; Usher & Davelaar, 2002). NE binds to α and β adrenoceptors of neurons, activates second messenger Neural Computation 17, 1739–1775 (2005)

© 2005 Massachusetts Institute of Technology

1740

O. Hoshino

systems, and augments the efficacies of excitatory (e.g., glutamatergic) and inhibitory (e.g., GABAergic) synaptic transmissions (Waterhouse, Moises, & Woodward, 1981; Waterhouse, Moises, Yeh, & Woodward, 1982; Mouradian, Sessler, & Waterhouse, 1991). Such NE-induced neuromodulation has been well demonstrated for cortical pyramidal cells (Sessler et al., 1995; Waterhouse et al., 2000). Concerning NE-induced modulation of excitatory synaptic transmissions, recent studies (Waterhouse, Devilbiss, et al., 1998; Waterhouse, Moises, & Woodward, 1998; Devilbiss & Waterhouse, 2000) have investigated dose-dependent actions of NE on glutamate-evoked discharges of layer II/III and V neurons of the somatosensory cortex. They found two distinct profiles of actions on the responsiveness of neurons to excitatory synaptic input. One was enhanced synaptic modulation, in which glutamateevoked discharges of neurons were progressively enhanced as the dose level of NE increased. The other one was depressed synaptic modulation, in which glutamate-evoked discharges of neurons were monotonically depressed as the dose level of NE increased. Although the augmentation of the efficacies of excitatory synaptic transmissions may provide cortical neurons with enhanced detectability to sensory stimuli, less is known about what roles the depression of the excitatory synaptic efficacies plays in processing sensory information. It seems also important to ask how the two contradictory (enhanced and depressed) NEinduced excitatory synaptic modulations operate in the cortex and influence its cognitive performance. There has been ample evidence that cortical neurons are not silent but active, or emit action potentials spontaneously, even without external stimulation. Until recently, this self-generative ongoing neuronal activity had been regarded as noise or meaningless. However, recent experiments have demonstrated that ongoing cortical activity contains important information and greatly influences subsequent cognitive processes (Engel, Fries, & Singer, 2001; Arieli, Shoham, Hildesheim, & Grinvald, 1995; Arieli, Sterkin, Grinvald, & Aertsen, 1996). Researchers have suggested that ongoing neuronal activity might play important roles in processing relevant sensory information. Recent neurophysiological experiments (Kasamatsu & Heggelund, 1982; Waterhouse et al., 1988; Waterhouse et al., 2000) have shown that NE application affects the ongoing property of cortical neurons, reducing their activity. Researchers have suggested that the reduction in ongoing neuronal activity is essential for the enhancement of evoked-to-background neuronal activity ratio, or signal-to-noise (S/N) ratio. A variety of S/N enhancements has been reported, for example, NE suppresses ongoing (background) neuronal activity more than stimulus-induced (evoked) neuronal activity (Moises, Woodward, Hoffer, & Freedman, 1979; Waterhouse & Woodward, 1980; Sessler, Cheng, & Waterhouse, 1988) or NE enhances stimulus-induced neuronal activity, keeping its ongoing activity almost unchanged (Waterhouse

Cognitive Enhancement Mediated Through Norepinephrine

1741

& Woodward, 1980). It seems likely that although NE-enhanced inhibitory synaptic efficacies are important for reducing ongoing neuronal activity (or noise) (Leventhal, Wang, Pu, Zhou, & Ma, 2003) and therefore for enhancing S/N ratio, a balance between excitatory and inhibitory synaptic modulations might be crucial for improving S/N ratio. The purpose of this study is to clarify how the two contradictory (enhanced and depressed) NE-induced modulations of excitatory synaptic efficacies operate in processing sensory information in the cortex. We investigate how the doses of NE into the cortex affect the ongoing neuronal activity and subsequent cortical information processing. We construct a neural network model that is simple but based on the essential neuronal architecture of the cortex. The network consists of neuron units, each composed of a pyramidal cell (PYC), a small basket cell (SBC), and a large basket cell (LBC). The PYC and the SBC are reciprocally connected via a positive (PYC-toSBC) and a negative (SBC-to-PYC) synapse. The PYC positively synapses on the LBC. A group of PYCs form a cell assembly whose collective activation expresses information about a specific sensory feature. The PYCs within cell assemblies are connected with each other via positive synapses, and the LBCs negatively synapse on the PYCs of other cell assemblies, by which the activities of the cell assemblies tend to be mutually inhibited. This inhibitory mechanism might be supported by the notion that neurons of primate sensory areas are often excited by mnemonic traces of preferred targets and inhibited by those of nonpreferred targets (Funahashi, Bruce, & Goldman-Rakic, 1989; Wilson, O’Scalaidhe, & Goldman-Rakic, 1994; Rao, Williams, & Goldman-Rakic, 1999). The neural network model has ongoing neuronal activity, in which these cell assemblies are dynamically, temporarily activated or deactivated in a self-generated manner. The duration and sequential order of the emergence of each dynamic cell assembly is completely random. When a given cell assembly is temporarily active, its PYCs fire action potentials in weak synchrony. When the network is stimulated with a relevant sensory feature, the dynamic cell assembly corresponding to the stimulus tends to be selectively activated. Under the stimulation period, the PYCs of the assembly continuously emit action potentials in strong synchrony, while the other cell assemblies tend to be suppressed through lateral inhibitory connections (LBC-to-PYC). We propose here two distinct NE-neuromodulatory systems: (1) an enhanced-excitatory and enhanced-inhibitory (E-E/E-I) system and (2) a depressed-excitatory and enhanced-inhibitory (D-E/E-I) system. In simulations, either E-E/E-I or D-E/E-I neuromodulatory system operates, through which the efficacies of excitatory and inhibitory synapses are modified depending on the dose level of NE. Injecting NE into the network with various dose levels, we record individual neuronal activities under both ongoing (background) and stimulus-induced (evoked) neuronal states and calculate the S/N ratio. By analyzing these data, we try to understand how the

1742

O. Hoshino

two contradictory NE-neuromodulatory systems operate in controlling the ongoing cortical state and in subsequent sensory information processing. Two distinct types of NE-mediated neuromodulation, or enhancedand depressed-excitatory glutamate-evoked responses, have been observed within the same cortical area, the rat barrel cortex (Devilbiss & Waterhouse, 2000). This led us to hypothesize that the two neuromodulatory systems (E-E/E-I and D-E/E-I) may interact. A simple hypothetical neural network model is proposed, into which the E-E/E-I and D-E/E-I systems are integrated. The PYCs of the E-E/E-I system receive enhanced synaptic inputs, and those of the D-E/E-I system receive depressed synaptic inputs from other PYCs. We investigate whether the two neuromodulatory systems cooperate in controlling the ongoing cortical state and processing sensory information. Cortical fast oscillations such as γ -oscillations (40–70 Hz) are believed to play important roles in cognitive processing in the brain. For example, Pulvermuller ¨ and colleagues (Pulvermuller, ¨ Keil, & Elbert, 1999) have suggested that cortical oscillations at higher frequencies reflect the presence of learned associative representations. We show that the E-E/E-I neuromodulatory system contributes to the generation of such γ -oscillations in ongoing neuronal activity. The ongoing γ -oscillatory state may be regarded as a ready neuronal state preparing for a subsequent sensory input, in which information about the learned sensory features is frequently accessed as γ bursts. We investigate whether the ongoing γ -oscillations play a significant role in processing sensory information in the cortex. 2 Neural Network Model 2.1 Model Structure. In the cortex, inhibitory interneurons are extremely diverse in morphology and electrophysiology (Fairen, DeFelipe, & Regidor, 1984; Gupta, Wang, & Markram, 2000; Krimer & GoldmanRakic, 2001). Inhibitory interneurons synapse on other neurons and give inhibitory effects on their dynamic behaviors. Among interneurons, basket cells are known to selectively synapse on pyramidal cells (PYCs) (Somogyi, Kisvarday, Martin, & Whitteridge, 1983) that are believed to play major roles in neural information processing in the brain. Basket cells have diversity especially in their axonal arborizations and are classified roughly into two subclasses: large basket cells (LBCs) and small basket cells (SBCs) (Wang, Gupta, Toledo-Rodriguez, Wu, & Markram, 2002) or wide arbor cells and local arbor cells (Krimer & Goldman-Rakic, 2001). The LBCs and SBCs have, respectively, aspiny wide (up to ∼1000 µm) and short (up to ∼300 µm) axonal arbors that tend to synapse on the somata of target cells. Nest basket cells (Wang et al., 2002) or medium arbor cells (Krimer & Goldman-Rakic, 2001) could be placed into a third subclass, where their axonal arborizations are in a medium range, ∼500 µm. The differential range of influence due to such a difference in length of

Cognitive Enhancement Mediated Through Norepinephrine

1743

axonal arbors has a great impact on cortical neuronal information processing (Krimer & Goldman-Rakic, 2001). Reaserchers (Sillito, 1984; Eysel, 1992) have suggested that lateral inhibition between PYCs, which is mediated by interneurons through their axonal arbors, plays an important role in various cognitive processes, such as tuning of excitatory neurons to sound frequency in the primary auditory cortex (Yost, 1994) and orientation or direction (Eysel, Shevelev, Lazareva, & Sharaev, 1998) and cross-orientation inhibition (Kisvarday, Kim, Eysel, & Bonhoeffer, 1994) in the primary visual cortex. Goldman-Rakic and colleagues (Wilson et al., 1994; Rao et al., 1999; Rao, Williams, & GoldmanRakic, 2000) have demonstrated that even in higher cortical regions (e.g., prefrontal cortex) lateral inhibitory mechanisms could operate in shaping memory fields of neurons when monkeys are engaging in working memory tasks. The LBCs may mediate such a lateral inhibitory effect through their horizontal axonal arbors. Based on the above observations, I construct a neural network model for the cortex (see Figure 1A). The model consists of an input (IP) and an output (OP) network. Feature stimuli Fn (n = 1, 2, 3, 4, 5) activate their corresponding groups of IP neurons (ellipses), whose action potentials are sent to the OP network via divergent and convergent feedforward projections (the solid lines) and activate corresponding cell assemblies (the circles). As shown in Figure 1B, the OP network consists of neuron units, each of which contains a PYC (the large triangles), an SBC (small circles), and on LBC (the large circle). In each unit, the PYC and the SBC are reciprocally connected by a positive (PYC-to-SBC) and a negative (SBC-to-PYC) synapse. The PYC positively synapses on the LBC. Groups of PYCs form cell assemblies. PYCs within cell assemblies are connected with each other via positive synapses, and there is no connection between PYCs across cell assemblies. LBCs negatively synapse on the PYCs of the other cell assemblies through lateral (LBC-to-PYC) connections. I assume here a primary cortical area whose neurons have tuning properties to specific sensory features. To make the PYCs feature selective, I create in the OP network multiple dynamic cell assemblies that are spatially separated from each other (see Figure 1A). Due to such separable property, the dynamics of the OP network allows a given cell assembly to be selectively activated against others when its corresponding feature stimulus is presented to the IP network. For simplicity, the IP network contains only projection neurons (PNs) between which there is no connection. That is, the IP network works exclusively as an input layer. 2.2 Model Description. Dynamic evolutions of the membrane potentials of PNs, PYCs, LBCs, and SBCs are, respectively, defined by τPN

duiPN (t) PN = − uiPN (t) − uPN rest + Ii (t), dt

(2.1)

1744

O. Hoshino

A

NE

B

Output LBC PYC

OP OP IP

,,,,,,

modulation rate

F5

Input

enhanced synaptic mod.

0

0

PN

IP

input

1.0

2.0 [NE]0

D modulation rate

F1 C

SBC

depressed synaptic mod.

[NE] 0

1.0

[NE]0 2.0

0

[NE] Figure 1: (A–B) Structure of the neural network model. (A) The model consists of an input (IP) and an output (OP) network. Feature stimuli Fn (n = 1, 2, 3, 4, 5) are applied to corresponding groups of IP neurons (the ellipses), whose action potentials are sent to the OP network via divergent and convergent feedforward projections (the solid lines) and activate corresponding cell assemblies (the circles). NE (norepinephrine) is dosed into the OP network. (B) The OP network consists of neuron units, each composed of a set of a PYC (the large triangle), a SBC (the small circle), and a LBC (the large circle). Open and filled small triangles denote excitatory and inhibitory synapses, respectively. (C–D) Schematic drawings of synaptic modulation rates as a function of the dose level of NE ([NE]). (C) Enhanced modulations for excitatory (the solid line) and inhibitory (the dotted line) synaptic efficacies. (D) Depressed modulation for excitatory synaptic efficacies.

τPY

NOP duiPY (t) = − uiPY (t) − uPY wiPY,PY (t)SPY j (t) rest + j dt j=1

−

NOP

PY,SB wiPY,LB (t)SLB (t)SiSB (t) j (t) − wi j

j=1

+

NIP j=1

L iOP,IP SPN j (t), j

(2.2)

Cognitive Enhancement Mediated Through Norepinephrine

1745

τLB

duiLB (t) LB,PY PY Si (t), = − uiLB (t) − uLB rest + wi dt

(2.3)

τSB

duiSB (t) SB,PY PY Si (t), = − uiSB (t) − uSB rest + wi dt

(2.4)

and their action potentials are generated according to Pr ob SiY (t) = 1 = f Y uiY (t) , f Y [u] =

1 1+

e −ηY (u−θY )

(Y = P N, PY, L B, SB),

,

(2.5)

where uiY (t) is the membrane potential of the ith Y (Y = PN, PY, LB, SB) Y neuron at time t, whose time constant is τY . urest is the resting potential PN of Y neuron. Ii (t) is an external input to the ith PN with an intensity (positive constant). NOP and NI P are the numbers of neuron units of the OP and IP networks, respectively. wiX,Y is a synaptic strength from the jth Y j neuron to the ith X (X = PY, LB, SB) neuron. wiX,Y is a synaptic strength from neuron Y to X of unit i. L iOj P,I P is a connection strength of the divergent and convergent feedforward projections from the jth PN to the ith PYC and defined by

L iOj P,I P

 1.0 between Fn-sensitive PNs and PYCs: n= 1-5 =  0.0 otherwise.

(2.6)

Equation 2.5 defines the probability of neuronal firing, that is, the probability of SiY (t) = 1 is given by f Y ; otherwise, SiY (t) = 0. ηY and θY are, respectively, the steepness and the threshold of sigmoid function f Y for Y neuron. When the ith neuron fires, SiY (t) takes on a value of 1 for 1 msec, which is followed by 0 for another 1 msec. This is a simplified representation of an action potential and a refractory period and gives a maximal firing rate (500 Hz). After firing, the membrane potential is reset to the resting potential. Unless otherwise stated elsewhere, τPN = τPY = τLB = 50 msec, τSB = SB PY LB 10 msec, uPN rest = urest = −0.6, urest = urest = −0.5, NI P = NOP = 100, ηPN = ηPY = 10.2, ηLB = ηSB = 21.0, θPN = θPY = 0.1, and θLB = θSB = 0.3. As will be explained in section 2.3, wiPY,PY (t), wiPY,LB (t), and wiPY,SB (t) are modified j j depending on a level of NE application, and wiLB,PY and wiSB,PY are not to be modulated and set to a constant value (20.0). Each cell assembly consists of 10 neuron units. When the IP network is stimulated with feature Fn (n = 1, 2, 3, 4 or 5), a group of IP neurons corresponding to sensory feature Fn (see the ellipse indicated by Fn Figure 1A) receive an input as IiPN (t) = 1.0, and the other IP neurons receive no input as IiPN (t) = 0.0. These parameter

1746

O. Hoshino

values were chosen for suitable network performance, by which I could clearly demonstrate essential neuronal behaviors of the network. In section 3.3, I show in detail how sensitive the dynamic behavior of the network to these parameters is. It is well known that excitatory and inhibitory neurons frequently form reciprocal synaptic connections (Martin, 2002). Zilberter (2000) has reported that 75% (n = 80) of pairs of a PYC and a fast-spiking interneuron form reciprocal synaptic connections. The fast-spiking interneurons might be SBCs or chandelier cells (Krimer & Goldman-Rakic, 2001). Such reciprocal connections may mediate self-inhibition of PYCs through feedback inputs from their accompanying SBCs as the activities of the PYCs increase. When an input current was applied to the soma of SBCs, the SBCs generated spikes with a higher firing rate, while PYCs and LBCs fired with a lower rate (Krimer & Goldman-Rakic, 2001). The larger (τPY = τLB = 50 msec) and smaller (τSB = 10 msec) membrane time constants are functional representations of such slow- and fast-spiking cortical neurons, respectively. 2.3 NE Neuromodulatory System. As schematically shown in Figure 1C, the efficacies of both excitatory (the solid line) and inhibitory (the dotted line) synapses are enhanced as a function of a dose level of norepinephrine, or concentration of NE ([NE]). The enhanced excitatory (wiPY,PY (t)) and inhibitory (wiPY,LB (t), wiPY,SB (t)) synaptic modulations are dej j scribed by the following equations: dwiPY,PY (t) j dt

(t) − w0PY,PY , = αPY ([NE]0 − [NE])[NE] − βPY wiPY,PY j (2.7)

(t) dwiPY,LB j dt

(t) − w0PY,LB , = αLB [NE] − βLB wiPY,LB j

dwiPY,SB (t) = αSB [NE] − βSB wiPY,SB (t) − w0PY,SB . dt

(2.8) (2.9)

The depressed excitatory synaptic modulation between PYCs is schematically depicted in Figure 1D and described by dwiPY,PY (t) j dt

(t) − w0PY,PY . = −αPY [NE] − βPY wiPY,PY j

(2.10)

The inverted-U (the solid line in Figure 1C) and the monotonic-decrease (see Figure 1D) shapes for the excitatory synaptic modulation between PYCs are based on observed results (Waterhouse, Devilbiss, et al., 1998; Waterhouse, Moises, et al., 1998; Devilbiss & Waterhouse, 2000). The

Cognitive Enhancement Mediated Through Norepinephrine

1747

“monotonic-increase” shape for the inhibitory synaptic modulation (the dotted line in Figure 1C) is a simple hypothetical representation that is based on observed results (Waterhouse et al., 1981, 1982; Mouradian et al., 1991). In this study, we focus especially on the postsynaptic (PYC-to-PYC, LBC-toPYC and SBC-to-PYC) actions of NE on the activities of PYCs that play major roles in cognitive information processing in the cortex. For simplicity, we did not modulate the other excitatory synapses, PYC-to-LBC and PYC-to-SBC. Unless otherwise stated elsewhere, w0PY,PY = 7.0, w0PY,LB = 0.1, w0PY,SB = 30.0, αPY = 3.0, αLB = 0.8, αSB = 60.0, βPY = βLB = βSB = 1.0, and [NE]0 = 2.0. 3 Results 3.1 Neural Bases of NE-Induced Neuromodulation. As shown by the raster plots of action potentials in Figure 2A, the PYCs have ongoing (background) activity, where no external stimulus and no dose of NE are applied. An initial network state {SiY (t = 0) : Y = PY, L B, SB} was randomly chosen, where the probability of firing an action potential (SiY (0) = 1.0) was set to 0.1. The random and brief emergence of the five (F1–5) dynamic cell assemblies, or population activation of PYCs, characterizes the ongoing neuronal activity. The temporal formation of each dynamic cell assembly arises from mutual excitation between the PYCs within cell assemblies. The brief nature of the dynamic cell assemblies arises largely from the self-inhibition mediated through their accompanying SBCs. Due to such a self-inhibitory mechanism, the more the PYCs fire action potentials, the greater the activities of the PYCs tend to be suppressed. When the IP network is stimulated with a sensory feature (F2), whose duration is indicated by a horizontal bar in Figure 2A, the PYCs of the cell assembly corresponding to the stimulus are activated and emit a long burst of action potentials. After switching off the input, the state of the OP network returns to the ongoing state. Note that the other dynamic cell assemblies (F1, F3, F4, and F5) tend to emerge frequently during the stimulation period. This indicates that the lateral inhibition across dynamic cell assemblies, which is mediated through LBC-to-PYC inhibitory connections, is not so strong under the original condition, or at [NE] = 0.0. Figures 2B and 2C show how the dynamic behavior of the network is modulated by the enhanced-excitatory (see the solid line in Figure 1C) and enhanced-inhibitory (see the dotted line in Figure 1C) system— the E-E/E-I system. The period of each brief burst under the ongoing state is decreased as the dose level of NE ([NE]) increases (Figure 2A → 2B → 2C), which is due largely to the enhanced self-inhibition of PYCs through SBC-to-PYC feedback connections. Note that the activation of the dynamic cell assemblies tends to be temporally separated from each other as [NE] increases; that is, they are not likely to overlap in the time course. This is due largely to the enhanced lateral inhibition through LBC-to-PYC connections. Such

1748

O. Hoshino

A

F2

input F5 F4 F3 F2 F1 1

3

5

[NE ] = 0.0

7

9

11

time (s)

enhanced modulation

B

[NE ] = 1.0

F5 F4 F3 F2 F1 C

[NE ] = 2.0 F5 F4 F3 F2 F1

D

depressed modulation

[NE ] = 1.0

F5 F4 F3 F2 F1 E

[NE ] = 2.0 F5 F4 F3 F2 F1

Figure 2: Dependence of the dynamic behavior of the OP network on dose levels of NE ([NE]). Raster plots of PYC action potentials of cell assemblies that are sensitive to features F1–5 are shown. (A) NE is not dosed, or [NE] = 0.0. A horizontal bar indicates a stimulation (F2) presentation period. (B–C) NE-induced neuromodulation operated under the E-E/E-I system. (D–E) NEinduced neuromodulation operated under the D-E/E-I system.

Cognitive Enhancement Mediated Through Norepinephrine

1749

temporal segregation of dynamic cell assemblies is essential for processing the applied feature stimulus (F2) in that as feature-detection neurons of an early sensory cortex, the PYCs must respond selectively to a specific feature stimulus, while the other PYCs are not allowed to respond, or emit fewer action potentials. Note that although the ongoing PYC activity is decreased as [NE] increases, the synchronous PYC activity within cell assemblies is well preserved (see Figure 2C). The term synchronous activity implies that the PYCs within cell assemblies generate action potentials almost at the same time. Figures 2D and 2E show how the dynamic behavior of the network is modulated by the depressed-excitatory (see Figure 1D) and enhancedinhibitory (see the dotted line of Figure 1C) system—the D-E/E-I system. Both the ongoing and the stimulus-induced activities tend to be decreased as [NE] increases (Figure 2A → 2D → 2E), where synchronicity among action potentials within cell assemblies progressively disappears. Such desynchronization in PYC activity is due largely to the depression of excitatory synaptic connections between pyramidal cells. We evaluated the cognitive performance of the network in terms of evoked-to-background PYC activity ratio, or [stimulus-induced firing rate of PYCs]/[ongoing firing rate of PYCs]. We applied the same feature (F2) stimulus with various stimulus intensities; = 0.3 (strong: see Figure 3A), = 0.05 (weak: see Figure 3B) and = 0.02 (too weak: see Figure 3C). In Figures 3A to 3C, the ongoing (the circles) and stimulus-induced (the triangles) PYC activities are shown for the E-E/E-I (left) and D-E/E-I (center) neuromodulatory systems. The evoked-to-background activity, which we call a signal-to-noise (S/N) ratio in a practical sense, is indicated by circles and triangles (right) for the E-E/E-I left and D-E/E-I systems, respectively. Note that the background neuronal activity itself contains significant information as internal representations in the dynamics of the network. Hence, we cannot straightforwardly refer to the background activity as noise and therefore should use the term evoked-to-background activity ratio rather than S/N ratio in a strict sense. The reason we used the term S/N instead of evoked-to-background is to evaluate the simulation results in relation to experimental (neurophysiological) observations (Waterhouse & Woodward, 1980; Kasamatsu & Heggelund, 1982; McCormick, 1989; Gu, 2002; Leventhal et al., 2003), in which evoked-to-background activity ratio was preferentially called S/N ratio. Nevertheless, this use of S/N ratio is unusual in such a field of engineering because noise is not allowed to involve any significant information (or signal). We understand that the term signal-to-noise ratio in experimental neuroscience might be used in a more practical sense than in engineering. In this study, we employed these terms (signal, noise, and S/N ratio) based on a neurophysiological use, not an engineering use. Under the E-E/E-I neuromodulatory system, the ongoing PYC activity is gradually enhanced and then depressed as [NE] increases (see the circles at the left of Figures 3A to 3C). The increase in PYC activity at lower [NE]

1750

O. Hoshino

ε = 0.3 III

80 60 40 20 0 0

100

5

III

80

4 S/N ratio

I II

100

(pulses/s)

(pulses/s)

A

60 40

2 1

20

0

0 0

1.0 2.0 [NE]

3

0

1.0 2.0 [NE]

ε = 0.05 III

40 30 20 10 0 0

50

III

2.5

40

2.0 S/N ratio

I II

50

(pulses/s)

(pulses/s)

B

30 20 10

1.0 0.5

0

0 0

1.0 2.0 [NE]

1.5

1.0 2.0 [NE]

0

1.0 2.0 [NE]

ε = 0.02 III

40 30 20 10 0 0

1.0 2.0 [NE]

50

III

0.5

40

0.4 S/N ratio

I

50

II

(pulses/s)

C (pulses/s)

1.0 2.0 [NE]

30 20 10

0.3 0.2 0.1 0

0 0

1.0 2.0 [NE]

0

1.0 2.0 [NE]

Figure 3: Neuronal and cognitive behavior of PYCs. The model is presented with a feature stimulus with (A) strong, (B) weak, and (C) too weak intensity. In each figure, the left shows the ongoing firing rate (the circles) and stimulusinduced firing rate (the triangles) of a PYC operated under the E-E/E-I system. The center is for the D-E/E-I system. The right is S/N ratios for the E-E/E-I (the circles) and the D-E/E-I (the triangles) systems. Regions marked by I, II, and III indicate that three distinct types of S/N enhancements take place.

Cognitive Enhancement Mediated Through Norepinephrine

1751

indicates that excitatory synaptic enhancement prevails against inhibitory synaptic enhancement, while the decrease in PYC activity at higher [NE] indicates that inhibitory synaptic enhancement prevails against excitatory synaptic enhancement. Under the D-E/E-I neuromodulatory system, the ongoing activity is monotonically depressed as [NE] increases (see the circles at the center of Figures 3A to 3C). This indicates that inhibitory synaptic enhancement prevails at all levels of [NE]. For stronger stimuli (see the right of Figure 3A), the S/N ratio is enhanced at an intermediate level of [NE] ([NE] = ∼1.0) under the E-E/E-I system (circles), and enhanced at a higher level of [NE] ([NE] = ∼1.75) under the D-E/E-I system (triangles). In both systems, stimulus-induced PYC activity is progressively depressed at [NE] > ∼1.0 (see the triangles at the left and center of Figure 3A), where S/N enhancement takes place provided that noise (or background PYC activity) is reduced more than signal (or evoked PYC activity). This result implies that noise reduction is as effective as signal enhancement for improving the S/N ratio and is consistent with experimentally observed results (Moises et al., 1979; Waterhouse & Woodward, 1980; Sessler et al., 1988). For weaker stimuli (see the right of Figure 3B), the S/N ratio is enhanced at lower levels of [NE] under the E-E/E-I system (the circles), and is not likely to be enhanced under the DE/E-I system (the triangles). Figure 3C (right) shows fewer S/N enhancements for stimuli under both systems that are too weak. The stronger the excitatory connections are between PYCs, the more easily the information about an applied stimulus is picked up against noise. The higher S/N ratio for [NE] = 1.0 than [NE] = 2.0 observed under the E-E/E-I system (see the circles at the right of Figure 3A) arises from the inverted-U profile (see the solid line in Figure 1C). That is, the strength of the excitatory connections reaches a maximal value at [NE] = 1.0 and is progressively decreased as [NE] further increases. The proposed inverted-U profile is based on the experimental results (Waterhouse, Devilbiss, et al., 1998; Waterhouse, Moises et al., 1998; Devilbiss & Waterhouse, 2000), in which glutamateevoked discharges of cortical neurons were progressively facilitated to a maximum as [NE] increased and then progressively depressed as [NE] further increased. The increase in discharges for lower NE concentration arises from the activation of α-1 adrenergic receptors, and the nonspecific activation of β or α-2 adrenergic receptors is responsible for the decrease in discharges for higher NE concentration (Devilbiss & Waterhouse, 2000). One of the interesting findings might be that there have been three possible schemes for the S/N enhancement; (1) signal enhancement more than noise increase (see region I of Figure 3), (2) signal enhancement and noise reduction (region II), and (3) noise reduction more than signal decrease (region III). To investigate whether these schemes have any functional significance in NE-induced S/N enhancement, we varied the stimulus intensity and the dose level of NE ([NE]) and observed S/N changes from that of the control state ([NE] = 0.0). The E-E/E-I system (see Figure 4A) preferentially

1752

O. Hoshino

enhances the S/N ratio for weaker stimuli with lower doses of NE (see the arrow), whereas the D-E/E-I system (see Figure 4B) preferentially enhances the S/N ratio for stronger stimuli with higher doses of NE (see the arrow). These results, together with those of Figure 3, may provide an important conclusion: scheme 2, or signal enhancement and noise reduction, is quite effective for detecting weaker stimuli operated under the E-E/E-I system, where lower doses of NE greatly improve the S/N ratio. When a stronger stimulus is applied, scheme 3, or noise reduction more than signal decrease, operates to detect the stimulus under D-E-/E-I systems, where higher doses of NE greatly improve the S/N ratio. It might be an interesting finding that the network can effectively enhance the S/N ratio for weaker stimuli provided that the network employs the E-E/E-I neuromodulatory system, and NE is dosed into the network at lower levels (see the arrow in Figure 4A). To investigate its neuronal mechanisms, we analyzed the dynamics of the ongoing and stimulus-induced PYC activities, where the network was dosed with a low level of NE, [NE] = 0.5. Figure 4C is the membrane potential of a PYC under the ongoing state, on which its action potentials are overlaid. The random brief bursts characterize its ongoing PYC activity. Figure 4D is a power spectrum (left) and a cross-correlation function (right) of action potentials between PYCs belonging to the same cell assembly (F2) under the going state. The peaks (see the arrows) at ∼60 Hz (left) and ∼0.0 s (right) indicate that the ongoing random brief bursts are synchronized γ -oscillations among PYCs within cell assemblies. Figure 4E is a power spectrum (left) and a crosscorrelation function (right) for the same PYCs when stimulated by F2 with a weak intensity ( = 0.05). The significant increases of these peak values from those of Figure 4D indicate that the S/N enhancement (see the arrow of Figure 4A) has been established through tightly synchronized γ -oscillations among PYCs within the cell assembly corresponding to the stimulus. If the network employs the D-E/E-I neuromodulatory system, such synchronized brief bursts do not emerge under the ongoing state as shown in Figures 4F and 4G. There is no significant enhancement in power and

Figure 4: (A–B) Dependence of S/N enhancement on stimulus intensity and the dose level of NE, [NE]. (A) Neuromodulation operated under the E-E/E-I system. (B) Neuromodulation operated under the D-E/E-I system. An arrow indicates a peak in S/N enhancement. (C–H) Spectral and cross-correlation analyses of the activity of a PYC for E-E/E-I (C–E) and D-E/E-I (F–H) neuromodulatory systems. (C, F) The membrane potential of a PYC under the ongoing state where [NE] = 0.5, on which its action potentials are overlaid. (D, G) The power spectrum (left) and cross-correlation function (right) of action potentials under the ongoing state. (E, H) The power spectrum (left) and cross-correlation function (right) of action potentials under the stimulus-induced state. = 0.05.

Cognitive Enhancement Mediated Through Norepinephrine

B depressed modulation

enhanced modulation

500

500

100

100 2.0 2 4

stim 6 1.0 E] ulu 0.1 2 [N s in 4 0.5 ten 6 sity 1.0 0

2.0

power

1.5 1.0

0

50

100

2.0 1.5

power

5s

0.01

0.005

0.5 0.0

E

cross-correlation cross-correlation

-0.6

D

0 -2.5

0

2.5

0

2.5

0.01

0.005

1.0 0.5 0.0 0

50

100

frequency (Hz)

0 -2.5

time lag (s)

-0.6 2.0

G power

1.5 1.0

0.01

5s

0.005

0.5 0.0 0

50

100

2.0

H

cross-correlation cross-correlation

F

1.5

4

stim 6 1.0 E] ulu 0.1 2 [N s in 4 0.5 ten 6 sity 1.0 0

C

2.0 2

1.5

1.5

power

S/N change (%)

A

1753

0 -2.5

0

2.5

0

2.5

0.01

0.005

1.0 0.5 0.0 0

50

100

frequency (Hz)

0 -2.5

time lag (s)

1754

O. Hoshino

cross-correlation even when presented with the stimulus (see Figure 4H). Note that under the D-E/E-I system, this condition ( = 0.05, [NE] = 0.5) cannot enhance the S/N ratio (see Figure 4B), whereas the E-E/E-I system can enhance it (see the arrow in Figure 4A). This result may indicate that synchronized γ -oscillations within cell assemblies under the ongoing state are essential for the enhancement of the S/N ratio for weaker (or subthreshold) stimluli. As demonstrated above, synchronized, brief γ -bursts characterize the ongoing PYC activity controlled under the E-E/E-I neuromodulatory system. To investigate the neuronal relevance of the γ -bursts in S/N enhancement, we measured power spectra with various stimulus intensities and dose levels of NE. Under the ongoing state (see Figure 5A), the γ -oscillations are gradually increased and reach a maximum and then are decreased as [NE] increases further (from top to bottom). The emergence of γ -bursts is limited to a certain range of [NE], ∼0.25 < [NE] < ∼0.75. Figure 5B shows how the γ -power is enhanced when stimulated with a weak ( = 0.05) stimulus. The γ -power is greatly increased at ∼0.25 < [NE] < ∼0.75. Note that the maximal peak at [NE] = ∼0.5 corresponds to the maximal signal enhancement (see the triangles at the left of Figure 3B). When a strong ( = 0.3) stimulus is presented to the network, there is no significant increase in power at a specific frequency (see Figure 5C). Instead, a wide range of frequencies (∼3–70 hertz) is augmented. This indicates that S/N enhancement for stronger stimuli is established through increasing all the oscillatory (frequency) components that appear under the ongoing state. It might be that the ongoing γ -oscillatory state is a ready neuronal state preparing for a subsequent sensory input, especially for weaker or subthreshold stimuli, in which information about the learned sensory features (F1–5) is frequently accessed as γ -bursts. It is well known that cortical neurons can respond with direct excitation or depression to iontophoretically applied norepinephrine (Johnson, Roberts, & Straughan, 1969). NE can influence neuronal behavior through direct actions on potassium channels, hyperpolarizing the cells (Sarvey, 1988). These influences roughly correspond to reparameterization of equation 2.5. To answer the question of how it interacts with synaptic modulation in controlling cortical S/N ratio, we made an additional simulation. We varied the threshold value (θPY of equation 2.5) and NE concentration ([NE]), recorded ongoing and stimulus-induced activities of PYCs, and calculated the S/N ratio for the E-E/E-I (see Figure 6A) and D-E/E-I (see Figure 6B) systems. For simplicity, we assumed that θPY is independent of NE concentration. Note that the parameter θPY manipulates the firing probability of PYCs (see equation 2.5). The S/N ratio is maximal at around (θPY , [NE]) = (∼0.05, ∼0.5) for the E-E/E-I system (see the shaded region of Figure 6A). For the D-E/E-I system (see Figure 6B), the S/N ratio can be enhanced with smaller doses (or even without any dose) of NE (see the filled arrow) provided that a higher value

Cognitive Enhancement Mediated Through Norepinephrine

B

ongoing activity

weak (ε = 0.05) inp. strong (ε = 0.3) inp. 2.0

[NE] = 0.0

power

1.5 1.0 0.5 0.0

1.0 0.5

100

0

frequency (Hz)

1.0 0.5 0.0

1.5 1.0 0.5

100

4.0

1.0 0.5

2.0

0.0

50

1.0 0.5

100

4.0

0.5

2.0

0.0

50

50

1.0 0.5

100

4.0

0.5 0.0

50

50

100

frequency (Hz)

[NE] = 0.75

2.0 1.0

100

2.0

0

1.0 0.5

50

100

frequency (Hz) 4.0

[NE] = 1.0

1.5

[NE] = 1.0

3.0 2.0 1.0 0.0

0.0 0

100

3.0

frequency (Hz) power

1.0

50

frequency (Hz)

0.0 0

[NE] = 1.0

1.5

1.0 0

[NE] = 0.75

1.5

frequency (Hz) 2.0

2.0

100

0.0 0

[NE] = 0.5

3.0

frequency (Hz)

power

1.0

100

0.0 0

[NE] = 0.75

1.5

50

frequency (Hz)

[NE] = 0.5

1.5

frequency (Hz) 2.0

1.0

0

power

50

2.0

100

0.0 0

[NE] = 0.25

3.0

frequency (Hz)

[NE] = 0.5

1.5

100

0.0 0

power

power

2.0

50

frequency (Hz)

power

50

frequency (Hz)

power

0

0.0 0

1.0

100

[NE] = 0.25

2.0

power

power

1.5

2.0

frequency (Hz)

[NE] = 0.25

2.0

50

power

50

[NE] = 0.0

3.0

0.0

0.0 0

power

4.0

[NE] = 0.0

1.5

power

power

2.0

C

power

A

1755

0

50

100

frequency (Hz)

0

50

100

frequency (Hz)

Figure 5: Spectral analysis of the PYC activity operated under the E-E/E-I neuromodulatory system. (A) Power spectra under the ongoing state. (B) Power spectra under the stimulus-induced state, where a weak stimulus ( = 0.05) is presented. (C) Power spectra under the stimulus-induced state, where a strong stimulus ( = 0.3) is presented. [NE] is increased from 0.0 (top) to 1.0 (bottom).

1756

O. Hoshino A

B

depressed modulation

S/N ratio

S/N ratio

enhanced modulation

2.0 0

2.0 0 0

2.0

0.1

0.2

1.0

Y

0.3 0

E]

[N

2.0

0.1

θP

Y

θP

1.0

0.2

0

0.3 0

E]

[N

Figure 6: Dependence of the S/N ratio on the threshold value of the sigmoid function (θPY of equation 2.5) and the dose level of NE ([NE]) operated (A) under the E-E/E-I system where a weak stimulus ( = 0.05) is applied, and (B) under the D-E/E-I system where a strong stimulus ( = 0.3) is applied.

is taken for θPY . When θPY is low, the S/N ratio can be enhanced with greater doses of NE (see the open arrow). These simulation results may indicate that a combinatorial modulation of firing probability and synaptic efficacies may operate in controlling cortical S/N ratio. 3.2 Possible Interaction Between E-E/E-I and D-E/E-I Neuromodulatory Systems. The main issue of this study was to investigate how the two distinct types of neuromodulatory systems, or the E-E/E-I (enhancedexcitatory/enhanced-inhibitory) system and the D-E/E-I (depressedexcitatory/enhanced-inhibitory) system, work in processing neuronal information in the cortex. We showed in Figures 4A and 4B that a major difference between the two systems is in their operation range. That is, the E-E/E-I system (see Figure 4A) preferentially enhances the S/N ratio for weaker stimuli with lower doses of NE, whereas the D-E/E-I system (see Figure 4B) preferentially enhances the S/N ratio for stronger stimuli with higher doses of NE. Two distinct types of NE-mediated cortical responses, or enhanced and depressed glutamate-evoked responses, have been observed within the same cortical area, for example, the rat barrel cortex (Devilbiss & Waterhouse, 2000). This led us to hypothesize that the two neuromodulatory-modes (E-E/E-I and D-E/E-I) may interact. A simple hypothetical neural network model into which the two modes are integrated is shown in Figure 7A. The PYCs of the E-E/E-I system (the solid circles) receive enhanced-excitatory synaptic input (the solid lines), and those of the D-E/E-I system (the dotted circles) receive depressed-excitatory synaptic input (the dotted lines). We investigate how the S/N ratio changes as a function of stimulus intensity and NE concentration ([NE]). An enhanced neuron, or a PYC whose activity is enhanced by NE application, shows

Cognitive Enhancement Mediated Through Norepinephrine

1757

A : enhanced neuron : depressed neuron : enhanced connection : depressed connection

enhanced neuron

S/N change (%)

B

depressed neuron

C

400

400 100

2.0 2 4

stim

1.5

6

0.1

ulu

4

s in

E]

1.0

2

ten

6

1.0

sity

0.5 0

[N

100

2.0 2 4

stim

1.5

6

0.1

ulu

2

s in 4 ten 1.0 0 sity

E]

1.0 0.5

[N

Figure 7: (A) A hypothetical neural network model into which the two neuromodulatory modes (E-E/E-I, D-E/E-I) are integrated. The PYCs of the E-E/E-I system (the solid circles) receive enhanced synaptic input (the solid lines), and those of the D-E/E-I system (the dotted circles) receive depressed synaptic input (the dotted lines) from other PYCs. (B–C) Dependence of S/N enhancement on stimulus intensity and the dose level of NE ([NE]) operated under the integrated neuromodulatory system. (B) S/N change of an enhanced neuron, or a PYC whose activity is to be enhanced by NE application. (C) S/N change of a depressed neuron, or a PYC whose activity is to be depressed (suppressed) by NE application. An arrow indicates a peak in S/N enhancement.

optimal S/N enhancement for weaker stimuli and lower [NE] (see the arrow of Figure 7B). This result is similar to that of Figure 4A (see the arrow). An interesting finding might be that a depressed neuron, or a PYC whose activity is to be depressed by NE application, shows bimodal S/N enhancement (see the arrows of Figure 7C). That is, the S/N ratio can be enhanced not only for stronger stimuli (see the open arrow) as in the original D-E/E-I system (see the arrow in Figure 4B) but also for weaker stimuli (see the filled arrow). This recruitment of the depressed neurons for the detection of weaker stimuli, which is not obvious in the original D-E/E-I system (see

1758

O. Hoshino

Figure 4B), might be advantageous for the brain to cope with a variety of sensory stimuli. It might be interesting to note that the profile of the S/N enhancement of the depressed neuron for weaker stimuli (see the filled arrow in Figure 7C) is quite similar to that of the enhanced neuron (see the arrow in Figure 7B). To answer why the depressed neuron behaves like the enhanced neuron, we calculated cross-correlation functions of action potentials between a pair of an enhanced and a depressed neuron under the ongoing state (see Figure 8). The level of synchronization between the two neurons is progressively enhanced as [NE] increases (A → B → C → D ), reaches a maximal at [NE] = 1.0 (E), and then is depressed (F → G → H → I ). The synchronized activity implies that these neurons behave like the members of the same cell assembly, and thus the depressed neuron can respond in the same way as the enhanced neuron. The less synchronization for higher [NE] ([NE] > ∼1.75; H, I) implies that the depressed neuron is segregated from the cell assembly. This result raises an important question of why the level of synchrony between the enhanced and depressed neuron is increased even though the efficacies of excitatory connections from the enhanced to depressed neurons are weakened as [NE] increases. Synchronization between enhanced neurons plays an important role in synchronizing the depressed neuron with the enhanced neuron. That is, the depressed neuron can effectively be activated if it receives coincidental inputs from the enhanced neurons. Tight synchronization among enhanced neurons, which can be established at suitable NE concentrations, ∼0.25 < [NE] < 1.5, could provide such coincidental firings between the enhanced neurons within cell assemblies (not shown). The coincidental firings effectively activate the depressed neurons so that the depressed neurons can be tightly synchronized with the enhanced neurons, regardless of the NE-mediated depression of the excitatory synaptic connections from the enhanced to depressed neurons. 3.3 Dependence of Ongoing γ -Oscillatory Behavior on Neuron and Network Parameters. In this section, we show how sensitive the dynamic behavior of the PYCs to the parameters of the modeled neuron and network is, based on which we determined the suitable parameter values as presented in sections 2.2 and 2.3. As demonstrated in Figures 4 and 5, the ongoing γ -bursts controlled under the E-E/E-I system played an essential role in enhancing S/N ratio for weaker stimuli, here we focus on how the ongoing γ -oscillatory behavior is influenced by these parameters. Figure 9 presents the sensitivity of the dynamic behavior of PYCs to their membrane time constant. Under the ongoing state governed by [NE] = 0.5, γ -bursts take place in a range of τPY = ∼5.0–50.0 ms, where its peak frequency tends to be increased as τPY decreases (e.g., see the top right of Figure 9A). Such a frequency shift presumably arises from an increase in mutual excitation among PYCs within cell assemblies, which is

Cognitive Enhancement Mediated Through Norepinephrine A 0.03

B 0.03

[NE] = 0.0

0.03

[NE] = 0.25

0.02

0.02

0.01

0.01

0.01

0

0

0

-0.01

-0.01

-0.01

0

2.5

-2.5

time lag (s)

0

2.5

[NE] = 0.5

-2.5

time lag (s)

D 0.03

cross-correlation

C

0.02

-2.5

0.02

0.02

0.01

0.01

0.01

0

0

0

-0.01

-0.01

-0.01

0

2.5

-2.5

time lag (s)

0.03

0

2.5

-2.5

time lag (s)

G

0.03

0.03

[NE] = 1.75

0.02

0.02

0.01

0.01

0.01

0

0

0

-0.01

-0.01

-0.01

0

time lag (s)

2.5

-2.5

0

2.5

I

0.02

-2.5

[NE] = 1.25

time lag (s)

H [NE] = 1.5

2.5

F 0.03

[NE] = 1.0

0.02

-2.5

0

time lag (s)

E 0.03

[NE] = 0.75

1759

0

time lag (s)

2.5

-2.5

[NE] = 2.0

0

2.5

time lag (s)

Figure 8: Cross-correlation functions of action potentials between a pair of an enhanced and a depressed neuron (PYC) under the ongoing state, where the E-E/E-I and D-E/E-I neuromodulatory systems interact (see Figure 7A). [NE] is increased from 0.0 (A) to 2.0 (I).

mediated through the coincidental firing of the PYCs due to such a shorter value of τPY . For τSB (see Figure 9B) and τLB (see Figure 9C), γ -bursts take place in a range of ∼1 to 100 ms. Figures 9D, 9E, and 9F present how the membrane potentials of PYCs change as these membrane time constants increase under the ongoing γ -oscillatory state, where AMP denotes the av-

1760

O. Hoshino

erage membrane potential of PYCs. Note that it is advantageous for the PYCs to oscillate at a subthreshold level for their action potential generation, because such a subthreshold oscillation in membrane potentials might enable the PYCs to respond rapidly to sensory stimulation. These parameter values (τPY = 50 ms, τSB = 10 ms, and τLB = 50 ms) were suitable for generating such subthreshold γ -oscillations under the ongoing state. Figure 10 presents the sensitivity of the dynamic behavior of PYCs to the original synaptic strength (w0PY,PY of equations 2.7) between PYCs (see Figure 10A), the strength (w0PY,SB of equations 2.9) from SBC to PYC (see Figure 10B), and the strength (w0PY,LB of equations 2.8) from LBC to PYC (see Figure 10C). Under the ongoing state governed by [NE] = 0.5, γ -bursts take place in a range of w0PY,PY = ∼7.0–9.0 (see Figure 10A). For w0PY,SB (see Figure 10B) and w0PY,LB (see Figure 10C), γ -bursts take place in a range of ∼10.0 to 30.0 and ∼0.01 to 1.0, respectively. Figures 10D, 10E, and 10F present how the membrane potentials of PYCs change as these synaptic strengths increase. The existing parameter values, or w0PY,PY = 7.0, w0PY,SB = 30.0, and w0PY,LB = 0.1, were suitable for generating subthreshold γ -oscillations in membrane potentials of PYCs under the ongoing state. Figure 11 presents the sensitivity of the dynamic behavior of PYCs to the threshold (see Figure 11A) and steepness (see Figure 11B) of the sigmoid function (see equation 2.5), and to the number of cell assemblies (see Figure 11C). Under the ongoing state governed by [NE] = 0.5, γ -bursts take place in a range of θPY = ∼0–0.1 (see Figure 11A) and ηPY = ∼8.0–10.5 (see Figure 11B). Figures 11D and 11E present how the membrane potentials of PYCs change as these parameters values increase. The present values, or θPY = 0.1 and ηPY = 10.2, were suitable for generating subthreshold γ -oscillations in membrane potentials of PYCs under the ongoing state. We compare this result with a neural network model (Keeler, Pichler, & Ross, 1989) that investigated the effects of signal-to-noise suppression (Waterhouse & Woodward, 1980; Foote, Bloom, & Aston-Jones, 1983). The model consists of McCulloch-Pitts neurons. These neurons receive weighted inputs plus noise, and their outputs are determined by a gain function such as a step or sigmoid function. The researchers have suggested that by

Figure 9: Sensitivity of the dynamic behavior of a PYC to membrane time constants under the ongoing state, where the E-E/E-I system operates and [NE] = 0.5. (A) Sensitivity to the membrane time constant of PYCs (τPY ). Left: The membrane potential of the PYC on which its action potentials are overlaid. Right: The power spectrum of the action potentials of the PYC. (B) Sensitivity to the membrane time constant of SBCs (τSB ). (C) Sensitivity to the membrane time constant of LBCs (τLB ). (D–F) Average membrane potentials for (A–C).

Cognitive Enhancement Mediated Through Norepinephrine

5s

= 30ms

PY

power

τ

A

τ

PY

power

-0.6

= 50ms

τ

PY

power

-0.6

= 70ms

-0.6

1761

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz) 5s SB

power

τ

B

= 5ms

τ

SB

= 10ms

SB

= 20ms

power

-0.6

τ

power

-0.6

-0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz) 5s LB

= 10ms

power

τ

C

τ

LB

power

-0.6

= 50ms

-0.6 LB

= 100ms

power

τ -0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz)

AMP

D

E

F

-0.5

-0.6

-0.6

-0.8

-0.7

-0.7

-1.1

-0.8 1

τ

10 PY

(ms)

100

-0.8 1

τ

10 SB

(ms)

100

1

τ

10 LB

(ms)

100

1762

O. Hoshino

w0

5s = 5.0

power

PY,PY

A

PY,PY w0 =

power

-0.6 7.0

PY,PY w0 =

power

-0.6 9.0

-0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz)

w0

5s

power

PY,SB

B

= 20.0

PY,SB w0 =

power

-0.6 30.0

PY,SB

w0

power

-0.6 = 40.0

-0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz)

w0

5s

power

PY,LB

C

= 0.05

PY,LB w0 =

power

-0.6 0.1

PY,LB

w0

power

-0.6 = 1.0

-0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz)

AMP

D

E

F

-0.65

-0.65

-0.6

-0.7

-0.7

-0.75

-0.75

-0.7

7

8 PY,PY

w0

9

-0.8 10

20 PY,SB

w0

30

-0.9 0.01

0.1

1

PY,LB

w0

Figure 10: Sensitivity of the dynamic behavior of a PYC to the original synaptic strengths under the ongoing state, where the E-E/E-I system operates and [NE] = 0.5. (A) Sensitivity to the original synaptic strength between PYCs (w0PY,PY ). Left: The membrane potential of the PYC on which its action potentials are overlaid. Right: The power spectrum of the action potentials of the PYC. (B) Sensitivity to the original synaptic strength from SBC to PYC (w0PY,SB ). (C) Sensitivity to the original synaptic strength from LBC to PYC (w0PY,LB ). (D–F) Average membrane potentials for A–C.

Cognitive Enhancement Mediated Through Norepinephrine

5s PY

= 0.05

power

θ

A

θ

PY

power

-0.6

= 0.1

-0.6 PY

= 0.2

power

θ -0.6

1763

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz) 5s

power

ηPY = 9.0

B -0.6

power

ηPY = 10.2 -0.6

power

ηPY = 11.0 -0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz) power

5s

C

n=5

power

-0.6

n=7 -0.6

power

n = 10 -0.6

2.0 1.0 0.0 2.0 1.0

0.0 2.0 1.0 0.0

0

50 100

frequency (Hz)

AMP

D

E

F

-0.64

-0.64

-0.64

-0.66

-0.66

-0.66

-0.68

-0.68

-0.68

-0.70 0

0.05

θ

PY

-0.70 0.1 8

9

ηPY

10

-0.70

5

7

9

n

Figure 11: Sensitivity of the dynamic behavior of a PYC to the neuron and network parameters under the ongoing state, where the E-E/E-I system operates and [NE] = 0.5. (A) Sensitivity to the threshold value (θPY ) of the sigmoid function (see equation 2.5). Left: The membrane potential of the PYC on which its action potentials are overlaid. Right: The power spectrum of the action potentials of the PYC. (B) Sensitivity to the steepness (ηPY ) of the sigmoid function. (C) Sensitivity to the number of cell assemblies (n). (D–F) Average membrane potentials for A–C.

1764

O. Hoshino

increasing the concentration of NE, the noise might be effectively decreased, and thus the signal-to-noise ratio can be enhanced. The level of noise is analogous to the temperature of the so-called Boltzmann machine type of neural network, and noise reduction can be made by decreasing the temperature (Sejnowski, 1981; Hinton & Sejnowski, 1986). Since a decrease in temperature corresponds to an increase in steepness of the sigmoid function in the model (see equation 2.5), the simulation result shown in Figure 11B may support their results. That is, the background activity (or noise) is reduced as the steepness increases (e.g., see the first through third traces from the top of Figure 11B), and thus the signal-to-noise ratio could be enhanced. Note that an excessive increase of ηPY should be avoided; otherwise, the ongoing γ -bursts tend to disappear (e.g., see the third trace from the top of Figure 11B). To investigate whether the ongoing γ -oscillations are sensitive to network size, we increased the number of neurons that participate in processing sensory information. New dynamic cell assemblies were created to express additional information about new sensory features. As shown in Figure 11C, similar ongoing oscillatory behaviors are available, in which five (top; the original model), seven (middle), and ten (bottom) dynamic cell assemblies were, respectively, employed for encoding five (n = 5), seven (n = 7), and ten (n = 10) sensory features. Note that the number of neurons constituting each cell assembly is the same: 10 neurons. Although the overall ongoing γ -oscillatory behavior is less influenced by the number of cell assemblies, the membrane potentials of PYCs tend to be decreased as the number of cell assemblies increases (see Figure 11F). Such a decrease in membrane potentials implies that the ongoing state moves away from the threshold for action potential generation, and therefore the response property of the PYCs would deteriorate. Concerning the rate of change of weights under NE modulation (αPY , αSB , and αLB of equations 2.7 to 2.10, we have obtained a similar tendency in network behavior (not shown) as that in Figure 10: the membrane potentials of PYCs under the ongoing γ -oscillatory state are decreased as αPY increases, increased as αSB increases, and decreased as αLB increases. The parameter values, or αPY = 3.0, αSB = 60.0, and αLB = 0.8, were suitable for generating subthreshold γ -oscillations in membrane potentials of PYCs under the ongoing state. Note that the parameter values, βPY = βSB = βLB = 1.0 (see equations 2.7 to 2.10), determine the transient property of synaptic modulation. In the simulations, we applied NE with the same concentration as [NE] = constant, dosed at time = 1000 ms, and kept it throughout simulations. To collect data from cell recordings under stationary states for the weight modulation, we discarded initial (time = 0–5000 ms) data, which was long enough for transient neuronal behaviors not to be involved in the collected data because of such smaller values of βPY , βSB , and βLB . We investigate how neuronal behavior is altered if the shapes of the modulation rates (see Figures 1C and 1D) are changed. We changed the

Cognitive Enhancement Mediated Through Norepinephrine

A

1765

C

excitatory synaptic modulation modulation rate

1 : 2

3

-0.50

AMP

1

2 :

-0.55 -0.60 -0.65 -0.70

0 0

1.0

2.0

0.0

3.0

0.2

[NE]

0.4

0.6

0.8

1.0

[NE]

B

3

power

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

power

1.0 0.0

0 50 100 frequency (Hz) 2.0

power

1.0

power

power

0 50 100 frequency (Hz) 2.0

power

1.0

0.0

1.0

NE=1.0 2.0

1.0 0.0

0 50 100 frequency (Hz) 2.0

power

0.0

0 50 100 frequency (Hz) 2.0

NE=0.75 2.0

power

1.0

0.0

1.0

power

2

power

0 50 100 frequency (Hz) 2.0

1.0

NE=0.5 2.0

power

0.0

power

1.0

power

power

1

NE=0.25 2.0

power

NE=0 2.0

1.0 0.0

0 50 100 frequency (Hz)

Figure 12: Dependence of ongoing γ -oscillatory behavior on the shape of modulation rate for the excitatory connections between PYCs operated under the E-E/E-I system. (A) Shapes of modulation rate. (B) Power spectra of action potentials of a PYC. (C) Average membrane potentials of PYCs.

shape of the inverted-U type of modulation as indicated by 1, 2, and 3 in Figure 12A, and recorded the ongoing activity of a PYC, whose power spectra is shown in Figure 12B. Ongoing γ -oscillations are available for shapes 1 and 2 but not for 3. Figure 12C presents the average membrane potentials of PYCs under the two ongoing γ -oscillatory states. The average membrane potential is lowered if the modulation is accelerated (shape 1) from the original condition (shape 2). Figure 13B presents how neuronal behavior is altered if the monotonic type of modulation (see Figure 13A) for inhibitory connections from SBC to PYC is accelerated (row 1) or decelerated (row 3) from the original modulation rate (row 2). If the modulation rate is accelerated (row 1),

1766

O. Hoshino

ongoing γ -oscillations disappear (see Figure 13B). I have observed ongoing γ -oscillations for rows 2 and 3. The deceleration of the modulation rate (row 3) results in decreasing the membrane potentials (see Figure 13C). The original condition (row 2) provides a suitable ongoing network state, or ongoing γ -bursts in PYC activity. Figure 13E presents how neuronal behavior is altered if the monotonic type of modulation (see Figure 13D) for the inhibitory connections from LBC to PYC is accelerated (graph 1) or decelerated (graph 3) from the original one (graph 2). In this case, the shape of modulation rate less influences the γ -oscillatory behavior (see Figure 13E). Although an increase in membrane potentials (see Figure 13F) by the deceleration of the modulation rate appears to be advantageous (i.e., moving the state of the network towards the threshold), it should be restricted. That is, a further increase in membrane potentials from the original condition (graph 2) results in an increase in ongoing firing rate of PYCs and thus in a decrease in the S/N ratio (not shown). 4 Discussion A variety of S/N enhancements have been reported. For example, NE enhances stimulus-induced neuronal activity, keeping its ongoing activity almost unchanged (Waterhouse & Woodward, 1980) or NE suppresses ongoing neuronal activity more than stimulus-induced neuronal activity (Moises et al., 1979; Waterhouse & Woodward, 1980; Sessler et al., 1988). This study has shown that a balance between excitatory and inhibitory synaptic modulation controlled under NE neuromodulatory systems is crucial for such a variety of ways of S/N enhancement. Depending on external circumstances, the brain may adopt the most appropriate NE neuromodulatory system and scheme among possible candidates—E-E/E-I or D-E/E-I, and signal enhancement more than noise increase, signal enhancement and noise reduction, or noise reduction more than signal decrease—for improving cognitive performance of the cortex. Waterhouse and colleagues (Waterhouse et al., 1988; Waterhouse, Azizi, Burne, & Woodward, 1990; Mouradian et al., 1991) have demonstrated that NE can facilitate cortical neuronal responses to both subthreshold and

Figure 13: (A–C) Dependence of the ongoing γ -oscillatory behavior on the shape of modulation rate for the inhibitory connections from SBC to PYC operated under the E-E/E-I system. (A) Shapes of modulation rate. (B) Power spectra of action potentials of a PYC. (C) Average membrane potentials of PYCs. (D–F) Dependence of ongoing γ -oscillatory behavior on the shape of modulation rate for the inhibitory connections from LBC to PYC. The other conditions are the same as those of A–C.

Cognitive Enhancement Mediated Through Norepinephrine

inhibitory synaptic modulation

C 2 :

-0.5 1

2

3

AMP

3 :

-0.6 -0.7 -0.8

0 2.0

0.0

3.0

0.2

0.4

[NE]

NE=0.25

1.0 0 50 100 frequency (Hz) 2.0

power

power

1.0

1.0

0.0

modulation rate

0 50 100 frequency (Hz)

power

0 50 100 frequency (Hz) 2.0 1.0

0.0

0 50 100 frequency (Hz) 2.0 1.0

0.0

0 50 100 frequency (Hz) 2.0

0.0

0 50 100 frequency (Hz) 2.0 1.0

0.0

0 50 100 frequency (Hz)

0.0

1.0

0.0

0 50 100 frequency (Hz) 2.0 1.0

0.0

0 50 100 frequency (Hz)

D

1.0

0.0

0 50 100 frequency (Hz) 2.0

1.0

0.0

0 50 100 frequency (Hz) 2.0

power

1.0

1.0

0.0

NE=1.0 2.0

power

power

power

0 50 100 frequency (Hz) 2.0

0.0

3

1.0

0.0

0 50 100 frequency (Hz) 2.0

2

power

1.0

0.0

1.0

NE=0.75 2.0

power

1.0

NE=0.5 2.0

power

2.0

power

1

0.8

power

NE=0 2.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz)

inhibitory synaptic modulation

F 1

0

2

3

-0.50

AMP

B

0.6

[NE]

power

1.0

power

0

power

modulation rate

A

1767

1 :

-0.55 -0.60

2 :

-0.65 -0.70

0

1.0

2.0

3.0

3 : 0.0

0.2

0.4

[NE]

0.6

0.8

1.0

[NE]

E

3

power

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz) 2.0 1.0 0.0

0 50 100 frequency (Hz)

power

1.0 0.0

0 50 100 frequency (Hz) 2.0

power

1.0

power

power

0 50 100 frequency (Hz) 2.0

1.0

power

1.0

0.0

NE=1.0 2.0

1.0 0.0

0 50 100 frequency (Hz) 2.0

power

0.0

0 50 100 frequency (Hz) 2.0

1.0

NE=0.75 2.0

power

1.0

0.0

power

2

power

0 50 100 frequency (Hz) 2.0

1.0

NE=0.5 2.0

power

0.0

power

1.0

power

power

1

NE=0.25 2.0

power

NE=0 2.0

1.0 0.0

0 50 100 frequency (Hz)

1768

O. Hoshino

suprathreshold excitatory stimuli. I have shown here that weaker (or subthreshold) stimuli can be recognized if the E-E/E-I neuromodulatory system operates with a lower dose of NE (see the arrow in Figure 4A). For suprathreshold (or stronger) stimuli, the D-E/E-I system may operate with a higher dose of NE (see the arrow in Figure 4B). Concerning ongoing neuronal activity, it is well known that cortical neurons are not silent but active, or emit action potentials spontaneously, even without external stimulation. This self-generative activity is called ongoing, spontaneous, or background neuronal activity. A recent experiment (Engel et al., 2001) has demonstrated that self-generated ongoing neuronal activity has a distinct spatiotemporal patterning, which leads to the temporal coordination of input-triggered responses and to their binding into synchronous dynamic cell assemblies. Arieli and colleagues (Arieli et al., 1995, 1996) have demonstrated that preceding ongoing neuronal activity is reflected in the activity of responding cell assemblies to subsequent sensory stimulation and suggested that ongoing neuronal activity might play an essential role in cortical information processing. We have expressed here such an ongoing neuronal state by random transitions between dynamic cell assemblies. Although the exact dynamic structure of the ongoing state in the cortex is still largely unknown, Tsodyks and colleagues (Tsodyks, Kenet, Grinvald, & Arieli, 1999) have suggested that in an ongoing state, where any sensory stimulus is not present, a cortical network wanders through various states expressed by the synchronous firings of different cell assemblies. When a stimulus is presented, it will quickly push the network from the current state into a preferred state, where the specific cell assembly representing information about the applied stimulus is selectively activated. The random transitions between dynamic cell assemblies may correspond to such a wandering state. When stimulated with a sensory feature, the network is driven from the wandering state to a preferred state in which the dynamic cell assembly corresponding to the stimulus is selectively activated. We have shown that the neural network can effectively respond to weaker stimuli if brief γ -bursts are involved in ongoing neuronal activity. Fast oscillations such as γ -oscillations (40–70 Hz) are believed to play important roles in cognitive processing in the brain. Pulvermuller ¨ and colleagues (Pulvermuller ¨ et al., 1999) have suggested that cortical oscillations at higher frequencies reflect the presence of learned associative representations. The dynamic cell assemblies could express such learned associative representations as well, because these assemblies can be created through self-organized (Hebbian) learning processes (Hoshino, Kashimori, & Kambara, 1998; Hoshino, Inoue, Kashimori, & Kambara, 2001; Hoshino, Zheng, & Kuroiwa, 2002). We suggest that such learned representations may be frequently accessed under the ongoing state as temporal formation of each dynamic cell assembly, which is coordinated by synchronous γ -busts among pyramidal cells.

Cognitive Enhancement Mediated Through Norepinephrine

1769

We expressed information about sensory features by separate dynamic cell assemblies, where neurons of the cell assemblies are sensitive to single features or the neurons are unimodal. However, neurons of the cortex are known to encode sensory information based on the so-called distributed coding scheme, in which neurons tend to participate in processing more than one sensory feature. Namely, the neurons are multimodal. This implies that increasing the inhibition of other pyramidal cells would not help enhance the S/N ratio. For an example, see the neuronal architecture presented in Figure 14A. Suppose that the neurons within and across the three cell assemblies (A, B, C) are, respectively, connected through excitatory and inhibitory synapses. The neurons within subset 1 are sensitive to three different features (A-B-C); those within 2, 3, and 4 are sensitive to two different features (A-B, A-C, B-C); and those within 5, 6, and 7 are sensitive to one specific feature (A, B, C). This neuronal circuitry might provide a simple example for the distributed coding of sensory information. When feature A is applied to as a sensory input, the neurons of 1, 2, 3, and 5 are simultaneously activated, but then the neurons of 5 receive suppressive inputs through lateral inhibitory connections from 1-to-5, 2-to-5, and 3-to-5, and thus the S/N ratio would deteriorate. To overcome this problem, a specific neuronal circuitry shown in Figure 14B may work. The solid and dotted lines, respectively, denote excitatory and inhibitory connections between the subsets (1–7) of neurons. Note that neurons within subsets are connected through excitatory synapses. In

A

A

1 2 3

5

4 2

3

5

1

6 6

B

4

7

7

U U B C U U A B C' U U A B' C U U A' B C U U A B' C' U U A' B C' U U A' B' C

B

A

5

2

3 1

6

4

7

C

Figure 14: (a) A neuronal architecture based on the distributed coding scheme. Neurons within and across the three cell assemblies (A, B, and C) are, respectively, connected through excitatory and inhibitory synapses. The circled numbers 1–7 represent subsets of neurons. (B) A proposed neuronal circuitry for the distributed coding scheme. The solid and dotted lines, respectively, denote excitatory and inhibitory connections between subsets of neurons. Within subsets, neurons are connected through excitatory synapses.

1770

O. Hoshino

this neuronal circuitry, the neurons of 5 are not suppressed but excited by 1, 2, and 3 when stimulated with feature A, because of no lateral inhibitory but excitatory connections from 1-to-5, 2-to-5, and 3-to-5. Although the other (irrelevant) neurons (4, 6, and 7) tend to be activated via the multimodal neurons (1, 2, and 3), such excitation is counterbalanced by the lateral inhibition from 5-to-6, 5-to-4, 5-to-7, 2-to-7, and 3-to-6. Therefore, the deterioration of the S/N ratio would be prevented or lessened. We suggest that the proposed NE-induced enhancement in the S/N ratio might be applicable to more realistic cortical maps in which sensory information is expressed based on the distributed coding scheme once their synaptic structure is properly organized. It has been suggested that information propagation through long-range excitatory connections between neurons integrates widely distributed information over the visual field (Kobayashi et al., 2000) or mediates internal representation of olfactory information in the piriform cortex (Hasselmo, Linster, Patil, Ma, & Cekic, 1997). NE application depresses these excitatory connections and therefore prevents the spread of excitatory activity (or information propagation), by which the influence of internal representation relative to afferent input is decreased. This implies that NE release makes the cortex to change its attentional focus from internal representation to external representation, or outside stimuli, and thus enables the cortex to acquire new information. Our simulation results may support this notion. That is, the internal representation, which is expressed in the neural network presented here as ongoing synchronous neuronal activity within cell assemblies, disappears for higher NE concentration (e.g., see Figure 2E). Even in such a desynchronized (ongoing) condition, synchronization in neuronal activity can easily be established when presented with an external stimulus (e.g., see the period of F2 stimulation of Figure 2E). In such a situation, a Hebbian learning process would reinforce (create) information about known (novel) features through the reorganization of the cortical map. It seems that the rate-based S/N (or evoked-to-background) ratio is not enough for assessing information transmission in this neural network, because although the neurons responding to an applied feature stimulus carry a major part of information about the stimulus, the other neurons irrelevant to the stimulus also play an important role in information transmission. To thoroughly investigate this issue, it is necessary to develop another measure. We propose sensory contrast, a measure that is defined as evokedto-suppressed activity ratio. This ratio, or evoked neuronal activity (e.g., see the F2-sensitive neurons of Figure 2A) to suppressed neuronal activity (see the F1-F3-, F4-, and F5-sensitive neurons), may transmit information about the applied feature (F2) as sensory contrast. The stronger the neurons that are irrelevant to the applied feature are suppressed, the more the efficiency in information transmission could be enhanced in the network, increasing its sensory contrast. Since the suppressive property of these

Cognitive Enhancement Mediated Through Norepinephrine

1771

irrelevant neurons is also sensitive to NE concentration (see Figures 2A to 2E), the efficiency in information transmission might be greatly influenced by NE. A detailed investigation on this issue will be our future work.

5 Conclusion We proposed two distinct types of NE neuromodulatory systems: (1) an enhanced-excitatory and enhanced-inhibitory (E-E/E-I) system and (2) a depressed-excitatory and enhanced-inhibitory (D-E/E-I) system. These two systems modified ongoing background cortical activity and greatly influenced subsequent cognitive neuronal processing. We found three possible schemes for the S/N enhancement operating under the E-E/E-I system: (1) signal enhancement more than noise increase, (2) signal enhancement and noise reduction, and (3) noise reduction more than signal decrease. We found only scheme (3) for the D-E/E-I system. We suggest that depending on external circumstances, the brain may adopt the most appropriate NE neuromodulatory system and S/N enhancement scheme with a suitable amount of NE release into the cortex for improving its cognitive performance. The E-E/E-I system effectively enhanced the S/N ratio for weaker stimuli with lower doses of NE, whereas the D-E/E-I system enhanced stronger stimuli with higher doses of NE. The neural network effectively responded to weaker stimuli if brief γ -bursts were involved in ongoing neuronal activity that is controlled under the E-E/E-I neuromodulatory system. We suggest that ongoing γ -oscillatory neuronal states might be essential for the brain to recognize weaker (subthreshold) sensory stimuli. We proposed a hypothetical neural network model, in which the E-E/EI and the D-E/E-I systems coexist in the same cortical area and interact with each other. We found that depressed neurons whose activity is to be depressed by NE application have bimodal property. That is, the S/N ratio can be enhanced not only for stronger stimuli as its original property but also for weaker stimuli, for which coincidental neuronal firings among enhanced neurons whose activity is to be enhanced by NE application are essential. We suggest that the recruitment of the depressed neurons for the detection of weaker (subthreshold) stimuli might be advantageous for the brain to cope with a variety of sensory stimuli.

Acknowledgments I am grateful to T. Kambara and Y. Kashimori for valuable discussions. I am also grateful to the anonymous referees for giving me valuable comments and suggestions on the earlier draft.

1772

O. Hoshino

References Arieli, A., Shoham, D., Hildesheim, R., & Grinvald, A. (1995). Coherent spatiotemporal patterns of ongoing activity revealed by real-time optical imaging coupled with single-unit recording in the cat visual cortex. J. Neurophysiol., 73, 2072–2093. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Devilbiss, D. M., & Waterhouse, B. D. (2000). Norepinephrine exhibits two distinct profiles of action on sensory cortical neuron responses to excitatory synaptic stimuli. Synapse, 37, 273–282. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchronization in top-down processing. Nature Rev. Neurosci., 2, 704–717. Eysel, U. T. (1992). Lateral inhibitory interactions in areas 17 and 18 of the cat visual cortex. Prog. Brain Res., 90, 407–422. Eysel, U. T., Shevelev, I. A., Lazareva, N. A., & Sharaev, G. A. (1998). Orientation tuning and receptive field structure in cat striate neurons during local blockade of intracortical inhibition. Neuroscience, 84, 25–36. Fairen, A., DeFelipe, J., & Redidor, J (1984). Nonpyramidal neurons: General account. In A. Peter & E. G. Jones (Eds.), Cerebral cortex: Cellular components of the cerebral cortex (pp. 201–245). New York: Plenum Press. Foote, S. L., Bloom, F. E., & Aston-Jones, G. (1983). Nucleus locus ceruleus: New evidence of anatomical and physiological specificity. Physiol. Rev., 63, 844–914. Funahashi, S., Bruce, C. J. & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol., 61, 331–349. Gu, Q. (2002). Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity. Neuroscience, 111, 815–835. Gupta, A., Wang, Y., & Markram, H. (2000). Organization principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273–278. Hasselmo, M. E., Linster, C., Patil, M., Ma, D., & Cekic, M. (1997). Noradrenergic suppression of synaptic transmission may influence cortical signal-to-noise ratio. J. Neurophysiol., 77, 3326–3339. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Hoshino, O., Inoue, S., Kashimori, Y., & Kambara, T. (2001). A hierarchical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput., 13, 1781–1810. Hoshino, O., Kashimori, Y., & Kambara, T. (1998). An olfactory recognition model based on spatio-temporal encoding of odor quality in olfactory bulb. Biol. Cybern., 79, 109–120. Hoshino, O., Zheng, M. H., & Kuroiwa, K. (2002). Roles of dynamic linkage of stable attractors across cortical networks in recalling long-term memory. Biol. Cybern., 88, 163–176. Johnson, E. S., Roberts, M. H., & Straughan, D. W. (1969). The responses of cortical neurones to monoamines under differing anaesthetic conditions. J. Physiol., 203, 261–280.

Cognitive Enhancement Mediated Through Norepinephrine

1773

Kasamatsu, T., & Heggelund, P. (1982). Single cell responses in cat visual cortex to visual stimulation during iontophoresis of noradrenaline. Exp. Brain Res., 45, 317–327. Keeler, J. D., Pichler, E. E., & Ross, J. (1989). Noise in neural networks: Thresholds, hysteresis, and neuromodulation of signal-to-noise. Proc. Natl. Acad. Sci. USA, 86, 1712–1716. Kisvarday, Z. F., Kim, D. S., Eysel, U. T., & Bonhoeffer, T. (1994). Relationship between lateral inhibitory connections and the topography of the orientation map in cat visual cortex. Eur. J. Neurosci., 6, 1619–1632. Kobayashi, M., Imamura, K., Sugai, T., Onoda, N., Yamamoto, M., Komai, S., & Watanabe, Y. (2000). Selective suppression of horizontal propagation in rat visual cortex by norepinephrine. Eur. J. Neurosci., 12, 264–272. Krimer, L. S., & Goldman-Rakic, P. S. (2001). Prefrontal microcircuits: Membrane properties and excitatory input of local, medium, and wide arbor interneurons. J. Neurosci., 21, 3788–3796. Leventhal, A. G., Wang, Y., Pu, M., Zhou., Y., & Ma, Y. (2003). GABA and its agonists improved visual cortical function in senescent monkeys. Science, 300, 812–815. Martin, K. A. C. (2002). Microcircuits in visual cortex. Curr. Opin. Neurobiol., 12, 418– 425. McCormick, D. A. (1989). Cholinergic and noradrenergic modulation of thalamocortical processing. Trends Neurosci., 12, 215–221. Moises, H. C., Woodward, D. J. Hoffer, B. J., & Freedman, R. (1979). Interaction of norepinephrine with Purkinje cell response to putative amino acid neurotransmitters applied by microiontophoresis. Exp. Neurol., 64, 493–515. Mouradian, R., Sessler, F. M., & Waterhouse, B. D. (1991). Noradrenergic potentiation of excitatory transmitter action in cerebrocortical slices: Evidence for mediation by an α1 receptor-linked second messenger pathway. Brain Research, 546, 83–95. Pulvermuller, ¨ F., Keil, A., & Elbert, T. (1999). High-frequency brain activity: Perception or active memory? Trends Cogn. Sci., 3, 250–253. Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (2000). Destruction and creation of spatial tuning by disinhibition: GABA(A) blockade of prefrontal cortical neurons engaged by working memory. J. Neurosci., 20, 485–494. Sarvey, J. M. (1988). Protein synthesis in long-term potentiation and norepinephrineinduced long-lasting potentiation in hippocampus. In P. W. Lanfield & S. Deadwyler (Eds.), Long-term potentiation: From biophysics to behavior (pp. 329–353). New York: Liss. Sejnowski, T. (1981). Skeleton filters in the brain. In G. Hinton & J. Anderson (Eds.), Parallel models of associative memory (pp. 189–212). Hillsdale, NJ: Erlbaum. Sessler, F. M., Cheng, J. T., & Waterhouse, B. D. (1988). Electrophysiological actions of norepinephrine in rat lateral hypothalamus I. Norepinephrine-induced modulation of LH neuronal responsiveness to afferent synaptic inputs and putative neurotransmitters. Brain Research, 446, 77–89. Sessler F. M., Lin, W., Firifides, M. L., Mouradian, R., Lin, R. C. S., & Waterhouse, B. D. (1995). Noradrenergic enhancement of GABA-induced input resistance changes

1774

O. Hoshino

in layer V regular spiking pyramidal neurons of rat somatosensory cortex. Brain Research, 675, 171–182. Sillito, A. M. (1984). Functional considerations of the operation of GABAergic inhibitory processes in the visual cortex. In A. Peter & E. G. Jones (Eds.), Cerebral cortex (pp. 91–117). New York: Plenum Press. Somogyi, P., Kisvarday, Z. F., Martin, K. A. C., & Whitteridge, D. (1983). Synaptic connections of morphologically identified and physiologically characterized large basket cells in the striate cortex of cat. Neuroscience, 10, 261–294. Tsodyks, M., Kenet, T., Grinvald, A., & Arieli, A. (1999). Linking spontaneous activity of single cortical neurons and the underlying functional architecture. Science, 286, 1943–1946. Usher, M., & Davelaar, E. J. (2002). Neuromodulation of decision and response selection. Neural Networks, 15, 635–645. Wang, Y., Gupta, A., Toledo-Rodriguez, M., Wu, C. Z., & Markram, H. (2002). Anatomical, physiological, molecular and circuit properties of nest basket cells in the developing somatosensory cortex. Cerebral Cortex, 12, 395–410. Waterhouse, B. D., Azizi, S. A., Burne, R. A., & Woodward, D. J. (1990). Modulation of rat cortical area 17 neuronal responses to moving visual stimuli during norepinephrine and serotonin microiontophoresis. Brain Research, 514, 276–292. Waterhouse, B. D., Devilbiss, D. M., Fleischer, D., Sessler, F. M., & Simpson, K. L. (1998). New perspective on the functional organization and postsynaptic influences of locus ceruleus efferent projection system. Adv. Pharmacol., 42, 749–754. Waterhouse, B. D., Moises, H. C., & Woodward, D. J. (1981). Alpha-receptor-mediated facilitation of somatosensory cortical neuronal responses to excitatory synaptic inputs and iontophoretically applied acetylcholine. Neuropharmacology, 20, 907– 920. Waterhouse, B. D, Moises, H. C., & Woodward, D. J. (1998). Phasic activation of the locus coeruleus enhances responses of primary sensory cortical neurons to peripheral receptive field simulation. Brain Research, 790, 33–44. Waterhouse, B. D., Moises, H. C., Yeh, H. H., & Woodward, D. J. (1982). Norepinephrine enhancement of inhibitory synaptic mechanisms in cerebellum and cerebral cortex: Mediation by beta adrenergic receptors. J. Pharmacol. Exp. Ther., 221, 495–506. Waterhouse, B. D., Mouradian, R., Sessler, F. M., & Lin, R. C. S. (2000). Differential modulatory effects of norepinephrine on synaptically driven responses on layer V barrel field cortical neurons. Brain Research, 868, 39–47. Waterhouse, B. D., Sessler, F. M., Cheng, J. T., Woodward, D. J., Azizi, S. A., & Moises, H. C. (1988). New evidence for a gating action of norepinephrine in central neuronal circuits of mammalian brain. Brain Research Bulletin, 21, 425–432. Waterhouse, B. D., & Woodward, D. J. (1980). Interaction of norepinephrine with cerebrocortical activity evoked by stimulation of somatosensory afferent pathways in the rat. Exp. Neurol., 67, 11–34. Wilson, F. A., O’Scalaidhe, S. P., & Goldman-Rakic, P. S. (1994). Functional synergism between putative γ -aminobutyrate-containing neurons and pyramidal neurons in prefrontal cortex. Proc. Natl. Acad. Sci. USA, 91, 4009–4013. Woodward, D. J., Moises, H. C., Waterhouse, B. D. Heh, H. H., & Cheun, J. E. (1991). Modulatory actions of norepinephrine on neural circuits. In S. Kito, T. Segawa,

Cognitive Enhancement Mediated Through Norepinephrine

1775

& R. W. Olsen (Eds.), Neuroreceptor mechanisms in brain (pp. 193–208). New York: Plenum Press. Woodward, D. J., Moises, H. C., Waterhouse, B. D. Hoffer, B. J., & Freedman, R. (1979). Modulatory actions of norepinephrine in the central nervous system. Federation Proc., 38, 2109–2116. Yost, W. A. (1994). Fundamentals of hearing San Diego, CA: Academic Press. Zilberter, Y. (2000). Dendritic release of glutamate suppresses synaptic inhibition of pyramidal neurons in rat neocortex. J. Physiology, 538, 489–496.

Received April 23, 2004; accepted December 9, 2004.

LETTER

Communicated by Nigel Goddard

Advancing the Boundaries of High-Connectivity Network Simulation with Distributed Computing Abigail Morrison [email protected] Computational Neurophysics, Institute of Biology III and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Carsten Mehring [email protected] Department of Zoology, Institute of Biology I and Berstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Theo Geisel [email protected] Department of Nonlinear Dynamics, Max-Planck-Institute for Dynamics and Self Organization, 37018 G¨ottingen, Germany

Ad Aertsen [email protected] Neurobiology and Biophysics, Institute of Biology III and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

Markus Diesmann [email protected] Computational Neurophysics, Institute of Biology III and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany

The availability of efficient and reliable simulation tools is one of the mission-critical technologies in the fast-moving field of computational neuroscience. Research indicates that higher brain functions emerge from large and complex cortical networks and their interactions. The large number of elements (neurons) combined with the high connectivity (synapses) of the biological network and the specific type of interactions impose severe constraints on the explorable system size that previously have been hard to overcome. Here we present a collection of new techniques combined to a coherent simulation tool removing the fundamental obstacle in the computational study of biological neural networks: the enormous number of synaptic contacts per neuron. Distributing an individual simulation over multiple computers enables the investigation of networks orders of magnitude larger than previously possible. The Neural Computation 17, 1776–1801 (2005)

© 2005 Massachusetts Institute of Technology

Distributed High-Connectivity Network Simulation

1777

software scales excellently on a wide range of tested hardware, so it can be used in an interactive and iterative fashion for the development of ideas, and results can be produced quickly even for very large networks. In contrast to earlier approaches, a wide class of neuron models and synaptic dynamics can be represented.

1 Introduction It has long been pointed out (Hebb, 1949; Braitenberg, 1978) that cortical processing is most likely carried out by large ensembles (assemblies) of nerve cells, whereby the membership of neurons in various assemblies is exhibited through the complex correlation structure of their spike trains (von der Malsburg, 1981, 1986; Abeles, 1982, 1991; Palm, 1990; Aertsen, Gerstein, Habib, & Palm, 1989; Gerstein, Bedenbaugh, & Aertsen, 1989; Singer, 1993, 1999). Theoretical studies (Shadlen & Newsome, 1998; Diesmann, Gewaltig, & Aertsen, 1999; Salinas & Sejnowski, 2000; Kuhn, Aertsen, & Rotter, 2003) have demonstrated that the cortical neuron is indeed sensitive to the higherorder correlation structure of their input. With the experimental technology for multiple-single unit recordings becoming routinely available for animals involved in behavioral tasks, appropriate network models need to be constructed to interpret the results. Due to the nonlinear and stochastic nature of neural systems, simulations have become a research tool of major importance in the developing field of computational neuroscience (Dayan & Abbott, 2001; Koch, 1999; Koch & Segev, 1989). A number of simulation tools have been developed (Genesis: Bower & Beeman, 1997; Neuron: Hines & Carnevale, 1997; XPP: Ermentrout, 2002) and are in widespread use. They are general purpose in the sense that they maintain a layer of abstraction between the neuron model to be simulated and the machinery implementing the network interaction; that is, they are not neuron or network model specific. The primary focus of these tools is small networks of detailed neuron models. One exception is SpikeNET (Delorme & Thorpe, 2003), which is specialized for a specific class of large networks of spiking neurons. A major barrier in the simulation of mammalian cortical networks has been the large number of inputs (afferents) a single neuron receives. Assuming a biologically realistic level of connectivity, each neuron should have of the order of 104 afferents. In order to ensure that the network is sufficiently sparse, a connection probability of 0.1 should be assumed, resulting in a minimal network size of 105 neurons, corresponding to roughly a cubic millimeter of cortex (Braitenberg & Schuz, ¨ 1998). Such a network contains of the order of 109 synapses and has proved to be beyond the memory capacity of the computers available to many researchers. Furthermore, even given a computer with sufficient memory resources, the amount of time required to construct and simulate such networks and

1778

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

perform reasonable parameter scans is not conducive to rapid scientific progress. Faced with these problems, many researchers have opted to use scaled networks, in which the connection probability remains constant and the synaptic weights are scaled in some manner with the inverse of the total number of connections a neuron receives. This approach is not without its risks; although the mean or, alternatively, the standard deviation of the subthreshold activity can be held constant, this is not true for correlations and combinatorial measures. Furthermore, in such networks, the memory requirement increases quadratically with the number of neurons. Clearly, these disadvantages hold only until the minimal network size is attained, at which point synaptic scaling need no longer be applied and memory requirements increase only linearly with increasing network size. In this letter, we describe without reference to a particular implementation language how distributed computing (i.e., simultaneous execution on multiple processors, where the addressable memory of one processor is not visible to the others) can be used to acquire the memory resources to surpass the threshold of 105 neurons and reach a simulation speed suitable for practical work. To our knowledge, this is the first description of a general-purpose simulation scheme that allows the routine investigation of networks of spiking neurons with biologically realistic levels of connectivity. Our design consists of a highly efficient distributed algorithm, based on the serial (i.e., running on one processor) simulation kernel first described in Diesmann, Gewaltig, and Aertsen (1995). Despite the use of a distributed algorithm, a serial interface is presented to the researcher. In addition to the increases in network size and simulation speed enabled by our approach, a key feature of our design is its flexibility due to object orientation (the implementation language being C++; Stroustrup, 1997). Recent advances in simulation technique have tended to focus on current-based integrate-and-fire point neurons (Mattia & Del Giudice, 2000; Lee & Farhat, 2001; Reutimann, Giugliano, & Fusi, 2003); using our technology, the researcher is not bound by any of these restrictions and can just as easily use conductance-based neurons (e.g., Chance, Abbott, & Reyes, 2002; Destexhe, Rudolph, & Pare, 2003; Kuhn, Aertsen, & Rotter, 2004), non-integrate-and-fire models (Hawkes, 1971), simple compartmentbased models (Larkum, Zhu, & Sakmann, 2001; Kumar, Kremkow, Rotter, & Aertsen, 2004), or implement a neuron model of his or her own. Whereas complex compartmental models could theoretically be implemented, there is no support for generating them as in Neuron (Hines & Carnevale, 1997) or Genesis (Bower & Beeman, 1997), and so such models remain out of the reach of the current version. The main constraint is the restriction to the class of neuron models where the interaction between neurons is noninstantaneous (finite delays) and mediated by point events (spikes). Similarly, a large range of synaptic dynamics can be employed, and a network may be entirely heterogeneous in terms of the neuron and synapse models used.

Distributed High-Connectivity Network Simulation

1779

In section 2, we describe the representation and construction of a network, including compression techniques. On this basis, we explain the distributed algorithm for solving the dynamics in section 3. After a brief discussion of the treatment of pseudorandom numbers in section 4, we provide benchmarks for our simulation technology in section 5 using relevant simulation examples and hardware, demonstrating its excellent scalability with respect to number of processors, network activity, and network size on computer clusters and parallel computers (for the purpose of this article, we reserve the term parallel computer for shared memory architectures). Section 6 summarizes the key concepts of our simulation scheme and discusses our approach in the light of future directions of neuroscience research and upcoming computer architectures. Source code detail is out of the scope of this letter, as is simulation infrastructure such as writing data to files. In the following, the term machine refers to one processor, addressing memory that is assumed to be invisible by other machines. Thus, a computer with two processors and shared memory would be regarded as two machines. The term list is taken in its intuitive meaning of a sequential ordering, which need not necessarily be implemented as the data structure known as list (Aho, Hopcroft, & Ullman, 1983). The research on a distributed simulation kernel described in this article is a module in our long-term collaborative project to provide the technology for neural systems simulations (Diesmann & Gewaltig, 2002). The application of the technology described here has already enabled interesting insights into the nature of cortical dynamics (Mehring, Hehl, Kubo, Diesmann, & Aertsen, 2003; Aviel, Mehring, Abeles, & Horn, 2003; Tetzlaff, Morrison, Geisel, & Diesmann, 2004). Preliminary results have been presented in abstract form (Morrison et al., 2003). 2 Representation of Network Structure 2.1 A Generic Network. Consider a generic network of spiking point model neurons. In order to represent this, it is helpful to consider the synapses as being separate entities from the neurons. If a synapse is triggered by a spike from its presynaptic neuron, it transmits this information in the form of an event with weight w and delay d to its postsynaptic neuron. Each neuron is assigned a unique index, and we say that the synapse that transmits a spike from neuron i to neuron j is an axonal synapse of i but a dendritic synapse of j. Thus, for a serial algorithm, a list of neurons, each possessing a list of its axonal synapses, whereby each synapse contains the index of its postsynaptic neuron, would suffice to represent the network structure completely. This is analogous to the adjacency lists commonly used to represent graphs (Gross & Yellen, 1999) and is illustrated in Figure 1. Inorder to allow maximum flexibility in the structure and heterogeneity of

1780

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann Neurons Synapses

1

3

2 5

4

7

6 8 9

12 10

11

1 2 3 4 5 6 7 8 9 10 11 12

4 1 2 5 3 1 3 9 5 6 8 7

6 10 2 7 5 8

Index: 12

Index: 7

V I τ

w d

10 12 11 9

Figure 1: Data structures for a serial simulation scheme. The network comprises N uniquely indexed neurons (left column), and each neuron is assigned a list of axonal synapses (right column). Each neuron contains its own state variables, as does each synapse (illustrated by close-up). In addition, each synapse contains the index of its postsynaptic neuron.

the networks that can be simulated, an object-oriented approach is highly advantageous, in which each individual neuron and synapse maintains its own parameters, as depicted in the close-up in Figure 1, and performs its own dynamics rather than being subject to a global algorithm. For a distributed algorithm, matters are somewhat more complicated. The most obvious requirement for distributing a simulation is that the neurons are distributed. The simplest possible load balancing is to assign an equal number of neurons to each machine. As there may be different neuron models with different associated computational costs in the network, this assignment is performed using a modulo operation, so that contiguous blocks of neurons (the most intuitive way of defining neuron populations) are dealt out among all the machines, resulting in a good estimate of the fairest load distribution. For simplicity, here and in the remainder of the article, we ignore this assignment and assume that neurons 1 to N/m, where m is the number of machines, are located on the first machine, neurons N/m + 1 to 2N/m on the second machine, and so on. Clearly, the synapses must also be distributed. We distribute the axonal synapses of each neuron as follows: on each machine, there are N lists of synapses, one for each neuron in the network. A synapse from neuron i to neuron j is stored in the ith list on the machine owning neuron j. Alternatively, from a biological point of view, we say the axon of a neuron is distributed, but its dendrite is local, as anticipated in Mattia and Del Giudice (2000). This seems less intuitive than distributing the dendrite and keeping the axon local, but confers a considerable advantage in communication costs, as discussed in section 3.

Distributed High-Connectivity Network Simulation

1

1781

3

2 5

4

7

6 8 9

12 10

11

Machine: a

Machine: b

Synapses Neurons Machines

1

4 1 2 5 6 3 1 2 3 5

1 2 3 4 5 6

a a a a b a a b

Synapses Neurons Machines

1 10

5 6

12

12

7 8 9 10 12 11 8 9 7

7 8 9 10 11 12

a b b a b a b b b

Figure 2: Data structures for a distributed simulation scheme. As an example, the network shown in Figure 1 is distributed (top) over two machines. Each machine (bottom panels) contains N/2 neurons (center column) and N lists of synapses (left column), one list for each neuron in the network. All synapses in the ith list are synapses from neuron i to neurons on the local machine. The dashed arrows indicate connections that cross machine boundaries. In addition to the data structures required for the serial algorithm (see Figure 1), each neuron is assigned a list of machines (right column). The machine list for neuron i specifies all the machines on which neuron i has targets (i.e., all machines where the ith synapse list is not empty).

Instead of a list of axonal synapses, each neuron is assigned a list of the machines on which it has targets; that is, the machine list for i contains exactly those machines on which the ith synapse list is not empty. This scheme is depicted in Figure 2. Further network elements not shown in Figure 2 include neuron input devices such as current generators and observation devices such as membrane potential recorders and spike detectors, which interact with any or all of the local neurons.

1782

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

2.2 Network Construction. As the synapses are represented on the same machine as their postsynaptic neuron, the network must also be connected from the postsynaptic side: for each neuron, its afferent neurons must be specified and a synapse added to the local synapse list of each of those neurons. Once the inputs for each neuron have been established, one complete exchange between the machines (see section 3.4) suffices to propagate the connectivity information necessary to construct the machine list for each neuron. Constructing the network is therefore a fully parallelizable activity that results in a near-linear speed-up (see Figure 7). This is an important feature of our technology, as without parallelization, the wiring of a network can account for a large fraction of the total run time, especially as biological levels of connectivity are approached. Despite the highly parallel nature of the construction, the user needs no knowledge about the location of the neurons and can specify connections as if it were a serial application. On the most basic level, a command Connect(i, j) is ignored almost instantaneously on all machines except the one owning neuron j. For higherlevel connection commands, such as one for connecting a random network, we make use of the fact that a machine can determine at the beginning of such a routine which neurons belong to it. The desired method can then be applied to all neurons within the appropriate bounds without having to check ownership of each neuron, thus achieving an even higher degree of code parallelization. 2.3 Network Compression. Simulations of large biological neural networks are by their very nature highly consumptive of memory. Although the emphasis of this article is on the use of distributed computing to provide the necessary memory resources, it is clear that any compression of the network structures will be beneficial, as it increases the size of network that can be investigated on any given hardware and decreases the simulation time for any given network. We therefore explain briefly the general principles of how redundancy can be reduced without special knowledge of the network structure in order to reduce the memory demands of a simulation. We will consider the simplest synapse possible, a static synapse consisting of a constant weight w, a constant delay d, and the index of the postsynaptic neuron i. Ignoring for simplicity the memory overhead involved in creating the synapse object, a naive representation of a synapse would therefore require a number of bytes Ms = Mw + Md + Mi , and a network of N neurons with K synapses each would require NKMs bytes purely for the synapses. To give some idea of the scale of the problem, typical values for Mw , Md , and Mi on a 32-bit system are 8, 4, and 4 bytes, respectively, and a network containing N = 105 neurons with a connectivity of K = 104 synapses each would require 16 GB. However, instead of storing the postsynaptic index for each synapse in a list, we keep the list sorted in order of increasing index and store just the differences between them. If this difference is less than 255, it can be stored in 1 byte. If the difference happens

Distributed High-Connectivity Network Simulation

1783

to be too large, an appropriate entry is made in an overflow list. In fact, in most applications, this is a rare occurrence. In the network mentioned above, if connected using a uniform distribution, the average difference between neurons is 10. If networks of orders of magnitude larger than this were to be investigated, spatial structure would have to be taken into consideration to avoid obtaining a biologically unrealistic sparseness. It should be noted that the requirement of keeping the target lists sorted is fulfilled without extra costs if, given indices j1 , j2 of neurons located on a particular machine with j1 ≤ j2 , the commands establishing the connections are issued in the sequence . . . , Connect(i 1 , j1 ), . . . , Connect(i 2 , j2 ), . . . the order and location of the presynaptic neurons being irrelevant. This technique reduces the amount of memory for each synapse to Ms Mw + Md + 1. Further compression can be achieved by considering the distributions of the synaptic delays and weights. In the best-case scenario, as far as memory is concerned, each neuron makes only one kind of axonal synapse: wij = wi and dij = di . In this case, the parameters have to be stored only once per list rather than once per synapse, resulting in Ms 1 and reducing the memory requirements for the above example network to approximately 1 GB. In the next best case, each neuron makes axonal synapses with weights and delays that take on only a few possible values (see, e.g., Brunel, 2000). Here, several synaptic lists will be initialized, one for each combination encountered. This produces compression almost as good as in the ideal case. However, there is a non-negligible memory overhead associated with the construction of each list, so for a broad distribution of delays, it is more efficient to represent them in terms of their difference from a reference value. For many applications, this difference can be expressed in 1 byte, resulting in Ms Mw + 1 + 1. Similar to the representation of the postsynaptic neurons, a value too far away from the reference value can be stored in an overflow list. This kind of compression is applicable only to discrete-valued variables such as delay. There is currently no way to compress a continuous distribution. These methods can be easily extended to more complicated synapse objects, but the onus is on the user to be aware of the redundancies in the network to be investigated and choose the appropriate kind of compression. As illustrated above, depending on the heterogeneity of the network, eliminating redundancy can reduce the memory requirements significantly, so that even large networks can be simulated on hardware available to modest budgets. Conversely, with access to large clusters or parallel machines, it is possible to investigate networks orders of magnitude larger than previously possible. 3 Simulation Dynamics 3.1 Time Driven or Event Driven? A Hybrid Approach to Simulation. The two classic approaches to simulation are time driven and event driven, also known as synchronous and asynchronous algorithms. We pursue a

1784

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

hybrid strategy whereby the neurons are updated at every time step, but a synapse is updated only if its presynaptic neuron produces a spike. A purely event-driven algorithm such as that proposed by Mattia and Del Giudice (2000) is unsuitable for our purposes because it places too many restrictions on the classes of neurons that can be simulated. For example, any neuron class in which a spike induces a continuous postsynaptic current would be very difficult to implement if a neuron is updated only on the arrival of a spike event. Furthermore, the motivation for these algorithms is the fact that the state of a current-based integrate-and-fire (IAF) neuron can be directly interpolated between events, where events are assumed to be rare. This perceived computational advantage dwindles rapidly as the frequency of events increases: a neuron with 104 afferent connections firing at just 1 Hz receives events at a rate of 10,000 Hz and is therefore at least as expensive to simulate event driven as on a time grid with a step size of 0.1 ms. In Reutimann et al. (2003), a solution to the problems of rise times and event saturation is presented for IAF neurons at the cost of large look-up tables and restrictions on the type of background population activity. By updating the neurons in fixed time steps and applying exact integration techniques where possible (Rotter & Diesmann, 1999), we maintain a highly flexible simulation environment at no grave computational cost. As a consequence of this scheme, spike times are constrained to the time grid and are expressed as multiples of h, the time step or computational resolution. It should, however, be noted that a neuron can perform arbitary dynamics within this time step, including using an internal resolution much finer than the one the spikes are constrained to. Conversely, a purely time-driven algorithm is equally unsuitable. Synapses are by far the most numerous elements in a network. This is the case even when synaptic scaling is employed (i.e., using fewer but stronger synapses), but particularly so if biologically realistic connectivity is assumed. Clearly, updating 109 synapses every time step would have catastrophic consequences for simulation times. Fortunately, although the above consideration that events are rare is not valid for neurons, it is valid for individual synapses—in the situation described above, a synapse processes events at just 1 Hz. Assuming that the synaptic state can be calculated from its previous state, the time since its last update, and information available from the postsynaptic neuron, a synapse need be updated only when it transfers a spike event. In fact, a wide range of synaptic dynamics falls into this category, including synaptic depression (Thomson & Deuchars, 1994), synaptic redistribution (Markram & Tsodyks, 1996), and spike-timedependent plasticity (Bi & Poo, 1998, 2001), for a review see Abbott and Nelson (2000). By combining the flexibility of a time-driven algorithm with the speed of an event-driven algorithm, a highly functional and fast simulation environment is achieved.

Distributed High-Connectivity Network Simulation

1785

y: state at t y ← Fh (y) y ← G(y, w t+h ) F

spike now? T spike

F

refractory?

wt+h

T refractory dynamics

y: state at t + h

rotation

Figure 3: Update of neuron state. The flowchart (left) defines the sequence of operations required to propagate the state y of an individual neuron by one time step h. Operator Fh performs the subthreshold dynamics, and operator G modifies the state according to the incoming events. Event buffers (only one is shown, right) contain incoming events for the neuron. They are rotated at the end of each time step so that for any simulation time t, the current read position (indicated by the black read head symbol) always provides the input scheduled to arrive at simulation time t + h.

3.2 Neuron Update. At each time step, each neuron is updated from its state at time t, yt = (y1 , y2 , . . . , yn )t , to its state at time t + h, where h is the temporal resolution of the simulation. This update is performed on the basis of yt and wt+h , the summed weight of all events arriving at time t + h. A flowchart of the update is shown in Figure 3, a concrete example is given in Diesmann, Gewaltig, Rotter, and Aertsen (2001), and the theory for grid-based simulation is developed in Rotter and Diesmann (1999). First, the subthreshold dynamics of the system is applied to the state vector; that is, the state of the neuron is calculated without taking new events into consideration. Next, wt+h is read out of the neuron’s event buffer, also shown in Figure 3. The event buffer can be thought of as a primitive looped tape device; at each time step, the current value can be read off and erased. At the end of a time step, all event buffers are rotated one segment so that the next value is available to be read in the next time step. Only one such buffer is depicted; in fact, the number of buffers is dependent on the dynamics of the neuron model. The provisional new state of the neuron is then updated on the basis of wt+h . Now the provisional state of the neuron reflects the values valid at time t + h with respect to the neuron’s subthreshold dynamics. If at this point the state of the neuron fulfills its spiking criteria (e.g., passing a threshold potential), the state is updated once again according to the appropriate spike generation dynamics (such as resetting the membrane

1786

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

potential). The emitted spike is assigned to the time t + h, and the information that it has occurred must be transmitted to the neuron’s distributed axonal synapses. This calculation of the new state of the neuron is defined by the individual neuron model rather than by a global algorithm. In many cases, it is possible to avoid using computationally expensive differential equation solvers by applying a propagator matrix to the neuron state to integrate exactly (Rotter & Diesmann, 1999). Clearly, the use of event buffers, already present in the original serial version (Diesmann et al., 1995; Gewaltig, 2000), obviates the requirement for a centralized event queuing system; an event of weight w due to arrive at the neuron d time steps in the future can be added to the event buffer d steps upstream from the current reading position. Obviously, causality requires a nonzero transmission delay, that is, d ≥ 1. It is important to note the implicit assumption made here that the weights of incoming events can be summed, as each ring buffer segment contains just one value, which is incremented by the weights of successive incoming events. It does not, however, imply that an incoming event may only cause a discontinuous jump in a state variable of the neuron. The interpretation of the weights is left up to the individual neuron model. For example, one model may interpret the weights as the magnitude of a jump in the membrane potential and another as the maximum value of an alpha function (Jack, Noble, & Tsien, 1983; Bernard, Ge, Stockley, Willis, & Wheal, 1994) describing the change of conductance. Nor does it imply that all the events received by a neuron necessarily induce identical dynamics. Models with different time constants for excitatory and inhibitory input, for example, or with several compartments (Kumar et al., 2004) are implemented by giving the neuron access to several such event buffers. 3.3 Index Buffering. At first glance it seems as if communication between machines should take place after every time step in order to convey the information of which neurons spiked during that time step. Fortunately, this is not so. If the minimum synaptic delay is dmin · h, then a neuron spiking at time t cannot have an effect on any postsynaptic neuron at a time earlier than t + dmin · h. Therefore, if the spikes can be stored maintaining their temporal order, it is sufficient to communicate in intervals of dmin time steps. This also represents the maximum possible communication interval; any greater communication interval would result in events arriving with delays longer than those specified by the user. This communication scheme is completely independent of the temporal resolution; simulating on a finer time grid does not increase the frequency of communication. Maintaining the temporal ordering is easily done. In each time step, the indices of all spiking neurons can be stored in a buffer as illustrated in Figure 4, and at the end of the time step, a marker is inserted into the buffer to separate the spikes of successive time steps. Note that if the axonal synapses were local and the dendritic synapses distributed, the synaptic weight, delay,

Distributed High-Connectivity Network Simulation

1787

Machine: a Neurons Machines 1 2 3 4 5 6

a a a a b a a b

Buffer for Machine:

a 6

b

Figure 4: Target machine-specific buffering of local events. In this example, neuron 6 located on machine a (see Figure 2) produces a spike. Its list of target machines contains the identifiers for machines a and b. The index 6 is appended to the index buffers for these machines.

and index of every target neuron would have to be communicated, at a cost of Mc = K· (Mw + Md + Mi ) per spike. Using a representation of the network structure where the postsynaptic neuron maintains the information about the weights and delays of incoming connections, as is the case in several simulators (e.g., Bower & Beeman, 1997), would reduce this cost to Mc = K · Mi . Assuming the same sizes of the synaptic parameters as in section 2.3, this amounts to a reduction factor of 4. However, in our representation, the entire synaptic structure, including the index of the postsynaptic neuron, is stored physically on the postsynaptic side but logically on the presynaptic side (see section 2.1). This means that it suffices to send merely the index of the source neuron to every machine on which it has a target, so the communication cost is no longer proportional to the connectivity of the neuron but to the number of machines: Mc = m · Mi . This is a reduction factor of K /m, which for biologically realistic levels of connectivity can be of the order of 104 and so adequately justifies the otherwise counterintuitive decision to distribute the axonal rather than dendritic synapses. A further reduction in communication bulk results from sending spike information only to where it is needed. In Figure 4, each neuron is shown to have a list of the machines on which it has target neurons. This information can be used to filter the indices into machine-specific buffers. At the end of the dmin interval, these buffers can be exchanged with the corresponding machines. By distributing axonal synapses, communicating in intervals of dmin , and sending spike information only to where it is needed, a communication scheme is achieved with both minimal bulk and frequency. 3.4 Buffer Exchange. The communication itself is performed using routines from the Message Passing Interface (MPI) library (Pacheco, 1997). The two basic communication types are blocking and nonblocking. In blocking

1788

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

communication, a Send (b) on machine a requires a corresponding call of Receive (a ) on machine b, and both machines wait until these calls have been successfully completed. In nonblocking communication, the data to be sent from machine a to machine b are written into a buffer that can be picked up by b when it is ready, and a continues with its next instruction without waiting. The latter paradigm is potentially more efficient; we use the former as it is more robust: the MPI library definition does not fully specify how data that have not yet been retrieved are to be buffered. Furthermore, as each of the machines needs to receive event buffers from every other machine before it can continue and the amount of time spent in communication is small compared to the amount of time required to update the neurons (see section 5), using nonblocking communication could result in only a minimal increase in performance. Given m machines, there are consequently m(m − 1)/2 individual bidirectional exchanges that need to be ordered carefully in order to prevent deadlock (for example, if a tries to send to b, while b tries to send to c and c tries to send to a ). This is equivalent to an edge coloring problem, where each machine is a vertex of a fully connected graph, and each edge represents the exchange of buffers between the two machines it connects. Each vertex may have only one edge of each color, and edges that are the same color correspond to exchanges that can be carried out in parallel without causing deadlock. The Complete Pairwise EXchange (CPEX) algorithm (Tam & Wang, 2000; Gross & Yellen, 1999) is a simple constructive process that produces sets of edges to enable the graph to be colored with the minimum of colors or, equivalently, the order of exchanges for maximally efficient communication with no deadlock. In Figure 5A, the ordering of exchanges is illustrated for a network with five machines. It should be noted that the algorithm is more efficient if an even number of machines is involved (see Figure 5B), resulting in m − 1 communication steps (colors) rather than m steps for odd m, where in every step, one machine is idle. 3.5 Event Delivery. The received buffers are then sequentially processed by reading off the indices one by one and activating the corresponding synapses, as illustrated in Figure 6. If the index of neuron i is read off before the first marker has been read, this means that neuron i spiked dmin time steps ago. The synapses of i are activated, and for each postsynaptic neuron j, an event of weight wij with synaptic delay dij · h is produced. This weight is then written to the appropriate event buffer of neuron j, dij − dmin time steps on from the current reading position, thus maintaining the correct temporal ordering of events. Indices between the first and second markers correspond to neurons that spiked in the second time step following the last buffer exchange; accordingly, the resulting events have their synaptic delays decremented by dmin − 1. This process continues until all indices from the buffer have been read off, at which point the next buffer can be processed in

Distributed High-Connectivity Network Simulation

1789

A

B

Figure 5: Illustration of the complete pairwise exchange (CPEX) algorithm as an edge coloring problem. Machines correspond to the nodes of the graph (filled circles) and communication routes to the edges (lines). (A) The five communication steps required for a computer cluster with m = 5 machines are represented by the sequence of graphs. In each step (from left to right), two pairs of machines (connecting edges highlighted by thick lines) exchange their messages. In the edge coloring terminology, this corresponds to the application of a different color. Using this odd number of machines, the progress of the algorithm can be visualized by the clockwise rotation of the highlighted parallel edges. (B) The m − 1 communication steps required for m = 6 machines. Same display as in A.

Machine: b Synapses

Neurons

1

Buffer from Machine:

6

a lag: 1

b

2

3

10

6

w,d–3

7 8 9 10 12 11 8 9 12 7

7 8 9 10 11 12

Figure 6: Delivery of received events. In this example, dmin = 3, and so as the index 6 is read out of the section of the buffer received from machine a (left) before the first marker (hatched block), neuron 6 spiked three time steps ago. This information is passed to the synapse list of neuron 6 on this machine (center). This list contains a synapse with the postsynaptic neuron 7. It produces an event of weight w and delay d, which is placed in the event buffer (cf. Figure 3) of neuron 7 (right) not d, but d − 3 segments on from the current reading position, to take account of the communication lag of three time steps.

1790

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

the same way. Once all buffers have been processed, the next dmin · h period in the life of the network is simulated, and the cycle begins again. 4 Random Numbers Many networks require a massive amount of random numbers for their simulation and construction. In addition to the usual pitfalls of pseudorandom number generation, this presents extra problems in a distributed environment. The ideal solution should be able to produce sequences of random numbers for each machine such that each sequence is itself uncorrelated, and each sequence is uncorrelated to any other sequence. Furthermore, it should be possible to perform a simulation on a different hardware or with a different number of machines, and obtain identical results. A central random number server is not an appropriate solution for this application, as this would result in a bottleneck due to the sheer volume of numbers required. A partial solution is provided by the use of random number generators (RNGs), which produce independent trajectories for different seeds (Knuth, 1997). Such RNGs are available from the GNU Scientific Library (Galassi, Gough, & Jungman, 2001), which, moreover, ensures platform independence. The solution is completed by the introduction of pseudoprocesses. The number of pseudoprocesses Npp is specified at run time, when the number of machines used m is also known. They are assigned to the machines such that pseudoprocess p is on machine p mod m, and each machine has an equal number of pseudoprocesses, thereby constraining m to be a factor of Npp . Each pseudoprocess is assigned one RNG with a unique seed. Each neuron is assigned to a pseudoprocess such that neuron n is assigned to pseudoprocess n mod Npp , and all the random numbers required for that neuron are drawn from the corresponding RNG. As the algorithm to assign the neurons to the pseudoprocesses depends solely on Npp , identical simulation results will be obtained for any m fulfilling the constraint. In this way, we ensure a fast, safe production of random numbers, independent of both the platform and the number of machines used. 5 Performance We tested the software on four architectures, chosen to reflect the kind of hardware currently available:

r r r

Elderly PC cluster (8 × 2 processors, 100 MBit Ethernet, Intel Pentium 0.8 GHz, 256 kB cache) Recent PC cluster (20 × 2 processors, Dolphin/Scali network, Intel Xeon, 2.8 GHz, 512 kB cache) Compaq GS160 (16 processors, Alpha 0.7 GHz, 8 MB cache)

Distributed High-Connectivity Network Simulation

A

B

32

6400 3200

16

1600

time [s]

8

time [s]

1791

4 2

800 400 200

1

100 1

2

4

machines

8

16

1

2

4

8

16

machines

Figure 7: Scalability of wiring with respect to number of processors: elderly PC cluster, solid line; recent PC cluster, dashed line; GS160, dash-dotted line; GS1280, dotted line. See the text for architecture and simulation details. (A) Wiring time for the 104 network against number of processors, log-log representation. The gray line indicates slope for a linear speed-up. (B) As in A but for the 105 network.

r

Compaq GS1280 (8 processors, Alpha 1.15 GHz, 1.75 MB associative cache)

The following simulations were performed on the three different architectures for several different numbers of processors:

r r r

104 low rate network: 10,000 neurons with 1000 random afferent connections each and an average spike rate of around 2.5 Hz (dynamics as described in Brunel, 2000) was simulated for 10 biological seconds. 104 high rate network: as above, but with an unrealistically high average spike rate of around 250 Hz. 105 low rate network: 100,000 neurons with 10,000 random afferent connections each and an average spike rate of around 2.5 Hz (dynamics as described in Brunel, 2000) was simulated for 1 biological second.

These simulations were chosen to demonstrate the scalability of the software with respect to the number of processors for networks of significantly different sizes and activities. In Figure 7 the wiring times for the two different network sizes are plotted against the number of machines used. The double logarithmic representation reveals the exponent of the dependence. In both cases, the wiring time scales linearly with the number of machines. For the smaller network (see Figure 7A), a saturation for large numbers of machines seems to be visible; however, the times measured are close to the resolution of measurement

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

(A) 6400

(B)

1

speed-up factor

1792

0.8

3200

time [s]

1600 800 400 200 100 1

2

4

machines

8

16

0.6 0.4 0.2 0

1

4

8

12

16

machines

Figure 8: Scalability of simulation time with respect to number of processors: elderly PC cluster, solid line; recent PC cluster, dashed line; GS160, dash-dotted line; GS1280, dotted line. See the text for architecture and simulation details. (A) Simulation time for the 105 network against number of processors, loglog representation. The gray line indicates the slope for a linear speed-up. (B) Corresponding speed-up factor against number of processors. The diagonal (broad gray line) corresponds to linear speed-up.

(1 second). For the larger network (see Figure 7B), no saturation is observed. In fact, no saturation is visible even when using 40 machines of the modern PC cluster (not shown). In contrast to the other architectures, the elderly PC cluster slows when increasing from 8 to 10 machines. This is due to the fact that the two processors on each board share a memory bus. For 8 or fewer machines, the application can be distributed such that only one processor on each board is running it. Above this point, both processors are in use on at least one board, which leads to a significant reduction in efficiency. If the application is distributed such that both processors are in use on all contributing boards, then a supralinear behavior is seen for all numbers of machines, but at the cost of larger absolute run times. In Figure 8A the simulation time of the 105 network is plotted against the number of machines used. All tested architectures show a steeper slope than that expected for a linear speed-up (gray line), commonly considered to be the maximum speed-up possible for a distributed application. This surprising result is due to the fact that the amount of fast cache memory available increases linearly with the number of processors. For our nondeterministic algorithm, the exploitation of this memory more than compensates for the memory and communication overheads that accrue as a result of distributing the simulation. In particular, in the range of machines tested, the communication overheads are negligible. Even when simulating the 104 high rate network on the elderly PC cluster (i.e., the maximimum amount

Distributed High-Connectivity Network Simulation

1793

1200 1000

time [s]

800 600 400 200 0

low rate 104

high rate 104

low rate 105

Figure 9: Comparison of architectures. Total time to run the three different types of simulation on four different architectures using eight processors. See the text for architecture and simulation details. In each block, the black bar refers to the elderly PC cluster, the dark gray bar to the recent PC cluster, the light gray bar to the GS160, and the white bar to the GS1280. The left block shows the run times of the 104 low-rate network, the middle block the run times of the 104 high-rate network, and the right block the run times of the 105 network. For the 105 network, the lower part of each bar shows the construction time of the network and the upper part the simulation time. For the 104 networks, the construction time is negligible.

of communication with respect to the number of neurons, with the slowest communication hardware), the time spent communicating amounted to less than 0.5% of the total runtime. The corresponding speed-up curves (see Wilkinson & Allen, 2004), that is, how many times faster the application runs with m processors than with 1 processor as a function of m, are plotted in Figure 8B. In the case of the PC clusters, the application is too large to be addressed by one 32-bit processor; therefore, the curves have been normalized appropriately. In this representation, the supralinear scaling is indicated by the fact that all the curves lie above the gray diagonal, indicating a linear speed-up, except for the elderly PC cluster for large n, as explained above. Similar results (not shown) were obtained for the 104 networks with low and high spike rates, whereby the supralinear behavior is more pronounced at high rates. At high rates, the efficiency of writing to the event buffers becomes an increasingly crucial factor, so the exploitation of the cache resource plays a much more important role than for low rate simulations. In the case of the recent PC cluster, we have been able to test up to 40 processors, and even at this high number, no saturation of the speed-up was observed, resulting in total run times for the 105 network of less than 2 minutes. In both panels, the reduction in efficiency of the elderly PC cluster caused by two processes competing for memory access is clearly

1794

A

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann 1

3200 1600

B

800

time [s]

time [s]

0.8 0.6 0.4 0.2 0

400 200 100 50 20 10

0

25

50

rate [Hz]

75

100

1

2

4

8

16

32

64

neurons [× 104 ]

Figure 10: Scalability with respect to activity and network size. (A) Simulation time (diamonds) on eight processors of the GS1280 for 1 biological second of a 104 network plotted against average spike rate λ. The gray line is a linear fit to the data. (B) Simulation time (diamonds) on eight processors of the GS1280 for 1 biological second of low-rate networks (3.9 Hz) against number of neurons, log-log representation. The number of synapses per neuron increases linearly with the number of neurons (i.e., constant connection probability) until a biologically realistic connectivity is reached (at 13 × 104 ), after which the number of synapses per neuron remains constant. The gray lines are linear fits to the two regimes, with slopes of 1.88 and 1.05, respectively. The lower dashed gray line indicates the expected run time increase assuming a linear dependence of the run time on the number of neurons; the upper dashed gray line indicates the expected run-time increase assuming a quadratic dependence on the number of neurons.

visible. Again, supralinear behavior is observed for the entire series if the application is redistributed as described above, but at the cost of higher absolute run times. A comparison of the run times for the different types of simulation is given in Figure 9. The total run time for all combinations of simulations and architectures is depicted for the case that eight processors are used. In all cases, the simulation finishes in less than 20 minutes. The recent PC cluster with its high-speed, low-latency network, and rapid clock speed has a clear advantage over both the elderly PC cluster and parallel computer and is at this number of processors comparable to the modern parallel computer. To demonstrate the scalability of the software with respect to network activity and size, we varied one parameter while holding the others constant. Figure 10A shows that the software scales linearly with the average spike rate λ. An excellent scaling behavior is also seen with respect to the number of neurons in the network with constant rate and connection probability (see Figure 10B). It lies between the linear scaling due to the increase in the number of neurons and the quadratic scaling due to the increase in the

Distributed High-Connectivity Network Simulation

1795

machines

relative time

1.1

1

2

4

6

8

1.05 1 0.95 0.9

1

2

3

4

5

6

7

8

9

5

neurons [× 10 ]

Figure 11: Scalability with respect to problem size. The ratio of serial simulation time and parallel simulation time (vertical) is shown as a function of network size (horizontal, bottom) and number of machines (horizontal, top) for the GS1280. The network size is increased from 110,500 to 884,000, keeping the number of neurons per machine and the number of synapses per neuron constant. The gray line represents a linear fit exhibiting a slope of −0.0057.

number of synapses. For increases in network size above 105 , a near-linear increase is observed; having reached biological levels of complexity, the total number of synapses in the network increases only linearly. We have discussed how simulation time scales with number of processors for a fixed network size (see Figure 8) and how the simulation time scales with the network size for a fixed number of machines (see Figure 10B). Another useful performance measure is the scaled speed-up (Wilkinson & Allen, 2004), where the network size is increased linearly with the number of machines. The motivation is that with a larger number of machines available, it should be possible to address a proportionally larger problem in the same time as the original problem on one machine. The scaled speed-up characterizes to what extent this assumption holds. Figure 11 shows that for the number of processors available on the parallel computer tested, the scaled speed-up remains close to 1, indicating excellent scalability of the software and the problem. The small, systematic linear decline of scaled speed-up presumably originates from the increased absolute amount of memory access and communication load. 6 Discussion We described a scheme for the efficient distributed simulation of large heterogeneous spiking neural networks. In contrast to earlier approaches,

1796

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

the simulation technology enables the investigation of recurrent networks with levels of connectivity and sparseness characteristic for the mammalian brain. A network of 100,000 neurons permits a biologically realistic number of synapses per neuron (of the order of 104 ) while simultaneously adhering to a biologically realistic constraint on the connection probability between two neurons (of the order of 0.1 in the local volume). In this sense, such a network constitutes a threshold size for realistic simulations of cortical networks. The technology described in this article easily overcomes this threshold on available hardware, requiring wall clock times suitable for routine use. For larger neural systems, the required computer memory and wall clock time scale merely linearly (as opposed to quadratically) with the number of neurons. The neural network to be simulated can be distributed over many computers. Execution time and the computer memory required on an individual machine scale excellently with the number of machines used. As a consequence of the considerations above, larger networks can be simulated, or a reduction in execution time achieved, simply by adding a proportionate number of computers to the system. In addition to the distributed simulation of the dynamics, an important feature of our technology is the parallel generation of the network structure. A serial construction of the network would severely limit the speed-up, as the time required to construct the network can constitute a considerable fraction of the total execution time (see Figure 9). A similar argument holds for the generation of random numbers, which can also easily become the component limiting the speed-up. Consequently, the generation of random number is parallelized, and care is taken that simulation results are independent of the number of machines used to carry out the simulation. The efficiency of the simulation scheme results from exploiting the fact that the interaction of network elements is noninstantaneous and mediated by point events (spikes). The frequency and bulk of the communication between machines are independent of the computation step size (precision of the simulation). The simulation scheme profits from the fact that the fast cache memory increases proportionally with the number of machines, reducing the ratio between the locally required working memory and the locally available cache (Wilkinson & Allen, 2004). Surprisingly, the increase in simulation speed gained more than compensates the overhead due to the communication in a distributed environment. Overall, a supralinear speedup is observed, justifying the use of large clusters of computers. A detailed quantitative investigation of cache effects is outside the scope of this study. However, it is evident from the results presented that when deciding on hardware for a simulation project using our scheme, not only the clock speed of the processors but also the amount of cache memory should be considered. We pointed out in section 3.1 that neither of the textbook simulation schemes, discrete time and event-driven algorithms, is optimal for the

Distributed High-Connectivity Network Simulation

1797

simulation of biological neural networks. Instead, only a carefully adjusted hybrid of both leads to a satisfactory run-time behavior. A similar observation is made with respect to the class design of the software components (cf. the comment on this observation in Gamma, Helm, Johnson, & Vlissides, 1994). Only a well-balanced mixture of objects from the problem domain (neurobiology) and from the machine-oriented domain of parallel algorithms leads to a design that appropriately compromises between the heterogeneity of biological structure, usability of the software, and the efficiency of the simulation on today’s computer hardware. Both observations argue for a pragmatic and undogmatic usage of software design principles and the selection of an implementation language supporting multiple paradigms (Stroustrup, 1994). Ideally, the interface with which the researcher, attempting to implement a new neuron model, is confronted would be expressed in terms of neuroscience concepts and objects, while the objects of the software layer below are optimized for cache exploitation and efficient communication. We have made some first steps in this direction, but further research is required to work out how this approach can consistently be applied to the different components of the simulation scheme. Future work on this topic falls broadly into two categories: optimization and functionality. With respect to the former, the clearly observable cache effects suggest that much could be gained by optimizing the data structures accordingly. We are currently testing various alternative representations of the network structure and update schemes in order to enhance the cache usage, particularly for low numbers of machines. Furthermore, the load balancing currently carried out by the simulation kernel is minimal and relies on the static uniform distribution of network elements and their types over the available machines. Thus, efficient use of the hardware resources requires homogeneous computer clusters. While in the context of high-performance computing this does not represent a major constraint, the limitation becomes relevant when more structured networks are investigated with large differences in the communication load between and within subnetworks. The next step in addressing the problem of load balancing would be to provide user-level control over the mapping of network elements to machines. Finally, the results presented in this article demonstrate that it is possible and efficient to execute our code on computers with multiple processors. However, the use of a communication protocol developed under the constraints of distributed computing (Pacheco, 1997) does not fully exploit the existence of a working memory addressable by all processors. In the framework of the NEST initiative (www.nest-initiative.org) we are developing the simulation technology for computers with multiple processors. Current trends in computer hardware toward clusters of multiprocessor machines, whereby each machine has a small number of processors and support for multithreading (Butenhof, 1997) in the individual processors, makes a hybrid simulation kernel using multithreading locally and message passing between computers increasingly interesting.

1798

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

One factor limiting the usability of the software described here is that the protocol for a particular simulation experiment must be specified by a C++ program. In a parallel line of work, we are developing a simulation language interpreter enabling the interactive specification and manipulation of neural systems simulations (Diesmann & Gewaltig, 2002). It remains to be investigated how the distributed simulation kernel can be combined with this interpreter. Other current work (Morrison, Hake, Straube, Plesser, & Diesmann, 2005) focuses on extending the functional range of the technology through the incorporation of precise (off-grid) spike times (see Hansel, Mato, Meunier, & Neltner, 1998; Rotter & Diesmann, 1999; Shelley & Tao, 2001, for discussion) and spike-time-dependent plasticity (Morrison, Aertsen, & Diesmann, 2004) into the simulation scheme. In future projects, structural plasticity and the interaction of spiking neural networks with modulatory chemical subsystems will also be addressed. Acknowledgments We acknowledge constructive discussions with Denny Fliegner, Stefan Rotter, Masayoshi Kubo, and the members of the NEST collaboration (in particular Marc-Oliver Gewaltig and Hans Ekkehard Plesser). This work was partially funded by the Volkswagen Foundation, GIF, BIF, DIP F1.2, DAAD 313-PPP-N4-lk, and BMBF Grant 01GQ0420 to BCCN Freiburg. All simulations have been carried out using the parallel computing facilities of the Max-Planck-Institute for Fluid Dynamics (now MPI for Dynamics and Self Organization) in Gottingen ¨ and the Agricultural University of Norway ˚ Part of the work was carried out when A. M. and M. D. were based in As. at the Max-Planck-Institute for Fluid Dynamics. The initial distributed version of the simulation software was developed by Mehring in the context of Mehring, et al. (2003). References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci., 3(Suppl.), 1178–1183. Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: SpringerVerlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900– 917. Aho, A. V., Hopcroft, J. E., & Ullman, J. D. (1983). Data structures and algorithms. Reading, MA: Addison-Wesley. Aviel, Y., Mehring, C., Abeles, M., & Horn, D. (2003). On embedding synfire chains in a balanced network. Neural Comput., 15(6), 1321–1340.

Distributed High-Connectivity Network Simulation

1799

Bernard, C., Ge, Y. C., Stockley, E., Willis, J. B., & Wheal, H. V. (1994). Synaptic integration of NMDA and non-NMDA receptors in large neuronal network models solved by means of differential equations. Biol. Cybern., 70, 267– 273. Bi, G.-q., & Poo, M.-m. (1998). Activity-induced synaptic modifications in hippocampal culture: Dependence on spike timing, synaptic strength and cell type. J. Neurosci., 18, 10464–10472. Bi, G., & Poo, M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. of Neurosci., 24, 139–166. Bower, J. M., & Beeman, D. (1997). The book of GENESIS: Exploring realistic neural models with the GEneral NEural SImulation System (2nd ed.). New York: TELOS, Springer-Verlag. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems. Berlin: Springer. Braitenberg, V., & Schuz, ¨ A. (1998). Cortex: Statistics and geometry of neuronal connectivity (2nd ed.). Berlin: Springer-Verlag. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Butenhof, D. R. (1997). Programming with POSIX threads. Reading, MA: AddisonWesley. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Delorme, A., & Thorpe, S. (2003). Spikenet: An event-driven simulation package for modeling large networks of spiking neurons. Network: Comput. Neural Systems, 14, 613–627. Destexhe, A., Rudolph, M., & Pare, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci., 4, 739–751. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (pp. 43–70). Gottingen: ¨ Ges. fur ¨ Wiss. Datenverarbeitung. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1995). SYNOD: An environment for neural systems simulations: Language interface and tutorial (Tech. Rep. GC-AA-/953). Tel Aviv: Weizmann Institute of Science. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Diesmann, M., Gewaltig, M.-O., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565– 571. Ermentrout, B. (2002). Simulating, analyzing, and animating dynamical systems: A guide to Xppaut for researchers and students (software, environments, tools). Philadelphia: Society for Industrial and Applied Math. Galassi, M., Gough, B., & Jungman, G. (2001). Gnu scientific library: Reference manual. Bristol, UK: Network Theory Ltd. Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design patterns: Elements of reusable object-oriented Software. Reading, MA: Addison-Wesely.

1800

A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann

Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14. Gewaltig, M.-O. (2000). Evolution of synchronous spike volleys in cortical Networks: Network simulations and continuous probabilistic models. Aachen, Germany: Shaker. Gross, J., & Yellen, J. (1999). Graph theory and its applications. Boca Ratan, FL: CRC Press. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Hawkes, A. G. (1971). Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society (London) B, 33, 438–443. Hebb, D. O. (1949). Organization of behavior: A neurophysiological theory. New York: Wiley. Hines, M., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9, 1179–1209. Jack, J. J. B., Noble, D., & Tsien, R. W. (1983). Electric current flow in excitable cells. New York: Oxford University Press. Knuth, D. E. (1997). The art of computer programming: Seminumerical algorithms (3rd ed., Vol. 2). Reading, MA: Addison-Wesley. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koch, C., & Segev, I. (1989). Methods in neuronal modeling. Cambridge, MA: MIT Press. Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Comput., 15(1), 67–101. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. Kumar, A., Kremkow, J., Rotter, S., & Aertsen, A. (2004). Synaptic integration in a 3-compartment model of layer 5 pyramidal neurons. FENS Abstr., 2, A014.27. Larkum, M., Zhu, J., & Sakmann, B. (2001). Dendritic mechanisms underlying the coupling of the dendritic with the axonal action potential initiation zone of adult rat layer 5 pyramidal neurons. J. Neurophysiol., 533 (pt. 2), 447–466. Lee, G., & Farhat, N. H. (2001). The double queue method: A numerical method for integrate-and-fire neuron networks. Neural Networks, 14, 921–932. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382(6594), 807–810. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cybern. 88(5), 395–408. Morrison, A., Aertsen, A., & Diesmann, M. (2004). Stability of plastic recurrent networks. Paper presented at the Monte Verita Workshop on Spike-Timing Dependent Plasticity, Monte Verita, Ascona, Switzerland. Morrison, A., Hake, J., Straube, S., Plesser, H. E., & Diesmann, M. (2005). Precise spike timing with exact subthreshold integration in discrete time network simulations. In Proceedings of the 30th G¨ottingen Neurobiology Conference, Neuroforum Supplement 1, pp. 205B.

Distributed High-Connectivity Network Simulation

1801

Morrison, A., Mehring, C., Diesmann, M., Aertsen, A., & Geisel, T. (2003). Distributed simulation of large biological neural networks. In Proceedings of the 29th G¨ottingen Neurobiology Conference, (p. 590). Pacheco, P. S. (1997). Parallel programming with MPI. San Francisco: Morgan Kaufmann. Palm, G. (1990). Cell assemblies as a guideline for brain research. Conc. Neurosci., 1, 133–148. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol. Cybern., 81(5/6), 381–402. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193– 6209. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W. (1999). Time as coding space. Curr. Opin. Neurobiol., 9(2), 189–194. Stroustrup, B. (1994). The design and evolution of C++. Reading, MA: Addison-Wesley. Stroustrup, B. (1997). The C++ programming language (3 ed.). Reading, MA: AddisonWesley. Tam, A., & Wang, C. (2000). Efficient scheduling of complete exchange on clusters. In G. Chaudhry & E. Sha (Eds.), 13th International Conference on Parallel and Distributed Computing Systems (PDCS 2000). Cary, NC: International Society for Computers and Their Applications. Tetzlaff, T., Morrison, A., Geisel, T., & Diesmann, M. (2004). Consequences of realistic network size on the stability of embedded synfire chains. Neurocomputing, 58–60, 117–121. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. TINS, 17, 119–126. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C. (1986). Am I thinking assemblies? In G. Palm & A. Aertsen (Eds.), Brain theory (pp. 161–176) Berlin: Springer-Verlag. Wilkinson, B., & Allen, M. (2004). Parallel programming: Techniques and applications using networked workstations and parallel computers (2nd ed.). Upper Saddle River, NJ: Prentice Hall.

Received July 20, 2004; accepted January 31, 2005.

LETTER

Communicated by Thomas Wennekers

Dynamical Analysis of Continuous Higher-Order Hopfield Networks for Combinatorial Optimization Miguel Atencia [email protected] Departamento de Matem´atica Aplicada, ETSI Telecomunicaci´on, Universidad de M´alaga, 29071 M´alaga, Spain

Gonzalo Joya [email protected]

Francisco Sandoval [email protected] Departamento de Tecnolog´ıa Electr´onica, ETSI Telecomunicaci´on, Universidad de M´alaga, 29071 M´alaga, Spain

In this letter, the ability of higher-order Hopfield networks to solve combinatorial optimization problems is assessed by means of a rigorous analysis of their properties. The stability of the continuous network is almost completely clarified: (1) hyperbolic interior equilibria, which are unfeasible, are unstable; (2) the state cannot escape from the unitary hypercube; and (3) a Lyapunov function exists. Numerical methods used to implement the continuous equation on a computer should be designed with the aim of preserving these favorable properties. The case of nonhyperbolic fixed points, which occur when the Hessian of the target function is the null matrix, requires further study. We prove that these nonhyperbolic interior fixed points are unstable in networks with three neurons and order two. The conjecture that interior equilibria are unstable in the general case is left open. 1 Introduction The main objective of this letter is the establishment of a solid background for the application of higher-order Hopfield neural networks to the solution of combinatorial optimization problems (COPs). With this aim, several contributions to the theoretical foundation of Hopfield neural networks help to clarify the dynamical behavior of these networks. Despite the importance of combinatorial optimization and intense research in the field (Pardalos & Resende, 2002), classical techniques for the solution of COPs have faced important limitations, mainly due to poor performance when dealing with high-dimensional problems. Hence, the application of the neural network paradigm to the traveling salesman problem (TSP) (Tank & Hopfield, 1985) Neural Computation 17, 1802–1819 (2005)

© 2005 Massachusetts Institute of Technology

Dynamics of Continuous Hopfield Networks for Optimization

1803

resulted in the establishment of a solid competitor to conventional optimization techniques. However, the practical results achieved after almost two decades of research (see Smith, 1999, for a recent review) do not correspond to the initial enthusiasm arisen by Hopfield networks. Indeed, several authors have presented a severe criticism on the irreproducibility of Tank and Hopfield’s experimental results (Wilson & Pawley, 1988). Several contributions have attempted to enhance the solution to the TSP by means of an optimum choice of parameters (e.g., Matsuda, 2002). Regrettably, these enhancements are dependent on the particular problem, so the results cannot be extrapolated to a more general setting. Further, since the TSP is solved by first-order networks, attention has seldom been paid to higher-order networks, which can be applied to a wider class of important problems. As a consequence, the general theoretical background for Hopfield networks is still uncomplete. This letter seeks to contribute to such a theory so that some insight can be obtained on the advantages and drawbacks of generic Hopfield networks. The fact that results found in the literature are not reproducible is often due to the confusion among different stages of the design of a practical optimization method with higher-order Hopfield networks. We propose that research on Hopfield networks, in the context of combinatorial optimization, should follow the following four-step program: 1. Analysis of the continuous network, defined by a differential equation 2. Implementation on a digital computer or as a hardware device 3. Selection of design parameters, related to the particular implementation 4. Heuristics to improve the performance in a particular problem Plenty of contributions (e.g., Matsuda, 1998; Bharitkar, Tsuchiya, & Takefuji, 1999) deal with the third and fourth items of this list, while results for establishing a solid background in full generality are scarce. Hence, this article focuses on a rigorous study of the continuous higher-order network. Although the second item is out of the scope of this contribution, its importance should be emphasized, since the implementation influences both convergence and solution quality. Indeed, any hardware device is subject to noise, whereas the implementation on a digital computer results in discretization and round-off errors. Yet the study of the continuous model is worthwhile in order to determine which favorable properties should be preserved in the implementation. This theoretical foundation should pave the way for the analysis of noise and discretization, which is, left for future research. Interestingly, the notion of defining an ordinary differential equation (ODE), whose long-term dynamics “solves” a problem, has already appeared in diverse fields, such as numerical analysis (Stuart & Humphries, 1996) and singular root finding (Riaza & Zufiria, 2002). With

1804

M. Atencia, G. Joya, and F. Sandoval

regard to optimization, a trend has emerged bared upon the so-called “oneapproach to nonlinear programming” (Schropp, 1995, 2000). In this sense, our analysis of Hopfield networks could constitute the seed of a promising ODE approach to combinatorial optimization. The analysis of the continuous network requires clearly stating the definition of the dynamics, as well as the corresponding Lyapunov function, in order to avoid confusing concepts (Joya, Atencia, & Sandoval, 2002). Noticeably, there are different formulations of Hopfield networks, among which we have chosen the Abe formulation (Abe, 1989), since its Lyapunov function has a structure identical to the multilinear target function of usual COPs. In order to prove the main properties of this system, first we distinguish between nonhyperbolic and hyperbolic fixed points. The former are those where the Jacobian of the dynamical equation is singular and their analysis is more difficult, since linearization techniques do not provide information on the nonlinear system at a nonhyperbolic point. The main result of the article is that when restricting to hyperbolic fixed points, there are only stable equilibria among vertices. This property has already been proved in the particular case of first-order and piecewise-linear transfer functions (Abe, 1993), but the procedure used is not easily generalizable to higher-order networks. The dynamical behavior of the network is summarized by the existence of a Lyapunov function, which implies stability, the fact that the state remains bounded within the unitary hypercube, and the convergence toward vertices. The overall conclusion of these properties is that continuous higher-order Hopfield networks are optimization methods that converge to a feasible solution; thus, they must be regarded as an adequate method for the solution of COPs. Hence, the criticism of their ability to achieve valid results should be displaced to the discretization process. As for nonhyperbolic fixed points, we claim that they are also unstable, although we are able to provide a proof of this fact only for the simplest higher-order network, with three neurons and second order. The conjecture that interior nonhyperbolic equilibria are unstable is left open in the general case. The letter is structured as follows. Section 2 recalls the main concepts on higher-order continuous Hopfield networks and clarifies the differences between the Hopfield formulation and the Abe formulation. Only the latter is used in subsequent sections. In section 3, the techniques of dynamical systems are used to analyze Hopfield networks, in the Abe formulation, with no restriction regarding the network order. The main result is that hyperbolic interior equilibria are unstable. The analysis of nonhyperbolic interior equilibria is faced in section 4. The nonhyperbolic case is much more difficult, and the instability of these interior equilibria is proved in the case of a second-order network with three neurons. We claim that this instability also holds in general, and this conjecture is left open. Finally, conclusions and possible directions for future research are provided in section 5.

Dynamics of Continuous Hopfield Networks for Optimization

1805

2 The Abe Model of Continuous Hopfield Networks In this section, we review the definitions of Hopfield networks that are continuous in both state and time and establish the equations of the model that are studied in the rest of the article. There are at least two formulations of continuous Hopfield networks, with different dynamical equations and with different Lyapunov functions, and there is often some confusion among them. The original formulation by Hopfield (1984) is given by the following ODE: dui = −ui + neti ; dt

si (t) = tanh

ui (t) ; β

i = 1 . . . n,

(2.1)

where si is the output or state of neuron i, ui can be regarded as a sort of internal potential, β > 0 is a parameter that regulates the slope of the hyperbolic tangent function, neti is the multilinear input to neuron i, and n is the number of neurons. The multilinear term will be formally defined below. Note that the results in this article are easily generalizable to any function different from the hyperbolic tangent as long as it is a saturated sigmoidal (Feng & Brown, 1998), that is, a bounded and strictly increasing function. A different formulation of the Hopfield network, which is better suited to combinatorial optimization, was proposed by Abe (1989): dui = neti ; dt

ui (t) si (t) = tanh ; β

i = 1 . . . n.

(2.2)

In the original formulations by both Hopfield and Abe, the term neti was simply the linear affine, strictly speaking, combination of states of other neurons, neti =

n

wij s j − b i ,

(2.3)

j=1

where wij is the weight of the connection from neuron j to neuron i and b i is the bias of neuron i. It is also usual to express equation 2.3 in matrix form: net = W s − b. In order to show the stability of the network, Hopfield assumed the conditions of symmetry of the weight matrix W, that is, wij = wji , and the absence of self-weights (wii = 0). In view of the limitations of first-order networks, higher-order models were soon proposed (Sejnowski, 1986), where connections start not only from individual neurons but also from sets of two or more neurons. It had already been argued that further analysis was needed to determine if the increased performance compensates for the computational cost of

1806

M. Atencia, G. Joya, and F. Sandoval

higher-order connections, and this question remains unsolved. This pioneering work focused on Boltzmann machines: networks with binary neurons and stochastic activation. The same rationale leads to the higher-order generalization of continuous Hopfield models (Samad & Harper, 1990; Joya, Atencia, & Sandoval 1997), whose dynamics is defined by equations 2.1 or 2.2, but now the term neti is multilinear; it contains the weighted sum of products of neuron states. In order to formalize this term, a word on notation q is needed. For any finite set A, let CA represent the set of all combinations of q elements chosen from A. Note that the word combination is used with the specialized meaning that it has in the context of combinatorics—a choice of q different elements from A not considering the order of them (Bronshtein, Semendyayev, Musiol, & Muehlig, 2004). Let Nn be the set of the first n natural numbers. Then the multilinear term neti of an r th order network can be expressed as

neti =

r

wii1 i2 ,...,iq si1 si2 , . . . , siq − b i ,

(2.4)

q q =1 (i 1 ,i 2 ,...,i q )∈ CN n −{i}

where wii1 i2 ,...,iq is the weight of the q th order connection from neurons sumi 1 , . . . , i q to neuron i. Note that the second summation comprises n−1 q mands and the definition is meaningful only if r < n. When r = 1, the original linear input, as defined in equation 2.3, is recovered. In the sequel, we assume a condition that could be termed as higher-order symmetry, although the term is misleading because now it is not related to the symmetry of any matrix: for every two weights wii1 i2 ,...,iq and wjj1 j2 ,..., jq of different neurons, if the sets {i, i 1 , i 2 , . . . , i q } and { j, j1 , j2 , . . . , jq } coincide, then wii1 i2 ,...,iq = w j j1 j2 ,..., jq . The other condition assumed in first-order networks, namely, the absence of self-connections, stems from equation 2.4 since the indexes of neurons that connect to neuron i are chosen from Nn − {i}. The justification for these assumptions will become evident below, because of the methodology of application of the Hopfield model to combinatorial optimization problems. The stability of the Hopfield network is proved by defining a suitable Lyapunov function (Joya et al., 1997, 2002). In the case of the Hopfield formulation, given by equations 2.1 and 2.4, the Lyapunov function is defined as V(s) = −

r

wi1 i2 ,...,iq iq +1 si1 si2 , . . . , siq siq +1 +

q =1 (i 1 ,i 2 ,...,i q ,i q +1 )∈C q +1 Nn

+β

i

0

n

b i si

i=1

si

arctanh(x) d x,

(2.5)

Dynamics of Continuous Hopfield Networks for Optimization

1807

whereas the Lyapunov function of the Abe formulation, defined by equations 2.2 and 2.4, is identical except for the absence of the integral term: V(s) = −

r

wi1 i2 ,...,iq ,iq +1 si1 si2 , . . . , siq siq +1 +

q =1 (i 1 ,i 2 ,...,i q ,i q +1 )∈ C q +1 Nn

n

b i si .

i=1

(2.6) In order to prove that equations 2.5 and 2.6 define suitable n Lyapunov functions, it will be helpful to observe that for each q , the q +1 combinations can , and be divided into those that contain a particular index i, which are n−1 q n−1 those that do not depend on si , which are q +1 :

wi1 i2 ,...,iq iq +1 si1 si2 , . . . , siq siq +1 q +1

(i 1 ,i 2 ,...,i q ,i q +1 )∈CNn

=

wii1 i2 ,...,iq si si1 si2 , . . . , siq q

(i 1 ,i 2 ,...,i q )∈CNn −{i}

+

wi1 i2 ,...,iq ,iq +1 si1 si2 , . . . , siq siq +1 .

(2.7)

q +1

(i 1 ,i 2 ,...,i q i q +1 )∈CNn −{i}

Note that this operation is possible due to the assumptions of higher-order symmetry and absence of self-weights. Next, we perform the calculations for the Lyapunov function of the Abe formulation, given in equation 2.6, and the procedure for the Hopfield formulation is similar. The partial derivatives are r ∂V =− ∂ si q =1

wi i1 i2 ,...,iq si1 si2 , . . . , siq q

(i 1 ,i 2 ,...,i q )∈CNn −{i}

+ b i = −neti = −

dui , dt

(2.8)

where the last equality results from the dynamical definition of the Abe formulation, equation 2.2. Then the negativeness of the time derivative results, n n n dV ∂ V dsi ∂ V dsi dui dui 2 tanh (ui ) < 0, = = =− dt ∂si dt ∂si dui dt dt i=1 i=1 i=1 (2.9) since the derivative of the hyperbolic tangent is positive everywhere.

1808

M. Atencia, G. Joya, and F. Sandoval

Typical COPs consist in finding a minimum of a multilinear function, where the solution is constrained to a finite set, usually the set of vertices of the hypercube, so that the problem can be formulated as:

Minimize E(s) = −

r

wi1 i2 ,...,iq ,iq +1 si1 si2 , . . . , siq siq +1 +

q =1 (i 1 ,i 2 ,...,i q ,i q +1 )∈C q +1 Nn

subject to

|si | = 1 i = 1, . . . , n.

n

b i si

i=1

(2.10)

The application of the Hopfield model to combinatorial optimization (Tank & Hopfield, 1985) is performed by matching the Lyapunov function of the network V to the target function E. This procedure justifies the conditions on the weights assumed above: higher-order symmetry and absence of self-weights. With respect to the former, two weights with the same subindexes coincide because they represent the same coefficient of the target function, that is, the coefficient of a product that appears only once. The latter stems from the constraint |si | = 1, which implies sik = 1 if k is even and sik = si if k is odd, so that no variable is raised to a power greater than 1 in the target function. A glance at equation 2.5 reveals a severe drawback of Hopfield’s formulation: its Lyapunov function can match the target function only if the integral term disappears, so β must be forced toward zero. Although some strategies have been studied with this aim (Vidyasagar, 1995; Joya et al., 2002), their convergence is slow, which penalizes the network performance. Comparison of equations 2.5 and 2.6 shows that the integral term is not present in the Abe model, so that the Lyapunov function can exactly match a multilinear target. This fact explains that many practical applications make use of the Abe network, although this is not always conveniently detailed. Besides, theoretical studies of the Abe dynamics are scarce, compared to the abundance of recent contributions on the Hopfield formulation (see, e.g., Chen & Amari, 2001; Juang, 1999). We focus on the Abe formulation of Hopfield networks, given by equation 2.2. In this field, several applications to optimization have been reported; significantly, the research by Takefuji and colleagues (Takefuji & Lee, 1991; Bharitkar et al., 1999), is worth mentioning. However, the proposed solutions are problem dependent, and implementation issues are solved by trial and error. Most theoretical studies are restricted to first-order networks (Abe, 1993) or approach the discretization process without a previous rigorous analysis of the continuous model (Matsuda, 1998). Thus, we feel that these systems deserve a deep theoretical investigation that permits their application to general optimization problems.

Dynamics of Continuous Hopfield Networks for Optimization

1809

3 Stability In this section, we prove one of the main contributions of the article: that under the assumption of hyperbolicity, the Abe formulation possesses no stable fixed point apart from vertices. The same result restricted to firstorder networks, already known, is obtained as a corollary. The techniques, used linearization and a local eigenvalue analysis, are usual tools in the study of ODEs, but they are not directly applicable to equation 2.2. Indeed, this equation is, strictly speaking, a differential algebraic equation (Ciarlet & Lions, 2002) rather than an ODE, since the two sets of variables si and ui are related by the nondifferential equation si = tanh(ui ). Hence, we must first reformulate equation 2.2 as an ODE, with the single set of variables si : dsi dui dsi = = dt dui dt

1 f (s). 1 − si2 neti = i β

(3.1)

In order to preserve the equivalence to equation 2.2, the initial conditions |si (t = 0)| < 1 are required. One can wonder if the direct substitution of the values of the si variables in equations 2.2 and 2.4 is more natural, leading to the following ODE: r dui = dt q =1

wii1 i2 ,...,iq tanh

ui1 β

(i1 ,i2 ,...,iq )∈CNq n −{i} uiq ui2 × tanh . . . tanh − bi . β β

(3.2)

However, adopting equation 3.1 as the dynamical definition of the network has a significant drawback: the feasible solutions to the problem are no longer fixed points of the dynamics. In fact, the vertices |si | = 1 are transformed to the ui variables as points at infinity. Although equation 3.2 can be used to study interior fixed points, the main aim of this section, the use of two different dynamical definitions for various fixed points is not particularly appealing, whereas equation 3.1 unifies the definition with the only cost being to pay attention to initial values. On the other hand, note that the fact that states remain bounded within the hypercube is not evident in equation 3.1, whereas it was so in the differential algebraic formulation, equation 2.2, since the range of the hyperbolic tangent function is (−1, 1). Although irrelevant in the study of the continuous network, it is worth mentioning that this invariance of the hypercube could be destroyed when the network is implemented on a digital computer due to round-off or discretization errors. Hence, a careful numerical implementation of equation 3.1 should preserve this property.

1810

M. Atencia, G. Joya, and F. Sandoval

The existence of a Lyapunov function implies that the long-term behavior of the system is simple (Hirsch & Smale, 1974): for any initial state, the system asymptotically approaches a point that belongs to the set of fixed points. In particular, there are no periodic solutions. The fixed points of this system can be classified into three classes:

r r r

The vertices of the unit hypercube: s = {−1, 1}n . The interior fixed points, which lead to net = 0. The points such that |si | = 1 ∀i ∈ I and neti = 0 ∀i ∈ / I for some subset I ⊆ {1, . . . , n}. These points lie on sides of the hypercube.

Since combinatorial optimization is intended, valid results should be obtained at the vertices. If equilibria within the hypercube exist, they should be unstable in order to guarantee that trajectories approach a vertex. This is the major result, which we prove: interior equilibria are unstable. In order to prove this, we must first calculate the jacobian of the function f, defined in equation 3.1. Since there are no self-weights, the condition ∂neti /∂si = 0 holds, so the Jacobian results in J(s) =

∂fi ∂s j

; ij

∂neti ∂fi 1 = 1 − si2 ∂s j β ∂s j

∂fi 2 = − si neti ; ∂si β

j = i. (3.3)

Next we prove an auxiliary technical lemma: Lemma 1. The Jacobian J of f at an interior fixed point is a matrix with zero diagonal. Further, it is similar to a symmetrical matrix, that is, there exist a nonsingular matrix A and a symmetrical matrix B such that B = A−1 JA. Proof. Let ξ be an interior fixed point. Then, since ξ is not a vertex, |ξi | = 1 ∀i holds. Consequently, since ξ is a fixed point (f(ξ) = 0), neti = 0 ∀i results from the definition of f in equation 3.1. Substituting the condition neti = 0 in equation 3.3 leads to J(ξ)ii = ∂fi /∂si |s=ξ = 0; hence, the Jacobian is a matrix with zero diagonal. In order to prove the second assertion of the lemma, define the two matrices D(s), H(s): D(s) =

1 diag 1 − si2 ; β

H(s) =

∂ neti ∂ sj

.

(3.4)

ij

Thus, from equation 3.3, since the diagonal vanishes, the Jacobian at ξ can be expressed as the product J(ξ) = D(ξ) H(ξ). Since all the diagonal elements of D(ξ) are strictly positive as long as ξ is an interior point, there exists the

Dynamics of Continuous Hopfield Networks for Optimization

1811

1

matrix D(ξ) 2 , as well as its inverse. So the following similarity transformation (Demmel, 2000) is possible: A. J(ξ) ∼ D− 2 J(ξ) D 2 = D− 2 D H D 2 = D 2 H D 2 = 1

1

1

1

1

1

(3.5)

Note that the reference to the vector ξ is omitted for brevity. As we observe in the last equality, the final matrix A is symmetrical if H is symmetrical. Hence, the proof is concluded by proving that H is symmetrical, which stems from the fact that the Lyapunov function V defined in equation 2.6 fulfills ∂ V/∂si = −neti so that h ij =

∂neti ∂2V =− , ∂s j ∂si ∂s j

(3.6)

and the symmetry of H stems from the interchangeability of the order of partial derivation. Remarkably, as observed in equation 3.6, the matrix H(s) defined in the proof of lemma 1 is the Hessian of the target function V(s) at a point s, with changed sign. This matrix will have a significant role in the elucidation of the dynamics of the network. Now the main theorem of this section, regarding instability of interior fixed points, is stated and proved: Theorem 1. Let V be the target function of an optimization problem, and let equation 2.2 define the dynamics of the corresponding neural network. Given an interior fixed point ξ, if the Hessian of V(s) at ξ is not the zero matrix, then ξ is unstable. Proof. From the previous lemma, the Jacobian evaluated at ξ is a matrix with zero diagonal. Recall from linear algebra (Golub & van Loan, 1996) that for any matrix, the sum of its eigenvalues coincides with its trace. Hence, the sum of the eigenvalues of the Jacobian is zero, and either there is at least one positive eigenvalue or all eigenvalues are zero. In the former case, the point ξ is unstable, and the proof is finished. But the latter case leads to a contradiction. It is well known (Horn & Johnson, 1985) that every symmetrical matrix is diagonalizable and all its eigenvalues are real. Since J(ξ) is similar to a symmetrical matrix, it is also similar to a real diagonal matrix A whose diagonal elements are its eigenvalues. If all eigenvalues of J(ξ) are null, so are the eigenvalues of A, which is then the zero matrix, and, by similarity, J(ξ) is also zero. In turn, since J(ξ) can be expressed by the product J(ξ) = D(ξ) H(ξ) and D(ξ) is nonsingular at interior points, H(ξ) also vanishes. Finally, observe in the proof of lemma 1 that the Hessian of V at ξ is the matrix −H(ξ) that, by hypothesis, is nonzero, and a contradiction results.

1812

M. Atencia, G. Joya, and F. Sandoval

The condition on the Hessian in theorem 1 is not easy to prove for a particular higher-order Hopfield network, since this would imply finding the fixed point by solving a nonlinear system of equations. Rather, the theorem is intended to serve as a confirmation of the observed fact that, in general, stable interior equilibria are not found in practical applications. Besides, numerical simulations of several networks suggest the conjecture that even the null Hessian condition is not necessary for instability. Observe that the condition H(ξ) = 0 implies that the Jacobian of f vanishes at ξ; hence, ξ is referred to as a nonhyperbolic fixed point in the context of dynamical systems. It is known that the analysis of nonhyperbolic equilibria is a rather difficult task since, in contrast to the hyperbolic case, linearization techniques do not provide information on stability. Thus, more advanced stability techniques are used in section 4 in an attempt to refine the theorem and omit the condition H = 0. When the range of studied neural networks is restricted to first-order networks, the fact that interior fixed points are unstable has already been proved by an involved procedure (Abe, 1993). Remarkably, the technique Abe used is not extendable to higher-order networks. Contrarily, in our analysis, this fact results from a simple corollary of theorem 1: Corollary 1. Consider a first-order network whose Lyapunov function is nonconstant. Then fixed points within the hypercube are unstable. Proof. If the network is first order, the target function is quadratic; hence, its Hessian is constant, and it coincides with the weight matrix W. Now suppose ξ is a stable interior fixed point. Then, by theorem 1, the Hessian of V(s) at ξ is the zero matrix, but since the Hessian is constant, it is identically zero. Hence, W is the zero matrix and the network is unconnected. Furthermore, null Hessian implies constant gradient, and since ξ is a fixed point, ∇V(ξ) = 0; hence, the gradient of V is identically zero. Thus, V is constant, and the optimization problem is trivial. Next we analyze the fixed points on sides of the hypercube, given by |si | = 1 ∀i ∈ I and neti = 0 ∀i ∈ / I , for some subset I ⊆ {1, . . . , n}. These points can be regarded as interior points within the side they lie on, which is a hypercube of reduced dimensionality: Lemma 2. Consider a fixed point ξ that fulfills |ξi | = 1 ∀i ∈ I ⊆ {1, . . . , n}. If the Hessian of V(s) at ξ is nonsingular, then ξ is unstable. Proof. The fixed point ξ lies on a side S of the hypercube, defined by S = {x | xi = ξi ∀i ∈ I }. We claim that a contradiction results from assuming that ξ is stable. Indeed, should this occur, ξ is an interior stable fixed point of the network restricted to the hypercube S of reduced dimensionality. Apply theorem 1 restricted to the hypercube S to conclude that a principal submatrix of H(ξ) is the zero matrix, concretely h ij = 0 ∀i, j ∈ I . Also, from

Dynamics of Continuous Hopfield Networks for Optimization

1813

equation 3.3, h ij = 0 ∀i ∈ / I ∀ j ∈ I , since the rows i ∈ / I correspond to |ξi | = 1. Thus, H(ξ) has at least one zero column, and hence it is singular, which contradicts the hypothesis, and the instability of ξ results. The remark following theorem 1 is also applicable here: the condition of nonsingularity of the Hessian is not easily provable, but it seems to hold in reported applications. Besides, the known result (Abe, 1993) for first-order networks is here a simple corollary: Corollary 2. Consider a first-order network with a stable fixed point ξ that lies on a side S of the hypercube. Then V(s) has the same value at any s ∈ S. Proof. Repeat the proof of corollary 1, restricted to the hypercube of reduced dimensionality S. From the point of view of optimization, the previous corollary leads to a practical procedure: if the network state approaches a side S where the points s ∈ S fulfill |si | = 1 ∀i ∈ I , then the variables si with i ∈ / I can be set to any value without affecting the solution. In particular, they can be rounded so as to attain a vertex, which is a feasible solution, instead of the approached interior—hence, unfeasible—fixed point. Simply put, the results in this section claim that continuous Hopfield networks in the Abe formulation are well suited to combinatorial optimization problems, since stable equilibria occur only at vertices. The corollaries on first-order networks are similar to the results in (Abe, 1993), but Abe employs the simpler piecewise-linear functions, and his techniques cannot be extended to higher-order networks. Therefore, to the best of our knowledge, the statement of these results in the general—higher-order—setting is novel. Also, these results critically depend on the absence of self-weights, which implies ∂neti /∂si = 0; hence, the diagonal of the Jacobian vanishes. Some experimental results (Abe & Gee, 1995; Abe, 1996) suggest that the network performs better when self-weights are allowed, but the algorithm has to deal with eventual stable interior equilibria. The techniques used are applied only in the case of first-order networks and have no general applicability due to their problem-dependent and empirical nature. Interestingly, some applications that do not fall into the frame of combinatorial optimization, but into continuous optimization (Atencia, Joya, & Sandoval, 2003a), require the existence of self-weights, since interior solutions are then intended. Finally, for completeness, we characterize stable vertices: Lemma 3. A vertex ξ is asymptotically stable if ξi neti |s=ξ > 0 ∀i. Proof. The Jacobian evaluated at a vertex is directly obtained by substituting the vertex condition |ξi | = 1 ∀ i in equation 3.3, resulting in the diagonal matrix diag(−2 ξi neti |s=ξ /β), whose eigenvalues are its diagonal elements.

1814

M. Atencia, G. Joya, and F. Sandoval

Thus, all the eigenvalues are negative if ξi neti |s=ξ > 0 ∀i. Hence, the vertex is stable. To summarize, since nonvertex fixed points are unstable and a Lyapunov function exists, the system converges to a stable vertex. Therefore, the network performs as intended, and there exist no interior solutions, which would be unfeasible. 4 NonHyperbolic Interior Equilibria In this section, we aim at acquiring further knowledge on the behavior of the network around a nonhyperbolic interior fixed point. As we have proved in theorem 1, this case requires the Hessian of the target function to be null at the fixed point. We have also observed that this can occur only in higher-order networks. The lack of experience in the application of the Abe formulation to higher-order problems is the reason for the absence of results in this subject. We conjecture that in higher-order neural networks, every nonhyperbolic fixed point is unstable. However, there is no evident path for the proof of this conjecture in maximum generality, so we restrict our study to the simplest nontrivial higher-order network, with three neurons and second order. In doing so, we are inspired by fruitful contributions that analyze the stability of particularly simple networks (e.g., Tino, ˇ Horne, & Giles, 2001). Even the simplified network with three neurons exhibits a complex dynamics, since the Jacobian has a zero eigenvalue with multiplicity three. In fact, the analysis of these triple-zero degeneracies is an active line of research (e.g., Algaba, Merino, Freire, Gamero, & Rodriguez-Luis, 2003) among the dynamical systems community. First, we prove that such an interior nonhyperbolic fixed point may indeed occur, and we establish sufficient conditions for its existence. Then the instability of the interior equilibrium is shown by means of an ad hoc simplification, since it is well known that classical linearization techniques fail in revealing the behavior of a dynamical system near a nonhyperbolic fixed point. Interestingly, the result requires the higher-order symmetry of the weights, as we have been assuming through the article. Recall that the ODE of a generic neural network, given in equation 3.1, is dsi /dt = (1/β) (1 − si2 ) neti . Then, in the most general form, the dynamical equations of the three-neuron network are: ds1 1 = 1 − s12 (w12 s2 + w13 s3 + w123 s2 s3 − b 1 ) dt β ds2 1 = 1 − s22 (w12 s1 + w23 s3 + w123 s1 s3 − b 2 ) dt β ds3 1 = 1 − s32 (w13 s1 + w23 s2 + w123 s1 s2 − b 3 ). dt β

(4.1)

Dynamics of Continuous Hopfield Networks for Optimization

1815

Assume an interior fixed point ξ, characterized by neti = 0 ∀ i, exists. Assume further that the Hessian of the corresponding target function vanishes at ξ. Recall that the Hessian is the Jacobian (see equation 3.3) of the function f = d s/d t. Since ξ is within the hypercube (|ξi | < 1), the Jacobian vanishing requires ∂ neti /∂ s j = 0 ∀i = j, which results in ξ1 = −

w23 , w123

ξ2 = −

w13 , w123

ξ3 = −

w12 . w123

(4.2)

Since this algebraic calculation is valid, as long as w123 = 0, we conclude that this nonhyperbolic point exists and is unique. Substituting equation 4.2 into the condition neti = 0 provides a characterization of the network biases: b1 =

w12 w13 , w123

b2 =

w12 w23 , w123

b3 =

w13 w23 . w123

(4.3)

These calculations are summarized in the following lemma: Lemma 4. Consider a neural network with three neurons, and assume its order is strictly two, that is, w123 = 0. Assume also |wij | < |w123 | ∀ i, j and the conditions of equation 4.3 hold. Then an interior fixed point ξ exists, as defined in equation 4.2. Further, the Hessian of the target function vanishes at ξ and this is the only nonhyperbolic interior fixed point. Proof. From equation 4.2, the condition |wij | < |w123 | implies |ξi | < 1, so ξ is within the unitary hypercube. Direct substitution of equations 4.3 and 4.2 shows both that ξ is a fixed point and that ∂ neti /∂ s j = 0 holds, so the Hessian of the target function is null. Since the simultaneous equations neti = 0 and ∂ neti /∂ s j = 0 have no other solution, ξ is the only nonhyperbolic interior fixed point. Next, we prove the instability of the interior fixed point, whose eventual existence is proved in lemma 4. For simplicity, we translate the fixed point to the origin with the linear transformation x = s − ξ. Hence, the dynamics, equation 4.1, reduces to: d x1 1 = 1 − (x1 + ξ1 )2 w123 x2 x3 dt β d x2 1 = 1 − (x2 + ξ2 )2 w123 x1 x3 dt β d x3 1 = 1 − (x3 + ξ3 )2 w123 x1 x2 . dt β

(4.4)

1816

M. Atencia, G. Joya, and F. Sandoval

Consider now the lowest-order terms, which completely determine the dynamical behavior of the network in a neighborhood of the origin: d x1 1 ≈ 1 − ξ12 w123 x2 x3 dt β d x2 1 ≈ 1 − ξ22 w123 x1 x3 dt β

(4.5)

d x3 1 ≈ 1 − ξ32 w123 x1 x2 . dt β Incidentally, notice that this is nothing but the normal form on the center manifold (Guckenheimer & Holmes, 1997). Then, differentiating the first equation in equation 4.5 and substituting the other two results in: d 2 x1 dt2

dx2 w123 dx3 1 − ξ12 x3 + x2 β dt dt 1 w123 1 = 1 − ξ12 1 − ξ22 w123 x1 x32 + 1 − ξ32 w123 x1 x22 β β β 2 2 2 w123 2 2 2 1 − ξ 1 − ξ x x x1 . + 1 − ξ (4.6) = 1 2 3 3 2 β2

=

Since the whole expression between brackets in the last line of equation 4.6 is positive, we conclude that the behavior of the equation in a small neighborhood of the origin is qualitatively similar to the simpler equation, d 2 x1 = k x1 , d t2

(4.7)

where k is a positive constant. The analytical solution of equation 4.7 is x1 = e

√ kt

√ kt

+ e−

,

(4.8)

which reveals that x1 is unstable at the origin. The same procedure can be applied to x2 and x3 ; hence, the second-order network with three neurons is unstable at interior points. Numerical simulations suggest that this also occurs in a more general setting, but a rigorous proof of this open conjecture is left for further research. 5 Conclusions and Future Directions In this letter, a thorough study of continuous higher-order Hopfield neural networks has been performed, with the aim of determining their ability

Dynamics of Continuous Hopfield Networks for Optimization

1817

to solve combinatorial optimization problems. The main characteristics of the continuous network are brought to the light. It results are that the state is bounded by the unitary hypercube, and hyperbolic interior equilibria, which are unfeasible points of the optimization problem, are unstable. These results had already been proved for first-order networks (Abe, 1993), whose methodology is not trivially extended to higher-order networks. In this sense, this letter can be regarded as the extension to higher-order networks of known results, but our proofs are less involved. In view of these facts and the existence of a Lyapunov function, the continuous higher-order network can be regarded as an adequate generic technique for combinatorial optimization. On the other hand, nonhyperbolic equilibria are left out of the general picture, since their analysis faces additional difficulties. Numerical simulations permit the conjecture that these points are unstable in general. Indeed, they deserve further research for at least two reasons apart from their theoretical interest:

r r

When the continuous system is discretized, complex bifurcations could branch from nonhyperbolic equilibria, leading to cycles or even strange attractors. Since hyperbolic interior equilibria are saddle nodes, their unstable manifolds separate the basins of attraction of stable vertices (Atencia, Joya, & Sandoval, 2003b). This analysis could be extended to nonhyperbolic fixed points.

The analysis of nonhyperbolic interior equilibria is accomplished in the case of second-order networks with three neurons, establishing the conditions for their existence and proving their instability. The results presented in the article provide some insight into the dynamical behavior of Hopfield networks, in the Abe formulation and its relation to practical solution of combinatorial optimization problems. In particular, the instability of interior points forces the network to provide a feasible point. Two important questions are left for further research: (1) the existence of local minima among feasible points and the size of the corresponding basins of attraction, when compared to that of global minima, and (2) the design of numerical methods that implement the continuous equation on a digital computer, while preserving those favorable properties. Acknowledgments This work has been partially supported by Project No. TIC2001-1758, funded by the Spanish Ministerio de Ciencia y Tecnolog´ıa and FEDER funds. Thanks are due to F. R. Villatoro, P. Guerrero, and R. Riaza for pointing out some useful references. The comments of the reviewers and editor have greatly contributed to enhancing the letter.

1818

M. Atencia, G. Joya, and F. Sandoval

References Abe, S. (1989). Theories on the Hopfield neural networks. In Proc. IEE International Joint Conference on Neural Networks (Vol. 1, pp. 557–564). Washington, DC: IEEE. Abe, S. (1993). Global convergence and suppression of spurious states of the Hopfield neural networks. IEEE Trans. on Circuits and Systems–I, 40(4), 246–257. Abe, S. (1996). Convergence acceleration of the Hopfield neural network by optimizing integration step sizes. IEEE Trans. on Systems, Man and Cybernetics B, 26(1), 194–201. Abe, S., & Gee, A. H. (1995). Global convergence of the Hopfield neural network with nonzero diagonal elements. IEEE Trans. on Circuits and Systems–II, 42(1), 39–45. Algaba, A., Merino, M., Freire, E., Gamero, E., & Rodriguez-Luis, A. (2003). Some results on Chua’s equation near a triple-zero linear degeneracy. International Journal of Bifurcation and Chaos, 13(3), 583–608. Atencia, M. A., Joya, G., & Sandoval, F. (2003a). Hopfield neural networks for parametric identification of a robotic system. In Proc. Engineering Application of Neural Networks Conference (EANN 2003), M´alaga (Spain) (pp. 311–318). M´alaga, Spain: Departamento de Ingenier´ıa de Sistemas y Autom´atica, Universidad de M´alaga. Atencia, M. A., Joya, G., & Sandoval, F. (2003b). Spurious minima and basins of ´ attraction in higher-order Hopfield networks. In J. Mira & J. R. Alvarez, (Eds.), Computational methods in neural modeling. Berlin: Springer. Bharitkar, S., Tsuchiya, K., & Takefuji, Y. (1999). Microcode optimization with neural networks. IEEE Trans. on Neural Networks, 10(3), 698–703. Bronshtein, I., Semendyayev, K., Musiol, G., & Muehlig, H. (2004). Handbook of mathematics. Berlin: Springer. Chen, T., & Amari, S.-I. (2001). New theorems on global convergence of some dynamical systems. Neural Networks, 14, 251–255. Ciarlet, P., & Lions, J. (2002). Handbook of numerical analysis. Amsterdam: Elsevier. Demmel, J. (2000). Hermitian eigenproblems. In Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, & H. van der Vorst (Eds.), Templates for the solution of algebraic eigenvalue problems: A practical guide (pp. 11–14). Philadelphia: SIAM. Feng, J., & Brown, D. (1998). Fixed-point attractor analysis for a class of neurodynamics. Neural Computation, 10(1), 189–213. Golub, G. H., & van Loan, C. F. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Guckenheimer, J., & Holmes, P. (1997). Nonlinear oscillations, dynamical systems, and bifurcations of vector fields. Berlin: Springer. Hirsch, M. W., & Smale, S. (1974). Differential equations, dynamical systems, and linear algebra. Orlando, FL: Academic Press. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088– 3092. Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge: Cambridge University Press. Joya, G., Atencia, M. A., & Sandoval, F. (1997). Associating arbitrary-order energy functions to an artificial neural network: Implications concerning the resolution of optimization problems. Neurocomputing, 14, 139–156.

Dynamics of Continuous Hopfield Networks for Optimization

1819

Joya, G., Atencia, M. A., & Sandoval, F. (2002). Hopfield neural networks for optimization: Study of the different dynamics. Neurocomputing, 43(1–4), 219–237. Juang, J.-C. (1999). Stability analysis of Hopfield-type neural networks. IEEE Trans. on Neural Networks, 10(6), 1366–1374. Matsuda, S. (1998). Optimal Hopfield network for combinatorial optimization with linear cost function. IEEE Trans. on Neural Networks, 9(6), 1319–1330. Matsuda, S. (2002). “Optimal” neural representation of higher order for traveling salesman problems. Electronics and Communications in Japan, Part 2, 85(9), 32–42. Pardalos, P. M., & Resende, M. G. (2002). Handbook of applied optimization. New York: Oxford University Press. Riaza, R., & Zufiria, P. J. (2002). Discretization of implicit ODEs for singular rootfinding problems. Journal of Computational and Applied Mathematics, 140, 695–712. Samad, T., & Harper, P. (1990). High-order Hopfield and Tank optimization networks. Parallel Computing, 16, 287–292. Schropp, J. (1995). Using dynamical systems methods to solve minimization problems. Applied Numerical Mathematics, 18, 321–335. Schropp, J. (2000). One-step and multistep procedures for constrained minimization problems. IMA Journal of Numerical Analysis, 20(1), 135–152. Sejnowski, T. J. (1986). Higher-order Boltzmann machines. In J. S. Denker (Ed.), Neural networks for computing (pp. 398–403). College Park, MD: American Institute of Physics. Smith, K. A. (1999). Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS Journal on Computing, 11(1), 15–34. Stuart, A., & Humphries, A. (1996). Dynamical systems and numerical analysis. Cambridge: Cambridge University Press. Takefuji, Y., & Lee, K. C. (1991). Artificial neural networks for four-coloring map problems and K-colorability problems. IEEE Trans. on Circuits and Systems, 38(3), 326–333. Tank, D., & Hopfield, J. (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Tino, ˇ P., Horne, B., & Giles, C. (2001). Attractive periodic sets in discrete-time recurrent networks (with emphasis on fixed-point stability and bifurcations in twoneuron networks). Neural Computation, 13(6), 1379–1414. Vidyasagar, M. (1995). Minimum-seeking properties of analog neural networks with multilinear objective functions. IEEE Trans. on Automatic Control, 40(8), 1359–1375. Wilson, G., & Pawley, G. (1988). On the stability of the travelling salesman problem algorithm of Hopfield and Tank. Biological Cybernetics, 58(1), 63–70.

Received May 10, 2004; accepted January 10, 2005.

LETTER

Communicated by Morris Hirsch

Some Generalized Sufficient Convergence Criteria for Nonlinear Continuous Neural Networks Jito Vanualailai [email protected] Department of Mathematics and Computing Science, University of the South Pacific, Suva, Fiji

Shin-ichi Nakagiri [email protected] Department of Applied Mathematics, Faculty of Engineering, Kobe University, Kobe, Japan

A reason for applying the direct method of Lyapunov to artificial neural networks (ANNs) is to design dynamical neural networks so that they exhibit global asymptotic stability. Lyapunov functions that frequently appear in the ANN literature include the quadratic function, the Persidskii function, and the Lur´e-Postnikov function. This contribution revisits the quadratic function and shows that via Krasovskii-like stability criteria, it is possible to have a very simple and systematic procedure to obtain not only new and generalized results but also well-known sufficient conditions for convergence established recently by non-Lyapunov methods, such as the matrix measure and nonlinear measure. 1 Introduction The direct method of Lyapunov, which uses energy-like functions called Lyapunov functions, is now a well-entrenched technique in the qualitative analysis of mathematical systems governed by differential equations. A flurry of activities by mathematicians, particularly between the early 1940s and the late 1960s, extended the work of Lyapunov to produce results that are now indispensable in many applications. (A modern review of the Lyapunov method and its many applications is by Sastry, 1999.) This letter is motivated to a large extent by modern applications of the Lyapunov method, especially those in artificial neural networks. Historically, the rigorous application of the Lyapunov method to artificial neural networks can be traced back to the pioneering work of the mathematician Grossberg, who in the 1970s started Lyapunov functionand Lyapunov functional-based methods for classifying the dynamical behaviors of a wide variety of competitive dynamical systems. By 1988, he had accumulated sufficient important and fundamental results, which he Neural Computation 17, 1820–1835 (2005)

© 2005 Massachusetts Institute of Technology

Nonlinear Continuous Neural Networks

1821

then summarized in an excellent review (Grossberg, 1988). Between the late 1970s and late 1980s, the physicist Hopfield concentrated on a particular form of dynamical systems that were being considered in general by Grossberg. Hopfield (1984) published the landmark paper that popularized the term Hopfield or Hopfield-type (artificial) neural network. Hopfield designed his network using a Lyapunov function that is now recognized as a special form of the Lyapunov function that was proposed a year earlier by Cohen and Grossberg (1983) for a more general system. Then Hopfield and Tank (1985, 1986) applied Hopfield’s earlier findings to firmly establish the role of Hopfield-type neural networks as standard models that perform some computational task, such as recognition and association, on a given key pattern via interaction between a number of interconnected units having simple functions. The key pattern presented to a Hopfield-type neural network is an initial state of the network. Then the network must be designed (using the Lyapunov method, for example) such that the network’s state settles ultimately to an equilibrium that depends on only the key pattern. In this letter, we also consider this model and discuss recent results. We start by considering the autonomous system of the form x˙ ≡

dx = g(x), x(t0 ) = x0 , t ≥ t0 ≥ 0. dt

(1.1)

Guided by the well-known 1954 result of Krasovskii, we seek to portray a simple method of generating the quadratic Lyapunov function for equation 1.1. The Lyapunov function guarantees global exponential stability. If system 1.1 is further perturbed by time-varying external inputs, then we can prove, by an appropriate extension of the quadratic Lyapunov function, that the perturbed system yields convergence of solutions to the equilibrium points of equation 1.1 when the perturbation is either L 2 or bounded and decays with time. We end by applying the stability and convergence criteria to artificial neural networks of the form of equation 1.1 and to Hopfield-type neural networks. Throughout the letter, we suppose that in equation 1.1, the function g = (g1 , . . . , gn )T is smooth enough to guarantee the existence, uniqueness, and continuous dependence of solutions x(t) = x(t; x0 ), with x = (x1 , . . . , xn )T . It is assumed that readers are familiar with the various standard definitions of Lyapunov stability. (We will use those in Sastry, 1999.) Thus, without loss of generality, we carry the assumption that g(0) = 0, so that 0 is the equilibrium point of equation 1.1.

2 Convergence Criteria In 1954, Krasovskii established an asymptotic stability criterion that avoided the linearization principle. He assumed that g ∈ C 1 [Rn , Rn ] and g(0) = 0.

1822

J. Vanualailai and S. Nakagiri

Then system 1.1 can be written as x˙ =

1

J(sx)xds, x(t0 ) = x0 ,

0

where J is the Jacobian matrix J(x) =

∂gi ∂g (x). (x) = [J ij (x)]n×n , where J ij (x) = ∂x ∂xj

The following result by Krasovskii is a fundamental one in control theory. Theorem 1 (Krasovskii, 1954). Let g ∈ C 1 [Rn , Rn ] and g(0) = 0. If there exists a constant positive definite symmetric matrix P such that x T [P J (x) + J T (x)P]x is a negative definite function, then the zero solution of equation 1.1 is globally asymptotically stable. We next state two different versions of Krasovskii’s theorem that are applicable to artificial neural networks. The first result is applicable to continuous artificial neural networks with the general form of equation 1.1. The second result is applicable to a specific case of equation 1.1. Both results explicitly use each component of system 1.1. Thus, defining D(x) = dij (x) n×n , where dij (x) =

1

J ij (sx)ds,

(2.1)

0

and given that g(0) = 0, we can write system 1.1 as x˙ = D(x)x, x(t0 ) = x0 , the ith component of which is x˙ i = dii (x)xi +

n

dij (x)x j .

j=1 j=i

Our first result is given next. Theorem 2. Let g ∈ C 1 [Rn , Rn ] and g(0) = 0, and define βi (x) = dii (x) +

n 1 |dij (x) + dji (x)|. 2 j=1 j=i

(2.2)

Nonlinear Continuous Neural Networks

1823

Suppose there are constants c i > 0 such that βi (x) ≤ −c i < 0 for i = 1, . . . , n and x ∈ Rn . Then the zero solution of equation 1.1 is globally exponentially stable. Proof. Consider, as a tentative Lyapunov function for system 1.1, V(x) =

n 1 x2 . 2 i=1 i

For c = min{c 1 , . . . , c n } > 0, we have, along a solution of equation 1.1,  V˙ (1.1) =

n

xi x˙ i =

i=1

n



 xi dii (x)xi +

i=1

 dij (x)x j 

j=1 j=i

 =

n



n n   2 dij (x)xj xi  dii (x)xi + i=1

j=1 j=i

  n n 1   2 = [dij (x) + dji (x)]xj xi  dii (x)xi + 2 j=1 i=1 j=i



 n n 1   2 ≤ |dij (x) + dji (x)| x 2j + xi2  dii (x)xi + 4 j=1 i=1 j=i

 =

=

n

1  dii (x) + 2 i=1

n

n



 |dij (x) + dji (x)| xi2

j=1 j=i

βi (x)xi2

i=1

≤ −c

n

xi2

i=1

= −2cV, so that V(x(t)) ≤ V(x0 )e −2c(t−t0 ) , t ≥ t0 ≥ 0. Hence, the conclusion of theorem 2 follows, and V is a Lyapunov function for system 1.1.

1824

J. Vanualailai and S. Nakagiri

To state our second result, we see that if we define n 1 |J ij (x) + J ji (x)|, 2 j=1

τi (x) = J ii (x) +

(2.3)

j=i

and assume that τi (x) ≤ −c i < 0, then the time derivative of our Lyapunov function can be continued as follows:   n n 1   V˙ (1.1) ≤ |dij (x) + dji (x)| xi2 dii (x) + 2 j=1 i=1 j=i



n  = 

0

i=1

≤

n i=1

1

1

0

n

1

J ii (sx)ds +

2 j=1 j=i

0

1



 2 J ij (sx) + J ji (sx) ds

 xi

n n τi (sx)ds xi2 ≤ − c i xi2 ≤ −c xi2 = −2cV. i=1

(2.4)

i=1

We thus have the following result: Theorem 3. Let g ∈ C 1 [Rn , Rn ] and g(0) = 0, and define τi (x) = J ii (x) +

n 1 |J ij (x) + J ji (x)|. 2 j=1 j=i

Suppose there are constants c i > 0 such that τi (x) ≤ −c i < 0 for i = 1, . . . , n and x ∈ Rn . Then the zero solution of equation 1.1 is globally exponentially stable. Our final result in this section recognizes that certain continuous artificial neural networks, such as the Hopfield-type networks, may be perturbed by external inputs. If the inputs are constants, then such neural networks can be recast into the form 1.1 and their stability analyzed accordingly. If the external inputs are time varying, then we need to consider a perturbed form of equation 1.1 as follows: x˙ = g(x) + h(t), x(t0 ) = x0 ,

(2.5)

where h(t) = (h i (t), . . . , h n (t))T , a vector function of t, need not be continuous on [0, ∞). Even for a simple perturbed system such as equation 2.5, establishing the asymptotic stability of the zero solution, assuming it is the equilibrium point of the system without h(t), could be challenging without some simplifying assumptions on h(t). A recent result that guarantees at least the convergence of solutions to the zero solution is as follows:

Nonlinear Continuous Neural Networks

1825

Theorem 4 (Vanualailai, Soma, & Nakagiri, 2002). Let the conditions of theorem 2 or theorem 3 hold. Then all solutions of equation 2.5 tend to zero if either one of the following conditions holds: i.

n

i=1 [h i (t)]

2

is bounded on [0, ∞) and

n

i=1 [h i (t)]

2

→ 0 as t → ∞.

ii. h i (·) ∈ L 2 [t0 , ∞) for i = 1, . . . , n.

A simpler result that embraces the two conditions i and ii of theorem 4 is as follows: Theorem 5. Let the conditions of theorem 2 or theorem 3 hold. If n i=1

t+1

[h i (s)]2 ds → 0 as t → ∞,

(2.6)

t

then all solutions of equation 2.5 tend to zero. It is clear that if condition i or condition ii of theorem 4 holds, then condition 2.6 of theorem 5 is satisfied. To prove theorem 5, we need the following lemma, which is a straightforward consequence of theorem A in Hara (1975): Lemma 1 (Hara, 1975). Suppose that there exists a Lyapunov function V(t, x) of equation 2.5, continuous differentiable in [t0 , ∞) × Rn , satisfying the following conditions: i. a (x) ≤ V(t, x) ≤ b(x), where a (r ) ∈ CIP (the family of continuous and increasing positive definite functions), a (r ) → ∞ as r → ∞ and b(r ) ∈ CIP. de f limsup 1 {V(t + h, x + h [g(x) + h(t)]) − V(t, x)} ii. V˙ (2.5) (t, x) = h→0+ h ≤ −cV(t, x) + λ(t)[1 + V(t, x)], where c > 0 is a constant and λ(t) ≥ 0 t+ 1 satisfies t λ(s)ds → 0 as t → ∞.

Then the solution x(t) of equation 2.5 satisfies x(t) → 0 as t → ∞. Proof. Utilizing the function

V(t, x) =

n 1 x2 , 2 i=1 i

1826

J. Vanualailai and S. Nakagiri

and continuing from equation 2.4, we have, along a solution of system 2.5, for c = min{c 1 , . . . , c n } > 0 and for > 0, V˙ (2.5) ≤ −2cV +

n

xi h i (t)

i=1

≤ −c

n

xi2 +

i=1

= −(c − )

n

xi2 +

i=1 n

xi2 +

i=1

= − 2(c − )V +

n 1 [h i (t)]2 4 i=1

n 1 [h i (t)]2 4 i=1

n 1 [h i (t)]2 . 4 i=1

If we take = c/2 and λ(t) =

1 2c

n

i=1 [h i (t)]

2

≥ 0, then

V˙ (2.5) ≤ −cV + λ(t)(1 + V), noting that V is nonnegative. Hence, the conclusion of theorem 5 follows immediately from lemma 1.

3 Application to Artificial Neural Networks Certain artificial neural networks could be considered as dynamical systems for which the convergence of system trajectories to equilibrium states is a necessity. Moreover, for these networks, it is best to guarantee exponential convergence since this implies that the rate of convergence to an equilibrium state can be measured, an important aspect in the stability analysis of neural networks (Yi, Heng, & Fu, 1999). A modern overview of the concepts, both mathematical and biological, associated with neural networks is given in Arbib (2002). In the first part of this section, we consider a continuous neural network that is described thoroughly in Hirsch (1989) and provide a convergence criterion using theorem 2. In the second part, we consider the Hopfieldtype neural network and provide convergence criteria using theorem 3 if the external inputs are constants, and theorem 5 if the external inputs are time-varying. 3.1 An Artificial Neural Network of the Type x˙ = g(x). Following the description by Hirsch (1989), we consider a net that has n units. To the ith unit, we associate its activation state at time t, a real number xi = xi (t); an output function µi ; a fixed bias θi ; and an output signal Ri = µi (xi + θi ). The

Nonlinear Continuous Neural Networks

1827

synaptic weight or connection strength on the line from unit j to unit i is a fixed real number Wij . When Wij = 0, there is no transmission from unit j to unit i. The incoming signal from unit j to unit i is Sij = Wij R j . In addition, there can be a vector I of any number of external inputs feeding into some or all units, so that we may write I = (I1 , . . . , Im )T . A neural network with fixed weights is a dynamical system: given initial values of the activation of all units, the future activations can be computed. The future activation states are assumed to be determined by a system of n differential equations, the ith equation of which is x˙ i = G i (xi , Si1 , . . . , Sin , I) = G i (xi , Wi1 R1 , . . . , Win Rn , I) = G i (xi ; Wi1 µ1 (x1 + θ1 ), . . . , Win µn (xn + θn ); I1 , . . . , Im ) .

(3.1)

With Wij , θi , Ik , and some initial value xi (t0 ), t0 ≥ 0, assumed known, we can write equation 3.1 as x˙ i = gi (x1 , . . . , xn ), xi (t0 ) = xi 0 , which is the ith component of the system x˙ = g(x), x(t0 ) = x0 ,

(3.2)

where g is a vector on Euclidean space Rn . We assume that g is continuously differentiable and satisfies the usual theorems on existence, continuity and, uniqueness of solutions. Thus, since g ∈ C 1 [Rn , Rn ], we can define D(x) as in equation 2.1 but using g in equation 3.2. Hence, if g(0) = 0, then system 3.2 can be written as x˙ = D(x)x , x(t0 ) = x0 ,

(3.3)

the ith component of which is x˙ i = dii (x)xi +

n

dij (x)x j .

j=1 j=i

First, we state a comparable result by Lakshmikantham, Matrosov, and Sivasundaram (1991), who used the concept of vector Lyapunov functions. Theorem 6 (Lakshmikantham et al., 1991). Let g ∈ C 1 [Rn , Rn ] and g(0) = 0. Define βi (x) = dii (x) +

n j=1 j=i

|dij (x)| .

1828

J. Vanualailai and S. Nakagiri

Suppose that βi (x) < 0 if

xi2 ≥ x 2j ,

for i, j = 1, . . . , n and x ∈ Rn , x = 0. Then the zero solution of equation 3.2 is globally asymptotically stable. If we apply theorem 2, we obtain a new convergence criterion. Corollary 1. Let g ∈ C 1 [Rn , Rn ] and g(0) = 0. Define βi (x) = dii (x) +

n 1 |dij (x) + dji (x)|. 2 j=1 j=i

Suppose there are constants c i > 0 such that βi (x) ≤ −c i < 0 for i = 1 , . . . , n and x ∈ Rn . Then the zero solution of equation 3.2 is globally exponentially stable. Remark 1. For the case where dij (x) = −dji (x), i = j, corollary 1 is more effective than theorem 6 in predicting the stability of system 3.2 in that if dii (x) < 0, for i = 1, . . . , n and x ∈ Rn , then network 3.2 is globally asymptotically stable no matter how large |dij (x)| are. If dij (x) = −dji (x) for all i, j = 1, . . . , n, then the matrix D in equation 3.3 is skew symmetric. In such an extreme case, the network 3.2 is stable by corollary 1. We have therefore obtained a result that is similar to Matsuoka’s well-known result (1992), which says that if the synaptic weight matrices in Hopfield-type neural networks are skew symmetric, then the networks are at least stable, but is applicable to a more general network, equation 3.2, of which the Hopfield network is a specific case. The next section shows that Matsuoka’s result is indeed but a special case of theorem 3. 3.2 The Hopfield-Type Neural Network. The Hopfield-type neural network is a specific case of network 3.2 and therefore of 3.1. It is modeled by the nonlinear differential equation, x˙ i = −a i xi +

n

Wij µ j (x j + θ j ) + Ii (t)

j=1

= −a i xi +

n

Wij ν j (x j ) + Ii (t),

(3.4)

j=1

where a i > 0 is the constant decay rate, Ii (t) is the time-varying external input (to the ith neuron) defined almost everywhere on [0, ∞), and νi is the suppressed notation for the fixed θi by having θi incorporated

Nonlinear Continuous Neural Networks

1829

into νi . The function νi is called the neuron activation function. Now define A = diag(−a 1 , . . . , −a n ), x = (x1 , . . . , xn )T , v(x) = (ν1 (x1 ), . . . , νn (xn ))T , W = [Wij ]n×n and h(t) = (I1 (t), . . . , In (t))T . Then equation 3.4 is the ith component of the system x˙ = Ax + Wv(x) + h(t), x(t0 ) = x0 .

(3.5)

3.2.1 Hopfield-Type Neural Networks with Constant External Inputs. Let de f us first look at the case of constant external input vector, h(t) = k = T (I1 , . . . , In ) . Let us assume the unique existence of the equilibrium point x = x∗ when the input vector is constant, so that Ax∗ + Wv(x∗ ) + k = 0. Introduce the vector u(t) = x(t) − x∗ . Then (on suppressing t), u˙ = A[u + x∗ ] + Wv(u + x∗ ) + k = A[u + x∗ ] + Wv(u + x∗ ) + k − [Ax∗ + Wv(x∗ ) + k] = A[u + x∗ − x∗ ] + W [v(u + x∗ ) − v(x∗ )] de f

= Au + Wr(u)

de f

˜ = g(u), u(t0 ) = u0 ,

(3.6)

the ith component of which is u˙ i = −a i ui +

n

Wij [ν j (u j + x ∗j ) − ν j (x ∗j )] = −a i ui +

j=1

n

Wij r j (u j ),

j=1

˜ where r(u) = (r1 (u1 ), . . . , rn (un ))T . It is clear that g(0) = 0, so that theorem 3 is applicable to the zero solution of equation 3.6, and therefore to the equilibrium point x = x∗ of equation 3.5 with constant external inputs. Now, the Jacobian matrix yields J ii (u) =

∂ g˜ i (u) = −a i + Wii νi (ui + xi∗ ) = −a i + Wii ri (ui ) ∂ui

J ij (u) =

∂ g˜ i (u) = Wij ν j (u j + x ∗j ) = Wij r j (u j ), i = j. ∂u j

and

Theorem 3 immediately gives the following new result. Corollary 2. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Define τi (u) = −a i + Wii ri (ui ) +

n 1 |Wij r j (u j ) + Wji ri (ui )|. 2 j=1 j=i

1830

J. Vanualailai and S. Nakagiri

Suppose there are constants c i > 0 such that τi (u) ≤ −c i for i = 1, . . . , n, and u ∈ Rn . Then the zero solution of equation 3.6 is globally exponentially stable. We shall now show that corollary 2 is an important encompassing result because it is a generalization of two well-known results. First, it is a generalization of a result by Matsuoka (1992). Corollary 2 immediately gives the following: Corollary 3. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Assume there exist constants ρi > 0 such that 0 ≤ ri (ui ) ≤ ρi for i = 1, . . . , n and u ∈ Rn . If  n  1 max |Wij ρ j + Wji ρi | < a i , 1 ≤i≤n  Wii ρi +  2 j=1  

(3.7)

j=i

then the zero solution of equation 3.6 is globally exponentially stable. Matsuoka’s result (corollary 2, p. 498) is the case where ρi = 1 and a i = 1. Also, it is noted that Matsuoka improved a result by Hirsch (1989), who had 1 n j=1 |Wij | + |Wji | in the second term of equation 3.7. 2 j=i

The first generalization of Matsuoka’s result appeared in 2001, in Liang and Si (2001). However, their derivation employs a full Lur´e-Postnikov function and is unnecessarily difficult compared to the derivation (and hence the statement) of corollary 3. In this respect, corollary 3 is a significant contribution in this article. Corollary 2 is also a generalization of a result by Fang and Kincaid (1996). Via the inequality |a + b| ≤ max{|a + b|, |a |, |b|} ≤ |a | + |b| a , b ∈ R, we can use corollary 2 to obtain two practically useful results: Corollary 4. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Assume there exist constants ρi > 0 such that 0 ≤ ri (ui ) ≤ ρi for i = 1, . . . , n and u ∈ Rn . Define y+ = max{y, 0 } for all real numbers y and ψi = −a i + Wii+ ρi +

n 1 Wij , 2 j=1 j=i

where Wij = max |Wij ρ j + Wji ρi |, |Wij |ρ j , |Wji |ρi .

Nonlinear Continuous Neural Networks

1831

If ψi < 0 for i = 1, . . . , n, then the zero solution of equation 3.6 is globally exponentially stable. Corollary 5. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Assume there exist constants ρi > 0 such that 0 ≤ ri (ui ) ≤ ρi for i = 1, . . . , n and u ∈ Rn . Define y+ = max{y, 0 } for all real numbers y and ψi = −a i + Wii+ ρi +

n 1 (|Wij |ρ j + |Wji |ρi ). 2 j=1 j=i

If ψi < 0 for i = 1, . . . , n, then the zero solution of equation 3.6 is globally exponentially stable. Note that we can obtain corollary 5 directly from theorem 3, given that |J ij (x) + J ji (x)| ≤ |J ij (x)| + |J ji (x)|. Now, in corollary 4, if Wij = Wij |ρ j |, then we obtain a slight improvement of Fang and Kincaid (1996, theorem 3.8 ii-b). Corollary 5 corresponds to Fang and Kincaid (1996, theorem 3.8 ii-d). As the following example shows, corollary 4 is easier to apply: Example. Consider the system x˙ 1 = −x1 + 0.90ν2 (x2 ) + I1 ,

x˙ 2 = −x2 − 0.55ν1 (x1 ) + I2 ,

(3.8)

where ν1 (r ) = tanh 2r , ν2 (r ) = tanh r , and I1 and I2 are arbitrary inputs. We have −1 0 0 0.90 A= , W= and ρ1 = 2, ρ2 = 1. 0 −1 −0.55 0 Applying corollary 4, we see that + ρ1 + ψ1 = −a 1 + W11

= −1 +

1 max {|W12 ρ2 + W21 ρ1 |, |W12 |ρ2 , |W21 |ρ1 } 2

1 (| − 0.55| · 2) < 0, 2

and + ψ2 = −a 2 + W22 ρ2 +

= −1 +

1 max {|W21 ρ1 + W12 ρ2 |, |W21 |ρ1 , |W12 |ρ2 } 2

1 (| −0.55| · 2) < 0. 2

1832

J. Vanualailai and S. Nakagiri

Hence, every equilibrium point (x1∗ , x2∗ ) of equation 3.8 corresponding to every constant external input vector k = (I1 , I2 ) is globally exponentially stable. Remark 2. On a historical note, the result by Hirsch (1989) was obtained by the application of the Gershgorin criterion, a non-Lyapunov method. The first major generalization and improvement of Hirsch’s result was made when Matsuoka (1992) used a Persidskii-type Lyapunov function to replace the term |Wij | + |Wji | by |Wij + Wji |. Hence, Matsuoka’s results were effective in predicting the stability of systems in which Wij = −Wji . After Matsuoka, the Persidskii-type Lyapunov function and the Lur´e-Postnikov Lyapunov function were used in the results of many researchers. The frequently cited results include those by Forti and Tesi (1995), Arik and Tavsanoglu (1998), Juang (1999), Yi et al. (1999), Guan, Chen, and Qin (2000), Kaszkurewicz and Bhaya (2000), Liang and Si (2001), and Chen, Lu, and Amari (2002). These results are essentially on establishing the existence of equilibrium solutions, sufficient conditions for exponential stability, rates of exponential stability of Hopfield-type neural networks, and generalized results. One may also note the elegant results of Pakdaman, Malta, and Grotta-Ragazzo (1999), who used the quadratic Lyapunov function as a contraction map to prove both the existence and global stability of the equilibrium points of Hopfield-type neural networks. The major non-Lyapunov methods that have produced similar results (such as corollary 5) and other sufficient conditions for convergence include the matrix measure technique by Fang and Kincaid (1996) and the nonlinear measure technique formulated by Qiao, Peng, and Xu (2001). However, apart from Liang and Si (2001) and this letter, none of the comparable results that appeared after 1992 seems to have reproduced Matsuoka’s superior result that also conclusively established global asymptotic stability in networks with skew-symmetric synaptic weight matrices. The approach by Liang and Si (2001), which used a Lur´e-Postnikov Lyapunov function, is unnecessarily more difficult than the approach in this article. 3.2.2 Hopfield-Type Neural Networks with Time-Varying External Inputs. Let us next consider the case of time-varying external inputs. First, we look at the autonomous system, x˙ = Ax + Wv(x), x(t0 ) = x0 .

(3.9)

For this, we assume that x∗ = (x1∗ , . . . , xn∗ )T is the unique equilibrium point, so that Ax∗ + Wv(x∗ ) = 0. Again, using the variable u(t) = x(t) − x∗ , we have, from equation 3.6, with k = 0, ˜ u˙ = Au + Wr(u) = g(u), u(t0 ) = u0 ,

(3.10)

Nonlinear Continuous Neural Networks

1833

so that we can apply theorem 3 to the zero solution of equation 3.10 and hence to the equilibrium point of equation 3.9. Since either corollary 2, 3, 4, or 5 guarantees stability given any constant external input vector, including 0, we have the following result: Corollary 6. Let the conditions of corollary 2, corollary 3, corollary 4, or corollary 5 hold. Then the zero solution of equation 3.10 is globally exponentially stable. Let us now perturb equation 3.10 with the time-varying external input vector h(t) to get the following system: ˙ = g(u) ˜ u(t) + h(t), u(t0 ) = u0 .

(3.11)

We can thus apply theorem 5, giving the convergence of all solutions of equation 3.11 to the zero solution, and hence convergence of all solutions of equation 3.5 to the solution x∗ . Corollary 7. Let the conditions of corollary 6 hold. If n i=1

t+1

[h i (s)]2 ds → 0 as t → ∞,

t

then all solutions of equation 3.11 tend to zero. Remark 3. Corollary 7 is new and deals with the time-dependent inputs that maintain convergence. It considerably improves the most recent result that first deals with strictly time-varying external inputs, namely, theorem 2, of Vanualailai et al. (2002), by proposing more practically useful results in the form of corollary 6 and corollary 7. If |h i (t)| ≤ ki , for some constant ki > 0, then it is a mistake to use the well-known Malkin’s theorem to conclude global stability or stability under persistent disturbances since h does not depend on x. In fact, to conclude this, it is best to use a practical stability criterion proposed by Lakshmikantham, Leela, and Martynyuk (1990) since it gives more than a mere statement of the existence of the bounds of the disturbances and initial conditions that maintain bounded outputs. Practical stability was first applied to Hopfield-type neural networks by Koksal ¨ and Sivasundaram (1993), but their proposed practical stability criteria were restrictive and were improved recently by Vanualailai et al. (2002). Acknowledgments We thank T. Hara for helpful discussions and the referees for their constructive remarks. J. V. acknowledges the French government and the French embassy, Fiji, for providing funds and assistance to allow him to spend

1834

J. Vanualailai and S. Nakagiri

his sabbatical leave (December 2002–March 2003) at the Laboratoire des Signaux et Syst`emes, Sup´elec, France, where the first draft of this article was written. References Arbib, M. A. (2002). The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Arik, S., & Tavsanoglu, V. (1998). A comment on “Comments on necessary and sufficient condition for absolute stability of neural networks.” IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 45, 595–596. Chen, T., Lu, W., & Amari, S. (2002). Global convergence rate of recurrently connected neural networks. Neural Computation, 14, 2947–2957. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, SMC-13, 815–826. Fang, Y., & Kincaid, T. G. (1996). Stability analysis of dynamical neural networks. IEEE Transactions on Neural Networks, 7, 996–1005. Forti, M., & Tesi, A. (1995). New conditions for global stability of neural networks with applications to linear and quadratic programming problems. IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 42, 354–366. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms and architectures. Neural Networks, 1, 17–61. Guan, Z., Chen, G., & Qin, Y. (2000). On equilibria, stability and instability of Hopfield neural networks. IEEE Transactions on Neural Networks, 11, 534–540. Hara, T. (1975). On the asymptotic behavior of solutions of certain non-autonomous differential equations. Osaka Journal of Mathematics, 12, 267–282. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3058–3092. Hopfield, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Hopfield J. J., and Tank, D. W. (1986). Computing with neural networks: A model. Science, 233, 625–633. Juang, J. (1999). Stability analysis of Hopfield-type neural networks. IEEE Transactions on Neural Networks, 10, 1366–1374. Kaszkurewicz, E., & Bhaya, A. (2000). Matrix diagonal stability in systems and computation. Boston: Birkh¨auser. Koksal, ¨ S., & Sivasundaram, S. (1993). Stability properties of the Hopfield-type neural networks. Dynamics and Stability of Systems, 8, 181–187. Krasovskii, K. K. (1954). On the stability in the large of a system of nonlinear differential equations. Prikladnaya Matematika i Mekhanika, 18, 735–737. Lakshmikantham, V., Leela, S., & Martynyuk, A. A. (1990). Practical stability of nonlinear systems. Singapore: World Scientific.

Nonlinear Continuous Neural Networks

1835

Lakshmikantham, V., Matrosov, V. M., & Sivasundaram, S. (1991). Vector Lyapunov functions and the stability analysis of nonlinear systems. Norwood, MA: Kluwer. Liang, X., & Si, J. (2001). Global exponential stability of neural networks with globally Lipschitz continuous activations and its application to linear variational inequality problem. IEEE Transactions on Neural Networks, 12, 349–359. Matsuoka, K. (1992). Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, 5, 495–500. Pakdaman, K., Malta, C. P., & Grotta-Ragazzo, C. (1999). Asymptotic behaviour of irreducible excitatory networks of analog graded-response neurons. IEEE Transactions on Neural Networks, 10, 1375–1381. Qiao, H., Peng, J., & Xu, Z. (2001). Nonlinear measure: A new approach to exponential stability analysis of Hopfield-type neural networks. IEEE Transactions on Neural Networks, 12, 360–3702. Sastry, S. (1999). Nonlinear systems: Analysis, stability and control. New York: Springer-Verlag. Vanualailai, J., Soma, T., & Nakagiri, S. (2002). Convergence of solutions and practical stability of Hopfield-type neural networks with time-varying external inputs. Nonlinear Studies, 9, 109–122. Yi, Z., Heng, P. A., & Fu, A. W. C. (1999). Estimate of exponential convergence rate and exponential stability for neural networks. IEEE Transactions on Neural Networks, 10, 1487–1493.

Received December 18, 2003; accepted November 2, 2004.

LETTER

Communicated by Alan Yuille

Estimation and Marginalization Using the Kikuchi Approximation Methods Payam Pakzad [email protected]

Venkat Anantharam [email protected] Electrical Engineering and Computer Science, Department, University of California, Berkeley, CA 94720, U.S.A.

In this letter, we examine a general method of approximation, known as the Kikuchi approximation method, for finding the marginals of a product distribution, as well as the corresponding partition function. The Kikuchi approximation method defines a certain constrained optimization problem, called the Kikuchi problem, and treats its stationary points as approximations to the desired marginals. We show how to associate a graph to any Kikuchi problem and describe a class of local message-passing algorithms along the edges of any such graph, which attempt to find the solutions to the problem. Implementation of these algorithms on graphs with fewer edges requires fewer operations in each iteration. We therefore characterize minimal graphs for a Kikuchi problem, which are those with the minimum number of edges. We show with empirical results that these simpler algorithms often offer significant savings in computational complexity, without suffering a loss in the convergence rate. We give conditions for the convexity of a given Kikuchi problem and the exactness of the approximations in terms of the loops of the minimal graph. More precisely, we show that if the minimal graph is cycle free, then the Kikuchi approximation method is exact, and the converse is also true generically. Together with the fact that in the cycle-free case, the iterative algorithms are equivalent to the well-known belief propagation algorithm, our results imply that, generically, the Kikuchi approximation method can be exact if and only if traditional junction tree methods could also solve the problem exactly. 1 Introduction In its most general form, the problem of finding the marginals of a product distribution is encountered frequently in various branches of science and engineering. An important special case, the probabilistic inference problem (Cowell, Dawid, Lauritzen, & Spiegelhalter 1999), is to infer the most probable scenarios, given a collection of observations. Under a Bayesian Neural Computation 17, 1836–1873 (2005)

© 2005 Massachusetts Institute of Technology

Estimation and Marginalization Using Kikuchi Methods

1837

causality model, this is equivalent to finding the marginals of a joint probability distribution in the form of the product of certain conditional probability functions. Applications of the probabilistic inference problem range from medical diagnosis to speech recognition and error-correcting codes. Example 1. Consider the soft decoding of an (n, k) binary linear code with (n − k) × n parity check matrix H (see, e.g., Wicker, 1995). Let P(x; y∗ ) represent the joint a posteriori probability density of bits of a code word x := (x1 , . . . , xn ), with noisy observations y∗ := (y1∗ , . . . , yn∗ ) over a binary memoryless channel. Then P(x; y∗ ) can be represented as the product of some indicator functions representing the parity checks between the bits of the code words, as well as conditional probabilities representing the noisy observations, n n 1 n−k P(x; y) = 1 Hi, j x j = 0 P(xi )P(yi∗ |xi ), Z i=1 j=1 j=1 where the summation is modulo-2, and 1(·) is the indicator function, taking values 1 or 0 depending on whether its argument is true or false; Z is a normalizing constant called the partition function. In this case, the marginal Pi (xi ; y∗ ) is used to find the most probable value of the ith bit. In some applications, one is interested mainly in calculating the partition function: Example 2. In a circuit-switched network, one is interested in finding the invariant distribution of calls in progress along routes of the network. It can be shown (see, e.g., Walrand & Varaiya, 1996) that the invariant distribution has the form L q 1 (x1 ) · · · q M (xM ) π(x1 , . . . , xM ) = 1 xi < n j . Z i∈R j j=1 Here, M is the total number of routes, xi is the number of calls along route i, q i (xi ) is the (known) invariant distribution of xi if the links had an infinite number of circuits, L is the number of links in the network, n j is the capacity of link j, and R j ⊂ {1, . . . , M} is the index set of routes that use link j. Finally, Z is the partition function, defined by

Z :=

M x1 ,...,xM i=1

q i (xi )

L j=1

1

i∈R j

xi < n j .

1838

P. Pakzad and V. Anantharam

Therefore, in order to calculate the invariant distribution, one needs only to calculate the partition function Z. As another example, in physics, one can derive various thermodynamical properties of a system, such as the average energy and entropy, if the partition function is known as a function of the temperature (see, e.g., Kittel & Kroemer, 1980). Although the general marginalization problem can be exponentially complex, scientists and engineers have long explored ways to reduce the computational complexity of the calculations required to find the marginals, either exactly or approximately, (see, e.g., Pearl, 1988; Aji & McEliece, 2000; Morita, 1994; Yedidia, Freeman, & Weiss 2001; Luby, 2002; Pakzad & Anantharam, 2004). Most approaches use a graphical model to represent the interdependence of variables in the factor functions and use messagepassing algorithms on this graph to localize the calculations. Belief propagation (Pearl, 1988) is one such algorithm. The success of low-density parity check (LDPC) codes (Gallager, 1963; MacKay & Neal, 1995) and turbo codes (Berrou, Glavieux, & Thitimajshima, 1993), which are decoded using instances of the belief propagation algorithm on a loopy graph (McEliece, MacKay, & Cheng, 1998), motivated many communications engineers to look more closely at belief propagation and junction graphs. So far, however, a general characterization of the quality of approximation and convergence properties of loopy belief propagation has not been discovered, despite a number of excellent partial results, which have considerably increased our understanding of the dynamics of such algorithms (see, e.g., Richardson & Urbanke, 2001; Weiss, 2000; Richardson, Shokrollahi, & Urbanke, 2001; Divsalar, Jin, & McEliece, 1998; Richardson, 2000; MacKay & Neal, 1995). Yedidia et al. (2001) showed recently that there is a close connection between loopy belief propagation and certain approximations to the variational free energy in statistical physics. Specifically, as we will also discuss in this article, the fixed points of the belief propagation algorithm were shown to coincide with the stationary points of the Bethe free energy subject to consistency constraints. Here, the Bethe free energy is an approximation to the variational free energy. The Bethe approximation is only a special case of a more general class of approximations called the Kikuchi approximations (Kikuchi, 1951). A class of iterative message-passing algorithms was introduced in Yedidia et al. (2001), which attempt to find the stationary points of the Kikuchi free energy. Using such message-passing algorithms, one is expected to obtain better approximations to the marginals and the partition function than the ones given by loopy belief propagation. Building on the generalized region-based approach introduced in Pakzad and Anantharam (2002a, 2002b), in this article, we explore a wide range of ideas related to the Kikuchi approximation method. In particular, we discuss necessary conditions for uniqueness of the minimizers of the Kikuchi free energy, introduce graphical representations for the problem, and define

Estimation and Marginalization Using Kikuchi Methods

1839

minimal graphical representations, which result in iterative solutions that are often significantly less complex than the algorithms discussed in Yedidia, Freeman, and Weiss (2001, 2002), and McEliece and Yildrim (2003). Furthermore, we will show that for generic problems, the Kikuchi approximation yields the exact marginals if and only if this minimal graphical representation of the Kikuchi problem is loop free.1 We will also address the more general problem of approximating the entropy of a product distribution in terms of the entropies of its marginals. Yedidia et al. (2002) independently developed a similar general framework based on region graphs. As such, there is some intersection between our work and theirs; in particular, the presentation in sections 2 and 3 should be compared with those in Yedidia et al. (2002), as well as Aji and McEliece (2001) and McEliece and Yildrim (2003). However, the major contributions of this work—the discussion of the graphical representations of Kikuchi problems, the convexity conditions, the necessary and sufficient conditions for exactness of the approximation, and the low-complexity extension of the generalized belief propagation algorithm—have not been previously addressed. We point out these similarities and differences throughout the article. Other researchers have developed various techniques based on related ideas, each with specific advantages over traditional loopy belief propagation. Yuille (2002) derives a double-loop, free-energy-minimizing algorithm that is guaranteed to converge, unlike loopy belief propagation. Welling and Teh (2001) formulate an algorithm of gradient-descent type, which is guaranteed to find a fixed point of the Bethe free energy. Wainwright and Jordan (2003) discuss convex relaxations of the variational principle, resulting in efficient algorithms that yield upper bounds to the partition function. The outline of this letter is as follows. We define the marginalization problem and set up some necessary notation in section 2. In section 3, we review the connection with methods in statistical physics, define the Kikuchi approximation method as one that approximates the desired marginals as the constrained fixed points of an appropriately defined free energy functional, and show that there are iterative message-passing algorithms whose fixed points correspond to the stationary points of the Kikuchi functional. Sufficient conditions for convexity of the Kikuchi functional are also provided. The restriction of these conditions to the Bethe case implies the well-known result on the convergence of loopy belief propagation on graphs with a single loop (see, e.g., Weiss, 2000). In section 4 we introduce the notion of graphical representations for a Kikuchi problem, establish the connection with junction trees, and prove results on the exactness of the Kikuchi approximation. In section 5 we derive

1 By Kikuchi problem, we mean the problem of minimizing the Kikuchi free energy, subject to some consistency constraints.

1840

P. Pakzad and V. Anantharam

the generalized belief propagation (GBP) algorithm of Yedidia et al. (2001) on any arbitrary graphical representation of a Kikuchi problem. This is an extension of the results in Yedidia et al. (2002) and McEliece and Yildrim (2003). Interested readers are referred to Pakzad (2004) for further technical discussions of these topics. Some experimental results are reported in section 6, comparing the convergence properties of the low-complexity GBP algorithm derived here, with the algorithms described in Yedidia et al. (2002) and McEliece and Yildrim (2003). 2 Problem Setup Let x := (x0 , . . . , xN−1 ), where for each i ∈ [N] := {0, . . . , N − 1}, xi is a variable taking value in [q i ] := {0, . . . , q i − 1}, with q i ≥ 2. Let R be a collection of subsets of [N]; we call each r ∈ R a region. We assume that each variable index i ∈ [N] appears in at least one region r ∈ R. Associated with each region r ∈ R is a nonnegative kernel function, αr (xr ), depending only on the variables that appear in r . Then the corresponding R-decomposable (Boltzmann) product distribution is defined as B(x) :=

1 αr (xr ). Z r ∈R

(2.1)

Here Z is the normalizing constant and is the partition function. For a subset s ⊂ [N], we denote by Bs (xs ) := x[N]\s B(x) the s-marginal of B(x). Problem. The problem considered in this letter is that of finding one or more of the Br (xr )’s for r ∈ R, and/or the partition function Z. The methods developed in this article to solve this problem are best described in the language of partially ordered sets or posets (see, e.g., Stanley, 1986). Specifically, the collection R of regions can be viewed as a poset with set inclusion as its partial ordering relation. This is because inclusion is reflexive (∀r ∈ R, r ⊆ r ), antisymmetrical (r ⊆ s and s ⊆ r implies r = s), and transitive (r ⊆ s and s ⊆ t implies r ⊆ t). We write r ⊂ t to denote strict inclusion. We say t covers u in R, and write u ≺ t, if u, t ∈ R, u ⊂ t and ∃v ∈ R s.t. u ⊂ v ⊂ t. Definition 1. Given a poset R, its Hasse diagram G R is a directed acyclic graph (DAG)2 whose vertices are the elements of R and whose edges correspond to cover relations in R; that is, an edge (t → u) exists in G R iff u ≺ t. 2

Traditionally the Hasse diagram is drawn as an undirected graph, with an implied upward direction (see Stanley, 1986). This is indeed equivalent to a DAG, which will be the view used in this letter.

Estimation and Marginalization Using Kikuchi Methods

1841

It follows that for any two distinct nodes r, s ∈ R, we have r ⊂ s iff there is a directed path from s to r in G R . Throughout this letter, we will need the following definitions. Let R be a poset of subsets of [N] with the partial ordering of inclusion. For each subset r ⊆ [N], we define: Ancestors:

A(r ) := {s ∈ R : r ⊂ s}

Descendants: D(r ) := {s ∈ R : s ⊂ r } Forebears:

F(r ) := {s ∈ R : r ⊆ s}

Further, for r ∈ R we define: Parents:

P(r ) := {s ∈ R : r ≺ s}

Children: C(r ) := {s ∈ R : s ≺ r } Note that in each of these definitions, the collection of subsets being defined comprises regions, even though the argument r of A(r ), D(r ), and F(r ) need not be a region itself. For a collection S of subsets of [N], we define F(S) := s∈S F(s). A subset T of R that is in the form T = F(S) for some S ⊆ R, viewed as a sub-poset of R (with the same partial ordering), is called an up-set of R. Finally we define the depth of each region r ∈ R as: d(r ) :=

0 if r is maximal 1 + maxs∈P(r ) d(s) otherwise.

3 Kikuchi Approximation Method 3.1 Connection with Statistical Physics. In the setup described in section 2, we can view xi as the spin of the particle at position i in a system of N particles. Let b(x) denote a probability distribution on the configuration of spins, and consider a function E(x) called the energy function. Suppose the energy function is R-decomposable, that is, E(x) = r ∈R Er (xr ) for certain functions {Er (xr ), r ∈ R}. In statistical physics, one defines the (Helmholtz) variational free energy as the following functional of the distribution F (b(x)) := U(b(x)) − H(b(x)),

(3.1)

where U := x b(x)E(x) is the average energy and H := − x b(x) log(b(x)) is the entropy of the system. We make the connection with the problem

1842

P. Pakzad and V. Anantharam

formulation of section 2 by setting Er (xr ) := − log(αr (xr )). We can then write E(x) = Er (xr ) r ∈R

=−

r ∈R

= − log

log(αr (xr ))

1 αr (xr ) − log(Z) Z r ∈R

= − log(B(x)) − log(Z), where B(x) is the Boltzmann distribution of equation 2.1. Then the variational free energy can be rewritten as follows: F (b) = b(x)(−log(B(x)) − log(Z)) + b(x) log(b(x)) x

=

x

b(x) log

b(x) B(x)

x

− log(Z)

= KL(b||B) − log(Z), where KL(b||B) is the Kullback-Leibler divergence between b(x) and B(x) (see, e.g., Cover & Thomas, 1991). It is then clear that F (b) is uniquely minimized when b(x) equals the Boltzmann distribution B(x) of equation 2.1, and we have F0 := min F (b(x)) = F (B(x)) = −log(Z). b(x)

(3.2)

As mentioned in section 1, equation 3.2 is of great interest in science and engineering. Physicists are interested in finding the log-partition function F0 , as a function of a temperature variable, which we have omitted here, since thermodynamical properties of physical systems can be derived from it. In estimation problems in engineering, one is interested in finding the marginals of the Boltzmann distribution B(x). This is called the probabilistic inference problem. However, equation 3.2, viewed as an optimization problem, does not prescribe a practical way for computing these quantities, as it involves minimization over the exponentially large domain of distributions b(x). Given that the energy function is R-decomposable, to simplify the minimization problem 3.2, one may try to reformulate it in a way that is, loosely speaking, also R-decomposable. A natural way to do this is to try to represent the free energy as a functional of the R-marginals of the distribution b(x). Definition 2. We will call a collection {br (xr ), r ∈ R} of probability functions, which may or may not be the marginals of a single distribution, a collection of

Estimation and Marginalization Using Kikuchi Methods

1843

R-pseudomarginals. A collection of R-pseudomarginals that are also the marginals of a probability distribution b(x) are called the R-marginals of b(x). Define R to be the family of the R-marginals of all probability distributions on x; that is, a collection {br (xr ), r ∈ R} belongs to R if and only if there exists a distribution b(x) s.t. ∀ r ∈ R, br (xr ) = x[N]\r b(x). Then we can rewrite equation 3.2 as F0 =

min

{br (xr )}∈ R

{br∗ (xr )} = arg

F R ({br (xr )})

min

{br (xr )}∈ R

F R ({br (xr )})

(3.3)

where F R ({br }) :=

min

b(x) : {br } R-marginals of b

F (b(x)).

Since E(x) is R-decomposable, the average energy decomposes as

U(b(x)) =

br (xr )Er (xr ),

(3.4)

r ∈R xr

where br (xr )’s are the marginals of distribution b(x). In general, however, the entropy term in the free energy, equation 3.1, cannot be decomposed in terms of the R-marginals of b(x). The key component of the Kikuchi approximation method is to use an approximation of the form H(b(x))

kr Hr (br (xr )),

(3.5)

r ∈R

where H r (br (xr )) := − xr br (xr ) log(br (xr )) is the regional entropy associated with a region r ∈ R, and kr ’s are suitable constants to be determined. As discussed in Pakzad and Anantharam (2002a), we may view R ∪ {[N]} as a poset with partial ordering of inclusion. For each r ∈ R, define cr := −µ(r, [N]) where µ(·, ·) is the Mobius ¨ function. Then the Mobius ¨ inversion formula (see, e.g., Stanley, 1986) shows that cr ’s are defined uniquely by the following equations, cr = 1 −

cs ,

(3.6)

s∈A(r )

where A(r ) is the set of ancestors of r , as defined in section 2. Following Yedidia et al. (2001), we call the cr ’s defined in this manner the overcounting factors. As it turns out, cr ’s are the natural choice for the constants {kr }

1844

P. Pakzad and V. Anantharam

in equation 3.5, as we show in proposition 1 (see Pakzad, 2004, for the proof). Proposition 1. The only choice of factors {kr } that can result in exactness of equation 3.5

for all R-decomposable Boltzmann distributions—that is, distributions b(x) := Z1 r ∈R αr (xr ) for all choices of {αr , r ∈ R}—is the overcounting factors {cr }. In fact, the original choice of {kr } in the Kikuchi approximation method (Kikuchi, 1951) was also {kr } = {cr }. It will also be shown that this exactness happens if and only the collection R of regions is loopfree in an appropriate sense, which will be defined in section 4.1. The Kikuchi approximation method, which will be defined more formally in section 3.2, proposes to solve a constrained minimization problem of the following form (cf. equation 3.3), {Br (xr )} {br∗ (xr )} := arg

min

{br (xr )}∈ KR

F RK ({br (xr )}).

(3.7)

Here F RK ({br }), known as the Kikuchi free energy3 (see, e.g., Kikuchi, 1951), is defined as follows, F RK ({br (xr )}) :=

br (xr )Er (xr ) +

r ∈R xr

cr br (xr ) log(br (xr )), (3.8)

r ∈R xr

and KR is a set of constraints to enforce consistency between the br ’s, defined as KR

:= {br (xr ), r ∈ R} : ∀ t, u ∈ R s.t. t ⊂ u,

and ∀u ∈ R,

b u (xu ) = b t (xt )

xu\t

b u (xu ) = 1 .

(3.9)

xu

Note that in general, the constraints of KR are not enough to guarantee that every collection of pseudomarginals {br , r ∈ R} ∈ KR is in fact the collection of the marginals of a single distribution function b(x). A collection may very well satisfy all the consistency constraints of equation 3.9 and not be the marginals of any distribution. In section 4 we discuss conditions on R that guarantee that the free energy F (b) can be viewed as a functional of the marginals of b(x), that is, {br , r ∈ R}, 3

Yedidia et al. (2002) call this functional the region graph free energy.

Estimation and Marginalization Using Kikuchi Methods

1845

and, as such a functional, equals the Kikuchi functional F RK . Further, we discuss conditions on R under which the constraint set KR equals the family of R-marginals R . 3.2 Kikuchi Approximation Method. In this section we formulate the Kikuchi approximation method for solving the marginalization problem posed in section 2. We will further describe conditions on the collection of regions R, which are expected to improve the quality of the approximations. Let R0 be a collection of regions and {αr0 (xr ), r ∈ R0 } be a collection of kernel functions. We are interested in solving the marginalization problem posed in section 2 for R0 and {αr0 }. Let R be another collection of regions obtained from R0 in such a way that 4 ∀ r ∈ R0 , ∃ r ∈ R s.t. r ⊆ r . Then one can always form a collection of Rkernels {αr (xr ), r ∈ R} so that − r ∈R log(αr (xr )) = − r ∈R0 log(αr0 (xr )) =: E(x).

Now for each r ∈ R, define βr (xr ) := s⊆r αs (xs ). Then the Boltzmann distribution of equation 2.1 takes the following product forms,

0 αr (xr ) βr (xr )cr r ∈R0 αr (xr ) B(x) = = r ∈R = r ∈R , (3.10) Z Z Z where the last equality follows from the fact that by equation 3.6, r ∈F(s) c r = 1 for all s ∈ R. Using approximations 3.8 and 3.9, we are now interested in solving the following: Problem (Kikuchi Approximation). −log(Z) F ∗ :=

min

{br (xr )}∈ KR

and {Br (xr )} {br∗ (xr )} := arg

F RK ({br (xr )}) min

{br (xr )}∈ KR

F RK ({br (xr )}).

(3.11)

Note now that by equation 3.3, if F (b(x)) = F RK ({br (xr )}) for all b(x), and R = KR , then the minimizer collection {br∗ (xr )} of equation 3.11 would correspond exactly to the collection of the marginals of the product function B(x) of equation 3.10. Hence, if the Kikuchi approximate free energy F RK ({br }) is close to F (b), and local consistency constraint set KR is also close to R , the minimizers {br∗ } of equation 3.11 are expected to be close approximations to these marginals. Our focus in the rest of this article shifts to the above problem and to the relation between the solution to this problem and the original one in section 2. An important question, which we address in detail, is when the 4 Note however that the way this assignment is done can have an impact on the quality of the approximations to equation 3.3 provided by equation 3.11.

1846

P. Pakzad and V. Anantharam

br∗ ’s are equal to the marginals Br of the Boltzmann distribution. We also address in detail in sections 4 and 5 message-passing algorithms on graphs that solve equation 3.11. These algorithms are similar in nature to the GBP of Yedidia et al. (2002) and poset belief propagation of McEliece and Yildrim (2003), although as we will show, the extensions presented here are often considerably more efficient than those algorithms. The collection R of regions effectively specifies both the Kikuchi approximation, equation 3.8, and the constraint set, equation 3.9. It is also evident that equation 3.11 as an approximation method can be applied for any given F RK and KR ; better choices of R simply result in better approximations. Therefore, we can define the Kikuchi approximation method as the general class of constrained minimization problems given by equation 3.11, which are parameterized by the poset5 R of regions, and local kernel functions αr (xr ) for each r ∈ R. It remains to specify which choices of R yield good approximations of the marginals. In the remainder of this article, we consider only collections of regions R that have the same maximal regions as R0 . Expansion of the maximal regions corresponds to clustering methods as discussed in Pearl (1988). The techniques developed here to derive low-complexity messagepassing algorithms to solve the Kikuchi approximation problem can also be applied after clustering. It certainly seems that minimization with more local consistency constraints on {br (xr )} should result in better approximations, since the true marginals would satisfy all such constraints. At the same time, the entropy approximations of the type given in equation 3.5 are also expected to improve if more regions are included. Therefore, one might conclude that for a given collection of maximal regions of R0 , augmenting them by introducing additional subregions to form R, where the αr ’s corresponding to the augmented subregions are taken to be 1, should improve the approximation (at the expense of increasing the complexity of the underlying minimization). Let G be a labeled graph whose vertices are identified with subsets of [N]. We define the following connectivity conditions on G: A1: ∀ i ∈ [N], the subgraph of G consisting of the regions in F({i}) is connected. Generalizing this, we can devise condition (An) on G, for each n ∈ {1, . . . , N} as follows: An: ∀ s ⊂ [N], |s| ≤ n, the subgraph of G on regions in F(s) is connected.

5 Note that although inclusion is certainly the most natural partial ordering for R, the problem is well defined for any arbitrary partial ordering.

Estimation and Marginalization Using Kikuchi Methods

1847

We say a poset R has property An iff its Hasse diagram G R satisfies condition An. Note that in the context of the Kikuchi problem, equation 3.11, property An guarantees that the beliefs at all regions will be consistent at the level of any subset xr of the variables of cardinality up to n. It is therefore natural to require that R satisfies at least condition A1. We call a poset R satisfying An for all n a totally connected poset. Inspired by Aji and McEliece (2001), one might insist that acceptable approximations of the entropy term 3.5 are those in which each variable xi appears the same number of times on the two sides of the equality sign: B1:

cr = 1

for each

i = 0, . . . , N − 1.

r ∈F({i})

We can extend this condition as follows: Bn:

cr = 1

for each

s ⊂ [N], |s| ≤ n s.t. F(s) = ∅.

r ∈F(s)

Conditions Bn are called the balance conditions, and we call a poset R satisfying Bn for all n, a totally balanced poset. These conditions are expected to give progressively better approximate solutions, although they will not in general guarantee an exact solution. The original cluster variation method of Kikuchi as defined in Morita (1994) and Yedidia et al. (2001) in effect chooses R to be the smallest collection of regions including R0 , which is closed under nonempty intersection of regions. The following proposition shows that the choice of R made in the cluster variation method is expected to give a reasonable Kikuchi approximation. Proposition 2. Any collection of regions R that is closed under nonempty intersection of regions is totally connected and totally balanced. Proof. See Pakzad (2004). The special case when the Hasse diagram G R has depth 2 (i.e., there are no distinct r, s, t ∈ R such that r ⊂ s ⊂ t) is called the Bethe case in this article. In this case, G R can be viewed as a hypergraph in which the maximal regions of R are the vertices and the minimal regions are the hyperedges. Aji and McEliece (2001) consider the case when the hypergraph view of G R is a graph, that is, the minimal elements of R are covered by at most two regions, so the hyperedges are in fact edges. It can be immediately verified that the junction graph condition given in Aji and McEliece (2001) is simply the intersection of conditions A1 and B1 above. On the other

1848

P. Pakzad and V. Anantharam

Figure 1: Tanner graph of a linear code.

hand, the generalized notion of junction graphs defined in Yedidia et al. (2002) is essentially the same as our Bethe case, with the minor exception that junction graphs are defined to satisfy condtition B1, whereas we do not a priori require that. We now give an example to illustrate some of the notions defined in this section. Example 3. Consider the (16, 8) linear code represented by the bipartite graph of Figure 1, where the top nodes correspond to parity checks and the bottom nodes correspond to symbol bits. This graph can be interpreted as the Hasse diagram of a two-level poset, where the regions associated with the bit nodes are {1}, {2}, . . . , {16}, respectively, and the region associated with each check node is the subset of {1, . . . , 16}, corresponding to the bits that constitute that parity check. This is an example of the Bethe case, where the regions corresponding to the check nodes are maximal and those corresponding to the bit nodes are minimal. The overcounting factors corresponding to the check nodes are equal to 1, while those corresponding to the bit nodes equal “one minus the number of check nodes connected to that bit node.” In this case each bit node is connected to three check nodes, so that the overcounting factors for all bit nodes equal 1 − 3 = −2. In this case, the GBP algorithm we discuss in section 5 will reduce to the original Gallager-Tanner decoding algorithm for LDPC codes (see Gallager, 1963; Tanner, 1981). This poset has property A1 but not A2: for example, the regions corresponding to the first and third check nodes are {2, 6, 7, 14, 15, 16} and {2, 6, 7, 10, 12, 16}, respectively, both containing {2, 6}, but they are not connected through regions that contain {2, 6}. Also, this poset satisfies B1 but not B2: with s := {2, 6}, F(s) is precisely the first and third check-node regions, and r ∈F(s) cr = 1 + 1 = 2 = 1. On the other hand, one can throw in all the intersections of the check node regions to create the poset whose Hasse diagram is shown in Figure 2. Here, the nodes in the middle row correspond to the intersections of the check node regions, in the first row, to which they are connected; for example, the second node in the middle row corresponds to region {2, 6, 7, 16}, which is the intersection of the first and third check node regions.

Estimation and Marginalization Using Kikuchi Methods

1849

Figure 2: Alternative poset for the linear code of example 3.

It is easy to verify that this poset is totally connected and totally balanced. 3.3 Lagrange Multipliers and Iterative Solutions. Lagrange’s method can be used to solve the constrained minimization problem 3.11. We form the Lagrangian: L := (−br (xr ) log(αr (xr )) + cr br (xr ) log(br (xr ))) r ∈R xr

+

r ∈R t≺r

+

r ∈R

κr

xt

λr t (xt ) b t (xt ) − br (xr ) − 1 ,

br (xr )

xr \t

(3.12)

xr

where coefficients λr t (st ) enforce consistency constraints and coefficients κr enforce normalization constraints, and as before t ≺ r means that r covers t. Note that since the edge constraints of G R are a sufficient representation of KR , as discussed before, we need only define λr t for pairs r, t ∈ R with t ≺ r , that is, along the edges of G R . Setting partial derivative ∂L/∂br (xr ) = 0 for each r ∈ R gives an equation for br (xr ) in terms of λur ’s and λr t ’s. The consistency constraints give update rules for each λr t in terms of other λ multipliers. Once a set of messages mr t (from r to t, for each edge (r → t) of G R ) has been defined in terms of the Lagrange multipliers λr t ’s, these update rules define an iterative algorithm whose fixed points are the stationary points of the given constrained minimization problem. In section 5 we will give a detailed derivation for an important algorithm of this type, called the generalized belief propagation (GBP) algorithm (see also Yedidia et al., 2001), and we will also see that the belief propagation algorithm of Pearl (1988) is the restriction of the above algorithm in the Bethe case. The version of GBP algorithm presented here is an extension of the one originally introduced in Yedidia et al. (2001), in which we take advantage of certain systematic complexity-reducing transformations that will be

1850

P. Pakzad and V. Anantharam

described in section 4. The resulting algorithm can be considerably less complex than the original GBP of Yedidia et al. (2001) and its generalizations in Yedidia et al. (2002) and McEliece and Yildrim (2003), constituting a major practical contribution of this work. 3.4 Convexity Conditions. In this section we describe our results regarding the convexity of the optimization problem 3.11, which we first reported in Pakzad and Anantharam (2002a). The Kikuchi free energy (see equation 3.8) constrained on {br } ∈ KR is bounded below, and hence the constrained minimization problem of equation 3.11 always has a global minimum. Therefore, as discussed in section 3.3, the message-passing algorithms derived from the Lagrangian, equation 3.12, always possess at least one fixed point (see Yuille, 2002, for an algorithm that is guaranteed to find a minimum of F RK ). The following result gives sufficient conditions on R for the problem of equation 3.11 to have precisely one minimum: Theorem 1. The Kikuchi free energy functional, equation 3.8, is strictly convex on KR (and hence the constrained minimization problem has a unique solution) if for all subcollections of regions S ⊆ R, the overcounting factors satisfy:

cs ≥ 0 ,

(3.13)

s∈F(S)

where, as defined in section 2, F(S) := ∪s∈S F(s) = {r ∈ R : ∃ s ∈ S s.t. r ⊆ s} is the up-set of S. Proof. See appendix A. Corollary 1. (cf. theorem 3 in Aji & McEliece, 2001). In the Bethe case, the constrained minimization problem of equation 3.11 has a unique solution if the graphical representation G R of R has at most one loop. Proof. See Pakzad (2004). Once we define a suitable notion of graphical representation for a general collection of regions in the next section, we will generalize the result of corollary 1. 4 Graphical Representations of the Kikuchi Approximation Problem In this section, we define the notion of graphical representations for a Kikuchi approximation problem. The algorithms of the type discussed in section 3.3 can then be viewed as message-passing algorithms along the edges of such graphs. We will discuss this in detail in section 5.

Estimation and Marginalization Using Kikuchi Methods

1851

We will further introduce minimal graphical representations for a given collection R of regions, which are graphical representations with the fewest number of edges. Our motivation for introducing such minimal graphs is twofold. First, note that the results of section 3.4 refer to the uniqueness of the solution of the constrained minimization problem of equation 3.11. However, we are also interested in the conditions under which these solutions are the exact marginals of the product distribution of equation 3.10. As we will show in this section, the exactness of approximations obtained using equation 3.11 corresponds directly to the nonexistence of loops in the minimal graphs. In fact, we will show that in the loop-free case, this graph is a junction tree, and the message-passing algorithms of type discussed in section 3.3 correspond to a variation of junction tree algorithm. Second, as we will discuss in detail in section 5, the message-passing algorithms of the type mentioned in section 3.3 on the minimal graphs will be the most compact among all graphical representations of the same problem, and can result in algorithms that are significantly less complex than the ones discussed in Yedidia et al. (2002) and McEliece and Yildrim (2003). Let G be a directed acyclic graph with vertex set V(G) and edge set E(G). Parallel to our definitions in section 2, for each vertex r ∈ V(G), define: Ancestors:

AG (r ) := {s ∈ V : ∃ a directed path from s to r }

Descendants: DG (r ) := {s ∈ V : r ∈ AG (s)} PG (r ) := {s ∈ V : (s → r ) ∈ E(G)}

Parents: Children:

CG (r ) := {s ∈ V : (r → s) ∈ E(G)}

Forebears:

FG (r ) := {r } ∪ AG (r )

As in section 2, for a subset S ⊆ V(G), we define FG (S) := Also define the depth of each vertex r ∈ V(G) as dG (r ) :=

0 1 + maxs∈PG (r ) dG (s)

s∈S

FG (s).

if PG (r ) = ∅ otherwise.

Similarly we define the depth of each edge (t → u) of G as the depth of the child vertex u: dG (t → u) := dG (u). Note that given a poset R of regions, the above definitions for the Hasse diagram G R are consistent with the corresponding definitions for the poset from section 2 (i.e., for all r ∈ V(G R ), AG R (r ) = A(r ) and so on).

1852

P. Pakzad and V. Anantharam

Back to the problem at hand, let R be a collection of regions as before, and let G be a directed acyclic graph whose nodes correspond to the regions r ∈ R. We will further assume that an edge (s → t) exists in G only if t ⊂ s. Definition 3. The edge constraint for an edge (s → t) of G is defined as the following functional of the pseudomarginals {br , r ∈ R}: EC(s→t) ({br , r ∈ R}) :=

b s (xs ) − b t (xt ).

(4.1)

xs\t

When the arguments are clear from the context, we abbreviate this as EC(s→t) . Definition 4. We call G a graphical representation of KR if KR can be represented using the edge constraints of G, that is, KR

=

{br (xr ), r ∈ R} : ∀(s → t) ∈ E(G), EC(s→t) = 0 and ∀r ∈ R,

br (xr ) = 1 .

(4.2)

xr

As mentioned in the previous sections, a poset R is most naturally represented by its Hasse diagram G R ; a Hasse diagram uses the transitivity of partial ordering to represent a poset in the most compact form. Note that our local consistency constraints also have the transitivity property: If (r → s), (s → t) and (r → t) are edges in graph G, then (EC(r →s) = 0) and (EC(s→t) = 0) =⇒ (EC(r →t) = 0). Therefore, the last edge (between “grandfather” and “grandchild”) is redundant. This is why the Hasse diagram G R is a graphical representation of KR . On the other hand, local consistency relations satisfy a property other than transitivity, which can be used to further reduce the representation of KR : Suppose (r → s), (r → u), (s → t), and (u → t) are edges in graph G, then (EC(r →s) = 0) and (EC(r →u) = 0) and (EC(u→t) = 0) =⇒ (EC(s→t) = 0).

Estimation and Marginalization Using Kikuchi Methods

q

t

u

r

v

w

q

s

x

1853

y

t

u

r

v

w

s

x

y

z

z (A)

(B)

Figure 3: Equivalence of edges for removal. In the Hasse diagram, G R depicted in A, {(t → z), (u → z), (v → z)} and {(w → z), (x → z), (y → z)} are the EER classes of z. (B) A realization of SR , the minimal graphical representation of R.

Then a graph obtained by removing the edge (s → t) of G is still a graphical representation of KR since the edge constraint of (s → t) is implied by other edge constraints. We now make precise the reductions in the graphical representation that are implied by this property. Definition 5. Edges (u → r ) and (v → r ) are said to be equivalent edges for removal (EER), and denoted (u → r ) ∼ (v → r ), if there exists a sequence (t0 → r ), . . . , (tk → r ) of edges in G R , with t0 = u and tk = v and with the property that ∀ i = 1, . . . , k, A(ti−1 ) ∩ A(ti ) = ∅. It is easy to verify that this relation ∼ is reflexive, symmetrical, and transitive and is hence indeed an equivalence relation. Therefore, for each region r ∈ R, the collection of all the edges leading to r can be partitioned into equivalence classes of edges for removal (EER classes of region r ). In the example of Figure 3A, (t → z) ∼ (u → z) ∼ (v → z) since A(t) ∩ A(u) = A(u) ∩ A(v) = {q }; also, (w → z) ∼ (x → z) ∼ (y → z) since A(w) ∩ A(x) = {r } and A(x) ∩ A(y) = {s}. But (v → z) ∼ (w → z) since no sequence of edges with the desired property can be found. It follows that {(t → z), (u → z), (v → z)} and {(w → z), (x → z), (y → z)} are the EER classes of z. Definition 6. From each EER class, remove all but one (representative) edge from the Hasse diagram G R . Denote the resulting graph by SR . Figure 3B shows one realization of SR for the Hasse diagram of Figure 3A. Note that graph SR is not unique, since the representative edge of each equivalence class can be arbitrarily chosen. However, the number of the edges in any choice of SR is unique and equals the total number of EER classes of all regions. We further show in Pakzad (2004) that important

1854

P. Pakzad and V. Anantharam

graph-theoretic properties such as the number of loops and connected components are invariant in all choices of SR , and the subgraph of any choice of SR on an up-set T of R is a version of ST for T. Based on these justifications and the next proposition, we call SR the minimal graphical representation, or the minimal graph, of R, and freely talk about the existence of loops in SR as if SR were unique. All results in the remainder of this article apply to every choice of SR . Proposition 3. SR is indeed a minimal graphical representation of KR , that is, a collection of pseudomarginals {br , r ∈ R} lies in KR iff it satisfies all the edge constraints of SR , and further, removal of any of the edges of SR results in misrepresentation of KR . Proof. See Pakzad (2004). As we have seen, to solve the constrained minimization problem, one forms the Lagrangian, introducing multipliers λtr (xr ) for each edge (t → r ) of SR . Since SR has fewer edges than any other graphical representation of R, algorithms based on SR require the fewest message updates per each iteration. 4.1 Connection with Junction Trees. In this section we show that there is a close connection between the minimal graphical representation of a collection R of Kikuchi regions and the junction trees on R. Definition 7. Let {r1 , . . . , r M } be a collection of subsets of the index set [N]. A tree or forest G with vertices {r1 , . . . , r M } is called a junction tree or forest if it satisfies condition A1 of section 3.2, that is, for each i ∈ [N], the subgraph consisting of all the vertices that contain i can be connected.6 Although junction trees are traditionally defined as undirected trees, in the above definition we do not make a distinction between directed and undirected graphs; we call a directed graph a junction tree if replacing all the directed edges with undirected ones yields a junction tree in the usual sense. Let {r1 , . . . , r M } be the maximal elements of R. For the rest of this article, we assume that R is totally connected, that is, it has property An for all n = 1, . . . , N. The following proposition links the concept of minimal graphs to the junction trees on the maximal elements of R: Proposition 4. If SR has no loops, then it is a junction forest, and hence the maximal elements {r1 , . . . , r M } can be put on a junction tree. Conversely, if {r1 , . . . , r M } can be put on a junction tree, then SR has no loops. 6

Note that given that G has no loops, condition A1 implies An for all n.

Estimation and Marginalization Using Kikuchi Methods

1, 2 , 3

2, 3, 4 2, 3

3, 4 , 5 3, 4

1855

1, 2, 3

2, 3, 4

3, 4 , 5

2, 3

3

3, 4

3

(A)

(B)

Figure 4: (A) Hasse diagram. (B) Junction tree.

Proof. See Pakzad (2004). Recall that in section 3.2 we originally defined the Kikuchi problem in terms of a collection R0 of regions. We now define the concept of loopiness of a Kikuchi problem: Definition 8. The original collection R0 of regions is called loop free if there exists a junction tree on its maximal elements and is called loopy if no such junction tree exists. Now as before, let R be any poset of regions with the same maximal regions as R0 , which is totally connected. Then by proposition 4 and definition 8, R0 is called loopy iff SR has a loop. 4.2 Necessary and Sufficient Conditions for Exactness of the Kikuchi Method. It is well known that the belief propagation algorithm converges to the exact marginals, in finite time, if the underlying graph is loop free (see, e.g., Pearl, 1988; Cowell et al., 1999). Likewise, the message-passing algorithms of the type discussed in section 3.3 will converge to yield the exact marginals if the Hasse diagram is loop free, and in fact the value of the Kikuchi functional equals the variational free energy. However, this is a rather weak result, since only very rarely will the Hasse diagram be loop free. In fact, many collections of regions that can be put on a junction tree result in Hasse diagrams that have loops. For example the poset R = {{123}, {234}, {345}, {23}, {34}, {3}} will have a loop in the Hasse diagram as displayed in Figure 4A, but it can be easily handled as a junction tree, as in Figure 4B. Also, in the example of Figure 3, although the Hasse diagram has loops, the solution to the Kikuchi approximation problem equals the exact marginals. This is because not all the loops of G R are “bad” loops that cause trouble for the message-passing algorithm. In fact these “bad” loops are precisely the loops that cannot be broken when one creates SR . In the examples of both Figures 3 and 4, SR is loop free. One can therefore run a message-passing algorithm on SR , which converges to yield the solution for the Kikuchi approximation, equation 3.11, which is identical to the exact

1856

P. Pakzad and V. Anantharam

marginals. In fact, in these examples, the Kikuchi free energy functional equals the variational free energy. The results in this subsection make these observations precise. The following theorem states sufficient conditions for the Kikuchi approximate free energy and the consistency constraint set of pseudomarginals to be exact: Theorem 2. (Exactness of the Kikuchi approximations, KR and F RK ). A. KR = R if SR is loop free. B. Let b(x) be a distribution with marginals

br (xr ). Then F RK ({br , r ∈ R}) = F (b) if b(x) = r ∈R br (xr )cr . Proof. Part A: Suppose SR is loop free, and let {br (xr ), r ∈ R} ∈ KR . Since SR is

a junction forest, there is a distribution b(x) := b (x )/ r r r ∈R (t→u)∈E(SR ) b u (xu ) that marginalizes to {b r (xr ), r ∈ R}. This is a well-known result on the junction trees, which can be verified by marginalizing b(x) in stages, from the leaves (of the undirected version of SR ) toward an arbitrary region r as the root, where at each step, by local consistency there will be cancellation. Therefore, {br (xr ), r ∈ R} ∈ R , and so KR ⊆ R . But clearly R ⊆ KR , since the true marginals of any distribution are locally consistent. Therefore, KR = R . Part B: From discussion of section 3.1, F RK ({br , r ∈ R}) = F (b) if the entropy approximation of equation 3.5 is exact. Now b(x) =

br (xr )cr

r ∈R

=⇒ KL b(x)

br (xr )cr =⇒

x

=⇒

x

b(x) log(b(x)) − log b(x) log(b(x)) −

cr

r ∈R

b(x) log(b(x)) =

r ∈R

=0

cr

br (xr )

cr log(br (xr )) = 0

r ∈R

b(x) log(b(x)) =

r ∈R

x

=⇒

=0

r ∈R

x

=⇒

b(x) log(br (xr ))

x

cr

br (xr ) log(br (xr ))

xr

=⇒ F RK ({br (xr ), r ∈ R}) = F (b(x)).

Estimation and Marginalization Using Kikuchi Methods

1857

Corollary 2. If R0 is loop free, then the constrained minimization problem of equation 3.11 has a unique solution. Further, the solution {br∗ , r ∈ R} is the exact marginal of the product function, and the minimum free energy equals the logpartition function, that is, br∗ (xr ) = Br (xr ) and F ∗ = − log(Z).7 Proof. See Pakzad (2004). The above results show that a sufficient condition for the exactness of the solutions of the Kikuchi approximation method of equation 3.11 is that SR be loop free. In the sequel, we address the necessary conditions for exactness. We first pose the following more abstract question about entropy approximations: Under what conditions is an entropy approximation in the form of equation 3.5 exact for all R-decomposable distributions? The following theorem attempts to answer this question. Theorem 3. Let R be a totally connected poset of subsets of [N], and let {kr , r ∈ R} be a collection of constants. Then the following are equivalent: 1. H(B) = 2. 3.

r ∈R kr

r ∈∪i∈s F(i) kr

r ∈∪s∈S F(s) kr

Hr (Br ) for all R-decomposable distributions B(x).

= 1 for all s ⊆ [N] such that ∪i∈s F(i) is connected in SR . = 1 for all S ⊆ 2[N] such that ∪s∈S F(s) is connected in SR .

Further, given the poset R, there exists a collection {kr , r ∈ R} satisfying 1, 2, and 3 above, iff SR , the minimal graph of R, is loop free. If such a collection {kr , r ∈ R} exists, then it is unique and equals the (M¨obius) overcounting factors {cr , r ∈ R}. Proof. See Pakzad (2004). An immediate consequence of this theorem is, as stated in proposition 1, that the only collection of factors for which the Kikuchi approximate free energy is exact is the (Mobius) ¨ overcounting factors, {cr , r ∈ R}, as defined in section 3.1. As mentioned in section 3.2, given a product distribution with kernels {αr0 (xr ), r ∈ R0 }, where the regions in R0 cannot be put on a junction tree, it is expected that expanding the collection R0 by adding subsets of r ∈ R0 as further regions would improve the quality of approximation obtained by the iterative algorithms such as GBP. The following result, however, shows that it is improbable that exact solutions will be obtained.

7 In fact, when R is loop free, iterative algorithms such as GBP, which we discuss in 0 section 5, converge in finite time to the unique solutions br∗ .

1858

P. Pakzad and V. Anantharam

Theorem 4. If R0 is loopy, then except on a set of measure zero of choices of kernels {αr0 (xr ), r ∈ R0 }, the Kikuchi approximation method of equations 3.11 will not produce exact results for both F0 and {Br }. Proof. Let T denote the set of R0 -kernels {αr0 (xr )} for which the Kikuchi entropy approximation of equation 3.5 is exact for the Boltzmann distribution. Specifically, define T := {αr0 (xr ), r ∈ R0 } : B(x) log(B(x)) x

=

r ∈R

cr

Br (xr ) log(Br (xr )) ,

(4.3)

xr

where, as in section 3.2, B(x) = Z1 r ∈R0 αr0 (xr ) and Br ’s are its marginals. For each region r ∈ R0 , define qr := i∈r q i to be the cardinality of the range of xr , and let l := r ∈R0 qr . Let f : Rl+ −→ R be the error function in approximating entropy as in equation 3.5, that is, f ({αr0 (xr )}) :=

x

=

B(x) log(B(x)) −

r ∈R

cr

Br (xr ) log(Br (xr ))

xr

0 0 s∈R αs (xs ) s∈R αs (xs )

log 0 0 x s∈R αs (xs ) x s∈R αs (xs ) x

x[N]\r s∈R αs0 (xs )

cr − 0 x s∈R αs (xs ) xr r ∈R

0 x[N]\r s∈R αs (xs )

× log . 0 x s∈R αs (xs )

Then T = f −1 (0). Function f (·) above is clearly analytic on its domain, Rl+ . Then, as demonstrated in Federer (1969), either T = Rl+ or µ(T) = 0, where µ(·) is the Lebesgue measure. The first alternative requires that f be identically zero on Rl+ . But from theorem 3, it is evident that if R0 is loopy, then the entropy approximation cannot be exact. This completes the proof. A stronger version of this result can also be derived. In Pakzad (2004), we show that the conditional measure of the set of kernels {αr0 (xr ), r ∈ R0 }, conditioned to be consistent with a predefined, weakly regulated pattern of zeros of the product distribution, for which the Kikuchi approximation method of equation 3.11 is exact, is zero. We conclude this section with a generalization of corollary 1 on sufficient conditions for convexity of the Kikuchi free energy. In particular, we make a

Estimation and Marginalization Using Kikuchi Methods

1859

(purely set-theoretic) connection between the number of loops of SR and the sufficient conditions of theorem 5 for convexity of the Kikuchi free energy. Definition 9. We call a collection R of Kikuchi regions normal, if for all S ⊂ R, there exists a largest region (with respect to set inclusion) m ∈ S S D(s), possibly the empty set, that contains any other region u ∈ S D(s). Note that the notion of normality defined here puts minor restrictions on the choices of collections R of regions, and in most cases of interest, such as the cluster variation method (Yedidia et al., 2001) and the junction graph method (Aji & McEliece, 2001), the collections of regions are in fact normal. Normality simply ensures that the minimal graph will contain no small, bow-tie-shaped loops with precisely two maximal and two minimal elements. Theorem 5. Let R be a normal collection of Kikuchi regions. Then the Kikuchi free energy functional F RK ({br }) is strictly convex if SR has zero or one loop. In particular, the Kikuchi free energy for the cluster variation method of Yedidia et al. (2001) is strictly convex if SR has zero or one loop. Proof. See Pakzad (2004). 5 Generalized Belief Propagation Algorithm We are now in position to describe a class of iterative message-passing algorithms that try to solve the constrained minimization problem, equation 3.11. These algorithms can be viewed as extensions of the GBP algorithm of Yedidia et al. (2001, 2002) and the poset-BP algorithm of McEliece and Yildrim (2003). More precisely, the earlier algorithms are defined and derived only on the full Hasse diagram. The results derived in the earlier sections of this article on the minimal graphs allow us to propose algorithms for solving equation 3.11 that are often substantially less complex than the ones proposed in Yedidia et al. (2002) and McEliece and Yildrim (2003), and which appear to have comparable convergence performance in some examples we have investigated and reported in section 6. Let R be the collection of regions for a Kikuchi approximation problem. In section 3.3 we described how the Lagrange multipliers method can be used to obtain an iterative, message-passing algorithm with fixed points that coincide with the stationary points of equation 3.11. Now let G be any graphical representation of KR as defined in section 4. Then the Lagrangian of equation 3.12 can be rewritten in terms of the edge constraints of G, in which case the “messages” of the resulting iterative algorithm can be identified precisely with the edges of G. This means that for each graphical representation of KR , there is a distinct message-passing algorithm along the edges of that graph. Clearly all such algorithms have

1860

P. Pakzad and V. Anantharam

the same set of fixed points, although the dynamics of each algorithm may be different. So far we have represented the constraint set KR using the edgeconstraints defined in section 4. Motivated by an observation made by Yedidia et al. (2001, 2002), we introduce an alternative but essentially equivalent set of edge constraints. We will then be able to use this alternative representation of the constraint set KR to derive an alternative message-passing algorithm to solve equation 3.11. Definition 10. The YFW edge-constraint8 for an edge (s → t) of G is defined as the following functional of the pseudo marginals {br , r ∈ R }: EC (s→t) ({br , r ∈ R }) :=

cu

b u (xu ),

(5.1)

xu\t

u∈F(t)\F(s)

where R := {r ∈ R, cr = 0} is the collection of regions with nonzero overcounting factors. Note that since c u = 0 for u ∈ R , EC (s→t) is a function of only {br , r ∈ R } as claimed in equation 5.1. When the arguments are clear from the context, we abbreviate these edge constraints as EC (s→t) . Proposition 5. The collection of pseudomarginals represented by the YFW edge constraints is equal to the restriction of KR to R . Namely, if we define R

:= {br (xr ), r ∈ R } : ∀(s → t) ∈ E(G), EC (s→t) ({br , r ∈ R }) = 0

and ∀r ∈ R ,

br (xr ) = 1 ,

(5.2)

xr

then R = KR | R , where KR | R is the restriction of KR to R , that is, a collection {br (xr ), r ∈ R } lies in KR | R iff it has an extension {br (xr ), r ∈ R} ∈ KR . Proof. See appendix B. Remark. Note from equations 3.8 and 3.10 that the Kikuchi free energy can be rewritten as follows: F RK ({br (xr )}) =

(−cr br (xr ) log(βr (xr )) + cr br (xr ) log(br (xr )))

(5.3)

r ∈R xr

From equation 5.3, it is apparent that F RK ({br , r ∈ R}) depends only on {br , r ∈ R }, since the terms involving the pseudomarginals corresponding 8

We call these constraints YFW after Yedidia, Freeman, and Weiss.

Estimation and Marginalization Using Kikuchi Methods

1861

to the regions with zero overcounting factors are multiplied by zero. Therefore, equation 3.11 can be rewritten as min

{br ,r ∈R}∈ KR

and

arg

min

{br ,r ∈R}∈ KR

F RK ({br , r ∈ R}) =

F RK ({br , r

∈ R})

R

min

{br ,r ∈R }∈ R

= arg

F RK ({br , r ∈ R })

min

{br ,r ∈R }∈ R

F RK ({br , r ∈ R }). (5.4)

In other words, the central constrained minimization problem of equation 3.11 is reduced to the following: min{br ,r ∈R }∈ R F RK ({br , r ∈ R }). This observation is also made in Yedidia et al. (2002). We will now write the Lagrangian for equation 5.4 using the YFW edge constraints: br (xr ) L := cr br (xr ) log λr t (xt ) + βr (xr ) r ∈R (r →t)∈E(G) xt cu b u (xu ) + κr br (xr ) − 1 . (5.5) × u∈F(t)\F(r )

xu\t

r ∈R

xr

Setting partial derivative ∂L/∂br (xr ) = 0 for each region r and each value of xr , and identifying messages, for each edge ( p → r ), as m pr (xr ) := e −λ pr (xr ) , we obtain: br (xr ) = k βr (xr ) m pr (xr ) m p d (xd ) , p ∈PG (r )

d∈D(r ) p ∈PG (d)\({r }∪D(r ))

(5.6) where constant k is chosen to normalize br so it will sum to 1, and message m pr is updated to satisfy the original edge constraint x p\r b p (x p ) − br (xr ) = 0: m pr (xr ) =

x p\r β p (x p ) s ∈PG ( p) msp (x p ) d∈D( p) s ∈PG (d)\({ p}∪D( p)) ms d (xd )

, k

βr (xr ) s ∈PG (r )\{ p} msr (xr ) d∈D(r ) p ∈PG (d)\({r }∪D(r )) m p d (xd ) (5.7)

where k is any convenient constant. Note that the common terms from the numerator and denominator of equation 5.7 can be cancelled, but to avoid even more complex expressions, we will not write the explicit form here.

1862

P. Pakzad and V. Anantharam

The fixed points of equations 5.6 and 5.7 set all the derivatives of the Lagrangian equal to zero, and hence are precisely the stationary points of the Kikuchi free energy F RK subject to constraint set KR . The algorithm of equations 5.6 and 5.7 is defined on any graphical representation of KR and has as many messages as the edges of the underlying graph. From the results of section 4, using SR , the minimal graphical representation, yields the least complex such algorithm in this sense. In fact, in most cases, the algorithm on SR is substantially less complex than the full version implemented on the Hasse diagram G R . Remark. Note that the GBP algorithm presented here is not fundamentally different from its original form as presented in Yedidia et al. (2001, 2002); in fact, McEliece and Yildrim (2003) described an algorithm called Poset-BP, which is equivalent to the restriction of our results when G is the Hasse diagram. However, it is important to note that our results show that in general, there are iterative algorithms with strictly fewer messages, and potentially simpler dynamics, that have the same fixed points. In particular, the messages corresponding to the edges of the Hasse diagram that are removed in forming a more compact graphical representation can be set to 1 in the entire algorithm. In addition, the update rules for the remaining messages are also less complex, since they depend on fewer edges. It is also worth mentioning that the proofs of the correctness for the GBP and Poset-BP algorithms given in Yedidia et al. (2002) and McEliece and Yildrim (2003) both presume that the poset is first simplified by removing the regions with zero overcounting factors. We note, however, that removing the regions with zero overcounting factors can in general alter the problem. This is because a region with zero overcounting factor may still serve to ensure consistency between the pseudomarginals at other regions (see, for example, the poset in Figure 5). We have avoided this restriction by proving the results for a general poset. Consider now the restriction of the above algorithm in the Bethe case, that is, each region in R is either maximal or minimal with regard to inclusion. Then P(r ) = ∅ for a maximal region r and D(s) = ∅ for a minimal region s. To demonstrate the connection with the belief propagation algorithm, in

1,5

1,2,3

1,3,4

1,2,4

1,3

1,2

1,4

1

Figure 5: The overcounting factor for region 1 is zero, but it cannot be removed.

Estimation and Marginalization Using Kikuchi Methods

1863

addition to the messages m pr (xr ) for r ⊂ p, we also define messages from a child to parent as follows:

nr p (xr ) := βr (xr )

msr (xr ).

(5.8)

s∈P(r )\{ p}

Then, by equation 5.6, for a maximal region p ∈ R, b p (x p ) = k β p (x p )

n pd (xd ).

(5.9)

d∈D( p)

Similarly, for a minimal region r ∈ R, br (xr ) = k βr (xr )

mdr (xr ).

(5.10)

d∈P(r )

The update equation 5.7 for messages m pr can then be rewritten as

x p\r β p (x p ) d∈D( p) nd p (xd )

m pr (xr ) = k βr (xr ) nr p (xr ) = k

β p (x p ) x p\r

βr (xr )

nd p (xd ).

(5.11)

d∈D( p)\{r }

It is now easy to see that equations 5.8 to 5.11 precisely define the conventional belief propagation algorithm of Pearl (1988) applied on G R .

6 Experimental Results In the previous section, we proved that the fixed points of GBP algorithms on any graphical representation for a poset R coincide with the solutions to the Kikuchi approximation problem of section 3.2. We further argued that the algorithm on the minimal graph SR has the smallest complexity per each iteration. Two important questions are not addressed in this article so far: How close are the Kikuchi approximations to the true marginals? How does the convergence behavior of the GBP algorithm on the minimal graph SR compare to that on the full Hasse graph G R ? In this section, we address these questions with some experiments. We considered three simple loopy posets below. In each case, all the variables were binary. For each run of the experiment for a given poset, first we generated a random collection of kernels {αr (xr )}, where each value αr (xr ) was chosen independently and uniformly in the interval [0, 1]. Next we calculated the product distribution B(x) = r ∈R αr (xr ) together with its true

1864

P. Pakzad and V. Anantharam

marginals Br (xr ). The GBP algorithm of section 5 then was run on each of the two graphs G R and SR for that poset. Further, two different schedules were incorporated to update the messages for each algorithm: parallel and serial. With the parallel schedule, all messages were updated together at each iteration. For the serial schedule, we update the messages one after another, in an order chosen so as to minimize the number of edges that are updated before their requisite set of edges has been updated. Each message is updated exactly once during each iteration. To ensure convergence of some algorithms, we used damping in the update rule for the messages. The quantity w reported for each algorithm is the damping factor. In parn n n ticular, we used mn+1 pr (xr ) = w F ({m }) + (1 − w) m pr (xr ), where m pr is the message at iteration n and F ({mn }) is the pure update rule of equation 5.7. The value of w is always between 0 and 1, with w = 1 corresponding to equation 6.7. For each case, we decreased w gradually to ensure that the algorithm converged. For each poset, we report the savings in complexity per each iteration of GBP on the minimal graph compared to that on the Hasse graph. To compute these savings, we calculated the total arithmetic complexity, that is, the number of additions, multiplications, and divisions involved in update rules of equation 5.7 for both algorithms. Note that this is not simply the fraction of edges that are removed in forming the minimal graph, since the update rules for the messages that remain in the minimal graph are less complex than the ones on the Hasse graph. To summarize the performance of each algorithm, at each iteration we calculated a special measure of distance between the beliefs {br } and the true r |b r (xr )−Br (xr )| marginals {Br }. We define a distance function D(br , Br ) := maxxmax xr Br (xr ) as the measure of distance from the belief br to the marginal Br ; this is a normalized maximum point-wise difference between the two distributions. The closer D is to 0, the closer the belief br (xr ) is to the true marginal Br (xr ) at all configurations of xr . At each iteration we then calculate the maximum 1 distance maxr ∈R D(br , Br ), and the mean distance |R| r ∈R D(b r , Br ). For each poset, the averages of these quantities over 200 runs are reported. Poset 1. The Hasse diagram of this poset has one loop, but the minimal graph is loop free (see Figure 6). There is a saving of 35.7% per each iteration of GBP on the minimal graph compared to that on the Hasse graph. As expected, the Kikuchi approximations coincide with the true marginals in this loop-free case. The serial algorithms converge to the fixed points after one iteration, because we use an optimal schedule for activating the messages. The parallel algorithm on the minimal graph takes four iterations (equal to the girth of the graph). The parallel algorithm on the Hasse graph requires damping and converges much more slowly. Note that in this case, the algorithm on the minimal graph both gives better performance at each iteration and has less complexity per iteration (see Figure 7).

Estimation and Marginalization Using Kikuchi Methods

1, 2,4

1, 3,5

1, 2,3

1, 2

1, 2, 4

1, 3

1865

1, 2,3

1, 2

1

1, 3,5

1, 3

1

(A)

(B)

Figure 6: Poset 1. (A) Hasse graph G R . (B) Minimal graph SR . Poset 1, Mean Distance D

Poset 1, Maximum Distance D

1

1

Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, GR, w=0.80 Parallel, SR, w=1.00

0.9 0.8

0.8 0.7

0.6

Distance

Distance

0.7

0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1 0

Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, GR, w=0.80 Parallel, SR, w=1.00

0.9

0.1 0

1

2

3

4

5 6 Iteration

(A)

7

8

9

10

0

0

1

2

3

4

5 6 Iteration

7

8

9

10

(B)

Figure 7: GBP performance on Poset 1. (A) Mean distance. (B) Maximum distance.

Poset 2. The Hasse diagram of this poset has five loops; all but one of these loops are broken in the minimal graph (see Figure 8). There is a saving of 46.2% per each iteration of GBP on the minimal graph compared to that on the Hasse graph. The Kikuchi approximations are at an average distance of about 0.05 from the true marginals, while the worst estimates have distance of about 0.13. Again, the serial algorithms converge very quickly, although the one on the Hasse graph requires a slight damping. Comparing the parallel algorithms, the one on the minimal graph clearly outperforms the one on the full Hasse graph, even with equal damping factors (see Figure 9). Poset 3. The Hasse diagram of this poset has five loops, whereas the minimal graph has only two loops (see Figure 10). There is a saving of 28.5% per each iteration of GBP on the minimal graph compared to that on the Hasse graph. The Kikuchi approximations are at an average distance of about 0.05 from the true marginals, while the worst estimates have distance of about 0.14. Once again, the serial algorithms converge very quickly, without the need for damping. The parallel algorithm on the

1866

P. Pakzad and V. Anantharam

1, 2, 3

2, 3, 4

1, 3, 4

1, 2, 4

1, 2, 3

2, 3, 4

1, 3, 4

1, 2, 4

1, 2

2, 3

3, 4

1, 4

1, 2

2, 3

3, 4

1, 4

1

2

3

4

1

2

3

4

(A)

(B)

Figure 8: Poset 2. (A) Hasse graph G R . (B) Minimal graph SR . Poset 2, Mean Distance D

Poset 2, Maximum Distance D

1

1

Serial, GR, w=1.00 Serial, SR, w=1.00 Parallel, G , w=0.80 R Parallel, SR, w=1.00

0.9 0.8

0.8 0.7

0.6

Distance

Distance

0.7

0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

1

2

3

4

5 6 Iteration

7

8

9

Serial, GR, w=0.90 Serial, SR, w=1.00 Parallel, G , w=0.75 R Parallel, SR, w=0.75

0.9

10

0

0

1

2

3

4

5 6 Iteration

(A)

7

8

9

10

(B)

Figure 9: GBP performance on Poset 2. (A) Mean distance. (B) Maximum distance. 1,6,7

1,5,6

1,6

1,5

1,3,6 3

1

1,2,3,5

2,3,4

1,2

2,4 2

(A)

1,2,4

1,6,7

1,5,6

1,6

1,5

1,3,6

1,2,3,5

2,3,4

1,2

2,4

3

1

1,2,4

2

(B)

Figure 10: Poset 3. (A) Hasse graph G R . (B) Minimal graph SR .

minimal graph again outperforms that on the full Hasse graph, the latter requiring a damping factor w = 0.70 to avoid oscillations (see Figure 11). At least for the simple posets considered here, the less complex GBP algorithm on the minimal graph, developed in this article, seems to perform better than the full GBP on the Hasse graph, especially with the parallel versions of the algorithm. Considering that each iteration of the algorithm on the minimal graph is less complex than that on the full Hasse graph, this

Estimation and Marginalization Using Kikuchi Methods Poset 3, Mean Distance D

Poset 3, Maximum Distance D

1

1

Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, G , w=0.70 R Parallel, SR, w=1.00

0.9 0.8

Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, G , w=0.70 R Parallel, SR, w=1.00

0.9 0.8 0.7

0.6

Distance

Distance

0.7

0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1 0

1867

0.1 0

1

2

3

4

5 6 Iteration

7

8

9

(A)

10

0

0

1

2

3

4

5 6 Iteration

7

8

9

10

(B)

Figure 11: GBP performance on Poset 3. (A) Mean distance. (B) Maximum distance.

suggests that there is considerable saving in the complexity to be gained by using the algorithm on the minimal graph. Appendix A: Proof of Theorem 1 Note that the Kikuchi approximate free energy, as a functional of the pseudomarginals {br (xr )} ∈ KR , consists of an energy term, which is linear, and a linear combination of entropy terms, with both positive and negative coefficients. We will show that if the hypothesis of the theorem holds, there is a matching between the negative and the positive terms such that the overall entropy term will be a positive linear combination of Kullback-Leibler (KL) divergence terms that is strictly convex (see, e.g., Cover & Thomas, 1991). We will prove the existence of such matching using results from the bipartite graph theory. Form a bipartite graph G(V + , V − , E) with vertex sets V + and V − and the edge set E as follows:

r r r

|c |

For each r ∈ R with cr < 0, create |cr | nodes {vr1 , · · · , vr r } in V − . For each s ∈ R with c s > 0, create c s nodes {u1s , · · · , ucs s } in V + . To form the edge set E, connect each vir ∈ V − to each usj ∈ V + iff r ⊂ s.

For a subset S ⊆ V − , denote by N(S) the subset of nodes in V + that are connected to a node in S. Then graph G has the following property: ∀S ⊆ V − , |S| ≤ |N(S)|.

(A.1)

To see this, let S = {vsi : (s, i) ∈ I} where the index set I consists of some pairs of the form (s, i) with c s < 0 and 0 < i ≤ |c s |. Now create an-

1868

P. Pakzad and V. Anantharam

other index set I as I := {(s, j) : (s, i) ∈ I for some i , 0 < j ≤ |c s |}, and let S := {vsi : (s, i) ∈ I}. Then clearly S ⊆ S and hence |S| ≤ |S|, but notice that N(S) = N(S). Also note that |S| = − t∈T c t , where T := {t ∈ R : (t, 1) ∈ I}. Further,

ct =

t∈A(T);c t >0

t∈A(T)

ct +

ct

t∈A(T);c t <0

= |N(S)| +

ct

t∈A(T);c t <0

≤ |N(S)|, where the second equality follows from thedefinitions of |N(S)| and A(T). But by the hypothesis of the theorem, − t∈T c t ≤ t∈A(T) c t . Putting these together, we get |S| ≤ |S| = − t∈T c t ≤ t∈A(T) c t ≤ |N(S)| = |N(S)| as claimed. Then the bipartite graph satisfies the hypothesis of Hall’s matching theorem (Hall, 1935), and hence there is a matching on G that saturates every j vertex of V − . In other words, there is matching M = {(vri , us )} such that j every vri ∈ V − is uniquely matched with a us ∈ V + . Denote by U the subset of vertices in V + that are left unmatched. We now rewrite the entropy term of the Kikuchi free energy, that is, the second summation in equation 3.8, using the matching M. For each {br } ∈ KR : r ∈R

=

cr

br log(br )

xr

cr

r :cr <0

=−

br log(br ) +

−cr

br log(br ) +

=

j (vri ,us )∈M xs

b s log(b s )

b s log(b s )

j us ∈V + xs

b s log(b s ) −

xs

j

(vri ,us )∈M

cs

br log(br ) +

b s log(b s )

xs

s:c s >0 j=1 xs

vri ∈V − xr

=

s:c s >0

xr

r :cr <0 i=1 xr

=−

cs

xr

b s log

br log(br ) + b s log(b s )

bs br

+

j us ∈U xs

j us ∈U xs

b s log(b s ).

(A.2)

Estimation and Marginalization Using Kikuchi Methods

1869

j

Notice that for each (vri , us ) ∈ M, by definition of the bipartite graph G, we have r ⊂ s. Further, we have taken {br } ∈ KR , and so that xs\r b s (xs ) = br (xr ), which implies the last equality. Now note that the first term in equation A.2 is a sum of KL divergences,9 which are strictly convex as functions of their arguments, and the second term is a sum of negative entropy functions which are also strictly convex (see, e.g., Cover & Thomas, 1991). On the other hand, as mentioned earlier, the average energy term of the Kikuchi free energy—the first summation in equation 3.8—is linear in {br }. Since, constrained by KR , the Kikuchi free energy is in effect a functional only of the pseudomarginals associated with the maximal regions in R, and since each maximal region contributes such a KL divergence term in equation A.2, the Kikuchi functional as a whole is also strictly convex.

Appendix B: Proof of Proposition 5 Given t ∈ R, s ∈ PG (t), by definition of the overcounting factors,

c u = 1 and

u∈F(t)

Therefore,

cu = 1

u∈F(s)

c u = 0.

(B.1)

u∈F(t)\F(s)

Now if {br , r ∈ R} ∈ KR , then ∀ u ∈ F(t),

EC (s→t) ({br , r ∈ R }) =

cu

b u (xu ) = b t (xt ). Therefore,

b u (xu )

xu\t

u∈F(t)\F(s)

=

xu\t

c u b t (xt )

u∈F(t)\F(s)

= b t (xt )

cu

u∈F(t)\F(s)

= 0. Hence, {br , r ∈ R } ∈ R , and we have proven that KR | R ⊆ R . Now, conversely, suppose that {br , r ∈ R } ∈ R . We will show by induction on depth function d(t) of region t ∈ R (with regard to poset R, and not graph G) that for all s ∈ A(t), xs\t b s (xs ) = b t (xt ). The statement holds for the maximal regions, since these regions cannot have parents. Now let

9

To be precise, each term differs from a true KL divergence by a constant.

1870

P. Pakzad and V. Anantharam

t be a region with depth d(t) = l > 0, and let PG (t) = {s1 , . . . , sm }. For each pair si and s j of parents of t in G, consider the following cases on A(si ) ∩ A(s j ):

r

Suppose A(si ) ∩ A(s j ) = ∅. Then, because {br , r ∈ R } ∈ R , we have cu b u (xu ) = 0 xu\t

u∈F(t)\F(si )

cu

b u (xu ) = 0.

xu\t

u∈F(t)\F(s j )

Subtracting one from another, we obtain the following equality: cu b u (xu ) = cu b u (xu ). (B.2) u∈F(si )

xu\t

u∈F(s j )

xu\t

Since d(si ) and d(s j ) are each no larger than l − 1, by induction hypothesis, we have b u (xu ) = b si (xsi ) ∀ u ∈ F(si ), xu\si

∀ u ∈ F(s j ),

b u (xu ) = b s j (xs j ).

xu\s j

Replacing these in equation B.2, we obtain b si (xsi ) cu = b s j (xs j ) cu . xsi \t

r

u∈F(si )

xs j \t

u∈F(s j )

But factors, by definition of the overcounting u∈F(s j ) c u = 1, so that xs \t b si (xsi ) = xs \t b s j (xs j ). i

u∈F(si ) c u

=

j

Suppose by induction hypothesis, we u ∈ A(si ) ∩ A(s j ). Then again have b (x ) = b (x ) and si si xu\si u u xu\s j b u (xu ) = b s j (xs j ). Thus, once again, b si (xsi ) = b u (xu ) = b s j (xs j ). xsi \t

xu\t

xs j \t

all pairs si and s j of parents of t in G, We can therefore show that for

b (x ) = b (x ) = b (x ) for a unique function b t (xt ). Now if t ∈ s s s s t i i j j t xsi \t xs j \t R , we define b t (xt ) := b t (xt ). If t ∈ R , using the fact that {br , r ∈ R } ∈ R , we have cu b u (xu ) = 0 u∈F(t)\F(si )

=⇒ c t b t (xt ) +

xu\t

u∈A(t)\F(si )

=⇒ b t (xt ) =

b t (xt ),

c u b t (xt ) = 0

Estimation and Marginalization Using Kikuchi Methods

1871

since by equation B.1, c t + u∈A(t)\F(si ) c u = 0, and c t = 0. We have shown that xsi \t b si (xsi ) = b t (xt ) for all si ∈ PG (t). But G is a graphical representation of KR . Therefore, by argument similar to those of proposition 3 for each (s → t) ∈ E(G R )\E(G), the edge constraint b xs\t s (xs ) = b t (xt ) is implied by the edge constraints of those edges of G at the same, or at a lower, depth. Specifically, there must be a path in G between u and t for each u ∈ A(t), consisting only of vertices that contain t, or else consistency between b u and b t could not be implied by the edge constraints of G. But any vertex that contains t must have a depth less than t (remember that we are using the depth function on R, and not on G: a region containing t could have a G depth higher than that of t.) Therefore, all the G edges in this path have depths no more than l = d(t) and can be used in our inductive argument. Together, they imply the consistency between u and t, that is, xu\t b u (xu ) = b t (xt ). Therefore, we have found the desired ex

tension {br , r ∈ R} ∈ R , and so R ⊆ KR R . This proves that R = KR R as claimed.

Acknowledgments This work was supported by grants from ONR/MURI, N00014-1-0637; NSF, SBR-9873086; DARPA, F30602-00-2-0538; California Micro Program; Texas Instruments; Marvell Technologies; and ST MicroElectronics. References Aji, S., & McEliece, R. (2000). The generalized distributive law. IEEE Trans. Inform. Theory, 46(2), 325–343. Aji, S., & McEliece, R. (2001). The generalized distributive law and free energy minimization. In Proceedings of the Allerton Conference on Communication, Control, and Computing (pp. 672–681). Urbana, IL: University of Illinois. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit errorcorrecting coding and decoding: Turbo-codes. In Proceedings of the IEEE International Conference on Communications (no. 2, pp. 1064–1070). Piscataway, NJ: IEEE. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Cowell, R., Dawid, A., Lauritzen, S., & Spiegelhalter, D. (1999). Probabilistic networks and expert systems. Berlin: Springer-Verlag. Divsalar, D., Jin, H., & McEliece, R. (1998). Coding theorems for “turbo-like” codes. In Proceedings of the Allerton Conference on Communication, Control, and Computing (pp. 201–210). Urbana, IL: University of Illinois. Federer, H. (1969). Geometric measure theory. Berlin: Springer-Verlag. Gallager, R. (1963). Low-density parity-check codes. Cambridge, MA: MIT Press. Hall, P. (1935). On representatives of subsets. Journal of London Mathematical Society, 10, 26–30. Kikuchi, R. (1951). A theory of cooperative phenomena. Phys. Rev., 6(81), 988–1003.

1872

P. Pakzad and V. Anantharam

Kittel, C., & Kroemer, H. (1980). Thermal physics. New York: W. H. Freeman. Luby, M. (2002). LT-codes. In Proceedings of IEEE Symposium on the Foundations of Computer Science (pp. 271–280). Piscataway, NJ: IEEE. MacKay, D., & Neal, R. (1995). Good codes based on very sparse matrices. In Cryptography and Coding: 5th IMA Conference (pp. 100–111). Berlin: Springer-Verlag. McEliece, R., MacKay, D., & Cheng, J. (1998). Turbo decoding as an instance of Pearl’s “belief propagation” algorithm. IEEE J. Select. Areas Commun., 16(2), 140–152. McEliece, R., & Yildrim, M. (2003). Belief propagation on partially ordered sets. In J. Rosenthal & D. S. Gilliam (Eds.), Mathematical systems theory in biology, communications, computation, and finance (pp. 275–300). Berlin: Springer. Morita, T. (1994). Formal structure of the cluster variation method. Prog. Theor. Phys. Supp., 115, 27–39. Pakzad, P. (2004). Low complexity, high performance algorithms for estimation and decoding. Unpublished doctoral dissertation, University of California, Berkeley. Pakzad, P., & Anantharam, V. (2002a). Belief propagation and statistical physics. In Conference on Information Sciences and Systems (CISS 2002) (Paper No. 225). Princeton, NJ: Princeton University. Pakzad, P., & Anantharam, V. (2002b). Minimal graphical representations of Kikuchi regions. In Proceedings of the Allerton Conference on Communication, Control, and Computing. Piscataway, NJ: IEEE. Pakzad, P., & Anantharam, V. (2004). A new look at the generalized distributive law. IEEE Trans. Inform. Theory, 50(6), 1132–1155. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Richardson, T. (2000). The geometry of turbo-decoding dynamics. IEEE Trans. Inform. Theory, 46(1), 9–23. Richardson, T., Shokrollahi, A., & Urbanke, R. (2001). Design of capacity-approaching irregular low-density parity-check codes. IEEE Trans. Inform. Theory, 47(2), 619– 637. Richardson, T., & Urbanke, R. (2001). The capacity of low-density parity-check codes under message-passing decoding. IEEE Trans. Inform. Theory, 47(2), 599–618. Stanley, R. (1986). Enumerative combinatorics (Vol. 1). Monterey, CA: Wadsworth & Brooks/Cole. Tanner, R. (1981). A recursive approach to low complexity codes. IEEE Trans. Inform. Theory, 27(9), 533–547. Wainwright, M., & Jordan, M. (2003). Graphical models, exponential families, and variational inference (Tech. Rep. 649). Berkeley: Department of Statistics, University of California. Walrand, J., & Varaiya, P. (1996). High-performance communication networks. San Francisco: Morgan Kaufmann. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12, 1–41. Welling, M., & Teh, Y. (2001). Belief optimization for binary networks: A stable alternative to loopy belief propagation. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 554–561). San Francisco: Morgan Kaufmann.

Estimation and Marginalization Using Kikuchi Methods

1873

Wicker, S. (1995). Error control systems for digital communication and storage. Upper Saddle River, NJ: Prentice Hall. Yedidia, J., Freeman, W., & Weiss, Y. (2001). Bethe free energy, Kikuchi approximations, and belief propagation algorithms (Tech. Rep. TR2001-16). Cambridge, MA: Mitsubishi Electric Research Lab. Yedidia, J., Freeman, W., & Weiss, Y. (2002). Constructing free energy approximations and generalized belief propagation algorithms (Tech. Rep. TR2002-35). Cambridge, MA: Mitsubishi Electric Research Lab. Yuille, A. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14, 1691–1722.

Received March 19, 2004; accepted January 31, 2005.

REVIEW

Communicated by Michael Lewicki

The Cocktail Party Problem Simon Haykin [email protected]

Zhe Chen [email protected] Adaptive Systems Lab, McMaster University, Hamilton, Ontario, Canada L8S 4K1

This review presents an overview of a challenging problem in auditory perception, the cocktail party phenomenon, the delineation of which goes back to a classic paper by Cherry in 1953. In this review, we address the following issues: (1) human auditory scene analysis, which is a general process carried out by the auditory system of a human listener; (2) insight into auditory perception, which is derived from Marr’s vision theory; (3) computational auditory scene analysis, which focuses on specific approaches aimed at solving the machine cocktail party problem; (4) active audition, the proposal for which is motivated by analogy with active vision, and (5) discussion of brain theory and independent component analysis, on the one hand, and correlative neural firing, on the other.

One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it “the cocktail party problem.” No machine has been constructed to do just this, to filter out one conversation from a number jumbled together. —Colin Cherry, 1957.

1 Introduction The cocktail party problem (CPP), first proposed by Colin Cherry, is a psychoacoustic phenomenon that refers to the remarkable human ability to selectively attend to and recognize one source of auditory input in a noisy environment, where the hearing interference is produced by competing speech sounds or a variety of noises that are often assumed to be independent of each other (Cherry, 1953). Following the early pioneering work (Cherry, 1953, 1957, 1961; Cherry & Taylor, 1954; Cherry & Sayers, 1956, 1959; Sayers & Cherry, 1957), numerous efforts have been dedicated to the CPP in diverse fields: physiology, neurobiology, psychophysiology, cognitive psychology, biophysics, computer science, and engineering. Due to its multidisciplinary nature, it is almost impossible to completely cover this problem in a single Neural Computation 17, 1875–1902 (2005)

© 2005 Massachusetts Institute of Technology

1876

S. Haykin and Z. Chen

article.1 Some early partial treatment and reviews of this problem are found in different disciplinary publications (Bregman, 1990; Arons, 1992; Wood & Cowan, 1995; Yost, 1997; Feng & Ratnam, 2000; Bronkhorst, 2000). Half a century after Cherry’s seminal work, however, it seems fair to say that a complete understanding of the cocktail party phenomenon is still missing, and the story is far from being complete; the enigma about the marvelous auditory perception capability of human beings remains a mystery. To unveil the mystery and imitate the human performance with a machine, computational neuroscientists, computer scientists, and engineers have attempted to view and simplify this complex perceptual task as a learning problem, for which a tractable computational solution is sought. Despite their obvious simplicity and distinction from reality, the efforts seeking the computational solutions to imitate a human’s unbeatable audition capability have revealed that we require a deep understanding of the human auditory system and the underlying neural mechanisms. Bearing such a goal in mind, it does not mean that we must duplicate every aspect of the human auditory system in solving the machine cocktail party problem. Rather, it is our belief that seeking the ultimate answer to the CPP requires deep understanding of many fundamental issues that are deemed to be of theoretical and technical importance. In addition to its obvious theoretical values in different disciplines, the tackling of the CPP will certainly be beneficial to ongoing research on human-machine interfaces. There are three fundamental questions pertaining to CPP: 1. What is the cocktail party problem? 2. How does the brain solve it? 3. Is it possible to build a machine capable of solving it in a satisfactory manner? The first two questions are human oriented and mainly involve the disciplines of neuroscience, cognitive psychology, and psychoacoustics; the last question is rooted in machine learning, which involves computer science and engineering disciplines. In addressing the cocktail party problem, we are interested in three underlying neural processes:2

r

Analysis: The analysis process mainly involves segmentation or segregation, which refers to the segmentation of an incoming auditory signal to individual channels or streams.3 Among the heuristics used by a

1 A recently edited volume by Divenyi (2004) discusses several aspects of the cocktail party problem that are complementary to the material covered in this review. 2 Categorization of these three neural processes, done essentially for research-related studies, is somewhat artificial; the boundary between them is fuzzy in that the brain does not necessarily distinguish between them as defined here. 3 For an early discussion of the segmentation process, see Moray (1959).

The Cocktail Party Problem

r

r

1877

listener to do the segmentation, spatial location is perhaps the most important. Specifically, sounds coming from the same location are grouped together, while sounds originating from other different directions are segregated. Recognition: The recognition process involves analyzing the statistical structure of the patterns contained in a sound stream that are helpful in recognizing the patterns. The goal of recognition is to uncover the neurobiological mechanisms through which humans are able to identify a segregated sound from multiple streams with relative ease. Synthesis: The synthesis process involves the reconstruction of individual sound waveforms from the separated sound streams. While synthesis is an important process carried out in the brain (Warren, 1970; Warren, Obusek, & Ackroff, 1972; Bregman, 1990), the synthesis problem is primarily of interest to the machine CPP.

Note also that recognition does not require the analysis process to be perfect; and by the same token, an accurate synthesis does not necessarily mean having solved the analysis and recognition problems, although extra information might provide more hints for the synthesis process. From an engineering viewpoint, in a loose sense, synthesis may be regarded as the inverse of the combination of analysis and recognition in that it attempts to uncover the relevant attributes of the speech production mechanism. The aim of synthesis is to build a machine that offers the capabilities of operating on a convolved mixture of multiple sources of sounds and to focus attention on the extraction from the convolved mixture a stream of sounds that is of particular interest to an observer; the convolution mentioned here refers to reverberation in a confined environment, which is a hallmark of real-life cocktail party phenomena. The main theme of this review4 is philosophical and didactic; hence, no detailed mathematical analysis is presented. The rest of the review is organized as follows. Section 2 discusses human auditory scene analysis. Section 3 discusses the impact of Marr’s work in vision (Marr, 1982) on auditory scene analysis (Bregman, 1990). Section 4 presents an overview of computational approaches for solving the cocktail party problem, with an emphasis on independent component analysis, temporal binding and oscillatory correlation, and cortronic network. Section 5 discusses active audition. Discussion of some basic issues pertaining to the cocktail party problem in section 6 concludes the review.

4 The early version of this review appeared as a presentation made by the first author (Haykin, 2003) and a more lengthy technical report by the second author (Chen, 2003). In this latter report, we presented a comprehensive overview of the cocktail party problem, including a historical account, auditory perceptual processes, relations to visual perception, and detailed descriptions of various related computational approaches.

1878

S. Haykin and Z. Chen

2 Human Auditory Scene Analysis Human auditory scene analysis (ASA) is a general process carried out by the auditory system of a human listener for the purpose of extracting information pertaining to a sound source of interest, which is embedded in a background of noise interference. The auditory system is made up of two ears (constituting the organs of hearing) and auditory pathways. In more specific terms, it is a sophisticated information processing system that enables us to detect not only the frequency composition of an incoming sound but also to locate the sound sources (Kandel, Schwartz, & Jessell, 2000). This is all the more remarkable, given the fact that the energy in the incoming sound waves is exceedingly small and the frequency composition of most sounds is rather complicated. In this review, our primary concern is with the cocktail party problem. In this context, a complete understanding of the hearing process must include a delineation of where the sounds are located, what sounds are perceived, as well as an explanation of how their perception is accomplished. 2.1 “Where” and “What.” The mechanisms in auditory perception essentially involve two processes: sound localization (“where”) and sound recognition (“what”). It is well known that (e.g., Blauert, 1983; Yost & Gourevitch, 1987; Yost, 2000) for localizing sound sources in the azimuthal plane, interaural time difference is the main acoustic cue for sound location at low frequencies, and for complex stimuli with low-frequency repetition, interaural level is the main cue for sound localization at high frequencies. Spectral differences provided by the head-related transfer function (HRTF) are the main cues used for vertical localization. Loudness (intensity) and early reflections are the probable cues for localization as a function of distance. In hearing, the precedence effect refers to the phenomenon that occurs during auditory fusion when two sounds of the same order of magnitude are presented dichotically and produce localization of the secondary sound waves toward the outer ear receiving the first sound stimulus (Yost, 2000); the precedence effect stresses the importance of the first wave in determining the sound location. The “what” question mainly addresses the processes of sound segregation (streaming) and sound determination (identification). Although it has a critical role in sound localization, spatial separation is not considered a strong acoustic cue for streaming or segregation (Bregman, 1990). According to Bregman’s studies, sound segregation consists of a two-stage process: feature selection and feature grouping. Feature selection invokes processing the auditory stimuli into a collection of favorable (e.g., frequency sensitive, pitch-related, temporal-spectral-like) features. Feature grouping is responsible for combining similar elements of incoming sounds according to certain principles into one or more coherent streams, with each stream corresponding to one informative sound source. Sound determination is more specific

The Cocktail Party Problem

1879

than segregation in that it not only involves segmentation of the incoming sound into different streams, but also identifies the content of the sound source in question. We will revisit Bregman’s viewpoint of human auditory scene analysis in section 3. 2.2 Spatial Hearing. From a communication perspective, our two outer ears act as receive-antennae for acoustic signals from a speaker or audio source. In the presence of one (or fewer) competing or masking sound source, the human ability to detect and understand the source of interest (i.e., target) is degraded. However, the influence of the masking source generally decreases when the target and masker are spatially separated, compared to when the target and masker are in the same location; this effect is credited to spatial hearing (filtering). In his classic paper, Cherry (1953) suggested that spatial hearing plays a major role in the auditory system’s ability to separate sound sources in a multiple-source acoustic environment. Many subsequent experiments have verified Cherry’s conjecture. On the other hand, spatial hearing is viewed as one of the important cues that are exploited in solving the CPP (Yost & Gourevitch, 1987) and enhancing speech intelligibility (Hawley, Litovsky, & Colburn, 1999). Spatial separation of the sound sources is also believed to be more beneficial to localization than segregation (Bregman, 1990). But in some cases, spatial hearing is crucial to the sound determination task (Yost, 1991, 1997). Specifically, spatial unmasking produces three effects: (1) pure acoustic effects due to the way sound impinges on the listener’s head and body, (2) binaural processing that improves the target signal-tomasker ratio, and (3) central attention whereby the listener can selectively focus attention on a source at a particular direction and block out the competing sources in the unattended directions. The classic book by Blauert (1983) presents a comprehensive treatment of the psychophysical aspect of human sound localization. Given multiple sound sources in an enclosed space (such as a conference room), spatial hearing helps the brain to take full advantage of the slight difference (timing, intensity) between the signals that reach the two outer ears to perform monaural (autocorrelation) and binaural (cross-correlation) processing for specific tasks (such as coincidence detection, precedence detection, localization, and fusion), based on which auditory events are identified and followed by higher-level auditory processing (e.g., attention, streaming, cognition). Figure 1 illustrates a functional diagram of the binaural spatial hearing process. 2.3 Binaural Processing. One of the key observations derived from Cherry’s classic experiment (Cherry, 1953) was that it is easier to separate the sources heard binaurally than when they are heard monaurally. Quoting from Cherry and Taylor (1954): “One of the most striking facts about our ears is that we have two of them—and yet we hear one acoustic world; only one voice per speaker.” We believe that nature gives us two ears for a

top-down

(expectation-driven)

bottom-up

(signal-driven)

1880

S. Haykin and Z. Chen

cognition

visual cues formation of the auditory event

consciousness attention segregation

fusion autocorrelation

crosscorrelation

autocorrelation

coincidence detection localization

Figure 1: Functional diagram of binaural hearing, which consists of physical, psychophysical, and psychological aspects of auditory perception. (Adapted from Blauert, 1983, with permission.)

reason just like it gives us two eyes. It is the binocular vision (stereovision) and binaural hearing (stereausis) that enable us to perceive the dynamic outer world and provide the main sensory information sources. Binocular/binaural processing is considered to be crucial in certain perceptual activities (e.g., binocular/binaural fusion, depth perception, localization).5 Given one sound source, the two ears receive slightly different sound patterns due to a finite delay produced by their physically separated locations. The brain is known to be extremely efficient in using varieties of acoustic cues, such as interaural time difference (ITD), interaural intensity difference

5 We can view sound localization as binaural depth perception, representing the counterpart to binocular depth perception in vision.

The Cocktail Party Problem

1881

(IID), and interaural phase difference (IPD), to perform specific audition tasks. The slight differences in these cues are sufficient to identify the location and direction of the incoming sound waves. An influential binaural phenomenon is the so-called binaural masking (e.g., Durlach & Colburn, 1978; Moore, 1997; Yost, 2000). The threshold of detecting a signal masked in noise can sometimes be lower when listening with two ears compared to listening with only one, which is demonstrated by a phenomenon called binaural masking level difference (BMLD). It is known (Yost, 2000) that the masked threshold of a signal is the same when the stimuli are presented in a monotic or diotic condition; when the masker and the signal are presented in a dichotic situation, the signal has a lower threshold than in either monotic or diotic conditions. Similarly, many experiments have also verified that binaural hearing increases speech intelligibility when the speech signal and noise are presented dichotically. Another important binaural phenomenon is binaural fusion. Fusion is the essence of directional hearing; the fusion mechanism is often modeled as performing some kind of correlation analysis (Cherry & Sayers, 1956; Cherry, 1961), in which a binaural fusion model based on the autocorrelogram and crosscorrelogram was proposed (see Figure 1). 2.4 Attention. Another function basic to human auditory analysis is that of attention, which is a dynamic cognitive process. According to James (1890), the effects of attention include five types of cognitive behavior: (1) perceive, (2) conceive, (3) distinguish, (4) remember, and (5) shorten the reaction time of perceiving and conceiving. In the context of the cocktail party problem, attention pertains to the ability of a listener to focus attention on one channel while ignoring other irrelevant channels. In particular, two kinds of attention processes are often involved in the cocktail party phenomenon (Jones & Yee, 1993; Yost, 1997):

r r

Selective attention, in which the listener attends to one particular sound source and ignores other sources Divided attention, in which the listener attends to more than one sound source

Once the attention mechanism (either selective or divided) is initiated, a human subject is capable of maintaining attention for a short period of time, hence the term maintained attention. Another intrinsic mechanism relating to attention is switched attention, which involves the ability of the human brain to switch attention from one channel to another; switched attention is probably mediated in a top-down manner by “gating” the incoming auditory signal (Wood & Cowan, 1995). In this context, a matter of particular interest is the fact that unlike the visual system, whose cortical top-down feedback goes only as far down as the thalamus, the cortical feedback in the auditory system exerts its effect

1882

S. Haykin and Z. Chen

auditory cortex

top-down

gating

thalamus

MGN

bottom-up

auditory haircells

Figure 2: Schematic diagram of an auditory attention circuit.

all the way down to the outer hair cells in the cochlea via the midbrain structure (Wood & Cowan, 1995). Accordingly, the potential for the early selection process of a speech signal of interest in audition is large. Figure 2 is a schematic diagram of the auditory attention circuit. As depicted in the figure, the thalamus acts mainly as a relay station between the sensory hair cells and the auditory cortex.6 The bottom-up signals received from the hair cells are sent to medial geniculate nuclei (MGN) in the thalamus and farther up to the auditory cortex through the thalamocortical pathways. The top-down signals from the cortex are sent back to the hair cells through the corticothalamic pathways, to reinforce the signal stream of interest and maximize expectation through feedback. In addition to auditory scene inputs, visual scene inputs are believed to influence the attention mechanism (Jones & Yee, 1993). For instance, lipreading is known to be beneficial to speech perception. The beneficial effect is made possible by virtue of the fact that the attention circuit also encompasses cortico-cortical loops between the auditory and visual cortices. 2.5 Feature Binding. One other important cognitive process involved in the cocktail party phenomenon is that of feature binding, which refers to the problem of representing conjunctions of features. According to von der

6 This is also consistent with the postulate of visual attention mechanism (Crick, 1984; Mumford, 1991, 1995).

The Cocktail Party Problem

1883

Malsburg (1999), binding is a very general problem that applies to all types of knowledge representations, from the most basic perceptual representation to the most complex cognitive representation. Feature binding may be either static or dynamic. Static feature binding involves a representational unit that stands for a specific conjunction of properties, whereas dynamic feature binding involves conjunctions of properties as the binding of units in the representation of an auditory scene, an idea traced back to Treisman and Gelade (1980). The most popular dynamic binding mechanism is based on temporal synchrony, hence the reference to it as “temporal binding.” Konig, ¨ Engel, and Singer (1996) suggested that synchronous firing of neurons plays an important role in information processing within the cortex. Rather than being a temporal integrator, the cortical neurons might serve as a coincidence detector evidenced by numerous physiological findings.7 Dynamic binding is closely related to the attention mechanism, which is used to control the synchronized activities of different assemblies of units and how the finite binding resource is allocated among the assemblies (Singer, 1993, 1995). Experimental evidence (especially in vision) has shown that synchronized firing tends to provide the attended stimulus with an enhanced representation. Temporal binding hypothesis is attractive, though not fully convincing, in interpreting the perceptual (Gestalt) grouping and sensory segmentation, which has also been evidenced by numerous neurophysiological data (Engel et al., 1991; von der Malsburg, 1999; see also the bibliographies in both works in the special issue on the binding problem). 2.6 Psychophysical and Psychoacoustic Perspectives. 2.6.1 Psychophysical Attributes and Cues. The psychophysical attributes of sound mainly involve three forms of information: spatial location, temporal structure, and spectral characterization. The perception of a sound signal in a cocktail party environment is uniquely determined by this kind of collective information; any difference in any of the three forms of information is believed to be sufficient to discriminate two different sound sources. In sound perception, many acoustic features (cues) are used to perform specific tasks. Table 1 summarizes the main acoustic features (i.e., the temporal or spectral patterns) used for a single-stream sound perception. Combination of a few or more of those acoustic cues is the key to conducting auditory scene analysis. Psychophysical evidence also suggests that significant cues may be provided by spectral-temporal correlations (Feng & Ratnam, 2000). It should be noted that the perception ability with respect to different sound objects (e.g., speech or music) may be different. The fundamental frequencies or tones of sounds are also crucial to perception sensitivity. Experimental results

7 For detailed discussions of the coincidence detector, see the review papers (Singer, 1993; Konig ¨ & Engel, 1995; Konig, ¨ Engel, & Singer, 1996).

1884

S. Haykin and Z. Chen

Table 1: The Features (Cues) Used in Sound Perception. Feature or Cue Visual Interaural time difference (ITD) Interaural intensity difference (IID) Intensity (volume), loudness Periodicity, rhythm Onsets/offsets Amplitude modulation (AM) Frequency modulation (FM) Pitch Timbre, tone Hamonicity, formant

Domain Spatial Spatial Spatial Temporal Temporal Temporal Temporal Temporal-spectral Spectral Spectral Spectral

Task Where Where Where Where + what What What What What What What What

have confirmed that difficulties occur more often in the presence of competing speech signals than in the presence of a single speech and other acoustic sources. 2.6.2 Room Acoustics. For auditory scene analysis, studying the effect of room acoustics on the cocktail party environment is important (Sabine, 1953; MacLean, 1959; Blauert, 1983). A conversation occurring in a closed room often suffers from the multipath effect—mainly echoes and reverberation, which are almost ubiquitous but are rarely consciously noticed. According to the acoustics of the room, a reflection from one surface (e.g., wall, ground) produces reverberation. In the time domain, the reflection manifests itself as smaller, delayed replicas (echoes) that are added to the original sound; in the frequency domain, the reflection introduces a comb-filter effect into the frequency response. When the room is large, echoes can sometimes be consciously heard. However, the human auditory system is so powerful that it can take advantage of binaural and spatial hearing to efficiently suppress the echo, thereby improving the hearing performance. The acoustic cues listed in Table 1 that are spatially dependent, such as ITD and IID, are naturally affected by reverberation. The acoustic cues that are space invariant, such as common onset across frequencies and pitch, are less sensitive to reverberation. On this basis, it is conjectured that the auditory system has the ability to weight the different acoustic cues (prior to their fusion) so as to deal with a reverberant environment in an effective manner. 3 Insight from Marr’s Vision Theory Audition and vision, the most influential perception processes in the human brain, enable us to absorb the cyclopean information of the outer world in our daily lives. It is well known to neuroscientists that audition (hearing) and

The Cocktail Party Problem

1885

vision (seeing) share substantial common features in the sensory processing principles as well as anatomical and functional organizations in higher-level centers in the cortex. In his landmark book, David Marr first presented three levels of analysis of information processing systems (Marr, 1982, p. 25):

r r r

Computation: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? Representation: How can this computational theory be implemented? In particular, what is the representation for the input and output, and what is the algorithm for the transformation? Implementation: How can the representation and algorithm be realized physically?

In many perspectives, Marr’s observations highlight the fundamental questions that need to be addressed in computational neuroscience, not only in vision but also audition. As a matter of fact, Marr’s theory has provided many insights into auditory research (Bregman, 1990; Rosenthal & Okuno, 1998). In a similar vein to visual scene analysis (e.g., Julesz & Hirsh, 1972), auditory scene analysis (Bregman, 1990) attempts to identify the content (“what”) and the location (“where”) of the sounds and speech in the environment. In specific terms, auditory scene analysis consists of two stages. In the first stage, the segmentation process decomposes a complex acoustic scene into a collection of distinct sensory elements; in the second stage, the grouping process combines these elements into a stream according to some principles; the streams can be interpreted by a higher-level process for recognition and scene understanding. Motivated by Gestalt psychology, Bregman (1990) has proposed five grouping principles for auditory scene analysis: 1. Proximity, which characterizes the distances between the auditory features with respect to their onsets, pitch, and intensity (loudness) 2. Similarity, which usually depends on the properties of a sound signal, such as timbre 3. Continuity, which features the smoothly varying spectra of a timevarying sound source 4. Closure, which completes fragmentary features that have a good Gestalt; the completion can be understood as an auditory compensation for masking 5. Common fate, which groups together activities (onset, glides, or vibrato) that are synchronous.

1886

S. Haykin and Z. Chen

Moreover, Bregman (1990) has distinguished at least two levels of auditory organization: primitive streaming and schema-based segregation, with schemas provided by the collected phonetic, prosodic, syntactic, and semantic information. While being applicable to general sound (speech and music) scene analysis, Bregman’s work focused mainly on primitive stream segregation. As discussed earlier, auditory scene analysis attempts to solve the analysis and recognition aspects of the CPP. 4 Computational Auditory Scene Analysis Computational auditory scene analysis (CASA) relies on the development of a computational model of the auditory scene with one of two goals in mind, depending on the application of interest: 1. The design of a machine, which by itself is able to automatically extract and track a sound signal of interest in a cocktail party environment 2. The design of an adaptive hearing system, which automatically computes the perceptual grouping process missing from the auditory system of a hearing-impaired individual, thereby enabling that individual to attend to a sound signal of interest in a cocktail party environment. Naturally, CASA is motivated by or builds on the understanding we have of human auditory scene analysis. Following Bregman’s seminal work, a number of researchers (Cooke, 1993; Brown, 1992; Ellis, 1996; Cooke & Brown, 1993; Brown & Cooke, 1994; Cooke & Ellis, 2001) have exploited the CASA.8 In the literature, there are two representative kinds of CASA systems: data-driven system (Cooke, 1993) and prediction-driven system (Ellis, 1996). The common feature in these two systems is to integrate low-level (bottom-up, primitive) acoustic cues for potential grouping. The main differences between them are (Cooke, 2002) that data-driven CASA aims to decompose the auditory scene into time-frequency elements (so-called strands), and then runs the grouping procedure, while prediction-driven CASA regards prediction as the primary goal. It requires only a world model that is consistent with the stimulus; it contains integration of top-down and bottom-up cues and can deal with incomplete or masked data (i.e., speech signal with missing information). However, as emphasized by Bregman (1998), it is important for CASA modelers to take into account psychological data as well as the way humans carry out auditory scene analysis (ASA). For instance, to model the stability of human ASA, the computational system must allow different cues to collaborate and compete and must account for the propagation of constraints across the frequency-by-time field. It is noteworthy that the performances 8

For review, see Rosenthal and Okuno (1998) and Cooke and Ellis, (2001).

The Cocktail Party Problem

1887

of the CASA approaches are quite dependent on the conditions of noise or interference (such as the spatial location and overlapping time-frequency map), which may therefore be a practical limitation for solving a machine cocktail party problem. In what follows, we restrict our attention on three major categorized computational approaches aimed at solving the cocktail party problem:9 (1) blind source separation (BSS) and independent component analysis (ICA), (2) temporal binding and oscillatory correlation, and (3) cortronic network. The first approach has gained a great deal of popularity in the literature, and there is no doubt it will continue to play an important role in neuroscience and signal-processing research; however, the basic ICA approach is limited by its assumptions, and it is arguably biologically implausible in the context of CPP. The second approach is biologically inspired and potentially powerful. The third approach is biologically motivated and knowledge based, and it is configured to solve a realistic machine CPP in a real-life environment. 4.1 Independent Component Analysis and Blind Source Separation. The essence of independent component analysis (ICA) can be stated as follows: given an instantaneous linear mixture of signals produced by a set of sources, devise an algorithm that exploits a statistical discriminant to differentiate these sources so as to provide for the separation of the source signals in a blind (i.e., unsupervised) manner. From this statement, it is apparent that ICA theory and the task of blind source separation (BSS) are intrinsically related. The earliest reference to this signal processing problem is the article by Jutten and H´erault (1991), which was motivated by Hebb’s postulate of learning (1949). This was followed by Comon’s article (1994) and that of Bell and Sejnowski (1995). Comon used some signal processing and informationtheoretic ideas to formulate a mathematical framework for instantaneous linear mixing of independent source signals, in the course of which the notion of nongaussian ICA was clearly defined. Bell and Sejnowski developed a simple algorithm (Infomax) for BSS, which is inspired by Hebb’s postulate of learning and the maximum entropy principle. It is well known that if we are to achieve the blind separation of an instantaneous linear mixture of independent source signals, then there must be a 9 It is noteworthy that our overview is by no means exclusive. In addition to the three approaches being reviewed here, several other approaches, some of them quite promising, have been discussed in the literature: Bayesian approaches (e.g., Knuth, 1999; Mohammad-Djafari, 1999; Rowe, 2002; Attias, 1999; Chan, Lee, & Sejnowski, 2003), timefrequency analysis approaches (e.g., Belouchrani & Amin, 1998; Rickard, Balan, & Rosca, 2001; Rickard & Yilmaz, 2002; Yilmaz & Rickard, 2004), and neural network approaches (e.g., Amari & Cichocki, 1998; Grossberg, Govindarajan, Wyse, & Cohen, 2004). Due to space limitation, we have not included these approaches in this article; the interested reader is referred to Chen (2003) for a more detailed overview.

1888

S. Haykin and Z. Chen

characteristic departure from the simplest possible source model: an independently and identically distributed (i.i.d.) gaussian model. The departure can arise in three different ways, depending on which of the characteristic assumptions embodied in this simple source model is broken, as summarized here (Cardoso, 2001):

r

r

r

Nongaussian i.i.d. model. In this route to BSS, the i.i.d. assumption for the source signals is retained but the gaussian assumption is abandoned for all the sources, except possibly for one of them. The Infomax algorithm due to Bell and Sejnowski (1995), the natural gradient algorithm due to Amari, Cichocki, and Yang (1996), Cardoso’s JADE algorithm (Cardoso & Souloumiac, 1993; Cardoso, 1998), and the FastICA algorithm due to Hyv¨arinen and Oja (1997) are all based on the nongaussian i.i.d. model. These algorithms differ from each other in the way in which source information residing in higher-order statistics is exploited. Gaussian nonstationary model. In this second route to BSS, the gaussian assumption is retained for all the sources, which means that second-order statistics (i.e., mean and variance) are sufficient for characterizing each source signal. Blind source separation is achieved by exploiting the property of nonstationarity, provided that the source signals differ from each other in the ways in which their statistics vary with time. This approach to BSS was first described by Parra and Spence (2000) and Pham and Cardoso (2001). Whereas the algorithms focusing on the nongaussian i.i.d. model operate in the time domain, the algorithms that belong to the gaussian nonstationary model operate in the frequency domain, a feature that also makes it possible for these latter ICA algorithms to work with convolutive mixtures. Gaussian, stationary correlated-in-time model. In this third and final route to BSS, the blind separation of gaussian stationary source signals is achieved on the proviso that their power spectra are not proportional to each other. Recognizing that the power spectrum of a wide-sense stationary random process is related to the autocorrelation function via the Wiener-Khintchine theorem, spectral differences among the source signals translate to corresponding differences in correlated-intime behavior of the source signals. It is this latter property that is available for exploitation. ICA algorithms that belong to this third class include those due to Tong, Soon, Huang, and Liu (1990), Belouchrani, Abed-Meraim, Cardoso, and Moulines (1997), Amari (2000), and Pham (2002).

In Cardoso (2001, 2003), Amari’s information geometry is used to explore a unified framework for the objective functions that pertain to these three routes to BSS.

The Cocktail Party Problem

1889

It is right and proper to say that in their own individual ways, Comon’s 1994 article and the 1995 article by Bell and Sejnowski, have been the catalysts for the literature in ICA theory, algorithms, and novel applications.10 Indeed, the literature is so extensive and diverse that in the course of ten years, ICA has established itself as an indispensable part of the everexpanding discipline of statistical signal processing, and has had a great impact on neuroscience (Brown, Yamada, & Sejnowski, 2001). However, insofar as auditory phenomena are concerned, ICA algorithms do not exploit the merit of spatial hearing; this limitation may be alleviated by adding a spatial filter in the form of an adaptive beamformer (using an array of microphones) as the front-end processor to an ICA algorithm (e.g., Parra & Alvino, 2002). Most important, in the context of the cocktail party problem that is of specific interest to this review, we may pose the following question: Given the ability of the ICA algorithm to solve the BSS problem, can it also solve the cocktail party problem? Our short answer to this fundamental question is no; the rationale is discussed in section 6. 4.2 Temporal Binding and Oscillatory Correlation. Temporal binding theory was most elegantly illustrated by von der Malsburg (1981) in his seminal technical report, Correlation Theory of Brain Function, in which he suggested that the binding mechanism is accomplished by the correlation correspondence between presynaptic and postsynaptic activities, and the strengths of synapses follow the Hebbian postulate of learning. When the synchrony between the presynaptic and postsynaptic neurons is strong (weak), the strength would correspondingly increase (decrease) temporally. Such a synapse was referred to as the “Malsburg synapse” by Crick (1984). The synchronized mechanism allows the neurons to be linked in multiple active groups simultaneously and form a topological network. Moreover, von der Malsburg (1981) suggested a dynamic link architecture to solve the temporal binding problem by letting neural signals fluctuate in time and by synchronizing those sets of neurons that are to be bound together into a higher-level symbol/concept. Using the same idea, von der Malsburg and Schneider (1986) proposed a solution to the cocktail party problem. In particular, they developed a neural cocktail party processor that uses synchronization and desynchronization to segment the sensory inputs. Correlations are generated by an autonomous pattern formation process via neuron coupling and a new synaptic modulation rule. Though based on simple experiments (where von der Malsburg and Schneider used amplitude modulation and stimulus onset synchrony as the main features of the sound, in 10 Several special issues have been devoted to ICA: Journal of Machine Learning Research (Dec. 2003), Neurocomputing (Nov. 1998, Dec. 2002), Signal Processing (Feb. 1999, Jan. 2004), Proceedings of the IEEE (Oct. 1998), and IEEE Transactions on Neural Networks (March 2004). For textbook treatments of ICA theory, see Hyv¨arinen, Karhunen, and Oja (2001), and Cichocki and Amari (2002).

1890

S. Haykin and Z. Chen

Speech and noise

Correlogram

Cochlear filtering

Hair cells

Resynthesized speech

Neural Oscillator Network

Cross-channel correlation

Resynthesized noise Resynthesis

Figure 3: Neural correlated oscillator model from Wang and Brown (1999; adapted with permission).

line with Helmholtz’s suggestion), the underlying idea is illuminating. Just as important, the model is consistent with anatomical and physiological observations. Mathematical details of the coupled neural oscillator model were later explored in von der Malsburg and Buhmann (1992). Note that correlation theory is also applicable to the feature binding problem in visual or sensor-motor systems (Konig ¨ & Engel, 1995; Treisman, 1996; von der Malsburg, 1995, 1999). The idea of oscillatory correlation, as a possible basis for CASA, was motivated by the early work of von der Malsburg (von der Malsburg, 1981; von der Malsburg & Schneider, 1986), and it was extended to different sensory domains whereby phases of neural oscillators are used to encode the binding of sensory components (Wang, Bhumann, & von der Malsburg, 1990; Wang, 1996). Subsequently, Brown and Wang (1997) and Wang and Brown (1999) developed a two-layer oscillator network (see Figure 3) that performs stream segregation based on oscillatory correlation. In the oscillatory correlation-based model, a stream is represented by a population of synchronized relaxation oscillators, each of which corresponds to an auditory feature, and different streams are represented by desynchronized oscillator populations. Lateral connections between oscillators encode the harmonicity and proximity in time and frequency. The aim of the model is to achieve “searchlight attention” by examining the temporal cross-correlation between the activities of pairs (or populations) of neurons: x(t)y(t) C = , 2 x (t) y2 (t) where x(t) and y(t) are assumed to be two zero-mean observable time series. The neural oscillator model depicted in Figure 3 comprises two layers: a segmentation layer and a grouping layer. The first layer acts as a locally excitatory, globally inhibitory oscillator, and the second layer essentially performs auditory scene analysis. Preceding the oscillator network, there is an auditory periphery model (cochlear and hair cells) as well as a middlelevel auditory representation stage (correlogram). As reported by Wang and Brown (1999), the model is capable of segregating a mixture of voiced speech and different interfering sounds, thereby improving the signal-to-noise

The Cocktail Party Problem

1891

ratio (SNR) of the attended speech signal. The correlated neural oscillator is arguably biologically plausible (Wang & Brown, 1999). In specific terms, the neural oscillator acts as a source extraction functional block by treating the attended signal as a “foreground” stream and putting the remaining segments into a “background” stream. However, the performance of the neural oscillator appears to deteriorate significantly in the presence of multiple competitive sources. Recently, van der Kouwe, Wang, and Brown (2001) compared their neural oscillator model with representative BSS techniques for speech segregation in different scenarios; they reported that the performance of the oscillator model varied from one test to another, depending on the time-frequency characteristics of the sources. Under most of the noise conditions, the BSS technique more or less outperformed the oscillator model; however, the BSS techniques worked quite poorly when applied to sources in motion or gaussian sources due to the violation of basic ICA assumptions. 4.3 Cortronic Network. The idea of a so-called cortronic network was motivated by the fact that the human brain employs an efficient sparse coding scheme to extract the features of sensory inputs and accesses them through associative memory. Using a cortronic neural network architecture proposed by Hecht-Nielsen (1998), a biologically motivated connectionist model has been recently developed to solve the machine CPP (Sagi et al., 2001). In particular, Sagi et al. view the CPP as an aspect of the human speech recognition problem in a cocktail party environment, and thereby regard the solution as an attended source identification problem. Only one microphone is used to record the auditory scene; however, the listener is assumed to be familiar with the language of conversation of interest and ignoring other ongoing conversations. All the subjects were chosen to speak the same language and have the same voice qualities. The goal of the cortronic network is to identify one attended speech of interest. The learning machine described in Sagi et al. (2001) is essentially an associative memory neural network model. It consists of three distinct layers (regions): sound-input representation region, sound processing region, and word processing region. For its operation, the cortronic network rests on two assumptions:

r r

The network has knowledge of the speech signals (e.g., language context) used. The methodology used to design the network resides within the framework of associative memory and pattern identification.

In attacking the machine CPP, the cortronic network undertakes three levels of association (Sagi et al., 2001): (1) sound and subsequent sound, (2) sequence of sounds and the token (i.e., information unit) that is sparsely coded, and (3) a certain word and the word that follows it in the language.

1892

S. Haykin and Z. Chen

In terms of performance described in Sagi et al., it appears that the cortronic network is quite robust with respect to the variations of speech, speaker, and noise, even under a −8 dB SNR. From one microphone, it can extract one attended speech source in the presence of four additive speech interferences (R. Hecht-Nielsen, personal communication, Sept. 2003). Different from the other computational approaches proposed to solve the CPP, the cortronic network exploits the knowledge context of speech and language; it is claimed to address a theory of how the brain thinks rather than how the brain listens. 5 Active Audition The three approaches to computational auditory scene analysis described in section 4 share a common feature: the observer merely listens to the environment but does not interact with it (i.e., the observer is passive). In this section, we briefly discuss the idea of active audition in which the observer (human or machine) interacts with the environment. This idea is motivated by the fact that human perception is not passive but active. Moreover, there are many analogies between the mechanisms that go on in auditory perception and their counterparts in visual perception.11 In a similar vein, we may look to active vision (on which much research has been done for over a decade) for novel ideas in active audition as the framework for an intelligent machine to solve the cocktail party problem. According to Varela, Thompson, and Rosch (1991) and Sporns (2003), embodied cognitive models rely on cognitive processes that emerge from interactions between neural, bodily, and environment factors. A distinctive feature of these models is that they use the world as their own model. For example, in active vision (also referred to as animated vision), proposed by Bajcsy (1988) and Ballard (1991), among others, it is argued that vision is best understood in the context of visual behaviors. The key point here is that the task of vision is not to build the model of a surrounding real world as originally postulated in Marr’s theory, but rather to use visual information in the service of the real world in real time, and do so efficiently and inexpensively (Clark & Eliasmith, 2003). In effect, the active vision paradigm gives “action” a starring role (Sporns, 2003). With this brief background on active vision, we may now propose a framework for active audition, which may embody four specific functions: 1. Localization, the purpose of which is to infer the directions of incoming sound signals. This function may be implemented by using an

11 The analogy between auditory and visual perceptions is further substantiated in Shamma (2001), where it is argued that they share certain processing principles: lateral inhibition for edge and peak enhancement, multiscale analysis, and detection mechanisms for temporal coincidence and spatial coincidence.

The Cocktail Party Problem

1893

adaptive array of microphones, whose design is based on direction of arrival (DOA) estimation algorithms developed in the signalprocessing literature (e.g., Van Veen & Buckley, 1997; Doclo & Moonen, 2002). 2. Segregation and focal attention, where the attended sound stream of interest (i.e., target sound) is segregated and the sources of interference are ignored, thereby focusing attention on the target sound source. This function may be implemented by using several acoustic cues (e.g., ITD, IID, onset, and pitch) and then combining them in a fusion algorithm.12 3. Tracking, the theoretical development of which builds on a state-space model of the auditory environment. This model consists of a process equation that describes the evolution of the state with time and a measurement equation that describes the dependence of the observables on the state. More specifically, the state is a vector defined by the acoustic cues (features) characterizing the target sound stream and its direction.13 By virtue of its very design, the tracker provides a one-step prediction of the underlying features of the target sound. We may therefore view tracking as a mechanism for dynamic feature binding. 4. Learning, the necessity of which in active audition is the key function that differentiates an intelligent machine from a human brain. Audition is a sophisticated, dynamic information processing task performed in the brain, which inevitably invokes other tasks simultaneously (such as vision and action). It is this unique feature that enables the human to survive in a dynamic environment. For the same reason, it is our belief that an intelligent machine that aims at solving a cocktail party problem must embody a learning capability to adapt itself to an ever-changing dynamic environment. The learning ability must also be of a kind that empowers the machine to take “action” whenever changes in the environment call for it. Viewed together, these four functions provide the basis for building an embodied cognitive machine that is capable of human-like hearing in an active fashion. The central tenet of active audition embodying such a machine is that an observer may be able to understand an auditory environment more

12 This approach to segregation and focal attention is currently being pursued by the first author, working with his research colleague Rong Dong at McMaster University. 13 In the context of tracking, Nix, Kleinschmidt, and Hohmann (2003) used a particle filter as a statistical method for integrating temporal and frequency-specific features of a target speech signal.

1894

S. Haykin and Z. Chen

effectively and efficiently if the observer interacts with the environment rather than is a passive observer.14 6 Discussion We conclude the article by doing two things. First, we present a philosophical discussion that in the context of the cocktail party phenomenon, ICA algorithm addresses an entirely different signal processing problem. Second, in an attempt to explain how the brain solves the CPP, we postulate a biologically motivated correlated firing framework for single-stream extraction based on our own recent work. 6.1 Brain Theory and ICA. In the course of over ten years, the idea of ICA has blossomed into a new field that has enriched the discipline of signal processing and neural computation. However, when the issue of interest is a viable solution to the cocktail party problem, the ICA/BSS framework has certain weaknesses:

r

r

r

Most ICA/BSS algorithms require that the number of sources not be fewer than the number of independent signal sources.15 In contrast, the human auditory system requires merely two outer ears to solve the cocktail party problem, and it can do so with relative ease. The matter of independent signal sources is assumed to remain constant in the ICA/BSS framework. This is an unrealistic assumption in a neurobiological context. For instance, it is possible for an auditory environment to experience a varying number of speakers (i.e., sound sources) or a pulse-like form of noise (e.g., someone laughing or coughing), yet the human capability to solve the cocktail party problem remains essentially unaffected by the variations in the auditory scene. Last but by no means least, the ICA/BSS framework usually requires the separation of all source signals. With both outer ears of the human auditory systems focused on a signal speaker of interest in a complex auditory scene, the cocktail party problem is solved by extracting that speaker’s speech signal and practically suppressing all other forms of noise or interference.

14 In this concluding statement on active audition, we have paraphrased the essence of active vision (Blake & Yuille, 1992), or more generally, that of active perception (Bajcsy, 1988). 15 In the literature, there are also some one-microphone BSS approaches that attempt to exploit either the acoustic features and masking technique (Roweis, 2000), the splitting of time-domain disjoint/orthogonal subspaces (Hopgood & Rayner, 1999), or the prior knowledge of the source statistics (Jang & Lee, 2003).

The Cocktail Party Problem

1895

Simply put, viewed in the context of the cocktail party problem, ICA/BSS algorithms seem not to be biologically plausible. Rather, we say that for a computational auditory scene analysis framework to be neurobiologically feasible, it would have to accommodate the following ingredients:

r r r

r

Ability to operate in a nonstationary convolutive environment, where the speech signal of interest is corrupted by an unknown number of competing speech signals or sources of noise. Ability to switch the focus of attention from one speech signal of interest to another and do so with relative ease. Working with a pair of sensors so as to exploit the benefit of binaural hearing. However, according to Sagi et al. (2001), it is claimed that a single sensor (reliance on monaural hearing) is sufficient to solve the cocktail party problem. There is no contradiction here between these two statements, as demonstrated by Cherry and coworkers over 50 years ago: simply put, binaural hearing is more effective than monaural hearing by working with a significantly smaller SNR. Parallelism, which makes it possible for the incoming signal to be worked on by a number of different paths, followed by a fusion of their individual outputs.

6.2 Correlative Neural Firing for Blind Single-Source Extraction. Correlation theory has played an influential role on memory recall, coincidence detection, novelty detection, perception, and learning, which cover most of the intelligent tasks of a human brain (Cook, 1991). Eggermont (1990), in his insightful book, has presented a comprehensive investigation of correlative activities in the brain; in that book, Eggermont argued that correlation, in one form or another, is performed in 90 percent of the human brain. Eggermont (1993) also investigated the neural correlation mechanisms (such as coincidence detection and tuned-delay mechanisms) in the auditory system and showed that synchrony is crucial in sound localization, pitch extraction of music, and speech coding (such as intensity-invariant representation of sounds, suppressing less salient features of speech sound, and enhancing the representation of speech formant). Recently, we (Chen, 2005; Chen & Haykin, 2004) proposed a stochasticcorrelative firing mechanism and an associated learning rule for solving a simplified form of CPP. The idea of correlative firing mechanism is similar to that of synchrony and correlation theory (von der Malsburg, 1981; Eggermont, 1990) and, interestingly enough, it is motivated by some early work on vision (Harth, Unnikrishnan, & Pandya, 1987; Mumford, 1995). In our proposed correlative firing framework, the auditory cortex implements a certain number of parallel circuits, each responsible for extracting the attended sound source of interest. By analogy with figure-ground segregation, the circuit extracts the “figure” from a complex auditory scene. To model the selective attention in the thalamus, a gating network is proposed for

1896

S. Haykin and Z. Chen

deciding or switching the attention of the segregated sources, as depicted in the schematic diagram of Figure 2. Specifically, the proposed stochastic correlative learning rule (Chen, 2005; Chen & Haykin, 2004) can be viewed as a variant of the ALOPEX (ALgorithm Of Pattern EXtraction), an optimization procedure that was originally developed in vision research (Harth & Tzanakou, 1974; Tzanakou, Michalak, & Harth, 1979). The stochastic correlative learning rule is temporally asymmetric and Hebbian-like. The most attractive aspect of this learning rule lies in the fact that it is both gradient free and model independent, which make it potentially applicable to any neural architecture that has hierarchical or feedback structures (Haykin, Chen, & Becker, 2004); it also allows us to develop biologically plausible synaptic rules based on the firing rate of the spiking neurons (Harth et al., 1987). Another appealing attribute of this learning rule is its parallelism in that synaptic plasticity allows a form of synchronous firing, which lends itself to ease of hardware implementation. The proposed correlative firing framework for the CPP assumes a time-invariant instantaneous linear mixing of the sources as in the ICA theory. However, unlike ICA, our method is aimed at extracting the “figure” (i.e., one single stream), given two sensors (corresponding to two ears), from the “background” (auditory scene) containing more than two nongaussian sound sources; additionally, our interpretation of mixing coefficients in the mixing matrix as neurons’ firing rates (instead of acoustic mixing effect) is motivated from the notion of the firing-rate stimulus correlation function in neuroscience (Dayan & Abbott, 2001), and the learning rule aimed at extracting a single source that bears the highest synchrony in terms of neurons’ firing rate. In simulated experiments reported in Chen and Haykin, (2004), we have demonstrated a remarkable result: the algorithm is capable of extracting one single source stream given only two sensors and more than four sources (including one gaussian source). This is indeed an intractable task for ICA algorithms. Our results emphasize that the main goal of the solution to the CPP is not to separate out the competing sources but rather to extract the source signal of interest (or “figure”) in a complex auditory scene that includes competing speech signals and noise. To conclude, our proposed stochastic correlative neural firing mechanism, embodying its own learning rule, is indeed aimed at blind signal source extraction. Most important, it provides a primitive yet arguably convincing basis for how the human auditory system solves the cocktail party problem; that is, it addresses question 2 of the introductory section. However, in its present form, the learning rule is not equipped to deal with question 3 pertaining to computational auditory scene analysis. Acknowledgments The work reported in this review is supported by the Natural Sciences and Engineering Research Council of Canada. It was carried out under a joint

The Cocktail Party Problem

1897

Canada-Europe BLISS project. We thank the editor and two anonymous reviewers for constructive suggestions and valuable comments that have helped us reshape the review into its present form. We also thank P. Divenyi, S. Grossberg, S. Shamma, S. Becker, and R. Dong for helpful comments on various early drafts of this review and J. J. Eggermont, R. Hecht-Nielsen, and W. A. Yost for kindly supplying some references pertinent to the cocktail party problem. We also greatly appreciate certain authors and publishers for permission to reproduce or adapt Figures 1 and 3 in the review. References Amari, S. (2000). Estimating functions of independent component analysis for temporally correlated signals. Neural Computation, 12(9), 2083–2107. Amari, S., & Cichocki, A. (1998). Adaptive blind signal processing—neural network approaches. Proceedings of the IEEE, 86(10), 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12, 35–50. Attias, H. (1999). Independent factor analysis. Neural Computation, 11, 803–851. Bajcsy, R. (1988). Active perception. Proceedings of the IEEE, 76, 996–1005. Ballard, D. H. (1991). Animate vision. Artificial Intelligence, 48, 57–86. Bell, A., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1120–1159. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–444. Belouchrani, A., & Amin, M. G. (1998). Blind source separation technique based on time-frequency representations. IEEE Transactions on Signal Processing, 46(11), 2888–2897. Blake, A., & Yuille, A. (Eds.). (1992). Active vision. Cambridge, MA: MIT Press. Blauert, J. (1983). Spatial hearing: The psychophysics of human sound localization (rev. ed.). Cambridge, MA: MIT Press. Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound. Cambridge, MA: MIT Press. Bregman, A. S. (1998). Psychological data and computational ASA. In D. F. Rosenthal & H. G. Okuno (Eds.), Computational auditory scene analysis. Mahwah, NJ: Erlbaum. Bronkhorst, A. (2000). The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker condition. Acoustica, 86, 117–128. Brown, G. J. (1992). Computational auditory scene analysis: A representational approach. Unpublished doctoral dissertation, University of Sheffield. Brown, G. J., & Cooke, M. P. (1994). Computational auditory scene analysis. Computer Speech and Language, 8, 297–336. Brown, G. J., & Wang, D. L. (1997). Modelling the perceptual segregation of concurrent vowels with a network of neural oscillation. Neural Networks, 10(9), 1547–1558.

1898

S. Haykin and Z. Chen

Brown, G. D., Yamada, S., & Sejnowski, T. J. (2001). Independent component analysis at the neural cocktail party. Trends in Neuroscience, 24, 54–63. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE, 86(10), 2029–2025. Cardoso, J.-F. (2001). The three easy routes to independent component analysis: Contrasts and geometry. In Proc. ICA2001. San Diego. Cardoso, J.-F. (2003). Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research, 4, 1177–1203. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non-Gaussian signals. IEE Proceedings–F, 140(6), 362–370. Chan, K., Lee, T.-W., & Sejnowski, T. J. (2003). Variational Bayesian learning of ICA with missing data. Neural Computation, 15(8), 1991–2011. Chen, Z. (2003). An odyssey of the cocktail party problem (Tech. Rep.). Hamilton, Ontario: Adaptive Systems Lab, McMaster University. Available online: http:// soma.crl.mcmaster.ca/∼ zhechen/download/cpp.ps Chen, Z. (2005). Stochastic correlative firing for figure-ground segregation. Biological Cybernetics, 92(3), 192–198. Chen, Z., & Haykin, S. (2004). Figure-ground segregation in sensory perception using a stochastic correlative learning rule (Tech. Rep.). Hamilton, Ontario: Adaptive Systems Lab, McMaster University. Available online: http://soma.crl.mcmaster. ca/∼ zhechen/download/scl.ps. Cherry, E. C. (1953). Some experiments on the recognition of speech, with one and two ears. Journal of the Acoustical Society of America, 25, 975–979. Cherry, E. C. (1957). On human communication: A review, survey, and a criticism. Cambridge, MA: MIT Press. Cherry, E. C. (1961). Two ears—but one world. In W. A. Rosenblith (Ed.), Sensory communication (pp. 99–117). New York: Wiley. Cherry, E. C., & Sayers, B. (1956). Human “cross-correlation”—A technique for measuring certain parameters of speech perception. Journal of the Acoustical Society of America, 28, 889–895. Cherry, E. C., & Sayers, B. (1959). On the mechanism of binaural fusion. Journal of the Acoustical Society of America, 31, 535. Cherry, E. C., & Taylor, W. K. (1954). Some further experiments upon the recognition of speech, with one and, with two ears. Journal of the Acoustical Society of America, 26, 554–559. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. New York: Wiley. Clark, A., & Eliasmith, C. (2003). Philosophical issues in brain theory and connectionism. In M. Arbib (Ed.), Handbook of brain theory and neural networks (2nd ed., pp. 886–888). Cambridge, MA: MIT Press. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Cook, J. E. (1991). Correlated activity in the CNS: A role on every timescale? Trends in Neuroscience, 14, 397–401. Cooke, M. (1993). Modelling auditory processing and organization. Cambridge: Cambridge University Press. Cooke, M. P. (2002, December). Computational auditory scene analysis in listeners and machines. Tutorial at NIPS2002, Vancouver, Canada.

The Cocktail Party Problem

1899

Cooke, M. P., & Brown, G. J. (1993). Computational auditory scene analysis: Exploiting principles of perceived continuity. Speech Communication, 13, 391–399. Cooke, M., & Ellis, D. (2001). The auditory organization of speech and other sources in listeners and computational models. Speech Communication, 35, 141–177. Crick, F. (1984). Function of the thalamic reticular complex: The searchlight hypothesis. Proc. Natl. Acad. Sci. USA, 81, 4586–4590. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Doclo, S., & Moonen, M. (2002). Robust adaptive time-delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal of Applied Signal Processing, 5, 2230–2244. Divenyi, P. (Ed.). (2004). Speech separation by humans and machines. Berlin: Springer. Durlach, N. I., & Colburn, H. S. (1978). Binaural phenomena. In E. C. Cartrette & M. P. Friedman (Eds.), Handbook of perception. New York: Academic Press. Eggermont, J. J. (1990). The correlative brain: Theory and experiment in neural interaction. New York: Springer-Verlag. Eggermont, J. J. (1993). Function aspects of synchrony and correlation in the auditory nervous system. Concepts in Neuroscience, 4(2), 105–129. Ellis, D. (1996). Prediction-driven computational auditory scene analysis. Unpublished doctoral dissertation, MIT. Engel, A. K., Konig, ¨ P., & Singer, W. (1991). Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acad. Sci. USA, 88, 9136–9140. Feng, A. S., & Ratnam, R. (2000). Neural basis of hearing in real-world situations. Annual Review of Psychology, 51, 699–725. Grossberg, S., Govindarajan, K., Wyse, L. L., & Cohen, M. A. (2004). ARTSTREAM: A neural network model of auditory scene analysis and source segregation. Neural Networks, 17(4), 511–536. Harth, E., & Tzanakou, E. (1974). Alopex: A stochastic method for determining visual receptive fields. Vision Research, 14, 1475–1482. Harth, E., Unnikrishnan, K. P., & Pandya, A. S. (1987). The inversion of sensory processing by feedback pathways: A model of visual cognitive functions. Science, 237, 184–187. Hawley, M. L., Litovsky, R. Y., & Colburn, H. S. (1999). Speech intelligibility and localization in a multisource environment. Journal of the Acoustical Society of America, 105, 3436–3448. Haykin, S. (2003, June). The cocktail party phenomenon. Presentation at the ICA Workshop, Berlin, Germany. Haykin, S., Chen, Z., & Becker, S. (2004). Stochastic correlative learning algorithms. IEEE Transactions on Signal Processing, 52(8), 2200–2209. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hecht-Nielsen, R. (1998). A theory of cerebral cortex. In Proc. 1998 Int. Conf. Neural Information Processing, ICONIP’98 (pp. 1459–1464). Burke, VA: IOS Press. Hopgood, E., & Rayner, P. (1999). Single channel signal separation using linear timevarying filters: Separability of non-stationary stochastic signals. In Proc. ICASSP (Vol. 3, pp. 1449–1452). Piscataway, NJ: IEEE Press. Hyv¨arinen, A., Karhunen, V., & Oja, E. (2001). Independent component analysis. New York: Wiley.

1900

S. Haykin and Z. Chen

Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. James, W. (1890). Psychology (briefer course). New York: Holt. Jang, G.-J., & Lee, T. W. (2003). A maximum likelihood approach to single-channel source separation. Journal of Machine Learning Research, 4, 1365–1392. Jones, M., & Yee, W. (1993). Attending to auditory events: The role of temporal organization. In S. McAdams & E. Bigand (Eds.), Thinking in sound (pp. 69–106). Oxford: Clarendon Press. Julesz, B., & Hirsh, I. J. (1972). Visual and auditory perception—an essay in comparison. In E. David & P. B. Denes (Eds.), Human communication: A unified view. New York: McGraw-Hill. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I–III. Signal Processing, 24, 1–29. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (Eds.). (2000). Hearing. In Principles of neural science (4th ed.). New York: McGraw-Hill. Knuth, K. H. (1999). A Bayesian approach to source separation. In Proc. ICA’99 (pp. 283–288). Aussois, France. Konig, ¨ P., & Engel, A. K. (1995). Correlated firing in sensory-motor systems. Current Opinions in Neurobiology, 5, 511–519. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends in Neuroscience, 19(4), 130–137. MacLean, W. R. (1959). On the acoustics of cocktail parties. Journal of the Acoustical Society of America, 31(1), 79–80. Marr, D. (1982). Vision. San Francisco: Freeman. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Mohammad-Djafari, A. (1999). A Bayesian approach to source separation. In Proc. 19th Int. Workshop on Maximum Entropy and Bayesian Methods (MaxEnt99). Boise, ID. Moore, B. C. J. (1997). An introduction to the psychology of hearing (4th ed.). San Diego: Academic Press. Moray, N. (1959). Attention in dichotic listening: Affective cues and the influence of instructions. Quarterly Journal of Experimental Psychology, 27, 56–60. Mumford, D. (1991). On the computational architecture of the neocortex. Part I: The role of the thalamocortical loop. Biological Cybernetics, 65, 135–145. Mumford, D. (1995). Thalamus. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 981–984). Cambridge, MA: MIT Press. Nix, J., Kleinschmidt, M., & Hohmann, V. (2003). Computational auditory scene analysis of cocktail-party situations based on sequential Monte Carlo methods. In Proc. 37th Asilomar Conference on Signals, Systems and Computers (pp. 735–739). Redondo Beach, CA: IEEE Computer Society Press. Parra, L., & Alvino, C. (2002). Geometric source separation: Merging convolutive source separation with geometric beamforming. IEEE Transactions on Speech and Audio Processing, 10(6), 352–362. Parra, L., & Spence, C. (2000). Convolutive blind source separation of nonstationary sources. IEEE Transactions on Speech and Audio Processing, 8(3), 320–327.

The Cocktail Party Problem

1901

Pham, D. T. (2002). Mutual information approach to blind separation of stationary sources. IEEE Transactions on Information Theory, 48(7), 1935–1946. Pham, D. T., & Cardoso, J. F. (2001). Blind separation of instantaneous mixtures of non-stationary sources. IEEE Transactions on Signal Processing, 49(9), 1837– 1848. Rickard, S., Balan, R., & Rosca, J. (2001). Real-time time-frequency based blind source separation. In Proc. ICA2001 (pp. 651–656). San Diego, CA. Rickard, S., & Yilmaz, O. (2002). On the approximate W-disjoint orthogonality of speech. In Proc. ICASSP2002 (pp. 529–532). Piscataway, NJ: IEEE Press. Rosenthal, D. F., & Okuno, H. G. (Eds.). (1998). Computational auditory scene analysis. Mahwah, NJ: Erlbaum. Rowe, D. B. (2002). A Bayesian approach to blind source separation. Journal of Interdisciplinary Mathematics, 5(1), 49–76. Roweis, S. (2000). One microphone source separation. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 793–799). Cambridge, MA: MIT Press. Sabine, H. J. (1953). Room acoustics. Transactions of IRE, 1, 4–12. Sagi, S., Nemat-Nasser, S. C., Kerr, R., Hayek, R., Downing, C., & Hecht-Nielsen, R. (2001). A biologically motivated solution to the cocktail party problem. Neural Computation, 13, 1575–1602. Sayers, B., & Cherry, E. C. (1957). Mechanism of binaural fusion in the hearing of speech. Journal of the Acoustical Society of America, 31, 535. Schultz, S. R., Golledge, H. D. R., & Panzeri, S. (2001). Synchronization, binding and the role of correlated firing in fast information transmission. In S. Wermter, J. Austin, & D. Willshaw (Eds.), Emergent neural computational architectures based on neuroscience. Berlin: Springer-Verlag. Shamma, S. (2001). On the role of space and time in auditory processing. Trends in Cognitive Sciences, 5(8), 340–348. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annual Review of Physiology, 55, 349–374. Singer, W. (1995). Synchronization of neural responses as putative binding mechanism. In M. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 960–964). Cambridge, MA: MIT Press. Sporns, O. (2003). Embodied cognition. In M. Arbib (Ed.), Handbook of brain theory and neural networks (2nd edition, pp. 395–398). Cambridge, MA: MIT Press. Tong, L., Soon, V., Huang, Y., & Liu, R. (1990). AMUSE: A new blind identification problem. In Proc. ICASSP (pp. 1784–1787). Piscataway, NJ: IEEE Press. Treisman, A. M. (1996). The binding problem. Current Opinion in Neurobiology, 6, 171–178. Treisman, A. M., & Gelade, G. (1980). A feature integration theory of attention. Cognitive Psychology, 12, 97–136. Tzanakou, E., Michalak, R., & Harth, E. (1979). The Alopex process: Visual receptive fields by response feedback. Biological Cybernetics, 35, 161–174. van der Kouwe, A. J. W., Wang, D. L., & Brown, G. J. (2001). A comparison of auditory and blind separation techniques for speech segregation. IEEE Transactions on Speech and Audio Processing, 9, 189–195.

1902

S. Haykin and Z. Chen

Van Veen, B., & Buckley, K. (1997). Beamforming techniques for spatial filtering. In V. K. Madisetti & D. B. Williams (Eds.), Digital signal processing handbook. Boca Raton, FL: CRC Press. Varela, F., Thompson, E., & Rosch, E. (1991). The embodied mind: Cognitive science and human experience. Cambridge, MA: MIT Press. von der Malsburg, C. (1981). The correlation theory of brain function. (Internal Rep. 812). Gottingen: ¨ Department of Neurobiology, Max-Plank-Institute for Biophysical Chemistry. von der Malsburg, C. (1995). Binding in models of perception and brain function. Current Opinion in Neurobiology, 5, 520–526. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biological Cybernetics, 54, 29–40. Wang, D. L. (1996). Primitive auditory segregation based on oscillatory correlation. Cognitive Science, 20(3), 409–456. Wang, D. L., & Brown, G. J. (1999). Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks, 10(3), 684–697. Wang, D. L., Buhmann, J., & von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Computation, 2, 94–106. Warren, R. M. (1970). Perceptual restoration of missing speech sounds. Science, 167, 392–393. Warren, R. M., Obusek, C. J., & Ackroff, J. M. (1972). Auditory induction: Perception synthesis of absent sounds. Science, 176, 1149–1151. Wood, N. L., & Cowan, N. (1995). The cocktail party phenomenon revisited: Attention and memory in the classic listening selective procedure of Cherry (1953). Journal of Experimental Psychology, General, 124, 243–262. Yilmaz, O., & Rickard, S. (2004). Blind separation of speech mixtures via timefrequency masking. IEEE Transactions on Signal Processing, 52, 1830–1847. Yost, W. A. (1991). Auditory image perception and analysis. Hearing Research, 46, 8–18. Yost, W. A. (1997). The cocktail party problem: Forty years later. In R. Gilkey & T. Anderson (Eds.), Binaural and spatial hearing in real and virtual environments (pp. 329–348). Ahwah, NJ: Erlbaum. Yost, W. A. (2000). Fundamentals of hearing: An introduction (4th ed.). San Diego: Academic Press. Yost, W. A., & Gourevitch, G. Eds. (1987). Directional hearing. New York: SpringerVerlag.

Received April 1, 2004; accepted February 23, 2005.

NOTE

Communicated by Aapo Hyvarinen

Edgeworth Approximation of Multivariate Differential Entropy Marc M. Van Hulle [email protected] K.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, B-3000 Leuven, Belgium

We develop the general, multivariate case of the Edgeworth approximation of differential entropy and show that it can be more accurate than the nearest-neighbor method in the multivariate case and that it scales better with sample size. Furthermore, we introduce mutual information estimation as an application. 1 Introduction The approximation of the one-dimensional differential entropy (Shannon, 1948) based on a polynomial density expansion, the one-dimensional Edgeworth expansion, has become popular since it was introduced in independent component analysis (ICA) (Comon, 1994; Amari, Cichocki, & Yang, 1996) and projection pursuit (Jones & Sibson, 1987). However, objections have been raised that Edgeworth expansion would not estimate well the structure near the centroid of the density (Friedman, 1987) and that it would be sensitive to outliers (Huber, 1985). What happens when we consider the general case of multidimensional entropy estimation, based on the multidimensional Edgeworth expansion? The more sophisticated approximation of one-dimensional entropy, based on the maximum entropy principle (Hyv¨arinen, 1998), cannot be readily extended to the multidimensional case. In this note, we develop the multivariate Edgeworth approximation of differential entropy. We compare our technique with a number of other entropy estimation techniques, both uni- and multivariate. As an application, we suggest mutual information since it is important to many areas and difficult to estimate (see Paninski, 2003). We show that the accuracy is comparable to that of the direct method introduced by Grassberger and coworkers (Kraskov, Stogbauer, ¨ & Grassberger, 2004). 2 Edgeworth Expansion The Edgeworth expansion of the density p(v), v = [v1 , . . . , vd ] ∈ V ⊆ Rd , up to order five about its best normal estimate φ p (i.e., with the same mean and covariance matrix as p), is given by Barndorff-Nielsen and Cox, (1989) Neural Computation 17, 1903–1910 (2005)

© 2005 Massachusetts Institute of Technology

1904

M. Van Hulle

(also called Gram-Charlier A series): p(v) ≈ φ p (v) 1 +

+

1 i, j,k 1 i, j,k,l κ h i jk (v) + κ h i jkl (v) 3! i, j,k 4! i, j,k,l

1 κ i, j,k κ l, p,q h i jklpq (v) , 72 i, j,k,l, p,q

(2.1)

with h ijk the ijkth Hermite polynomial, with i, j, k the corresponding input dimensions, i, j, k ∈ {1, . . . , d}, and κ i, j,k the corresponding standardi jk ized cumulant, κ i, j,k = σiκσ j σk (for N large), with κ ijk the cumulant over input dimensions i, j, k, and where the sum over all combinations i, j, k is considered, and h ijkl the ijklth Hermite polynomial over input dimensions i, j, k, l and the corresponding standardized cumulant κ i, j,k,l , κ i, j,k,l = κ i jkl , with κ ijkl the cumulant over input dimensions i, j, k, l. For the σi σ j σk σl connection between moments and cumulants in the multivariate case, see McCullagh (1987). Since the differential entropy H( p) = H(φ p ) − J ( p), with the latter the negentropy, we obtain the following approximation: H( p) = H(φ p ) −

p(v) log

V

≈ H(φ p ) −

p(v) dv φ p (v)

φ p (v)(1 + Z(v)) log(1 + Z(v))dv

V

≈ H(φ p ) −

φ p (v)(Z(v) + 0.5Z(v)2 )dv V

d d 1 (κ i,i,i )2 + 3 (κ i,i, j )2 = H(φ p ) − 12 i=1 i, j=1,i= j d 1 i, j,k 2 + (κ ) , 6 i, j,k=1,i< j
(2.2)

which converges on the order of O(N−2 ), with Z(v) = 3!1 i, j,k κ i, j,k h i jk (v), which is obtained after retaining the dominant terms and using φ (v)Z(v)dv = 0 and the orthogonality properties of the Hermite polyV p nomials. The term H(φ p ) is the familiar expression for the d-dimensional entropy: H(φ p ) = 0.5 log || + d2 log 2π + d2 , where |.| stands for determi nant. Note that there are d κ i,i,i terms, 2 × d2 κ i,i, j terms and d3 κ i, j,k terms. As a one-dimensional example, we consider the standard normal distribution and the exponential distribution with the rate of change λ = 1. In this way, the two distributions have the same unit variance, but their

Edgeworth Approximation of Multivariate Differential Entropy

1905

entropies differ: 1.419 and 1 (in nats), respectively. This allows us to assess the effect of truncating the Edgeworth series beyond the third cumulants. We estimate the differential entropy using the 1-spacings- (Hall, 1984), nearest-neighbor- (Kozachenko & Leonenko, 1987), plug-in- (Ahmad & Lin, 1976), and Edgeworth-based methods using N data points. For the plug-in method, we first estimate the density with a Parzen window estimator on N data points, using gaussian kernels, and for which we take the optimal radius as the one that minimizes the differential entropy of an independent test set of N data points (EMMA algorithm; Viola, Schraudolph, & Sejnowski, 1996), and second, we estimate the entropy using the independent test set (resubstitution plug-in estimate; see Beirlant, Dudewicz, & van der Meulen, 1997). The result is shown in Figure 1A as averages of 1000 runs (except for the plug-in method, where the medians are plotted, to reduce the effect of outliers). We observe that the Parzen window plug-in estimates and the spacings estimates converge the slowest and that the Edgeworth-based estimates are biased for the exponential distribution, as expected. However, for N ≥ 100, the results for the different methods are not significantly different, since the standard deviations are on the order of 0.05 for all methods, except for the Parzen window plug-in method, which reaches this accuracy only for N ≥ 2000. Furthermore, we consider entropy estimation as a function of the input dimensionality d. We take the same distributions as in the 1D case, but now independently along each dimension d, so that a multivariate distribution is obtained, and take N = 1000. We can now use only the nearest-neighbor-, Parzen window plug-in-, and Edgeworth-based methods. We perform 1000 runs and plot the averages for the ratio H( p)/d (see Figure 1B), except for the plug-in method, where we plot the medians. We observe for the Edgeworth-based estimates of the multivariate exponential distribution that the bias in H( p)/d stays constant. More important, we see that the nearest-neighbor result for the exponential distribution diverges toward the gaussian case (which is reached around d ≈ 30). This is postponed when the number of data points increases (see the result for N = 10,000, the dotted line in Figure 1B). Note that the nearest-neighbor method does not theoretically guarantee that gaussian distributions can be distinguished from nongaussian distributions based on entropy in the multidimensional case (N. Leonenko, personal communication, Jan. 2005). We therefore also tested the multidimensional gamma distribution (the sum of two exponentially distributed data points along each dimension, α = 2), and also observed a divergence in the entropy estimate (result not shown). Hence, the divergence for the multivariate exponential distribution is probably due to the (exponential) tails of the distribution, which become more important in higher dimensions, and not the presence of the discontinuity at the origin. Finally, we observe that the Parzen window plug-in estimate for the gaussian case gradually loses its accuracy as the dimensionality increases, as can be expected, since it depends on a nonparametric density estimate

1906

M. Van Hulle B

entropy

0.10

|

0.05

|

0.00

|

|

-0.05

|

D

-0.5

|

|

|

|

|

|

|

|

2 3 4 5 6 7 8 9 10 d

|

|

|

|

|

|

|

|

-0.10 0.0 0.2 0.4 0.6 0.8 1.0 r |

|

|

|

-1.0

|

∆MI

∆MI

0.0

|

1 2 3 4 5 6 7 8 9 10 d

|

0.5

0.5

|

|

1.0

|

|

C

|

|

|

|

|

|

0.5 100 101 102 103 104 N

1.0

|

|

1.0

1.5

|

1.5

entropy/d

A

|

|

|

|

|

Figure 1: (A) Differential entropy estimation as a function of sample size N for the 1D case, using 1-spacings (dotted lines), nearest-neighbor (solid lines), Parzen window plug-in (thin dashed lines), and Edgeworth estimates (thick dashed lines) for gaussian and exponential densities (top and bottom curves). Each curve is an average over 1000 runs, except for the plug-in estimates, where medians are shown. The theoretical values are indicated by the stippled lines. The standard deviations are on the order of 0.05 for N = 10,000 and are not shown for clarity. (B) Differential entropy estimation divided by the dimension d as a function of d, given N = 1000. Same line conventions as in A. The dotted line represents the result for the nearest-neighbor method using N = 10,000. The vertical bars represent standard deviations; the vertical bars for the dotted line are omitted for clarity (the standard deviations are on the order of 0.05). The quartiles of the thin dashed lines can be large (>0.5) and are also omitted. (C) Difference between the theoretical and estimated mutual information (MI) of the multivariate exponential distribution as a function of d, using the nearestneighbor- (thin solid line), Edgeworth-based (dashed line), and direct methods (thick solid line). (D) MI as a function of the covariance r (in steps of 0.05) of a 2D gaussian distribution with unit variance along the x- and y-axes. The standard deviations on the results are on the order of 0.080, 0.040, and 0.065 for the nearest-neighbor-, Edgeworth-based-, and direct methods, respectively, and are not shown for clarity. Other conventions are as in C. The stippled line corresponds to MI = 0 (zero error).

Edgeworth Approximation of Multivariate Differential Entropy

1907

(Vapnik, 1995). The results for the exponential case are the worst and even more quickly degrade. Finally, we compare the time complexities of the different methods. In the one-dimensional case, the 1-spacings method requires the ranking of N data points, so its complexity is on the order of O(N log2 N). The nearestneighbor method requires, for every data point, a search for the closest other data point in the data set, hence its complexity is on the order of O(N2 ); the Parzen window plug-in estimate requires, for every data point, evaluating N gaussian kernels, hence, O(N2 ). Note that because of the N2 kernel evaluations in total, Parzen window plug-in estimation is clearly the slowest method and quickly becomes impractical for large data sets. It also requires more data points than the other methods (training and test sets), unless some sort of cross-validation is used. The Edgeworth-based method’s complexity is on the order of O(N). In the d-dimensional case, we have O(N2 d) for the nearest-neighbor method, O(N2 d) for the Parzen window plug-in method, and O(Nd 3 ) for the Edgeworth method, since there d i, j,k are 3 κ terms in equation 2.2. For example, for N = 1000, d = 10, and the exponential distribution, 1000 runs take 94 sec for the nearest-neighbor method, 85 sec for the Edgeworth-based method and 10,293 s for the Parzen window plug-in method on a Pentium 4 processor (3 GHz) running Linux. 3 Mutual Information Estimation As an application of our method, we suggest mutual information (MI) estimation, since it is regarded as not an easy problem (e.g., see Paninski, 2003). The most widespread technique is to approximate the MI integral by binning the coordinate axes and counting the number of data points per bin, which is computationally intensive and prone to systematic errors. An alternative approach would be to estimate the mutual information from the difference between the marginal entropies and the joint entropy, MI(v) = i H(vi ) − H(v), where each of the entropy terms is estimated with the nearest-neighbor technique. A perhaps more elegant approach is to directly estimate the mutual information from nearest-neighbor distances (Kraskov et al., 2004). Another approach is to estimate mutual information from a Parzen window density estimator (Kwak & Choi, 2002). Consider again the exponential distribution example (with d > 1). Since the data points are taken independently along each dimension, we theoretically have that MI ≡ 0. We again consider N = 1000 data points and perform 1000 runs. We estimate each of the entropy terms in the expression MI(v) = i H(vi ) − H(v) with our Edgeworth-based method (MIEdgeworth ) and the nearest-neighbor method (MInn ). Furthermore, we compare with the direct method of Kraskov et al. (2004), in which case we take the I(2) algorithm (k = 1), since it was claimed to be more accurate for the higher-dimensional case. We will not consider the MI estimation based on

1908

M. Van Hulle

the Parzen window density estimator because of the inferior results obtained with it in the previous section and its unreasonable computational efforts. The results are shown in Figure 1C. We immediately observe that the MInn approach quickly diverges: seemingly, the errors in estimating the entropy terms do not cancel but quickly accumulate. (This could explain why, to the best of our knowledge, MI estimation from nearest-neighbor entropy estimators has not been reported in the literature.) We observe that our MIEdgeworth approach performs much better in this regard. Note also that the spread of the I(2) results is an order of magnitude larger and fully encompasses the spread of our MIEdgeworth results, at least over the d-range shown. The computational complexity of the I(2) algorithm is on the order of O(N2 d) (a nearest-neighbor search for each data point), whereas ours is O(Nd 3 ). Hence, albeit that our third-order Edgeworth approximation has a bias for the entropy estimates (see Figure 1B), the bias cancels in the MI estimation (cf. the standardized cumulants). To verify this, we took different rates of changes λ for the exponential distributions along the different dimensions: λ(i) = i, i = 1, . . . , d. The results are similar (not shown). Finally, we took the benchmark case of (Kraskov et al., 2004): a 2D gaussian with unit variances along the x- and y-axes and a covariance r for which the theoretical MI = − 12 log(1 − r 2 ). We again use N = 1000 and 1000 runs. The results are given in Figure 1D. We observe that the Edgeworth-based method performs best. 4 Conclusion We have introduced a parametric differential entropy estimator based on the multivariate Edgeworth expansion of a gaussian kernel. We have shown that it can outperform the nonparametric nearest-neighbor method in the higher-dimensional case and that it can be used for mutual information estimation. The latter application could be an interesting starting point for developing a test of independence between the components found with standard ICA algorithms. Finally, besides Shannon entropy, R´enyi and other forms of entropy have been introduced in the past and estimators developed for them. According to Principe and coworkers, the Parzen window density estimator is particularly suited for R´enyi entropy estimation (Erdogmus, Hild, & Principe, 2004). However, Shannon entropy is the only one that possesses all the desired properties of an information measure so that its efficient and accurate estimation is of prime importance. Acknowledgments I thank N. Leonenko, Cardiff School of Mathematics, Cardiff University, and J. Beirlant, Universitair Centrum voor Statistiek, K.U. Leuven, for

Edgeworth Approximation of Multivariate Differential Entropy

1909

helpful discussions. I am supported by research grants received from the Belgian Fund for Scientific Research–Flanders (G.0248.03 and G.0234.04); the Interuniversity Attraction Poles Programme—Belgian Science Policy (IUAP P5/04); the Flemish Regional Ministry of Education, Belgium (GOA 2000/11); and the European Commission (IST-2001-32114, IST-2002-001917, and NEST-2003-012963). References Ahmad, I. A., & Lin, P. E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, 22, 372–375. Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. Mozer & M. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Barndorff-Nielsen, O. E., & Cox, D. R. (1989). Inference and asymptotics. London: Chapman and Hall. Beirlant, J., Dudewicz, E. J., Gyorfi, ¨ L., & van der Meulen, E. C. (1997) Nonparametric entropy estimation: An overview. Int. J. Math. and Statistical Sciences, 6, 17–39. Comon, P. (1994). Independent component analysis—a new concept? Signal Process., 36(3), 287–314. Erdogmus, D., Hild, K. E., & Principe, J. C. (2004). Adaptive blind deconvolution of linear channels using Renyi’s entropy with Parzen window estimation. IEEE Trans. on Signal Processing, 52(6), 1489–1498. Friedman, J. (1987). Exploratory projection pursuit. J. American Statistical Association, 82(397), 249–266. Hall, P. (1984). Limit theorems for sums of general functions of m-spacings. Math. Proc. Camb. Phil. Soc., 96, 517–532. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475. Hyv¨arinen, A. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Jones, M., & Sibson, R. (1987). What is projection pursuit? J. Royal Statistical Society A, 150, 1–36. Kozachenko, L. F., & Leonenko, N. N. (1987). Sample estimate of the entropy of a random vector. Problems of Information Transmission, 23(2), 95–101. Kraskov, A., Stogbauer, ¨ H., & Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E, 69, 066138. Kwak, N., & Choi, C.-H. (2002). Input feature selection by mutual information based on Parzen window. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(12), 1667–1671. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hallo. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15, 1191–1253. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379–423.

1910

M. Van Hulle

Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Viola, P., Schraudolph, N. N., & Sejnowski, T. J. (1996). Empirical entropy manipulation for real-world problems. In D. S. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 851–857). Cambridge, MA: MIT Press.

Received September 27, 2004; accepted February 23, 2005.

NOTE

Communicated by Michael Carter

Perfect Fault Tolerance of the n-k-n Network Elko B. Tchernev [email protected]

Rory G. Mulvaney [email protected]

Dhananjay S. Phatak [email protected] Computer Science and Electrical Engineering Department, University of Maryland Baltimore County, Baltimore, MD 21250, U.S.A.

It was shown in Phatak, Choi, and Koren (1993) that the neural network implementation of the n-k-n encoder-decoder has a minimal size of k = 2, and it was shown how to construct the network. A proof was given in Phatak and Koren (1995) that in order to achieve perfect fault tolerance by replicating the hidden layer, the required number of replications is at least three. In this note, exact lower bounds are derived for the number of replications as a function of n, for the n-2-n network and for the n-log2 n-n network. 1 Network Parameters of the n-2-n Network The structure of the network and the assumptions are as follows. There are n input units (neurons) and n input sets. Each input set is a vector of zeros, and only one active input is equal to 1. (This allows simple calculation of the hidden-layer weights.) The hidden and output units (neurons) have hyperbolic tangent squashing functions, with outputs in the ±1 range. There are two hidden units and n output units. Consider the x-y plane defined by the output of the two hidden units. Each neuron output pair for one of the n cases defines one point in that plane. The choice of the radius R (distance from the origin to each point) is arbitrary, provided it is within the ±1 neuron output range. The points lie on a circle centered at the origin. The weights connecting the output units to the two hidden units define planes that bisect the space and form a polygon by the intersection points of the lines that separate the classes of the output neurons. Inside this polygon, no output neuron is active. Outside the polygon, only one neuron is active only inside an isosceles triangle formed by one polygon side (its base) and the extensions of the two adjacent sides. Outside of these isosceles triangles, more than one output is active; thus, the desired operation of the network is to translate a valid input into a point inside these triangles. A sample configuration when the number of Neural Computation 17, 1911–1920 (2005)

© 2005 Massachusetts Institute of Technology

1912

E. Tchernev, R. Mulvaney, and D. Phatak

1

r0

0.5

R

l

α

0

b

r s

0.5

-1 -1

-0.5

0

0.5

1

Figure 1: Relationships in the plane formed by the outputs of the hidden units for a sample n-2-n problem configuration with n = 7.

inputs is n = 7 is shown in Figure 1. The following calculations determine the weights of the network to achieve this goal. The rotational angle α between the n points is α = 2π . n The side lengths of the polygon (centered at 0,0) are s = 2r sin α2 = 2r sin πn , where r is the polygon radius. The isosceles triangles that delineate the zone of just one active output neuron have base angles equal to the central angle α = 2π . Their inscribed n radii r0 determine the maximum change in a hidden neuron output that leaves the total output correct. Using one of the standard formulas for inscribed circle radius, r0 = ( p − b) tan α2 , where b is the length of the equilateral side and p is the half perimeter p = 2b+s , we arrive at: 2 r0 =

s α α α π π tan = r sin tan = r sin tan . 2 2 2 2 n n

The distance l from the origin to the decision line of any output neuron (side of the polygon) is l = r cos α2 = r cos πn .

Perfect Fault Tolerance of the n-k-n Network

1913

The distance R from the origin to the centers of the inscribed circles (to the ideal output values of the hidden neurons) is R = l + r0 = r cos

π π π π π π + r sin tan = r cos + sin tan . n n n n n n

Each output neuron has a combined input z = a x + by − c, where x and y are the outputs of the two hidden neurons, respectively; a and b are the weights connecting them to the output; and c is the output’s bias. Setting z = 0 corresponds to a line on the plane, coinciding with the decision line of the corresponding output neuron and with a side of the polygon. In normal form, c is the distance of the line to the origin, or l, and a = cos θ , b = sin θ, where θ is the angle between l and the x-axis. For any and all of the points, cos θ and sin θ range from –1 to +1, and l is subject to choice. Therefore, for the output layer, the maximum absolute weight (needed later for fault-tolerance estimation) is the bigger of 1 (the maximum of sin/cos) and l (the maximum for the bias weight). For the hidden layer, the maximum absolute input weight |wihidden |max occurs whenever only one of the x and y units is active, and is: hidden = tanh−1 R. wmax

As R is the factor that determines the input-to-hidden neuron weight, it is better to express l in terms of R: l = R cos2

π . n

This way, since R ≤ 1 and cos2 x ≤ 1, the maximum output-layer weight output will be wmax = 1. Of course, it is possible to multiply all the terms of the line expression by some m and preserve the output unit function; this will make a = m cos θ, b = m sin θ , c = ml, and output wmax = m.

The largest weight magnitude in the network overall will then be: wmax = max(m, tanh−1 R).

1914

E. Tchernev, R. Mulvaney, and D. Phatak

The expression for the combined correct input for an output-layer node then becomes: π z = m cos θ R cos θ + m sin θ R sin θ − mR cos2 n 2 2 2 π = mR cos θ + mR sin θ − mR cos n 2 π 2 π = mR sin . = mR 1 − cos n n We are free to choose any m that maximizes the overall fault tolerance. The parameters subject to choice, then, become R and m. 2 Calculating the Redundancy 2.1 Fault 1. In this fault, a weight is disconnected (w = 0), or a hidden unit is stuck at 0. Because of the way the network is constructed, disconnection of input to hidden and hidden to output has the same effect. Disconnection of input to hidden will affect one input case; hidden to output will affect all cases. The effect will always be to move the effective combined input toward the origin, thereby reducing its absolute value. If the redundancy ratio is k, the combined (nonfaulty) input to any output unit is zs = kax + kby – kc = kz. The faulty term will be (k – 1)ax or (k – 1)by. Hence, the faulty combined input will be z f = (k − 1)a x + kby − kc. We need to choose k such that the fault preserves at least the sign; therefore, setting z f = 0 will obtain the lower, noninclusive limit for k. Therefore, k > max

ax z

=

max(a x) mR 1 = = z mR 1 − cos2 πn sin2

π n

.

(2.1)

2.2 Fault 2. In this fault, an input-to-hidden weight (hidden bias included) is stuck at ±∞, or a hidden unit is stuck at ±1. A stuck weight will affect only one input case. A stuck unit will affect all cases, but the end effect will be the same: one hidden unit stuck at ±1. This might be of the same polarity as it should be or in the opposite. We look at the two cases. 2.2.1 Opposite Polarity. The faulty combined input to an output unit will be z f = a ((k − 1)x − 1) + kby − kc. After subtracting from the regular expression, the maximum k is:

a (x + 1) k > max z

=

max(2a x) 2mR 2 = = π 2 z mR 1 − cos n sin2

2.2.2 Same Polarity. This case is equivalent to Fault 1: k >

π n

1 sin2

π n

.

. (2.2)

Perfect Fault Tolerance of the n-k-n Network

1915

2.3 Fault 3. Here, the hidden-to-output weight is stuck at ±wmax (output bias excluded). 2.3.1 Opposite Polarity. The faulty combined input to an output unit will be z f = a (k − 1)x − wmax x + kby − kc. After subtracting from the regular expression, the maximum k is:

a x + wmax x z

max(a x) + max(wmax x) z max 1 + wm wmax mR + wmax R 1 = + = . = mR 1 − cos2 πn sin2 πn m sin2 πn sin2 πn max 1 + wm k> . sin2 πn

k > max

=

(2.3)

2.3.2 Same Polarity. z f = a (k − 1)x + wmax x + kby − kc: max

|a x − wmax x| max (|a x − wmax x|) < z z = <

max (R cos θ |m cos θ − wmax |) mR sin2

π n

max (|m cos θ − wmax |) m sin2

π n

< k.

However, as wmax ≥ m, this becomes: max (|m cos θ − wmax |) m sin2 πn

< <

wmax − m cos θ m sin2 wmax m

π n

− cos θ

sin2

π n

<

wmax m sin2 πn

< k.

(2.4)

We can see from formulas 2.1 through 2.4 that apart from n, the lower max limit of k depends on the ratio wm . In order to eliminate it, we can choose −1 m = tanh R, thus making wmax = m. The fault of a weight stuck at ±wmax (opposite polarity case, formulas 2.2 and 2.3) will determine the lower limit on k for the n-2-n architecture: k>

2 sin2

π n

.

(2.5)

1916

E. Tchernev, R. Mulvaney, and D. Phatak

3 The n-log2 n-n Construction Here we show how to build this (traditional) encoder-decoder network and calculate the required number of replications for its full fault tolerance. 3.1 Calculating Network Parameters. The nominal hidden-layer outputs Z lay within a unit hypercube of Dimensionality d = log2 n , at a √ distance R from the origin, R ≤ d. The output-layer decision hyperplanes s of dimensionality d − 1 are perpendicular to the diagonals of the hyper√ from the origin. Each output node’s total input cube at a distance of l = d−1 d d is z = i=1 a i xi − c, where a i are the hidden-to-output weights, all in ± √md ; xi are the hidden unit outputs; and the bias c = ml. Thus, output

wmax

m = √ . d

Since the nominal hidden-layer output points are chosen to be the vertices of a hypercube with a diagonal of 2R, that is, all hidden units are active, the output of each hidden unit must be of magnitude xh = √Rd . Therefore, the maximum magnitude of a hidden-layer weight will be R hidden = tanh−1 √ . wmax d The expression for the combined correct input for an output-layer node then becomes z=

d

d −1 d −1 m R a i xi − c = daxh − c = d √ √ − m √ = mR − m √ . d d d d i=1

Since z is magnitude, it must be positive; therefore, there is another limit on √ the values √ that R can take. After solving z > 0, the range for R becomes d ≥√R > d − √1d . Either end of the range is not a good choice. Setting correspondingly large R = d will mean saturated hidden units with a √ magnitude of the hidden weights, while setting R ≈ d − √1d will place the activation too close to the decision plane. We can compromise by setting √ R = d − 2√1 d , and for this choice of R, each hidden unit output will be of magnitude Oh = 1 − 2d1 . 2 shows the network configuration when n = 4, d = 2, and R = √ Figure d − 2√1 d . 3.2 Calculating the Redundancy. Consider k-factor replication; the combined input of each output-layer unit will be zk = kz.

Perfect Fault Tolerance of the n-k-n Network

1917

1

Z

Z

0.5

R l O 0

(0,0)

s

-0.5

Z -1 -1

Z -0.5

0

0.5

1

Figure 2: Geometry of the n-log2 n-n architecture, when n = 4, d = 2, and R = √ d − 2√1 d .

We skip Fault 1, weight disconnected or hidden unit stuck at 0, as it is not the worst case. Consider Fault 2: input-to-hidden weight (including hidden bias) stuck at ±∞ or hidden unit stuck at ±1. A stuck weight will affect only one input case, and a stuck unit will affect all cases, but the end effect will be the same: one hidden unit stuck to ±1. This might be of the same polarity as it should be or in the opposite. We look at the more severe opposite-polarity case: R d −1 m z f = zk − a x − a = k mR − m √ −m − √ d d d √ √ 2 3 k R d + kd − kd − R d − d =m . √ d3

1918

E. Tchernev, R. Mulvaney, and D. Phatak

Because√ this must be positive, after solving for k, we obtain d √ k > d Rd+R . ( d−d+1) Substituting the two choices for R, we obtain: R=

√

d

⇒

m z= √ ; d

km zk = √ ; d

zf =

m (k − 2) . √ d

k > 2 ⇒ k = 3. R=

√ 1 d− √ 2 d k>

(3.1) m z= √ ; 2 d

⇒

km m (kd − 4d + 1) zk = √ ; z f = . √ 2 d d3

4d − 1 ⇒ k = 4. d

(3.2)

Finally, consider Fault 3: hidden-to-output weight stuck at ±wmax (output bias excluded). We look at the more severe opposite-polarity case: d −1 R mwmax z f = zk − a x − a wmax = k mR − m √ −m − √ d d d √ √ k R d 3 + kd − kd 2 − R d − dwmax =m . √ d3 Because √this must be positive, after solving for k, we obtain √ +R d . k > ddwRmax ( d−d+1) Substituting the two choices for R, we obtain: R=

√

d

⇒

m z= √ ; d

k > 1 + wmax

R=

√

1 d− √ 2 d

k > 2 + 2wmax −

1 d

zf =

m (k − 1 − wmax ) . √ d

k = 1 + wmax .

⇒

⇒

km zk = √ ; d

m (kd − 2d + 1 − 2dwmax ) . √ 2 d3 1 k = 2 + 2wmax − . d

(3.3)

zf = ⇒

(3.4)

If wmax is considered to be the maximum existing weight in the network, √ it is clear that this is set to some hardware-imposed limit whenR = d, because the input-to-hidden-layer weights will be tanh–1 (1). On the other

Perfect Fault Tolerance of the n-k-n Network

1919

√ hand, if we select R = d − 2√1 d , then the input-to-hidden-layer weights will be w = tanh−1 (1 − 2d1 ). The largest weight in the network will be one of m a= √ ; d

m (d − 1) c= ; √ d

−1

w = tanh

1 1− . 2d

Since w depends only on d, it cannot be modified; we can take that as maximum and scale a and c to make c equal to w: √ √ d tanh−1 1 − 2d1 w d m= = ; (d − 1) (d − 1)

c = w;

tanh−1 1 − a= (d − 1)

1 2d

.

The value of k will be the biggest of the values determined for the different faults by formulas 3.1 to 3.4. It is easy to see that the maximum is the result of formula 3.4 for the fault of a weight stuck at ±wmax (opposite-polarity case):

1 k = 2 + 2w − d

= 2 + 2 tanh

−1

1 1− 2d

1 − . d

(3.5)

The total number ofhidden units in the log2 n configuration then, becomes nodes = kd = k log2 n . 4 Conclusion This study was undertaken in order to explore the relationship between size and perfect single fault tolerance. Using replication, does the smallest initial network produce the smallest fault-tolerant network? The answer is important from the point of view of necessary resources and the required construction algorithm. The n-k-n encoder-decoder problem allowed us to answer this question analytically for two different network architectures. Figure 3 compares the minimum replication factors k and corresponding hidden node counts for worst-case faults w = wmax for the n-2-n and n-log2 n-n architectures, starting at n = 4. The x- and y-axes are in logarithmic scale. The unit line n is plotted for reference. The result demonstrates that a bigger initial seed network (of size log2 n) requires fewer replications and a smaller total number of units to achieve

1920

E. Tchernev, R. Mulvaney, and D. Phatak

100000 10000 k n-2-n 1000

nodes n-2-n n

100

k n-logn 10

nodes n-logn

1 1

100

10000

n

Figure 3: Minimum number of replications k and number of nodes needed for complete fault tolerance of the n-2-n and n-log2 n-n neural network architectures implementing a solution to the n-k-n encoder-decoder problem.

perfect fault tolerance for the encoder-decoder problem than the minimal seed network of two hidden units, for all n ≥ 4. Acknowledgments This work was supported by NSF grants ECS-9875705 and ECS-0196362. References Phatak, D. S., Choi, H., & Koren, I. (1993). Construction of minimal n-2-n encoders for any n. Neural Computation, 5, 783–794. Phatak, D. S., & Koren, I. (1995). Complete and partial fault-tolerance of feedforward neural Nets. IEEE Transactions on Neural Networks, 6, 446–456.

Received October 15, 2004; accepted March 2, 2005.

NOTE

Communicated by Zoubin Ghahramani

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models Kaare Brandt Petersen [email protected]

Ole Winther [email protected]

Lars Kai Hansen [email protected] Informatics and Mathematical Modeling, Technical University of Denmark, Building 321, DK = 2300 Kongens Lyngby, Denmark

We analyze convergence of the expectation maximization (EM) and variational Bayes EM (VBEM) schemes for parameter estimation in noisy linear models. The analysis shows that both schemes are inefficient in the low-noise limit. The linear model with additive noise includes as special cases independent component analysis, probabilistic principal component analysis, factor analysis, and Kalman filtering. Hence, the results are relevant for many practical applications. 1 Introduction The expectation maximization (EM) algorithm introduced by Dempster, Laird, and Rubin (1977) is widely used for maximum likelihood estimation in hidden variable models. More recently, a generalization of the EM algorithm, the so-called variational Bayes EM algorithm (VBEM), has been introduced (see, e.g., Attias, 1999), which allows more accurate modeling of parameter uncertainty. EM convergence is known to slow dramatically when the signal-to-noise ratio is high, and a natural question is then: Will the more accurate modeling of parameter variance in VBEM assist the convergence? Here we analyze both schemes and show that they are subject to slow convergence in the low-noise limit. We consider linear models with additive normal noise, xt = Ast + nt , t = 1, . . . , N, where xt ∈ Rm are N observed data vectors, st ∈ Rd unobserved hidden variables, and nt ∼ N (0, ) white gaussian noise. For notational convenience, we construct the matrices X and S, which consist of the observed and unobserved data vectors as columns. The unobserved variables S are assumed to be distributed according to a prior p(S), which can be gaussian (factor Neural Computation 17, 1921–1926 (2005)

© 2005 Massachusetts Institute of Technology

1922

K. Petersen, O. Winther, and L. Hansen

analysis) or nongaussian (independent component analysis). The matrix A is referred to as the mixing matrix, and it can (but does not have to) be square (m = d). If m < d, we speak of the overcomplete case, while the opposite situation, d < m, is denoted overdetermined. In our discussion, the data are assumed prewhitened, that is, XXT = NI, a mild condition that simplifies the notation. 2 Slow Convergence in EM For parameter estimation (A, ) in the linear model, the main challenge is that the marginal likelihood involves an average over all possible configurations of the hidden variables, with a measure that depends on the unknown parameters themselves. EM algorithms break up this stalemate in two separate iterated steps. First, we find the posterior distribution of the hidden variables P(S|X, A, ), for fixed parameters and then improve the parameters by maximizing the log likelihood averaged with regard to the approximate hidden variable posterior (for details, consult Dempster et al., 1977; McLachlan & Krishnan, 1997). Bermond and Cardoso (1999) made an important but seemingly little-known discovery about the convergence properties of the EM algorithm in the low-noise limit. Following their line of thought, and for simplicity considering the case = σ 2 I, we can expand the moments of the posterior S and SST in powers of the noise variance (see Figure 1) to obtain approximate expressions for the parameter updates. Using the notation Γ = X − AS, we get ˜ n + O(σ 4 ) An+1 = XST SST −1 = An + σn2 A 2 σn+1 =

1 Tr(ΓΓT ) mN

2 2 4 = σbia s + σn z + O(σ ),

where An denotes the estimated mixing matrix in the nth iteration. 2 In the square case, the noise update simplifies into σn+1 = σn2 + O(σ 4 ). 2 In the overdetermined case, σbia s = 1 − rank(A)/m and z = rank(A)/m − 2Tr(U)/N, where U is a data and prior dependent matrix. (This result is discussed in Petersen & Winther, 2005a, to which readers are referred for details.) The result indeed explains the poor convergence properties experienced using EM in the low-noise limit. The EM algorithm “freezes,” and an excessive number of iterations are needed for convergence of the mixing matrix. Moreover, for the square case, as also mentioned in Bermond and Cardoso ˜ n is proportional to the gradient of the (1999), the first-order correction A noiseless model’s likelihood, and thus the fix point is to first order equivalent to the fix point of the noiseless model (Bell & Sejnowski, 1995). The slow convergence of the EM algorithm has been debated for a while, and many suggestions for speeding it up have been proposed (McLachlan & Krishnan, 1997). One straightforward method is to use a gradient-based

EM and VBEM in Low-Noise Linear Models

1923

2

10

S Generative S 0.order SS 0.order S 1. order SS 1.order S MF SS MF SS LR

0

10

Error

-2

10

-4

10

-6

10

-8

10

-5

10

-4

10

-3

10

-2

10

noise variance σ

-1

10

0

10

2

Figure 1: This plot demonstrates that the fundamental Taylor expansion of the moments S and SST is reasonably accurate. A set of sources Sgen is generated using a mixture of gaussians (MoG) prior. From this, using a suitable 2 × 2 mixing matrix, the observed data X is constructed for each noise level. Since the source prior is a MoG, the exact posterior moments Se xc , SST e xc can be computed. The error is the mean squared difference of the true mean and the approximation, Err = d1N it (Sit e xc − Sit est )2 , and correspondingly for the second moment. Note that the approximation is fairly accurate when the noise variance is small. As expected, the first-order approximation (triangles) is more accurate than the zeroth-order approximation (circles) in the low-noise regime. Only for noise variance larger than 10−2 is it beneficial to use the generative sources (dots) as estimators for the posterior means. This is possible only for artificial data sets but is included for perspective. The mean field (MF) approximations to the posterior moments (squares) are also included for perspective (see Hojen-Sorensen, Winther, & Hansen, 2002, for details). The MF approach is performing very well indeed, especially when the so-called linear response (LR) correction is taken into account. This is an indicator that in the low-noise regime, ICA techniques such as mean field ICA may prove to be accurate approaches.

optimizer in the M-step. The gradient and the bound value are expressed in terms of the sufficient statistics, which are obtained in the E-step (Olsson, Lehn-Schiøler, & Petersen, 2005). Recently, another general technique, by Salakhudinov and Roweis (2003), called adaptive overrelaxed EM, was proposed, leading to considerable faster convergence (Petersen & Winther, 2005b). The key idea of the adaptive, overrelaxed EM is to boost the update by a factor η ≥ 1. Combining this with the low-noise-limit analysis, we get EM ˜ n + O(σ 4 ). An+1 = An + η An+1 − An = A n + σ 2 η A

1924

K. Petersen, O. Winther, and L. Hansen

That is, adaptive, overrelaxed EM works because the step size factor η directly counters the small magnitude of the noise variance. The only downside is that there is no longer a guarantee of an increase in the likelihood, and a test-step is introduced to remedy this. 3 Variational Bayes EM In variational Bayes EM, we expand the model to include a distribution over the model parameters A and σ 2 , treating them at the same footing as the hidden variables. (See Beal & Ghahramani, 2003, for an introduction to variational Bayes techniques.) The algorithm is aimed at maximizing the lower bound of the marginal log likelihood and allows convenient addition of prior information on the parameters. We choose a zero-mean gaussian prior for the mixing matrix, with covariance p and an inverse gamma distribution with (hyper) parameters α p and β p : 1 T p(A) ∝ exp − Tr A −1 A p 2 p(σ 2 ) ∝ (σ 2 )−(α p +1) exp –β p /σ 2 . Combining these priors with the observation model, we obtain the variational approximations for the posterior distributions, which have the moments that are updated sequentially in the VBEM algorithm. At convergence, we use the posterior mean of these variational distributions as estimators of the unknown model parameters. The statistics that determines the width of the posterior distribution of A is 1/σ 2 . Defining r 2 = 1/1/σ 2 , the low-noise limit corresponds to r 2 → 0, and we can expand the moments involved in updating the A pdf, in powers of r 2 : −1 = An + O(r 2 ) An+1 = XST SST + r 2 −1 p −1 Var(A)n+1 = r 2 SST + r 2 −1 = 0 + O(r 2 ), p and accordingly for the parameters α, β of the inversed-gamma distributed σ 2 , αn+1 = α p +

mN = αn 2

bias βn+1 = β p + Nβn+1 = βn + NO(r 2 ) 2 = βn+1 /αn+1 = rn2 + O(r 2 ), rn+1

EM and VBEM in Low-Noise Linear Models

1925

bias = 1 − rank(A where βn+1 n+1 )/m. Hence, we find that the VBEM update for the mixing matrix and the crucial moment of the noise distribution is freezing exactly as in EM.

4 Discussion The analysis shows that for linear models with low gaussian noise, both the traditional EM algorithm and the variational Bayes extension, which practically degenerates back into an EM algorithm, have serious defects with respect to the rate of convergence. Experience from ICA problems furthermore indicates that the window in which the noise is sufficiently large to make the convergence reasonable and yet not too large with respect to estimation of parameters is indeed very small. Furthermore, note that in Salakhutdinov, Rowels, and Ghahramani (2003), the convergence rate in a gaussian mixture model is demonstrated to be slow when the noise level is large, that is, when the mixtures have considerable overlap. The situation analyzed in this article, however, is a limit of low noise in which the problem intuitively should have an extraordinarily clear and well-defined solution. In that sense, this result is counterintuitive and different from some of the previous observations regarding the slowdown of the EM algorithm. Most likely the explanation is that there is more than one situation in which the EM algorithm becomes slow and that these different situations are not effects of the same underlying reason but rather truly different. Finally, it is crucial for the analysis that the observation model is linear, since we otherwise cannot get closed-form expressions in the M-step. But practical experience and preliminary analysis suggest that this is not the core of the convergence problem and we are instead conjecturing that it is indeed the low-noise limit that is the essence of the matter. Acknowledgments The research for this note was supported financially by Oticon Fonden. References Attias, H. (1999). Inferring parameters and structure of latent variable models by variational bayes. In In Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI. San Francisco: Morgan Kavffman. Beal, M. J., & Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: With application to scoring graphical model structures. Bayesian Statistics, 7, 453–465. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159.

1926

K. Petersen, O. Winther, and L. Hansen

Bermond, O., & Cardoso, J. F. (1999). Approximate likelihood for noisy mixtures. In Proceedings of the First International Workshop on Independent Component Analysis and Blind Source Separation, ICA ’99. Aussois, France. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistics Society, Series B, 39, 1–38. Hojen-Sorensen, P., Winther, O., & Hansen, L. K. (2002). Mean-field approaches to independent component analysis. Neural Computation, 14, 889–918. McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. Olsson, R. K., Lehn-Schiøler, T., & Petersen, K. B. (2005). State-space models—from the EM algorithm to a gradient approach. Manuscript submitted for publication. Petersen, K. B., & Winther, O. (2005a). Explaining slow convergence of EM in low noise linear mixtures (Tech. Rep. 2005-2). Kongens Lyngby: Informatics and Mathematical Modelling, Technical University of Denmark. Petersen, K. B., & Winther, O. (2005b). The EM algorithm in independent component analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscateway, NJ: IEEE. Salakhutdinov, R., & Roweis, S. (2003). Adaptive overrelaxed bound optimization methods. In Proceedings of International Conference on Machine Learning, ICML. AAAI Press. Salakhutdinov, R., Roweis, S., & Ghahramani, Z. (2003). Optimization with EM and expectation-conjugate-gradient. In Proceedings of International Conference on Machine Learning, ICML. AAAI Press.

Received January 5, 2005; accepted March 2, 2005.

LETTER

Communicated by George Gerstein

Analyzing Functional Connectivity Using a Network Likelihood Model of Ensemble Neural Spiking Activity Murat Okatan [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114-2698, U.S.A.

Matthew A. Wilson [email protected] Picower Center for Learning and Memory, Riken-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Emery N. Brown [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, and Division of Health Sciences and Technology, Harvard Medical School/Massachusetts Institute of Technology, Boston, MA 02114-2698, U.S.A.

Analyzing the dependencies between spike trains is an important step in understanding how neurons work in concert to represent biological signals. Usually this is done for pairs of neurons at a time using correlationbased techniques. Chornoboy, Schramm, and Karr (1988) proposed maximum likelihood methods for the simultaneous analysis of multiple pair-wise interactions among an ensemble of neurons. One of these methods is an iterative, continuous-time estimation algorithm for a network likelihood model formulated in terms of multiplicative conditional intensity functions. We devised a discrete-time version of this algorithm that includes a new, efficient computational strategy, a principled method to compute starting values, and a principled stopping criterion. In an analysis of simulated neural spike trains from ensembles of interacting neurons, the algorithm recovered the correct connectivity matrices and interaction parameters. In the analysis of spike trains from an ensemble of rat hippocampal place cells, the algorithm identified a connectivity matrix and interaction parameters consistent with the pattern of conjoined firing predicted by the overlap of the neurons’ spatial receptive fields. These results suggest that the network likelihood model can be an efficient tool for the analysis of ensemble spiking activity.

Neural Computation 17, 1927–1961 (2005)

© 2005 Massachusetts Institute of Technology

1928

M. Okatan, M. Wilson, and E. Brown

1 Introduction A major goal of neural data analysis is to characterize how neurons that are part of an ensemble interact with each other (Espinosa & Gerstein, 1988; Gochin, Gerstein, & Kaltenbach, 1990; Eggermont, 1991; Gochin, Miller, Gross, & Gerstein, 1991; Lindsey, Hernandez, Morris, & Shannon, 1992; Wilson & McNaughton, 1994; Vaadia et al., 1995; Skaggs & McNaughton, 1996; Li, Morris, Baekey, Shannon, & Lindsey, 1999; Schoenbaum, Chiba, & Gallagher, 2000; Shannon, Baekey, Morris, Li, & Lindsey, 2000; Louie & Wilson, 2001; Lee & Wilson, 2002). With the advent of the multielectrode recording technology, it is now possible to record the activity of several hundred neurons simultaneously (Wilson & McNaughton, 1993; Nicolelis et al., 2003). This has underlined the need for developing analysis methods that can process these data quickly and efficiently (Brown, Kass, & Mitra, 2004). Statistical dependencies between spike trains may be represented using cross-intensity functions (Cox & Lewis, 1972; Brillinger, 1976a, 1992), product densities, cumulant densities, cumulant spectra, method of moments (Bartlett, 1966; Brillinger, 1975a, 1975b), cross-correlation (Perkel, Gerstein, & Moore, 1967; Moore, Segundo, Perkel, & Levitan, 1970; Aertsen & Gerstein, 1985; Palm, Aertsen, & Gerstein, 1988; Brody, 1998, 1999a, 1999b), coherence (Brillinger, 1976b, 1992), and joint peristimulus time histogram (JPSTH) (Gerstein & Perkel, 1969, 1972; Aertsen, Gerstein, Habib, & Palm, 1989). These methods are usually applied to characterize the dependencies between pairs of neurons at a time, ignoring possible effects from other neurons, although techniques such as partial coherence (Brillinger, 1976b, 1992), conditional cross-correlogram (Eggermont, 1991), and JPSTH (Kristan & Gerstein, 1970) have been applied to the simultaneous analysis of triplets. Logical extension of these methods to the simultaneous analysis of more than three spike trains faces the problem of exponentially increasing number of bins to calculate. This problem is avoided in methods that detect particular spike timing relations among multiple neurons. These include the gravitational clustering algorithm, which is used for the detection of synchronous cell assemblies (Gerstein & Aertsen, 1985; Gerstein, Perkel, & Dayhoff, 1985; Baker & Gerstein, 2000), and spike pattern classification methods, which are used to detect precisely timed spatiotemporal firing patterns or search for a particular sequential firing pattern within the ensemble spiking activity (Abeles & Gerstein, 1988; Louie & Wilson, 2001; Pipa & Grun, ¨ 2003; Lee & Wilson, 2004). Statistical methods based on the maximum likelihood (ML) principle have been proposed as alternative techniques for the simultaneous analysis of neural interactions (Borisyuk, Borisyuk, Kirillov, Kovalenko, & Kryukov, 1985; van den Boogaard, 1986; Brillinger, 1988a, 1988b, 1992; Chornoboy, Schramm, & Karr, 1988). The applicability of the ML method to the analysis of spike trains of more than a few neurons has been limited due to

Network Likelihood Model

1929

the difficulties in computing the ML estimates in large models. Chornoboy et al. (1988) addressed the issue of easy computability of the estimates for large models and proposed iterative estimation algorithms for a class of additive-linear and multiplicative conditional intensity function models. In this article, we explore the computational properties of the algorithm associated with the multiplicative model and use it to analyze both simulated and real neural data. The original formulation of the algorithm is in continuous time. Here we formulate a discrete-time implementation and describe a method that reduces the number of operations. We derive a principled method for computing a good initial parameter vector to start the algorithm. We also derive a data-specific stopping rule to terminate the iteration. We apply this algorithm to the analysis of simulated spike trains and show that it successfully recovers the true time course of the excitatory and inhibitory interactions among simulated neurons. We propose a method to determine optimal connectivity matrices for neural ensembles using a combination of the likelihood ratio test and the Akaike information criterion or the Bayesian information criterion (AIC or BIC; Box, Jenkins, & Reinsel, 1994). We show that this method discovers the correct connectivity matrices in simulations. We apply these methods to the analysis of simultaneously recorded spike trains of a population of rat hippocampal place cells. We determine an optimal connectivity matrix for these cells and analyze its relationship to the degree of overlap between their place fields.

2 Theory In this section, we introduce the network likelihood model and the iterative algorithm for the ML estimation of the model parameters. We derive a method to compute a starting vector for the algorithm and a stopping rule to terminate the iteration. Finally, we describe how the ML estimates are used to determine a connectivity matrix that describes the data optimally. 2.1 Network Likelihood Model. Our method is based on the point process representation of spike trains (Brillinger 1988a, 1992; Chornoboy et al., 1988; Brown, Barbieri, Ventura, Kass, & Frank, 2002; Brown, Barbieri, Eden, & Frank, 2003). Define an interval (0, T], and let 0 < ui,1 < ui,2 <, . . . , < ui,n−1 < ui,n ≤ T be a set of spike times from the spike train of neuron i, for 1 ≤ i ≤ C, where C is the total number of neurons in the ensemble. For t ∈ (0, T], let Ni (t) be the sample path of the counting process associated with the spike train of neuron i. The sample path is a right continuous function that jumps 1 at the spike times and is constant otherwise (Snyder & Miller, 1991). In this way, Ni (t) counts the number and location of the spikes of neuron i in the interval (0, t]. We model the spiking activity of a neuron (target) as a point process that depends on a finite spiking history of

1930

M. Okatan, M. Wilson, and E. Brown

all the neurons in the ensemble (triggers). Let Ht denote this finite spiking history, which is defined in the interval [t − MW, t). This interval is divided into M nonoverlapping rectangular windows of duration W. Using χm (u) as the indicator function for u ∈ ((m − 1) W, mW], with 1 ≤ m ≤ M, define the process Ic,m (t) as the number of spikes fired by neuron c in window m, such that

t

Ic,m (t) =

χm (t − τ ) dNc (τ ).

(2.1)

0

The duration W of the spike counting window and the duration MW of the history are constant and the same for all cell pairs. The firing propensity of the target neuron is computed as a function of these processes using the following conditional intensity function (Brillinger, 1975b; Chornoboy et al., 1988), M C λi (t |αi , Ht ) = exp αi,0 + αi,c,m Ic,m (t) ,

(2.2)

c=1 m=1

where, αi,0 represents the component of the firing probability that is not correlated with the available history and αi,c,m represents the effect of Ic,m (t) on the firing probability of cell i at time t. The probability that neuron i fires a spike in a brief time interval (t, t + ] is then given by λi (t |αi , Ht ) . The vector αi is the row vector of parameters αi = [αi,0 , αi,1,1 , . . . , αi,1,M , αi,2,1 , . . . , αi,C,M ].

(2.3)

This vector represents the dependency of neuron i on a finite firing history Ht of all neurons in the ensemble. It specifies the functional connections that neuron i receives from these neurons. To specify the pair-wise connections among all neurons simultaneously, we consider the following parameter vector, which is obtained by concatenating the vectors αi for all i: α1:C = [α1 , α2 , . . . , αC ].

(2.4)

We assume that neurons are conditionally independent given α1:C and Ht . Given the ensemble spiking activity in (0, T], provided that neurons do not fire simultaneous spikes, the likelihood function of the parameter vector

Network Likelihood Model

1931

α1:C is (Jacod, 1975; Chornoboy et al., 1988) LT (α1:C ) = exp

C i=1

T

+

T

log(λi (u|αi , Hu )) dNi (u)

0

[1 − λi (u|αi , Hu )]du

.

(2.5)

0

Since the vector α1:C specifies the functional interactions among all neurons, it can be viewed as the description of a fully interconnected network. It follows that LT (α1:C ) is the likelihood function of this network. The network log-likelihood function is

l T (α1:C ) =

C

T

log(λi (u|αi , Hu ))dNi (u)

0

i=1

+

T

[1 − λi (u|αi , Hu )]du .

(2.6)

0

2.2 Maximum Likelihood Estimation Algorithm. Our goal is to compute the parameter vector α ˆ 1:C , the ML estimate of α1:C , which maximizes the network log-likelihood function. This function is uncoupled with respect to the index i. Therefore, the ML estimate can be computed by maximizing the log-likelihood function l T (αi ) separately for each i, where

T

l T (αi ) =

log(λi (u|αi , Hu ))dNi (u) +

0

T

[1 − λi (u|αi , Hu )]du.

(2.7)

0

This is the log-likelihood function corresponding to the spike train of neuron i. Equation 2.7 is actually the log of the ratio of the likelihood relative to a Poisson process model with unit rate (Karr, 1991). Because it differs from the log-likelihood by only the constant term T, we do not make the distinction between it and the actual log-likelihood function (Brown et al., 2003; Daley & Vere-Jones, 2003). To reduce the number of subscripts used for indexing the parameters, we combine the indices c and m in a new index j given by j = (c − 1)M + m.

(2.8)

1932

M. Okatan, M. Wilson, and E. Brown

Using this new index in equation 2.2, the conditional intensity function for neuron i is written as D λi (t|αi , Ht ) = exp αi, j I j (t) , (2.9) j=0

where D = C M, and I0 (t) = 1. The definition in equation 2.3 of the parameter vector αi is rewritten accordingly: αi = [αi,0 , αi,1 , . . . , αi,D ] .

(2.10)

Using equation 2.9 in equation 2.7, the log-likelihood function of αi is given by

T

l T (αi ) =

D

0

αi, j I j (u) dNi (u)

j=0

T

+ 0

1 − exp

D

αi, j I j (u)

du.

(2.11)

j=0

Chornoboy et al. (1988) proposed an iterative algorithm to compute the ML estimates. Here, we describe a discrete-time formulation of this algorithm. The algorithm is designed to compute the ML estimate of γ i = [exp(αi, j )]. For this reason, the model is often expressed as a function of γ i in the following analysis. The ML estimate γˆ i of γ i is computed iteratively using the following equations: βi, j K −1  (n)  k=0 I j,k Ni,k:k+1  = γi,j   ,  K −1 D (n) Il,k γ I j,k i,l k=0 l=0 

(n+1)

γi,j

βi, j

K −1 k=0 I j,k Ni,k:k+1 = , K −1 D k=0 I j,k l=0 Il,k Ni,k:k+1

(2.12)

(2.13)

where I j,k = I j (k), Ni,k:k+1 = Ni ((k + 1)) − Ni (k) ≈ dNi (k), K is the smallest integer larger than T/, and we choose = 1 ms such that Ni,k:k+1 ≤ 1. The ML estimate of a parameter γi, j cannot be computed using the iteration if βi, j is zero. Therefore, such parameters are excluded from the model. In the following, it is assumed that βi, j > 0 for all parameters. If the duration of the spike counting window W is short relative to the average interspike interval in the data, the processes I j>0 are zero at

Network Likelihood Model

1933

most time points. This allows computing the denominator in equation 2.12 by executing the summations and multiplications only at the time points where I j are nonzero. In the following, we develop the notation needed to achieve this simplification. We define the set Q j of all time points k at which I j,k > 0. With this definition, the denominator in equation 2.12 is D (n) given by k∈Q j I j,k [ l=0 (γi,l ) Il,k ]. This expression can be further simplified since the products can be computed by multiplying only the parameters that have a nonzero exponent. This can be achieved by defining a set, at each time point, to index these parameters. We thus define the sets J k = {l|0 ≤ l ≤ D, Il,k > 0} for 0 ≤ k ≤ K − 1. With these sets, equations 2.12 and 2.13 may be written as  (n+1) γi, j

(n) = γi, j



k∈Q j

(n)

k∈Q j (n)

Pi,k =

I j,k Ni,k:k+1 I j,k Pi,k

βi, j 

,

(n) Il,k , γi,l l∈J k

(2.14)

(2.15)

βi,j = k∈Q j

I j,k Ni,k:k+1 . I j,k l∈J k Il,k Ni,k:k+1 k∈Q j

(2.16)

Only the denominator of equation 2.14 needs to be recomputed at each it(n) eration. The product Pi,k is not computed anew for each parameter. It is computed once per iteration, and its values are looked up during the computation of the denominator for individual parameters. Precomputing the sets Q j and J k for later look-up reduces the computation time substantially, especially considering that the parameters are reestimated under several different contingencies during connectivity analyses. The ML estimate αˆ i, j of αi, j is computed as αˆ i, j = log(γˆi, j ) using the invariance principle of the ML estimation (Pawitan, 2001). For large T, the prob√ ability density function of the statistic T(α ˆ i − αi0 ) is multivariate normal with mean 0 and covariance −1 , where αi0 is the true value of the param∂ 2 l T (αi ) 0 . We use this relationship to compute eter vector and jl = − T1 ∂α | i, j ∂αi,l αi =αi

the 100 (1 − α)% confidence interval of αˆ i, j as αˆ i, j ± T1 ( −1 )jj Φ−1 (1 − α2 ), where 0 ≤ α ≤ 1 is the significance level, Φ(·) is the standard normal distribution function, and is evaluated at αi = α ˆ i.

2.3 Formulation of the Starting Vector and the Stopping Rule. It is desirable to start the iteration using a parameter vector that is as close to the solution as possible. Here, we present two equations that can be used to compute such a vector (see the appendix for details). In the following, the

1934

M. Okatan, M. Wilson, and E. Brown (1)

starting vector for neuron i is denoted by γ i . For j > 0, the components (1) γi,j of this vector are computed by solving the following equation: Rj r (1) (1) card Qrj (r νi,0 − νi, j ) γi, j = 0. p γi,j =

(2.17)

r =0

Here, νi, j = k∈Q j I j,k Ni,k:k+1 , Qrj is the set of all time points k at which I j,k = r , card(Qrj ) is the cardinality of Qrj , and R j = max(I j,k ). By Descartes’ k (1) sign rule, at most one of the roots of the equation p(γi, j ) = 0 is strictly (1) positive (Struik, 1986). If this root exists, we use it as γi, j . If it does not exist, (1) then some γi, j > 0 that minimizes | p(γi, j )| may be used as γi, j . We compute (1) γi,0 as the average firing rate, (1)

γi,0 =

νi,0 . K

(2.18)

We stop the iteration when the change in the parameter values is below a certain threshold. This is achieved by terminating the iteration when the following inequality is achieved: δγ i,n = max j

(n+1)

γi, j

(n)

γi, j

(n)

− 1,

γi, j

(n+1)

γi, j

− 1 < θ.

(2.19)

When the iteration is stopped using the i,n < θ, the change in the log ruleδγ D T likelihood is around θ 2 , where = Dj=0 l=0 0 I j (u)Il (u)dNi (u) (see the appendix for details). is completely specified given the ensemble spike train data. Therefore, this relationship can be used to select a data-specific threshold θ. was smaller than 4.8 104 for all place cells in our study. In our analysis, we used θ = 10−5 , which resulted in a change in the log likelihood of less than 4.8 10−6 at the end of the iterations. 2.4 Computing an Optimal Connectivity Matrix. We used a combination of the likelihood ratio test and the AIC or the BIC to determine the optimal order of the network model, which in this case corresponds to the optimal number and configuration of connections to include. Minimizing the AIC or the BIC strikes a balance between maximizing the log likelihood and reducing the number of parameters in the model (Box et al., 1994). We use these criteria and the likelihood ratio test as follows (see the appendix for details). We first compute the significance of each connection using the likelihood ratio test. This test determines whether the removal of one pair-wise connection from the network changes the likelihood significantly. Therefore, it computes this significance by taking into account all other pairwise interactions. We obtain a new network configuration by discarding the

Network Likelihood Model

1935

connections whose significance is lower than a variable threshold. We compute the AIC and the BIC for such configurations by reestimating the surviving model parameters. We find the thresholds that minimize these criteria. The optimal parameter vector that minimizes the AIC (BIC) is denoted by AI C α1:C (αBIC 1:C ). These vectors constitute optimal representations of the changes in the firing probability of neurons following a spike in any neuron in the network. They also imply the optimal connectivity matrices G AI C and G BIC . In these binary matrices, a 1 at the entry (i, c) indicates that the functional connection from neuron c to neuron i was kept in the optimal model. Here, the term connection refers to statistical dependencies between spike trains (neurons) and does not necessarily imply the existence of an anatomical connection between the corresponding neurons. 3 Applications 3.1 Simulation of Ensemble Spiking Activity. To demonstrate that the method successfully recovers inhibitory and excitatory interactions among neurons, we generated spike trains that had known interaction matrices using both types of interactions. We simulated the ensemble spiking activity as a multivariate point process using the discrete version of the time rescaling method, with a temporal resolution of = 1 ms (Brown et al., 2002). The simulated ensembles consisted of 4, 8, and 20 model neurons. These are shown in Figures 1A, 4A, and 6A, respectively. The firing probability of each neuron depended on the activity of other neurons through inhibitory and excitatory interactions. Each neuron inhibited itself and had at least one inhibitory and one excitatory interaction with other neurons. These interactions are shown in Figure 1B and are described by the following functions, α 0 (u) = −2 sin 2π 0.08−1 u exp −0.04−1 u , α + (u) = 2 sin 2π0.06−1 u exp −0.04−1 u , α − (u) = −3 sin 2π 0.12−1 u exp −0.04−1 u ,

(3.1)

where u is in seconds and α 0 (u), α + (u), and α − (u) represent the autoinhibitory, excitatory, and cross-inhibitory interactions among neurons, respectively. In addition, each neuron had a spontaneous firing rate of 5 Hz. These functions are dimensionless and represent the logarithm of dimensionless multipliers of the spontaneous firing rate. They represent statistical dependencies between neurons without reference to physiological processes. To generate the simulated spike trains, we computed the firing probability of neuron i in the time interval (t, t + ] as λi (t |αi , Ht ) using equation 2.9 with a history duration of 120 ms and a spike counting window of W = 1 ms. This resulted in M = 120 parameters for each pair-wise

M. Okatan, M. Wilson, and E. Brown

A

B 2

4

α+(u)

α-(u)

1

α0(u)

1936

3

2 0 −2 2 0 −2 2 0 −2 0

30

60

90

120

Time (ms)

Normalized Counts per Bin

C

0.01 0.005 0 0

60 120 180 240 300 360 420 480 ISI (ms)

dB/Hz

−48

−50

−52

0

16

32

48 64 80 96 112 Frequency (Hz)

Figure 1: (A) The four-neuron network. Each neuron has a spontaneous firing rate of 5 Hz. Neurons interact through inhibitory (black) and excitatory connections. (B) The firing probability of each neuron is modulated by an autoinhibitory interaction (α 0 (u)) in addition to the cross-inhibitory (α − (u)) and the excitatory (α + (u)) interactions. (C) The interspike interval (ISI) distribution (top) and the power spectral density (PSD) of the spike train for neuron 1. The exponential function superimposed on the ISI distribution shows the distribution that would be obtained from a 5 Hz Poisson spike train.

interaction. We now explain how the parameters for neuron 1 were obtained from equation 3.1 in the network of four neurons, which is shown in Figure 1A. In this network, each neuron excites its neighbor in the clockwise direction and inhibits its neighbor in the opposite direction. Using equation 2.8 with M = 120, the component α1,(c−1)M+m of the parameter vector α1 describes how the number of spikes fired by neuron c in the interval [t − mW, t − (m − 1)W) influences the firing probability of neuron 1 at time t. In the present case, this interval contains at most one spike because W = 1 ms. When neuron 1 fires a spike at time t − mW, its effect on its own firing probability at time t is computed as α 0 (mW) using the auto-interaction function in equation 3.1. The effect of a spike that occurs within the interval [t − mW, t − (m − 1)W) may be represented by the mean value of the

Network Likelihood Model

1937

function α 0 (u) for u ∈ ((m − 1)W, mW]. Since α 0 (u) does not change by a large amount within 1 ms intervals, we approximated this mean by the value of the function at the beginning of the interval. Therefore, the value of the parameter α1,m was computed as α1,m = α 0 ([m − 1]W) for 1 ≤ m ≤ 120. Similarly, for 1 ≤ m ≤ 120, the parameters that describe the influence of neurons 2 and 4 on neuron 1 are computed as α1,M+m = α − ([m − 1]W), and α1,3M+m = α + ([m − 1]W), respectively. On the other hand, the parameters α1,2M+m are all zero since neuron 3 does not directly influence the firing of neuron 1. The parameter α1,0 is equal to log(5) since the spontaneous firing rate is 5 Hz for all neurons. Model parameters were obtained in an identical fashion for the other simulated neurons. The simulations were stopped after each neuron fired at least 5000 spikes. An absolute refractory period of 1 ms was enforced by preventing spikes from being generated in adjacent time steps. We designed the simulated networks to have nontrivial connectivity while avoiding excessive visual complexity. These examples are intended to show that for increasingly complex networks under no a priori assumption about the connectivity, our algorithm has a high likelihood of identifying whatever connectivity is in the system. The network model generates both stationary and nonstationary spike trains depending on whether its parameters satisfy certain constraints. Since our simulation and analysis methods do not impose any specific constraints on the parameters, the application of our method is not restricted to stationary spike trains.

3.2 Analysis of the Simulated Ensemble Spiking Activity. We use the four-neuron network as a simple example to illustrate our method. Figure 1C shows the interspike interval (ISI) distribution (top) and the power spectral density (PSD) of the spike train for neuron 1. The exponential function superimposed on the ISI distribution shows the distribution that would be obtained from a 5 Hz Poisson spike train. These plots are representative of all neurons in this and the other simulated networks due to the symmetry and the use of the same interaction functions in each network. It can be seen that the neuron has a certain tendency to fire with ISIs of about 60 ms, which shows up as a peak at around 16 Hz in the PSD. This structure can be explained in terms of the temporal distribution of excitation within the interaction functions (see Figure 1B). Namely, the auto-inhibition has an excitatory rebound that peaks at around 56 ms, and the excitation has two excitatory peaks separated by 60 ms. The former results in an interval of increased firing probability at around 56 ms after each firing of neuron 1, and the latter results in two intervals of increased firing probability, separated by about 60 ms, after each firing of an excitatory trigger neuron, which is the neuron 4 in this case. For the parameter estimation, we used a history duration of 120 ms and W = 10 ms, which resulted in 12 parameters per connection. An optimal value of W can be found by trying different values and selecting the one

1938

M. Okatan, M. Wilson, and E. Brown

that minimizes the AIC or the BIC. This is done in the analysis of the place cell data in section 3.3. Here, the value of 10 ms was chosen intuitively to obtain a relatively small number of parameters per connection while maintaining the temporal resolution sufficiently high to capture the important features of the interaction functions. Figure 2A shows α ˆ 1:4 , the ML estimate of the parameter vector α1:4 , superimposed on the actual interaction functions. The ML estimates are shown by the black dots, and each estimate is plotted at the center of the 10 ms window that it is associated with. The error bars show the 99.99% confidence intervals of the estimates. The asterisks indicate the parameters that are significantly different from zero ( p < 0.0001). The diagonal plots show the auto-inhibitory interaction estimates. The connectivity pattern implied by Figure 2A matches the actual diagram in Figure 1A. It can be seen that the temporal evolution of the interactions is captured in the parameter values accurately. The estimates of the spontaneous firing rates for each cell are shown in Figure 2B. Figure 2A shows that the parameters that represent the connections between the neuron pairs (1, 3) and (2, 4) are close to zero. As a result, these were the least significant connections in the network. Both the AIC and the BIC were minimized when these connections were removed from the model. AI C The resulting optimal vectors α1:4 and αBIC 1:4 were identical. In these vectors, the parameters that correspond to the removed connections are zero. The other parameters are reestimated under this constraint, and their values were found to be very close to the values shown in Figure 2A. In other words, the optimal parameter vectors correctly identified the true matrix of excitatory and inhibitory interactions among the simulated neurons. The AI C connectivity matrices G AI C and G BIC that are implied by α1:4 and αBIC 1:4 were also identical and are shown in Figure 2C. In this matrix, connections that were kept in the optimal model are indicated by the black squares. The diagonal entries represent the dependence of the cells on their own activity history. This matrix corresponds to the true connectivity matrix of the network in Figure 1A. Note that the optimal connectivity matrices do not show a connection between neurons that do not have direct interactions, even though they have indirect interactions. For instance, neuron 1 has a bidirectional interaction with neuron 3 through neuron 2. In other words, neuron 2 acts as an “interneuron” between neurons 1 and 3. This relation is successfully detected in Figures 2A and 2C without mistakenly concluding that neurons 1 and 3 are in direct interaction. For comparison, we also computed the interaction estimates using autoand cross-intensity functions (Brillinger, 1976a). The estimates for neuron 1 are shown in Figure 3. These estimates are representative for the other neurons of this network as well. Here, the square root of the instantaneous firing rate of neuron 1 is shown as a function of time after a trigger spike. Assuming that the neurons do not fire simultaneous spikes and that the bivariate point processes that consist of the spike trains of neuron 1 and each of

Network Likelihood Model

1939

Triggers

A

1 2

2 0 −2

3

2 0 −2 2 0 −2

Targets

2 0 −2

4

1

2

3

4

0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120

Time (ms) C

1.8

1

Targets

B

αi,0

1.6 1.4 log(5 Hz)

2 3 4

1.2

1

1

2

3

4

2

3

4

Triggers

Cells Figure 2: (A) The ML estimates of the parameter vectors computed using the algorithm, superimposed on the actual interaction functions. The estimates are shown by the black dots. The error bars show the 99.99% confidence intervals of the estimates. The asterisks indicate the parameters that are significantly different from zero ( p < 0.0001). The interaction estimates are dimensionless and represent the logarithm of dimensionless numbers that modulate the spontaneous firing rate. (B) The spontaneous firing-rate estimates. (C) G AI C and G BIC were identical and corresponded to the true connectivity matrix. Black squares indicate the presence of a connection.

the four trigger neurons are stationary, the square root of the auto- or crossintensity function has a nearly normal asymptotic distribution (Brillinger, 1976a). In each graph, the thick curve shows the transformed true interaction this curve is the function ! function. For instance, in the top left graph, 5 exp(α 0 (u)) for 0 ≤ u ≤ 120 ms, where α 0 (u) is given by equation 3.1 and

1940

M. Okatan, M. Wilson, and E. Brown

Root Firing Rate (s-1/2)

Cell 1 to Cell 1

Cell 2 to Cell 1

4

4

3

3

2

2

1

1

0

40

80

120

0

Cell 3 to Cell 1 4

3

3

2

2

1

1 40

80

80

120

Cell 4 to Cell 1

4

0

40

120

0

40

80

120

Time (ms) Figure 3: Square root of the auto- and cross-intensity functions for neuron 1. The thick curves show the transformed true interaction functions (see text for details). In each graph, the dashed line indicates the square root of the estimated average firing rate. Of the three parallel curves, the middle curve shows the square root of the auto- or cross-intensity function. The upper and lower curves show the 99.99% confidence interval for the estimates. The two raster traces below the curves show the points where the estimates significantly differ from the average firing rate (top raster trace) or the true interactions (bottom raster trace). The estimates differ significantly from the true interactions in 16% to 35% of the 120 ms interval.

5 Hz is the true spontaneous firing rate of neuron 1. The curves that show the true crossinteractions between neuron 1 and neurons 2 and 4 were obtained similarly, using α − (u) and√α + (u), respectively, in place of α 0 (u). For neuron 3, the thick curve shows 5 s−1/2 , which is the square root of the spontaneous firing rate. In each graph, the dashed line indicates the square root of the estimated average firing rate of neuron 1, which has the same value in all plots. Of the three parallel curves, the middle curve shows the square root of the intensity function estimate. We computed this function using a bin size of 10 ms such that its value at time τ represents the square root of the average intensity in the interval (τ − 5 10−3 , τ + 5 10−3 ] (Brillinger, 1976a). Therefore, these estimates can be directly compared with the interaction estimates computed using our method (see Figure 2A). The upper and lower curves show the 99.99% confidence interval around the estimates. The two raster traces below the curves show the points where the estimates

Network Likelihood Model

1941

significantly differ from the estimated average firing rate (top raster trace) or the true interaction functions (bottom raster trace) ( p < 0.0001). It can be seen that the interaction estimates are significantly different from the true interaction functions in certain segments of the 120 ms interval. These segments cover 16% to 35% of the 120 ms interval. This coverage is 40% to 60% using 95% confidence intervals (not shown). In contrast, the true interaction functions are within the confidence intervals of all parameters in Figure 2A. These results show that although the auto- and cross-intensity functions provide a descriptive estimate of the interactions, they could not recover the true interaction functions accurately. To test our algorithm on a larger network with a denser connectivity pattern, we designed a network of fully interconnected eight neurons. All neurons had the same properties as those in the previous network. In this network, each neuron receives only one excitatory connection, which is from the previous neuron in the clockwise direction, and six cross-inhibitory connections from the remaining neurons. This network is shown in Figure 4A, where only the connections made to three neurons are shown for clarity. The spontaneous firing-rate estimates and the best and worst cases of the interaction estimates, based on the mean squared error, are shown in Figures 4B and 4C, respectively. Even in the worst case, the estimates were very close to the true values of the parameters. Figure 5 shows α ˆ 1:8 , the full matrix of the interaction estimates. The diagonal plots show the auto-inhibitory interactions. Each neuron is seen to have one excitatory interaction with the proper neuron and inhibitory interactions with the remaining neurons, in agreement with the true network connectivity. All entries in G AI C and G BIC were 1’s, indicating that the optimal connectivity matrices corresponded to a fully interconnected network. AI C It also follows that the optimal parameter vectors α1:8 and αBIC 1:8 were identical to α ˆ 1:8 . To try a network of nontrivial size, we constructed a network of 20 neurons using the four-neuron network as a building block, as shown in Figure 6A. This network contains 10 neuron triplets, such as the triplets (4←3→5) and (2←6→5), in which two neurons receive a shared input without having direct interactions. Therefore, this network can be used to test whether the method can differentiate between direct interaction and shared input. The optimal connectivity matrices G AI C and G BIC were identical and corresponded to the true connectivity matrix shown in Figure 6B. In this case, G AI C corresponded to this matrix when the interactions were estimated at W = 5 ms resolution instead of 10 ms. At 10 ms resolution, G AI C had three false-positive connections. This suggested that 12 parameters per connection did not provide a sufficiently accurate representation of the interactions, which caused the model to overfit the data by forcing some parameters to be significantly different from zero even though they were part of the connections that did not exist in the original network. Indeed, after doubling

1942

M. Okatan, M. Wilson, and E. Brown

A

B

1 2

7

3

αi,0

8

1.8 1.6 1.4

6

log(5 Hz)

4

1.2

5

1

2

3

4

5

6

7

8

Cells

Target 5

C 2

Worst

0

−2 0

40

80

120

Target 4

Time after cell 2 spike (ms) 2

Best

0

−2 0

40

80

120

Time after cell 4 spike (ms) Figure 4: (A) The eight-neuron network. Same properties as in Figure 1. Each neuron receives only one excitatory connection, which is from the previous neuron in the clockwise direction, and six cross-inhibitory connections from the remaining neurons. Only the connections made to three neurons (8, 1, and 2) are shown for clarity. (B) The spontaneous firing-rate estimates. (C) The best and the worst cases of the interaction estimates based on the mean squared error. Same figure properties as in Figure 2.

the number of parameters per connection to 24 (a resolution of 5 ms), G AI C corresponded to the true connectivity matrix. Also, the minimum AIC was smaller at 5 ms resolution, indicating that the true connectivity matrix was indeed the optimal solution across temporal resolutions, as expected. This suggests that in an unknown experimental situation, a range of values for W should be tried, and values that minimize the AIC or the BIC (or both) should be used in the analysis. On the other hand, G BIC corresponded to the true connectivity matrix at both temporal resolutions, and the minimum BIC was lower at 10 ms resolution. Figure 6B shows that G AI C and G BIC do not indicate direct interactions between the neuron pairs that receive shared input without direct interaction, while the fact that they receive a shared input is visible in these matrices. This shows that the method can make this

Network Likelihood Model

1943 Triggers

2

3

4

5

6

7

8

4 5 8

7

6

Targets

3

2

1

1

Time 0-120 ms

Figure 5: The full matrix of the interaction estimates. The diagonal plots show the auto-interaction for each neuron. Each row has one excitatory interaction. The other plots show the cross-inhibitory interactions. Same figure properties as in Figure 2.

distinction successfully, provided that the shared input source is accurately modeled in the conditional intensity functions of the neurons. The spontaneous firing-rate estimates and the best and worst cases of the interaction estimates using a 10 ms resolution are shown in Figures 6C and 6D, respectively. As for the other networks, the estimates were highly accurate even in the worst case. The interaction estimates in Figures 2 and 4 through 6 show that the model fit the true interactions with high accuracy. The goodness-of-fit of the conditional intensity functions may be assessed by using KolmogorovSmirnov (KS) plots. The KS plots are obtained by rescaling the spike times of a spike train using the conditional intensity function that is fit to that spike train and comparing the distribution of the intervals between the rescaled times (rescaled ISIs) to the ISI distribution of a Poisson process with unit rate. The rescaling is done using the following formula,

ui,k

(ui,k ) = 0

λ(u|Hu )du,

(3.2)

1944

M. Okatan, M. Wilson, and E. Brown 1

2

B 1

4

3

5

17 18

5

6

20 19

8

7

Targets

A

10 15

13 14

9 10

16 15

12 11

αi,0

C

20 1

5

10

15

20

Triggers

1.8 1.6 1.4 log(5 Hz) 1.2

D

1

5

10

15

20

Target 12

Cells

Worst

2 0 −2 0

40

80

120

Target 17

Time after cell 15 spike (ms) Best

2 0 −2 0

40

80

120

Time after cell 20 spike (ms)

Figure 6: (A) The 20-neuron network. Same properties as in Figure 1. (B) G AI C and G BIC were identical and corresponded to the true connectivity matrix. Black squares indicate the presence of a connection. (C) The spontaneous firing-rate estimates. (D) The best and the worst cases of the interaction estimates based on the mean squared error. Same figure properties as in Figure 2.

where, ui,k is the time at which the kth spike of neuron i occurs, (ui,k ) is the rescaled time, and λ(u|Hu ) is a conditional intensity function satisfying 0 < λ(t|Ht ) for all t ∈ (0, T]. According to the time rescaling theorem, if the spike times are rescaled using the true intensity function of the spike train, the rescaled times constitute a Poisson process with unit rate (Brown et al.,

Network Likelihood Model

1945 1

Best

Cumulative Distribution Function

0.8 0.6 0.4 0.2 0 0 1

0.2

0.4

0.6

0.8

1

0.6

0.8

1

Worst

0.8 0.6 0.4 0.2 0 0

0.2

0.4

Quantiles Figure 7: The best and the worst KS plots across all simulated neurons. The goodness-of-fit is assessed by the proximity of the plot to the 45 degree line (see text for details). All plots were fully within the 95% confidence intervals (dashed lines).

2002). The ISIs in a Poisson process with unit rate are distributed exponentially with unit mean. This distribution can be transformed into a uniform distribution. The same transformation can be applied to the rescaled ISIs. The quality of the fit would then be assessed by comparing the distribution of the rescaled and transformed ISIs to the uniform distribution using the KS statistic. As a result, in KS plots, the quality of the fit is measured by the proximity of the plot to the 45 degree line. Figure 7 shows the best and worst KS plots across all neurons in all simulated networks. All plots were within the 95% confidence intervals, indicating that the conditional intensity functions fit the spike trains very well. These plots were obtained using the interaction estimates that were computed with 10 ms resolution. 3.3 Analysis of the Place Cell Ensemble Spiking Activity. To illustrate the application of the method to the analysis of real neural data, we analyzed

1946

M. Okatan, M. Wilson, and E. Brown

the spiking activity of an ensemble of hippocampal place cells. A given place cell fires only when the animal is in a certain subregion of the environment, termed the neuron’s place field (O’Keefe & Dostrovsky, 1971). The data were recorded from a Long-Evans rat in a familiar open circular environment 70 cm in diameter with walls 30 cm high and fixed visual cues. A multiunit electrode array was implanted into the CA1 region of the hippocampus of the animal. The simultaneous activity of 33 place cells was recorded from the electrode array for 23 minutes while the animal was freely foraging in the environment for randomly delivered food pellets. The cells were clearly separated even though there were more than one cell per tetrode. Spike sorting was done by visual cluster cutting and ambiguous spikes were eliminated. The position of the animal was also recorded simultaneously using a camera (Brown, Frank, Tang, Quirk, & Wilson, 1998). We divided the recording session into two equal parts and analyzed the ensemble spiking activity independently in each part. In this way, we obtained two independent estimates of the optimal connectivity matrices of the cells. There were up to 3000 spikes available per neuron in each part of the recording session. In a preliminary analysis, we selected a history duration of 120 ms, with a spike counting window of W = 1 ms. We observed that significant auto- and cross-interactions among the neurons occurred first within an approximately 30 ms window, and then again at the next cycle of the hippocampal theta rhythm (data not shown). For this analysis, we chose a history duration of 30 ms to analyze the early phase of the functional interactions. We estimated the interactions at the resolutions of W = 1 ms, 3 ms, and 5 ms for the first part of the session. The lowest AIC was obtained at the resolution of 3 ms. We therefore used this resolution in analyzing the data. Examples of the interaction estimates for the place cells are shown in Figure 8. Note that the estimates are very similar in the first (black lines) and second (red lines) parts of the recording session. This level of similarity was observed for most auto-interactions and a few cross-interactions. The estimates G BIC and G BIC of G BIC in the first and second parts of the 1 2 session are shown superimposed in Figure 9. The stars and the disks indicate the connections that were present only in the first and second parts of the session, respectively. The black squares show the connections that were present in both parts. Note that almost all of the diagonal entries are black squares, indicating that the auto-interactions were significant throughout the session. The optimal connectivity matrix among the place cells is represented by the off-diagonal entries. The relative number of off-diagonal black squares is a measure of how similar the matrices were in the two parts of the session. We computed the significance of this similarity under the null hypothesis that each of the C(C − 1) connections in an ensemble of C neurons has the same probability po of being present in the optimal connectivity matrix, and that po is the same in both parts of the session. The ML estimate of the probability po is then given by po = (m1 + m2 )/(2F ), where F = C(C − 1),

log multiplier of spontaneous rate

Network Likelihood Model

1947

2

2

0

0

−2

−2

CELL 21 TO CELL 21

−4

CELL 22 TO CELL 33

−4 0

3

6

9 12 15 18 21 24 27 30

0

2

2

0

0

−2

3

6

−2

CELL 28 TO CELL 28

−4

9 12 15 18 21 24 27 30

CELL 10 TO CELL 25

−4 0

3

6

9 12 15 18 21 24 27 30

0

3

6

9 12 15 18 21 24 27 30

Time after trigger cell spike (ms) Figure 8: Examples of the interaction estimates for the place cells. Each plot shows the interactions estimated in the first (black lines) and second (red lines) parts of the recording session for a given cell (left), or cell pair (right). The dashed lines are the 99.99% confidence intervals. Asterisks indicate the parameters that were significantly different from zero ( p < 0.0001).

and m1 and m2 are the numbers of off-diagonal entries in the optimal connectivity matrices obtained from the first and second parts of the recording session, respectively. Denoting the observed number of off-diagonal black squares by mc , we propose that given po and F , the probability of observing mc or more of the off-diagonal entries in both matrices can be used as the significance of the similarity between these matrices. This probability is given by Pr(S ≥ mc |F, po ) =

F F

min(µ 1 ,µ2 )

µ1 =mc µ2 =mc s=max(mc ,τ −F )

F × µ1

µ1 s

F − µ1 µ2 − s

poτ (1 − po )2F −τ ,

(3.3)

where, τ = µ1 + µ2 and the random variable S represents the number of common off-diagonal entries in the two matrices. Using this equation, we found that the similarity of the optimal connectivity matrices in the two

1948

M. Okatan, M. Wilson, and E. Brown

1 5 Targets

10 15 20 25 30 1 5 10 15 20 25 30 Triggers Figure 9: The optimal connectivity matrices G BIC and G BIC 1 2 , computed in the first and second parts of the recording session, respectively, are shown superimposed. The stars and the disks indicate the functional connections that were present only in the first and second parts of the session, respectively. The black squares show the functional connections that were present in both parts of the session. The diagonal entries represent the cells’ dependence on their own spiking activity history. The relative number of off-diagonal black squares measures the similarity of the optimal connectivity matrices among neurons in the two parts of the session.

parts of the session was highly significant ( p < 10−22 ). Table 1 shows the data used for this computation. In either case, the assumption that po is the same in both parts of the session was not rejected on the basis of the observed numbers of off-diagonal connections (Pr(|m − F po | ≥ |m1 − m2 |/2) > 0.45, where m is the random variable that represents the number of off-diagonal connections.). As described above, we modeled the place cell firing as being solely dependent on the 30 ms history of the ensemble spiking activity. The KS plots in Figure 10 show how well this model fit the place cell spike trains. These plots are representative of the best and worst fits across all cells in both parts of the session. Essentially, none of the KS plots were within the 95% confidence intervals, except for a few cells that fired relatively low numbers of spikes and thus had wide confidence intervals. These plots indicate that the 30 ms history of the ensemble activity did not contain all the information that is needed to account for the activity of these cells under the model. To determine whether increasing the amount of data would improve the goodness-of-fit, we reanalyzed the data without splitting the recording

Network Likelihood Model

1949

Table 1: Data Used in Computing the Significance of the Similarity Between the Optimal Connectivity Matrices in the Two Parts of the Recording Session. Part I Connections (BIC) Off-diagonal, m1 and m2 Common, mc Total possible, F Probablity, po Pr (S ≥ mc | po , F )

Part II

45

35 27 33 × 32 = 1056 (45 + 35)/2/1056 = 0.0379 10−23.93

Connections (AIC) Off-diagonal, m1 and m2 Common, mc Probability, po Pr (S ≥ mc | po , F )

255

250 147 (255 + 250)/2/1056 = 0.2391 10−22.12

Cumulative Distribution Function

Notes: The entries in the upper part of the table (BIC) are obtained from Figure 9. Pr(S ≥ mc | po , F ) is given by equation 3.3.

Worst

Best 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 0

0.2

0.4

0.6

0.8

1

0 0

0.2

0.4

0.6

0.8

1

Quantiles

Figure 10: The KS plots for the place cell data. These plots are representative of the best and worst fits across all cells in both parts of the session. Essentially, none of the KS plots were within the 95% confidence intervals (dashed lines), except for a few cells that fired relatively low numbers of spikes and thus had wide confidence intervals.

session into two parts. This is equivalent to doubling the recording period, relative to analyzing the data from each part separately. There were no noticeable improvements in the KS plots under these conditions. In other words, doubling the amount of data did not improve the goodness-of-fit. We next explored whether a model with a longer history would fit the data better. Reanalyzing the data using a 120 ms history at W =3 ms resolution resulted in better KS plots. However, these plots also were not within the 95% confidence intervals. The interaction estimates obtained using this longer history revealed that significant interactions occurred first within the initial

1950

M. Okatan, M. Wilson, and E. Brown

30 ms and then again at the next cycle of the hippocampal theta rhythm (data not shown). This suggests that the model with the longer history fit the data better because it implicitly represented the dependence of the firing of the place cells on the phase of the hippocampal theta rhythm. The optimal connectivity matrices that were obtained using the model with the longer history were also significantly similar in the two parts of the session ( p < 10−14 ). The strongest modulator of the place cell firing is the position of the animal in the environment. We anticipate that a conditional intensity function that models the firing probability as a function of the ensemble spiking activity history, the theta rhythm, and the position of the animal would fit the data better. We tested whether the interactions that underlie the optimal connectivity matrices were linked to the degree of overlap between the place fields of the cells. We modeled the place fields and computed their areas of overlap as follows. Each spike of a given cell is associated with a spatial point on the circular area, which represents the location of the animal at the time of that spike. We used the center of mass of these points as the center of that cell’s place field. Using the principal component analysis (Jolliffe, 1986), we computed the principal axes of the distribution of these points. We represented the place field using an ellipse whose major and minor axes were parallel to the principal axes. The lengths of the major and minor axes were proportional to twice the standard deviation of the data along the corresponding principal axes. We used proportionality constants of ρ = 0.25, 0.5, 1, 2, and 3 to try different place field sizes. The overlap between a pair of place fields was computed as the area of the intersection of the corresponding ellipses. Place fields were determined using the data from the entire recording session. Using this analysis, we found that the cell pairs that were part of the optimal connectivity matrices in both parts of the session had significantly larger areas of overlap between their place fields, compared to other cell pairs ( p < 0.005; Wilcoxon rank sum test; Lehmann, 1975). This was true for all values of ρ and for the optimal connectivity matrices found using AIC or BIC. As part of the analysis of place cell ensemble spiking activity, we tested whether the starting vector given by equations 2.17 and 2.18 (reference vector) had any advantage over random vectors of comparable magnitude. For this, we reanalyzed the place cell data using random starting vectors that had 0.1, 1, and 10 times the magnitude of the reference vector. The components of the random vectors were uniformly distributed in [10−10 , 1] prior to magnitude adjustment. We found that the algorithm took 10.70 ± 1.36%, 9.79 ± 1.38%, and 9.46 ± 1.80% more iterations to converge for the magnitude ratios of 0.1, 1, and 10, respectively (mean ± standard error of the mean (s.e.m.); n = 66. Reference number of iterations: 109.94 ± 3.23). We also observed that the same ML estimate of the parameter vector was found for all random starting vectors for each cell. The average magnitude of the difference vectors between the ML estimate obtained using random starting

Network Likelihood Model

1951

vectors and the reference ML estimate was 0.00312 ± 0.00035%, 0.00314 ± 0.00035%, and 0.00270 ± 0.00032% (mean ± s.e.m, in percentages of the magnitude of the reference ML estimate) for the above order of magnitude ratios. The average magnitude of the difference vectors between the random starting vectors and the reference vector was 96.08 ± 0.12%, 106.06 ± 1.11%, and 960.83 ± 1.18% (mean ± s.e.m, in percentages of the magnitude of the reference) for the same order of magnitude ratios. These results indicate that the algorithm converged to the same ML estimates starting from widely different initial conditions.

4 Discussion Analyzing the functional interactions among multiple spike trains is an important and challenging problem in computational neuroscience. In this study, we examined certain computational properties of a maximum likelihood method and demonstrated its application to the analysis of functional interactions among simulated and real spike trains. The method estimates pair-wise interactions among neurons simultaneously by taking into account the activity of all neurons in the ensemble. It allows simultaneous estimation of a large number of interaction parameters by maximum likelihood. We demonstrated that the algorithm successfully recovered both inhibitory and excitatory interactions among simulated spike trains that had known interaction matrices (Figures 2 and 4–6). Our method of computing the optimal connectivity matrix recovered the true connectivity in all simulated networks. This method successfully distinguished between direct interaction (e.g., synaptic), indirect interaction (e.g., via an interneuron), and shared input. The connectivity of the simulated networks did not make it easier or favor the application of our algorithm in any way. The analysis suggests that whatever connectivity is in the underlying system, given the assumptions of the model, our method would have a high likelihood of identifying that connectivity. Further simulation and real data analysis studies will be needed to explore how the accuracy of the connectivity inference depends on the topology of the underlying network or the properties of the interaction functions. We computed the auto- and cross-intensity function estimates of the simulated interactions and showed that they were significantly different from our estimates and the true interaction functions. By design, these methods are suitable for the analysis of two interdependent stationary spike trains that do not depend on other spike trains. In our simulations, each spike train depended on the history of at least three spike trains. As a result, the auto- and cross-intensity functions could not estimate the interactions accurately. Statistical dependency of one spike train on multiple spike trains is common in real data. Figure 9 shows that the activity of certain place cells depended on up to five spike trains simultaneously (e.g., cells 6 and

1952

M. Okatan, M. Wilson, and E. Brown

10 in part 1). This underlines the importance of using methods that analyze multiple interactions simultaneously. Existing methods for the simultaneous analysis of multiple spike trains include the gravitational clustering algorithm (Gerstein & Aertsen, 1985; Gerstein et al., 1985; Baker & Gerstein, 2000) and spike pattern classification methods (Abeles & Gerstein, 1988; Louie & Wilson, 2001; Pipa & Grun, ¨ 2003; Lee & Wilson, 2004). The gravitational clustering algorithm is useful for detecting synchronous cell assemblies. It can also be used to infer connectivity diagrams among a group of neurons using graphical methods (Gerstein & Aertsen, 1985). The significance of the clusters is determined using Monte Carlo methods (Baker & Gerstein, 2000). Spike pattern classification methods are useful for detecting precise spatiotemporal firing patterns (Abeles & Gerstein, 1988; Pipa & Grun, ¨ 2003), or searching for a particular sequential firing pattern within the ensemble spiking activity (Louie & Wilson, 2001; Lee & Wilson, 2004). The detection of precise spatiotemporal firing patterns involves the delicate statistical issue of formulating an appropriate null hypothesis to determine the correct significance level for the detected patterns. In some applications, the usual null hypothesis of inhomogeneous Poisson spike counts has been shown to overestimate the significance of the observed patterns when the actual spike count distributions differed from the Poisson distribution by even small amounts (Oram, Wiener, Lestienne, & Richmond, 1999). Combinatorial methods that search for a particular sequential firing pattern within the ensemble spiking activity are useful in applications where such an activity pattern is expected due to the design of the experiment (Louie & Wilson, 2001; Lee & Wilson, 2002), but may not be particularly useful in applications such as our place cell analysis where no particular sequential firing pattern is expected a priori. However, these methods can be used in combination with our method to test hypotheses of sequential firing that are suggested by the optimal networks. Our method provides an alternative approach for the simultaneous analysis of multiple spike trains. We construct a network likelihood model, which is a joint likelihood model for all pair-wise interactions among an ensemble of neurons. We compute the ML estimate of all interaction parameters simultaneously. The use of the likelihood is an optimal way of analyzing the data generated by a process if the probability model used in the ML method is a good approximation to that process (Pawitan, 2001). We determined the optimal order of the network model by minimizing the AIC or the BIC. In the present application, this procedure determines the optimal number and configuration of the functional connections within the ensemble, given the assumptions of the method. These connections may reflect direct or indirect interactions among neurons (effective connectivity; Aertsen & Preissl, 1991; Friston, Frith, Liddle, & Frackowiak, 1993; Lee, Harrison, & Mechelli, 2003), as well as shared inputs to the recorded neurons from sources that are not represented in the network model. Distinguishing direct interaction from shared input is a long-standing problem in

Network Likelihood Model

1953

connectivity analysis. Methods that have been used to make this inference for neuron pairs or triplets include the JPSTH (Aertsen et al., 1989), conditional cross-correlogram (Eggermont, 1991), partial coherence (Brillinger, 1976b, 1992), and ML methods (Borisyuk et al., 1985; Brillinger, 1988a, 1988b; Chornoboy et al., 1988). Our simulation results showed that our method could make this inference accurately if the network model is a close approximation to the true network underlying the data. In cases where such close approximations are not achieved, as may be indicated by a poor KS plot, the possibility that some of the inferred connections are due to shared inputs may not be ruled out. In our analysis of the place cell data, we observed that the functional connectivity matrix of the place cells was significantly stable throughout the recording session (Figure 9 and Table 1). This stability is consistent with the facts that the place fields of CA1 place cells that are active in a familiar environment are stable throughout a given session, and over several days (Muller, Kubie, & Ranck, 1987; Wilson and McNaughton, 1993; Barnes, Suster, Shen, & McNaughton, 1997), and that cells with overlapping place fields exhibit correlated activity (Wilson & McNaughton, 1994; Skaggs & McNaughton, 1996, 1998). Indeed, we found that the place fields of the cell pairs that were part of the optimal connectivity matrices throughout the session had significantly larger overlaps compared to other cell pairs ( p < 0.005). The significance of this relationship increased with increasing spatial selectivity of the place fields ( p < 1.7 10−9 for ρ = 0.25). In analyzing the ensemble spiking activity, we explicitly ignored that the neurons have overlapping place fields. These results indicate that neurons that were found to be functionally connected corresponded to neurons that had overlapping place fields. In this analysis, we have assumed that spikes have been sorted properly. The effect of improper spike sorting could significantly alter the results. We have discussed in detail the implications of correct spike sorting and the importance of developing algorithms to conduct spike sorting in a previous publication (Brown et al., 2004). As is the case with all single unit analyses, we are able to perform the analysis using the spike trains only from the neurons that are observed. The particular functional networks that we inferred from the data may change if cells are added to or removed from the ensemble. The results suggest that the new functional networks would be related to the overlaps of the place fields of the cells in the new ensembles and that they would be significantly stable throughout the session. Several extensions of this work are possible. First, the derivation of the network likelihood model relies on the assumption that neurons do not fire simultaneous spikes. We will formulate a likelihood model that relaxes this assumption. Second, the activity of the place cells is modulated by covariates such as the animal’s velocity and the hippocampal theta rhythm in addition to position (McNaughton, Barnes, & O’Keefe, 1983; O’Keefe & Recce, 1993). We will construct more comprehensive place cell models that

1954

M. Okatan, M. Wilson, and E. Brown

incorporate these covariates explicitly into the model. Such models could be useful in spike train decoding applications. We found that including the network model in equation 2.2 into a place cell model similar to those used by Brown et al. (1998) and Barbieri et al. (2004) resulted in a smaller decoding error (unpublished result). Third, as in Wilson and McNaughton (1994), we plan to analyze the connectivity matrix of place cells during behavior and sleep episodes to study the reactivation of ensemble memories during sleep. Finally, we also see potential application of our methods to the analysis of ensemble spiking activity in cortical areas MI and 5d during reaching behavior in macaque monkey (Truccolo, Fellows, Eden, Brown & Donoghue, 2004). Precise time evolution of the functional interactions between these areas can be determined by analyzing large ensembles using our method. In conclusion, we believe that the methods that we presented here may serve as fast and efficient tools that can be used to analyze the activity of a large number of neurons simultaneously. Such analyses are important for studying how the nervous system encodes information within the concerted activity of ensembles of neurons.

Appendix: Formulation of the Starting Vector and the Stopping Rule We first derive a necessary condition that the solution vector satisfies. We then use this condition to solve for the maximum likelihood estimates under some simplifying assumptions and use this solution as the starting vector of the iteration. The log-likelihood function of γ i = [exp(αi, j )] is obtained from equation 2.11: l T (γ i ) =

T

D

0

log γi, j I j (u) dNi (u)

j=0

+

T

0

D I j (u) 1− du. γi, j

(A.1)

j=0

Since the vector γ i is nonnegative, the maximization of l T (γ i ) with respect to γ i is a constrained optimization problem. Under this constraint, a parameter vector γˆ i is a ML estimate of γ i if it satisfies the Kuhn-Tucker conditions (see Luenberger, 1984): " ∂l T (γ i ) "" ≤ 0, ∂γi, j "γˆ i

γˆi, j ≥ 0,

γi, j

" ∂l T (γ i ) "" = 0, ∂γi, j "γˆ i

0 ≤ j ≤ D.

(A.2)

Network Likelihood Model

1955

Taking the partial derivative of the log-likelihood function with respect to the components of γ i we obtain 1 ∂l T (γ i ) = ∂γi, j γi, j

T

I j (u)dNi (u) −

0

T

I j (u) 0

D

(γi,l )

Il (u)

du .

(A.3)

l=0

We are interested in solutions where γˆi, j > 0 for all j. For those solutions, " T (γ i ) " the condition A.2 yields ∂l∂γ = 0. Applying this to equation A.3 and γˆ i i, j rearranging the terms, we obtain T

T 0

I j (u)dNi (u) = 1. D Il (u) I j (u) du l=0 (γˆi,l ) 0

(A.4)

This equation gives the relationship between the components of the ML estimate γˆ i and can be used to compute approximate values for these components. To do this, we first compute the integrals in equation A.4 using finite differences as in equation 2.12: K −1 k=0 I j,k Ni,k:k+1 = 1. K −1 D Il,k k=0 I j,k l=0 (γˆi,l )

(A.5)

For j > 0, we assume that the model consists of only the parameters γi, j and γi,0 and set the other parameters to 1. Denoting the ML estimates of γi, j and γi,0 under these conditions by γi,(1) and γ˜i,0 , respectively, equation A.5 gives j the following equations corresponding to these estimates, K −1

I j,k Ni,k:k+1 = 1, I K −1 (1) j,k γ˜i,0 k=0 I j,k γi, j k=0

K −1

k=0 Ni,k:k+1 = 1, K −1 (1) I j,k γ˜i,0 k=0 γi, j

(A.6)

where we have used the fact that I0,k = 1 for all k. The covariates I j>0 are integer valued since they represent spike counts. Therefore, the denominators of these equations are polynomial functions of γi,(1) and γ˜i,0 . Isolating j in these equations gives equation 2.17. For j = 0, we assume that the γi,(1) j model consists of only the parameter γi,0 and set the other parameters to 1. Denoting the ML estimate of γi,0 under these conditions by γi,0(1) , equation A.5 gives equation 2.18, which is simply the average firing rate. In other applications, the conditional intensity function may include continuous-valued covariates. The above method can also be used in those cases after discretizing such covariates. This discretization is necessary only for the computation of the starting vector. The original, continuous-valued covariates may be used in equations 2.14 to 2.16 without discretization.

1956

M. Okatan, M. Wilson, and E. Brown

We obtained the relationship between the stopping threshold θ and the change in the log likelihood using the following approximation. The log(n+1) (n) likelihood difference l T (γ i ) − l T (γ i ) can be approximated by the following formula near the ML estimate, where the magnitude of the difference (n+1) (n) vector γ i − γ i is sufficiently small, " D (n+1) (n) (n+1) (n) ∂l T (γ i ) "" lT γ i , − lT γ i ≈ γi, j − γi, j ∂γi, j "γ (n) j=0

(A.7)

i

which gives lT

(n+1) γi

− lT

T

×

I j (u) 0

(n) γi

D

≈

(n+1) (n) 2 D γi, j − γi, j j=0

(n)

γi, j

2 Il (u) dNi (u) ≤ δγ i,n ,

(A.8)

l=0

D T where = Dj=0 l=0 0 I j (u)Il (u)dNi (u) and δγ i,n is given by equation 2.19. Therefore, when the iteration is stopped using the rule δγ i,n < θ, the change in the log likelihood is around θ 2 . A.1 Computing an Optimal Connectivity Matrix. We use the network AI C and log-likelihood function in equation 2.6 to find parameter vectors α1:C BIC α1:C that minimize the AIC and the BIC, respectively. For 1 ≤ i, c ≤ C, define the sets Aic = {α1:C = [α1 , α2 , . . . , αC ] ∈ R1×(D+1)C | αi, j = 0, (c − 1)M + 1 ≤ j ≤ c M}.

(A.9)

Here, αi , 1 ≤ i ≤ C, is given by equation 2.10. The elements of the set Aic specify networks in which the connection from neuron c to neuron i is absent. For large T, the likelihood ratio statistic 2[l T (α ˆ 1:C ) − maxα1:C ∈Aic (l T (α1:C ))] has a chi-squared distribution with M degrees of freedom. Here α ˆ 1:C is the ML estimate of the unconstrained parameter vector. The vector arg maxα1:C ∈A (l T (α1:C )) is the ML estimate of ic the constrained parameter vector and is computed using equations 2.14 to 2.16. The projection of α ˆ 1:C into Aic is used as the starting vector for the iteration. For 1 ≤ i, c ≤ C, we compute the significance level σ ic of the likelihood ratio statistic. Define the set B = {0, 1, σ ic |1 ≤ i, c ≤ C}. We choose a variable threshold β ∈ B and define the set β = {α1:C ∈ R1×(D+1)C | αi, j = 0,

(c − 1)M + 1 ≤ j ≤ c M,

σ ic > β, 1 ≤ i, c ≤ C}.

(A.10)

Network Likelihood Model

1957

The elements of the set β specify networks in which connections whose significance is lower than β are absent. For α1:C ∈ β , the AIC and the BIC are computed using the following equations,

C C AI Cβ = −2 max (l T (α1:C )) + 2 (D + 1) C − M ψβ (σ ic ) , α1:C ∈β

i=1 c=1

(A.11) B I Cβ = −2 max (l T (α1:C )) α1:C ∈β

+ (D + 1)C − M

C C i=1 c=1

ψβ (σ ic ) log

C

Nc (T) ,

(A.12)

c=1

where ψβ (σ ic ) =

1,

σ ic > β

0,

otherwise.

(A.13)

This function indicates whether the connection from neuron c to neuron i is discarded from the model at threshold β. Finally, we define the set B = ∪β∈B β . Then the vector α1:C ∈ B that minimizes the AIC (BIC) is the BIC AI C AI C optimal parameter vector α1:C (αBIC 1:C ). The vector α1:C (α1:C ) implies an opAI C BIC timal connectivity matrix G (G ). Let β ∈ B be the threshold that minimizes the AIC. Then the optimal connectivity matrix G AI C is defined by (G AI C ) ic = 1 − ψβ (σ ic ).

(A.14)

G BIC is defined similarly. In this way, these matrices indicate which connections are kept in the model after identifying the optimal order of the model. Acknowledgments We thank Riccardo Barbieri, Uri Eden, and Loren Frank for helpful discussions and two anonymous referees for several comments that helped significantly improve the manuscript in content and presentation. This work was supported in part by NIH grants DA-015644, MH-59733, and MH-61637 to E.N.B. References Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60, 909–924.

1958

M. Okatan, M. Wilson, and E. Brown

Aertsen, A. M. H. J., & Gerstein, G. L. (1985). Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of effective connectivity. Journal of Neurophysiology, 61, 900–917. Aertsen, A. M. H. J. & Preissl, H. (1991). Dynamics of activity and connectivity in physiological neuronal networks. In H. G. Schuster (Ed.), Nonlinear dynamics and neuronal networks. New York: VCH Publishers. Baker, S. N., & Gerstein, G. L. (2000). Improvements to the sensitivity of gravitational clustering for multiple neuron recordings. Neural Computation, 12, 2597– 2620. Barbieri, R., Frank, L. M., Nguyen, D. P., Quirk, M. C., Solo, V., Wilson, M. A., & Brown, E. N. (2004). Dynamic analyses of information encoding in neural ensembles. Neural Computation, 16, 277–307. Barnes, C. A., Suster, M. S., Shen, J., & McNaughton, B. L. (1997). Multistability of cognitive maps in the hippocampus of old rats. Nature, 388, 272–275. Bartlett, M. S. (1966). An introduction to stochastic processes (2nd ed.). Cambridge: Cambridge University Press. Borisyuk, G. N., Borisyuk, R. M., Kirillov, A. B., Kovalenko, E. I., & Kryukov, V. I. (1985). A new statistical method for identifying interconnections between neuronal network elements. Biological Cybernetics, 52, 301–306. Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). Time series analysis, forecasting and control (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Brillinger, D. R. (1975a). Estimation of product densities. In J. W. Frane (Ed.), Computer Science and Statistics: 8th Annual Symposium (pp. 431–438). Los Angeles. Brillinger, D. R. (1975b). The identification of point process systems. Annals of Probability, 3, 909–929. Brillinger, D. R. (1976a). Estimation of the second-order intensities of a bivariate stationary point process. Journal of the Royal Statistical Society B, 38, 60–66. Brillinger, D. R. (1976b). Identification of synaptic interactions. Biological Cybernetics, 22, 213–228. Brillinger, D. R. (1988a). Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics, 59, 189–200. Brillinger, D. R. (1988b). The maximum likelihood approach to the identification of neuronal firing systems. Annals of Biomedical Engineering, 16, 3–16. Brillinger, D. R. (1992). Nerve cell spike train data analysis: A progression of technique. Journal of the American Statistical Association, 87, 260–271. Brody, C. D. (1998). Slow covariations in neuronal resting potentials can lead to artefactually fast cross-correlations in their spike trains. Journal of Neurophysiology, 80, 3345–3351. Brody, C. D. (1999a). Disambiguating different covariation types. Neural Computation, 11, 1527–1535. Brody, C. D. (1999b). Correlations without synchrony. Neural Computation, 11, 1537– 1551. Brown, E. N., Barbieri, R., Eden, U. T., & Frank, L. M. (2003). Likelihood methods for neural data analysis. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach (pp. 253–286). London: Chapman and Hall.

Network Likelihood Model

1959

Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14, 325–346. Brown, E. N., Frank, L. M., Tang, D., Quirk, M., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7, 456–461. Chornoboy, E. S., Schramm, L. P., & Karr, A. F. (1988). Maximum likelihood identification of neuronal point process systems. Biological Cybernetics, 59, 265–275. Cox, D. R., & Lewis, P. A. W. (1972). Multivariate point processes. Proceedings Sixth Berkeley Symposium on Probability and Mathematical Statistics, 3, 401–448. Daley, D., & Vere-Jones, D. (2003). An introduction to the theory of point process (2nd ed.). New York: Springer-Verlag. Eggermont, J. J. (1991). Neuronal pair and triplet interactions in the auditory midbrain of the leopard frog. Journal of Neurophysiology, 66, 1549–1563. Espinosa, I. E., & Gerstein, G. L. (1988). Cortical auditory neuron interactions during presentation of 3-tone sequences: Effective connectivity. Brain Research, 450, 39– 50. Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. J. (1993). Functional connectivity: The principal component analysis of large (PET) data sets. Journal of Cerebral Blood Flow and Metabolism, 13, 5–14. Gerstein, G. L., & Aertsen, A. M. H. J. (1985). Representation of cooperative firing activity among simultaneously recorded neurons. Journal of Neurophysiology, 54, 1513–1528. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials: Analysis and functional interpretation. Science, 164, 828–830. Gerstein, G. L., & Perkel, D. H. (1972). Mutual temporal relationships among neuronal spike trains: Statistical techniques for display and analysis. Biophysical Journal, 12, 453–473. Gerstein, G. L., Perkel, D. H., & Dayhoff, J. E. (1985). Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5, 881–889. Gochin, P. M., Gerstein, G. L., & Kaltenbach, J. A. (1990). Dynamic temporal properties of effective connections in rat dorsal cochlear nucleus. Brain Research, 510, 195–202. Gochin, P. M., Miller, E. K., Gross, C. G., & Gerstein, G. L. (1991). Functional interactions among neurons in inferior temporal cortex of the awake macaque. Experimental Brain Research, 84, 505–516. Jacod, J. (1975). Multivariate point processes: Predictable projection, radon nikodym derivatives, representation of martingales. Z. Wahrsch. Verw. Geb., 31, 235–253. Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag. Karr, A. F. (1991). Point processes and their statistical inference. New York: Dekker. Kristan, W. B. Jr., & Gerstein, G. L. (1970). Plasticity of synchronous activity in a small neural net. Science, 169, 1336–1339. Lee, A. K., & Wilson, M. A. (2002). Memory of sequential experience in the hippocampus during slow wave sleep. Neuron, 36, 1183–1194.

1960

M. Okatan, M. Wilson, and E. Brown

Lee, A. K., & Wilson, M. A. (2004). A combinatorial method for analyzing sequential firing patterns involving an arbitrary number of neurons based on relative time order. Journal of Neurophysiology, 92, 2555–2573. Lee, L., Harrison, L. M., & Mechelli, A. (2003). A report of the functional connectivity workshop, Dusseldorf 2002. Neuroimage, 19, 457–465. Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. New York: McGraw-Hill. Li, Z., Morris, K. F., Baekey, D. M., Shannon, R., & Lindsey, B. G. (1999). Multimodal medullary neurons and correlational linkages of the respiratory network. Journal of Neurophysiology, 82, 188–201. Lindsey, B. G., Hernandez, Y. M., Morris, K. F., & Shannon, R. (1992). Functional connectivity between brain stem midline neurons with respiratory-modulated firing rates. Journal of Neurophysiology, 67, 890–904. Louie, K., & Wilson, M. A. (2001). Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep. Neuron, 29, 145–156. Luenberger, D. G. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. McNaughton, B. L., Barnes, C. A., & O’Keefe, J. (1983). The contributions of position, direction, and velocity to single unit activity in the hippocampus of freely-moving rats. Experimental Brain Research, 52, 41–49. Moore, G. P., Segundo, J. P., Perkel, D. H., & Levitan, H. (1970). Statistical signs of synaptic interactions in neurons. Biophysical Journal, 10, 876–900. Muller, R. U., Kubie, J. L., & Ranck, J. B. Jr. (1987). Spatial firing patterns of hippocampal complex-spike cells in a fixed environment. Journal of Neuroscience, 7, 1935–1950. Nicolelis, M. A., Dimitrov, D., Carmena, J. M., Crist, R., Lehew, G., Kralik, J. D., & Wise, S. P. (2003). Chronic, multisite, multielectrode recordings in macaque monkeys. Proceedings of the National Academy of Sciences, 100, 11041–11046. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Research, 34, 171–175. O’Keefe, J., & Recce, M. L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. Journal of Neurophysiology, 81, 3021–3033. Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among neuronal spike trains. Biological Cybernetics, 59, 1–11. Pawitan, Y. (2001). In all likelihood: Statistical modelling and inference using likelihood. New York: Oxford University Press. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes II. Simultaneous spike trains. Biophysical Journal, 7, 419–440. Pipa, G., & Grun, ¨ S. (2003). Non-parametric significance estimation of joint-spike events by shuffling and resampling. Neurocomputing, 52–54, 31–37. Schoenbaum, G., Chiba, A. A., & Gallagher, M. (2000). Changes in functional connectivity in orbitofrontal cortex and basolateral amygdala during learning and reversal training. Journal of Neuroscience, 20, 5179–5189.

Network Likelihood Model

1961

Shannon, R., Baekey, D. M., Morris, K. F., Li, Z., & Lindsey, B. G. (2000). Functional connectivity among ventrolateral medullary respiratory neurons and responses during fictive cough in the cat. Journal of Physiology, 525.1, 207–224. Skaggs, W. E., & McNaughton, B. L. (1996). Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science, 271, 1870–1873. Skaggs, W. E., & McNaughton, B. L. (1998). Spatial firing properties of hippocampal CA1 populations in an environment containing two visually identical regions. Journal of Neuroscience, 18, 8455–8466. Snyder, D., & Miller, M. (1991). Random point processes in time and space (2nd ed.). New York: Springer-Verlag. Struik, D. J. (1986). A source book in mathematics, 1200–1800. Princeton, NJ: Princeton University Press. Truccolo, W., Fellows, M. R., Eden, U. T., Brown, E. N., & Donoghue, J. P. (2004). Primary motor (MI) and parietal (5d) coordination during reaching: Point process and LFP models. Washington, DC: Society for Neuroscience. Online. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. van den Boogaard, H. (1986). Maximum likelihood estimations in a nonlinear selfexciting point process model. Biological Cybernetics, 55, 219–225. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories during sleep. Science, 265, 676–679.

Received July 23, 2004; accepted March 2, 2005.

LETTER

Communicated by Jonathan Victor

Data-Robust Tight Lower Bounds to the Information Carried by Spike Times of a Neuronal Population G. Pola g [email protected] Department of Pure and Applied Mathematics, University of L’Aquila, I-67010 L’Aquila, Italy

R. S. Petersen [email protected] University of Manchester, Faculty of Life Sciences, Manchester M60 1QD, U.K.

A. Thiele [email protected]

M. P. Young [email protected] Psychology, Brain, and Behaviour, University of Newcastle upon Tyne Newcastle upon Tyne, NE2 4HH, U.K.

S. Panzeri [email protected] The University of Manchester, Faculty of Life Sciences, Moffat Building, PO Box 88, Manchester M60 1QD, U.K.

We develop new data-robust lower-bound methods to quantify the information carried by the timing of spikes emitted by neuronal populations. These methods have better sampling properties and are tighter than previous bounds based on neglecting correlation in the noise entropy. Our new lower bounds are precise also in the presence of strongly correlated firing. They are not precise only if correlations are strongly stimulus modulated over a long time range. Under conditions typical of many neurophysiological experiments, these techniques permit precise information estimates to be made even with data samples that are three orders of magnitude smaller than the size of the response space. 1 Introduction A fundamental problem in systems neuroscience is to understand the nature of the code used by neuronal populations to transmit sensory information. A traditional hypothesis is that information is carried by the total number of spikes emitted by individual neurons over a relatively long time window. Neural Computation 17, 1962–2005 (2005)

© 2005 Massachusetts Institute of Technology

Lower Bounds to Spike Timing Information

1963

Total spike counts typically vary across a stimulus set, such that spike counts afford some degree of discriminability concerning which stimulus has occurred (Adrian, 1926; Tovee, Rolls, Treves, & Bellis, 1993; Shadlen & Newsome, 1998). However, recent studies have revealed that information may also be carried by precise spike timing. The spike timing code may take the simple form of precise timing of individual spikes (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1996; Furukawa, Xu, & Middlebrooks, 2000; Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001; DeWeese, Wehr, & Zador, 2003) or the more complex form of patterns of spikes whose emission times are statistically correlated (Abeles, Bergman, Margalit, & Vaadia, 1993; Vaadia et al., 1995; Dan, Alonso, Usrey, & Reid, 1998). An increasingly popular way to characterize quantitatively the relative role of spike timing and spike counts is to quantify and compare the amounts of information carried by different neuronal codes (Rieke et al., 1996; Borst & Theunissen, 1999; Dimitrov and Miller, 2001; Panzeri, Petersen, et al., 2001). However, a problem with this approach is that quantifying reliably the information conveyed by spike timing often requires the collection of unpractically large samples of data. This is mainly because spike times are not statistically independent: they are correlated (Mastronarde, 1983; Gawne & Richmond, 1993; de Oliveira, Thiele, & Hoffman, 1997; Averbeck & Lee, 2004). If such correlations did not exist, then the statistics of spike times would be completely characterized by the time-dependent firing rate of each neuron. However, one needs to measure also the correlations among all possible groups of spikes. A complete characterization of these correlations requires a number of parameters that are difficult to sample with a realistic amount of neuronal data. Thus, spike timing information measures suffer from an upward sampling bias problem (Panzeri & Treves, 1996). One useful approach to the sampling bias problem has been to seek data-robust lower bounds to the spike timing information that neglect part of the spike timing correlations (Reich, Mechler, Purpura, & Victor, 2000; Reich, Mechler, & Victor, 2001). These information lower bounds have been used to provide convincing characterizations of the role of spike timing in cortical coding (Reich et al., 2000; Reich et al., 2001; Panzeri, Petersen, et al., 2001). However, these lower bounds are not tight if neurons are strongly correlated. To overcome this limitation, in this letter we develop new datarobust lower bounds to spike timing information that have better sampling properties and are tighter than previous estimators. Our new lower bounds are very tight also in the presence of strongly correlated firing. They fail only if correlations are strongly stimulus modulated over a long time range. They permit precise and reliable estimates of the information conveyed by spike times even when dealing with data samples that are relatively small with respect to the size of the response space. Under appropriate circumstances, the new techniques allow investigation of the spike timing information in

1964

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

long time windows with samples made of tens to hundreds of trials in cases where direct estimation of mutual information would require thousands to millions of trials. The letter is organized as follows. We first review basic concepts of information theory applications to spike trains; we then critically evaluate previous lower-bound techniques. Next we introduce our new lower bounds, testing and illustrating them by means of applications to both simulated data and real neuronal spike trains. 2 The Information Carried by Neuronal Population Responses We consider a time period of duration T, associated with a dynamic or static sensory stimulus s (chosen with probability P(s) from a stimulus set S with S elements), during which the activity of C cells is observed. We assume that the spike arrival times are binned with a timing precision t and transformed into a sequence of spike counts in each time bin. L denotes the number of time bins (i.e., T = Lt). The neuronal population response is denoted by a one-dimensional array r = {r(1), r(2), r(3), . . . , r(L)}, where r(t) = {r1 (t), r2 (t), r3 (t), . . . , rC (t)} is the population response in the t th time bin; rc (t) is the number of spikes emitted by the cth neuron in the tth time bin. The maximum number of spikes that can be observed in a single time bin in any trial is denoted by M. (If t is very short, M is 1 and rc (t) is binary.) We indicate the response space with R. (R contains (M + 1) LC elements.) Following Shannon (1948), we write the mutual information transmitted by the population response about the whole set of stimuli as I (R; S) = H(R) − H(R|S),

(2.1)

where H(R) and H(R|S) are the response entropy (stimulus unconditional) and the noise entropy (stimulus conditional), respectively, of the response variables. They are defined (Cover & Thomas, 1991) as H(R) = −

P(r) log2 P(r),

r∈R

H(R|S) = −

s∈S

P(s)

P(r|s) log2 P(r|s).

(2.2) (2.3)

r∈R

In equations 2.2 and 2.3, the summation over r stands for the sum over all possible population responses. The summation over s indicates a sum over all stimuli s. P(r|s) is the probability of simultaneously observing a particular response r conditional to stimulus s, and P(r) = P(r|s)s is its average across all stimulus presentations (the angular brackets indicate the average over different stimuli, F (s)s ≡ s∈S P(s)F (s)). In practice, P(r|s)

Lower Bounds to Spike Timing Information

1965

is determined experimentally by repeating each stimulus in exactly the same way on many trials, while recording the neuronal responses. The probability P(s) is usually chosen by the experimenter. Estimating the information carried by spike times of real neuronal populations is difficult because each stimulus-response probability has to be measured from limited amounts of data. The statistical errors in estimating the response probabilities lead to downward systematic errors (biases) in both noise and response entropy (Miller, 1955) and in an overall upward bias when estimating mutual information (Panzeri & Treves, 1996). This makes it difficult to estimate the information directly from equation 2.1, especially for long time windows or precise spike time discretizations (large L) and large neuronal populations (large C). 3 Independent and Correlated Stimulus-Response Probabilities The full description of the stimulus-response relationship is given by P(r|s). Estimating this probability, which has (M + 1) LC − 1 free parameters for each stimulus s, requires extensive data samples. However, if spike times were statistically independent events, the stimulus-response probability would simply be characterized by the probability of a spike in each individual time bin. One way to alleviate the sampling problem is thus to work with probability models that assume that spikes emitted in response to a certain stimulus are independent of each other. Since by definition, stochastic variables are statistically independent if their joint response probabilities equal the product of the individual probabilities, we define the independent probability model Pind (r|s) as the product of P(rc (t)|s), the stimulus-conditional marginal probabilities of responses of individual cells in each time bin: Pind (r|s) =

C L

P(rc (t)|s).

(3.1)

c=1 t=1

While estimating P(r|s) requires an evaluation of (M + 1) LC − 1 parameters for each stimulus s, estimating Pind (r|s) needs only MLC parameters for each stimulus. In this letter, when we say that the spike trains are correlated, we mean that for some stimulus s, the real stimulus-response probability P(r|s) is different from Pind (r|s). Thus, when we refer to correlations, we refer to correlations at fixed stimulus. These correlations are usually called noise correlations (Gawne & Richmond, 1993; Nirenberg & Latham, 2003; Pola, Thiele, Hoffmann, & Panzeri, 2003), and are the main subject of this letter. For simplicity, in the rest of this letter, when we use the term correlation, we intend “noise correlation.” One way to parameterize noise correlations is (Pola et al., 2003) to introduce a generalized correlation coefficient γ (r|s) quantifying how much the

1966

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

real response probability P(r|s) deviates from Pind (r|s): γ (r|s) =

P(r|s) − 1, if Pind (r|s) = 0, Pind (r|s) if Pind (r|s) = 0.

γ (r|s) = 0,

(3.2)

This generalized correlation coefficient varies in the range −1 ≤ γ (r|s) < ∞. Negative values indicate anticorrelation; positive values indicate correlation (Pola et al., 2003). 4 A Lower Bound That Neglects Correlations in the Noise Entropy Let us now consider in detail the effect of ignoring correlations in the stimulus-conditional response probability on both noise entropy and response entropy. Neglecting correlations by using Pind (r|s) instead of the true distribution P(r|s) necessarily increases the noise entropy: Hind (R|S) ≥ H(R|S),

(4.1)

where Hind (R|S) is defined as Hind (R|S) = −

P(s)

s∈S

Pind (r|s) log2 Pind (r|s).

(4.2)

r∈R

The inequality in equation 4.1 can be proved rewriting the difference between Hind (R|S) and H(R|S) as Hind (R|S) − H(R|S) = D(P(r|s)||Pind (r|s)),

(4.3)

where D(P(r|s)||Pind (r|s)) is the conditional Kullback-Leibler (KL) distance between P(r|s) and Pind (r|s). The conditional KL distance between two distributions P(r|s) and Q(r|s) is defined as (see Cover & Thomas, 1991) D(P(r|s)||Q(r|s)) =

s∈S

P(s)

r∈R

P(r|s) log2

P(r|s) . Q(r|s)

(4.4)

D(P(r|s)||Q(r|s)) is nonnegative, and it is zero if and only if P(r|s) = Q(r|s) for every r and s. Thus, Hind (R|S) = H(R|S) if and only if P(r|s) = Pind (r|s) for every r and s; otherwise, Hind (R|S) > H(R|S).

Lower Bounds to Spike Timing Information

1967

Hind (R|S) is much easier to sample than H(R|S) because it can be expressed as a sum of entropies of the marginal distributions in individual time bins, Hind (R|S) =

L C

H(Rc,t |S),

(4.5)

c=1 t=1

where H(Rc,t |S) = −

s∈S

P(s)

M

P(rc (t)|s) log2 P(rc (t)|s).

(4.6)

rc (t)=0

In contrast to the noise entropy, the response entropy H(R) can be either reduced or increased by using the independent probability model (Schultz & Panzeri, 2001). Hence, replacing P(r|s) with Pind (r|s) in the definition of mutual information, equation 2.1, does not provide a lower bound on the mutual information. The idea of Reich and collaborators (Reich et al., 2000, 2001) was to produce a sampling-robust lower bound by neglecting correlations only in the noise entropy, as follows,1 ILB1 = H(R) − Hind (R|S),

(4.7)

where H(R) and Hind (R|S) are defined in equations 2.2 and 4.2, respectively. The lower bound introduced in this way is much more data robust than the mutual information because only Pind (r|s) and P(r), but not P(r|s), need to be estimated. Since Hind (R|S) is a sum of low-dimensional entropies (see equation 4.5), the most biased term in equation 4.7 is H(R). Numerical investigations of these sampling properties will be presented below. It is useful to rewrite the lower-bound ILB1 in the following equivalent way: ILB1 = I (R; S) − D(P(r|s)||Pind (r|s)).

(4.8)

Since the conditional KL distance is always nonnegative, and it is zero if and only if P(r|s) = Pind (r|s) for every r and s, the very presence of correlations causes ILB1 to become less than the mutual information, even if these correlations are not carrying any information about the stimuli. Indeed, ILB1 can sometimes become negative in the presence of strong enough correlations.

the mutual information I (R; S), ILB1 is also defined over the stimulus and response space and should thus have (R; S) as argument; however, we drop this argument for simplicity of notation. 1 Like

1968

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

It is also useful to point out that ILB1 is always less than the sum of single time bin information, ILB1 ≤

C L

I (Rc,t ; S),

(4.9)

c=1 t=1

where I (Rc,t ; S) is the information conveyed by the spikes emitted by the cth cell in the tth time bins and is defined as I (Rc,t ; S) = H(Rc,t ) − H(Rc,t |S),

(4.10)

where H(Rc,t |S) is defined in equation 4.6, and

H(Rc,t ) = −

M

P(rc (t)) log2 P(rc (t)).

(4.11)

rc (t)=0

Thus, ILB1 cannot reveal the presence of synergistic encoding between spikes emitted at different times or by different neurons. 5 A Tighter Lower Bound That Ignores Stimulus-Modulated Correlations ILB1 , though much more data robust than the mutual information, is not tight in the presence of correlations between spike times. The main purpose of this article is to introduce new data-robust lower bounds that can be tight even in the presence of strong correlations. This new lower bound is based on a recently developed information breakdown formalism (Pola et al., 2003) that separates out the contribution of different coding mechanisms. These components of the mutual information breakdown have different magnitudes and sampling properties. In this section, we show how this information breakdown provides the basis for better data-robust lower bounds. 5.1 The Information Breakdown. The information breakdown method was originally introduced to investigate the role of correlated cortical neuronal firing in transmitting sensory information. Here, we briefly review it and slightly extend it to quantify the information content of all correlation between spike times (and not only of cross-cell correlations as in Pola et al., 2003). This formalism consists in breaking down the total mutual information into a sum of components, each of which can be associated with a different

Lower Bounds to Spike Timing Information

1969

coding mechanism (Pola et al., 2003; Golledge et al., 2003): I (R; S) = Ilin + Isig−sim + Icor .

(5.1)

The mathematical expression and its interpretation in terms of coding mechanisms, and the sampling properties of each component will be discussed below. We will present the coding components in a mathematical form similar to that reported in appendix A of Pola et al. (2003) and in Golledge et al. (2003). 5.1.1 The Linear Component Ilin . The first term of the information breakdown is the information that would be obtained if the spikes emitted in different time bins and cells all conveyed independent information. In this case, the total information transmitted by the population would just be a linear sum of the information provided by each time bin and each cell: Ilin =

L C [H(Rc,t ) − H(Rc,t |S)].

(5.2)

c=1 t=1

This component has very good sampling properties because it only requires sampling the entropies in single bins separately. Deviations from independent information transmission (i.e., synergy or redundancy) are expressed by the terms considered next. 5.1.2 The Signal-Similarity Term Isig−sim . This term quantifies the redundancy (or information loss) arising from signal similarity (Gawne & Richmond, 1993)—similarity across stimuli of the mean probability of spike emission in each time bin.2 Its expression is: Isig−sim = Hind (R) −

L C

H(Rc,t ),

(5.3)

c=1 t=1

where Hind (R) is the stimulus-unconditional independent entropy: Hind (R) = −

Pind (r) log2 Pind (r)

(5.4)

r

and Pind (r) = s P(s)Pind (r|s). Isig−sim , which is always less than or equal to zero, was first introduced by Pola et al. (2003) as a generalization of

2 Signal

similarity is more often called signal correlation, (see Gawne & Richmond, 1993; Nirenberg and Latham, 2003). Here we use the word similarity because we use correlation to refer only to noise correlation.

1970

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

the series expansion of Panzeri and Schultz (2001), and later discussed by Schneidman, Bialek, and Berry (2003) (with an overall change in sign) under the name “Isignal .” Isig−sim does not depend on spike time correlations, but only on the marginal distributions. For this reason, it has extremely good sampling properties. The sum of Ilin and Isig−sim equals Iind , the information that would be obtained if there were no correlations (i.e., if P(r|s) = Pind (r|s) for every r and s): Iind = Ilin + Isig−sim .

(5.5)

5.1.3 The Total Impact of Correlation Icor . The next term in the information breakdown, indicated as Icor , is the only term that depends on the correlation coefficient γ (r|s) and quantifies the total impact of correlation in information encoding. It is defined as the difference between the information I (R; S) in the presence of correlations and the information Iind in absence of correlation: Icor = I (R; S) − Iind .

(5.6)

This quantity was introduced in Hatsopoulos, Ojakangas, Paninski, and Donoghue (1998) and Nirenberg and Latham (1998) and later refined and used in Panzeri, Golledge, Zheng, Tovee, and Young (2001), and Pola et al. (2003) to study the role of correlations in sensory and motor coding. Icor can be further broken down into two components, which we will term “correlational,” that reflect two different ways in which correlations may contribute to coding (Pola et al., 2003; Golledge et al., 2003): Icor = Icor−ind + Icor−dep .

(5.7)

The two correlational components Icor−ind and Icor−dep will be briefly introduced next. Further considerations on the meaning and scope of Icor , Icor−ind and Icor−dep can be found in Panzeri, Pola, Petroni, Young, & Petersen, (2002), Nirenberg and Latham (2003), Pola et al. (2003), and Schneidman et al.(2003). These considerations will not be reviewed here because they have no direct bearing on the work presented in this article, as the information breakdown is used merely as a tool to obtain data-robust lower bounds rather than separating out different correlational coding mechanisms. 5.1.4 The Stimulus-Independent Correlational Component Icor−ind . Even if not stimulus modulated, correlations can still affect the neuronal code (Abbott & Dayan, 1999; Oram, Foldi´ ¨ ak, Perrett, & Sengpiel, 1998). The

Lower Bounds to Spike Timing Information

1971

component of information associated with stimulus-independent correlations is Icor−ind = χ(R) − Hind (R),

(5.8)

where χ(R) is defined as χ(R) = −

P(r) log2 Pind (r).

(5.9)

r

Icor−ind is positive (synergistic) when spike timing correlation and signal similarity have opposite signs Icor−ind is instead negative (redundant) when they have the same sign (see Oram et al., 1998; Pola et al., 2003). Neurophysiological studies revealed that Icor−ind can have a substantial impact on cortical information encoding. It is positive for both within-cell correlations in rat S1 cortex (+18% of the total information; Panzeri, Petersen, et al., 2001) and cross-cell correlations in both monkey S2 cortex (Romo, Hernandez, Zainos, & Salinas, 2003) and rat prefrontal cortex (Jung, Qin, Lee, & Mook-Jung, 2000). It is negative (−20%) for both cross-cell correlations between nearby neurons in rat S1 cortex (Petersen, Panzeri, & Diamond, 2001) and monkey IT cortex (−12% ; Rolls, Franco, Aggelopoulos, & Reece, 2003). Icor−ind has good sampling properties because it depends on only P(r) and Pind (r), not on P(r|s) and Pind (r|s). 5.1.5 The Stimulus-Dependent Correlational Component Icor−dep . The final term of the information breakdown, Icor−dep , is associated with stimulus modulation of correlations: Icor−dep = D(P(s|r)||Pind (s|r)) ≡

r

P(r)

s

P(s|r) log2

P(s|r) . (5.10) Pind (s|r)

This term is always positive or zero. It is associated with stimulus-dependent correlations because it equals zero if and only if the correlation coefficient γ (r|s) does not depend on s for every response r. If a neuronal population carries information by emitting patterns of correlated spikes that “tag” each stimulus, Icor−dep is greater than zero. Equation 5.10 was introduced by Nirenberg, Carcieri, Jacobs & Latham (2001) (and named I ); Pola et al. (2003) then showed that this is a generalization of an analogous quantity introduced by Panzeri and Schultz (2001) in the short-time approximation. An alternative interpretation of Icor−dep stems from the fact that Icor−dep is zero if and only if P(s|r) = Pind (s|r) for every s and r (Cover & Thomas, 1991; Nirenberg et al., 2001). Thus, Icor−dep is zero if and only if the decoding dictionary obtained using Pind (·) is the same one obtained using the true distribution P(·). Nirenberg & Latham (2003)

1972

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

interpreted it as a cost function measuring “how much harder it is to decode neural responses when correlations are ignored than when they are taken into account.” Although in general, a K-L distance can be infinite in some cases (Cover & Thomas, 1991), it is important to note that Icor−dep is always finite (Pola et al., 2003). This is because if Pind (s|r) is zero, then so is P(s|r), and in this case the quantity P(s|r) log P P(s|r) would be zero.3 ind (s|r) A number of studies have made estimates of Icor−dep from neurophysiological data. With the one exception of Dan et al. (1998),4 all of these studies have reported small values of Icor−dep as follows. Panzeri, Petersen, et al. (2001) and Petersen et al. (2001) found that it contributes 3% to coding of whisker position in rat S1 cortex. Nirenberg et al. (2001) found it to be negligible for the vast majority of nearby cells in mouse retina (Icor−dep /I was more than 10% for only one pair out of over four hundred). Rolls et al. (2003) found it to be less than 2% of the total information about faces carried by the firing of monkey IT neurons. Golledge et al. (2003) found that Icor−dep was ≈5% of the total information about visual objects carried by the firing of cat V1 neurons. Current data thus suggest that Icor−dep typically contributes little to the total information available in the neuronal response. Icor−dep is by far the most biased of all terms entering the information breakdown in equation 5.1. It is as biased as the total mutual information itself. This is because its evaluation requires measuring the full correlational structure for each stimulus. 5.2 The New Tighter Lower Bound. Since the stimulus-dependent correlational component Icor−dep is nonnegative, is the only component that presents significant sampling problems, and has been found in most cases to account for a small proportion of the total information, a data-robust and tight lower bound can be obtained by computing the information ignoring Icor−dep : ILB2 = Ilin + Isig−sim + Icor−ind .

(5.11)

Using equations 5.2, 5.3, and 5.8, ILB2 can be written as ILB2 = χ(R) − Hind (R|S).

(5.12)

3 In fact, P ind (s|r) = 0 implies that either P(s)=0 (which implies that P(s|r) = 0) or that Pind (r|s) = 0. In the latter case, at least one of the marginals of P(r|s) entering the product in equation 3.1 is zero, which implies that P(r|s) = 0 and P(s|r) = 0. 4 It should be however noted that the experiment of Dan et al. (1998) quantified information through a stimulus reconstruction method rather by means of a more direct approach, and this makes the comparison with the work presented here difficult.

Lower Bounds to Spike Timing Information

1973

As we shall see in section 6, ILB2 has very good sampling properties compared to both I (R; S) and ILB1 . Icor−dep is zero if and only if γ (r|s) is not stimulus modulated for every response r, whatever the overall strength of γ (r|s). Hence, even if the spike trains are strongly correlated but these correlations are stimulus independent, the lower bound ILB2 is still tight. Thus, under many circumstances, ILB2 is a significantly tighter lower bound than ILB1 , which is tight only in the total absence of correlations. As for ILB1 , there are hypothetical cases where stimulus modulation of correlations is particularly strong and individual spikes code for little information; here, ILB2 can become negative and thus not useful. However, no such situation has yet been reported in information analysis of experimentally recorded spike trains. Is ILB2 always tighter than ILB1 ? The following inequality can be proved: ILB2 − ILB1 =

P(r) log2

r

P(r) = D(P(r)||Pind (r)) ≥ 0, Pind (r)

(5.13)

where in the above equation D(P(r)||Pind (r)) is a KL distance (see Cover & Thomas, 1991, p. 18, equation 2.26). Thus, ILB2 is always tighter than ILB1 : ILB1 ≤ ILB2 ≤ I (R; S),

(5.14)

with ILB1 = ILB2 if and only if P(r) equals Pind (r) for each r. In order to clarify and illustrate the differences between the two lowerbound estimators, we applied the method to synthetic spike trains. We simulated a neuronal pair responding to two stimuli, reflecting different ways of encoding information through correlations. The analysis in this section was performed using a large number of trials; the behavior of the two estimators with small data samples will be discussed in the next section. We considered three situations: uncorrelated spike trains, correlated spike trains with weak stimulus modulation of correlation, and correlated spike trains with strong stimulus modulation of correlation. We started with the uncorrelated case. We generated, independently for each cell, simulated data according to a stationary Poisson process with the mean rates reported in Figure 1A (left panel). For illustration, we show in Figure 1A (central panel) that the cross-correlogram (CCG) between the two spike trains was flat. Consistent with the above mathematical analysis, both lower-bound estimators were exactly equal to the true mutual information (see Figure 1A, right panel). Thus, in the absence of correlations, both lower bounds provide equally precise estimates. We then modeled a neuronal pair with the same mean firing rates as above, but with the addition of correlated activity (see Figure 1B). The

1974

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

amount of correlation was strong, but only very weakly stimulus modulated (see Figure 1B, central panel). In this case, ILB2 was extremely accurate, but ILB1 underestimated the true information by 39% (see Figure 1B, right panel). Finally, we simulated a case of strong stimulus modulation of the correlation (see Figure 1C). In this case, ILB2 was no longer tight; however, it still performed much better than ILB1 , which in this case was strongly negative. Thus, unlike ILB1 , ILB2 can provide accurate and robust measures of information even in the presence of correlated activity, so long as correlations are not stimulus modulated. In section 7, we will show how to improve on this. 6 Sampling Bias Properties In practice, the spike timing information and its lower-bound approximations must be estimated from experimental probabilities obtained from a limited number N of repeated presentations of all stimuli. This leads to a systematic error (or bias) in the estimate of the mutual information and of its lower bounds, the size of the bias decreasing when increasing the number of trials Panzeri & Treves, 1996). In this section, we focus more explicitly on the sampling bias properties of the two lower-bound estimators.

Figure 1: The performance of the lower-bound estimators LB1 and LB2 in the presence and absence of correlated activity. The lower-bounds estimates are compared to the total mutual information I (R; S) in simulated neuronal pairs responding to two stimuli. The simulated spike trains were analyzed in the 0–60 ms poststimulus window, using a timing resolution t of 10 ms to bin the responses (thus, there were six time bins per cell). We considered three ways of encoding information through correlations: (1) absence of correlation, (2) stimulus-independent correlation, and (3) stimulus-dependent correlation. Data were generated as described next. We first created, independently for each cell, spikes from a Poisson process with a certain firing rate r , as follows. For each time bin, we generated at random a spike with a generation probability equal to r t, the generated spike being given a random time within the time bin. To simplify the terminology, we call this process “Poisson” to emphasize that it generates spike independently in each time bin. We then generated spikes from a third Poisson process, and these spikes were added to both cells in order to create cross-correlation. To avoid synchronization with infinite time precision, the shared spike times added to the second cell were shifted together in time by a random amount chosen anew for each trial from a zero-mean gaussian distribution with standard deviation of 1 ms. In all cases, a joint spike timing code over a 60 ms long window with a time precision of 10 ms was considered.

Lower Bounds to Spike Timing Information CCGs

Stim 1 Stim 2

40

Information Stim 1 Stim 2

Information (bits)

80

Normalized CCG

B)

Firing rates (Hz)

0

40

40

0

Information (bits)

80

Normalized CCG

Firing rates (Hz)

0

C)

cell 1

cell 2

Information (bits)

80

Normalized CCG

Firing rates (Hz)

Firing rates

A)

1975

−5

0 5 Lag (ms)

0.4

0.2

0 0.4

0.2

0 0.5

0

−0.5

LB1

LB2

I

The left panels plot the mean firing rate of the two cells; the central panels plot the cross-correlogram (CCG; computed analytically from knowledge of the simulated processes); the right panels report the values of the lower bounds ILB1 and ILB2 and the true information. In this information estimation, we considered a large number of trials per stimulus in order to focus only on the asymptotic estimations (see main text). (A) Uncorrelated spike trains. A Poisson process is used for each cell, and no shared spikes are added. Both ILB1 and ILB2 give good estimations. (B) Correlated spikes with weak stimulus modulation of correlation. The ratio of independent versus shared spikes was approximately the same for both stimuli, and hence the correlation strength γ (·) was only weakly stimulus modulated. The presence of stimulus-independent correlation makes ILB1 markedly different from I . However, ILB2 remained a precise estimator of information. The simulation parameters were as follows. For both cells, the mean rate of the independently generated spikes was 32 Hz for the first stimulus and 16 Hz for the second stimulus. The mean rate of the shared spikes was 8 Hz to the first stimulus and 4 Hz to the second one. (C) Correlated spikes with strong stimulus modulation of correlation. The ratio of independent versus shared spikes was much higher for the first stimulus than for the second. As a consequence, the γ (·) coefficients were stimulus dependent. The stimulus modulation of correlation made ILB2 smaller than I ; ILB1 was negative. The data were generated as follows. For both cells, the mean rate of the independently generated spikes was 15 Hz for the first stimulus and 19 Hz for the second stimulus. The mean rate of the shared spikes was 25 Hz to the first stimulus and 1 Hz to the second one.

1976

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

We first report derivations of approximate analytical estimates of the sampling biases of ILB1 and ILB2 . These bias equations provide useful insights into the relative sampling properties of each information quantity, and they can also be used to quantify and subtract out the bias in real experimental conditions. After considering the analytical approximations to the bias, we perform numerical simulations to test independently both the validity of the analytical estimates and the performance of bias removal procedures based on the subtraction of the above analytical estimates. The extent to which these corrected estimates might be improved further by applying alternative methods to control the bias (Victor, 2002; Nemenman, Shafee, & Bialek, 2002; Paninski, 2003) will not be investigated here. 6.1 Analytical Estimates of the Magnitude of Bias Properties of LowerBound Estimators. The mutual information is H(R) − H(R|S). ILB1 is H(R) − Hind (R|S), and ILB2 is χ(R) − Hind (R|S). Thus, the relative sampling properties of the information and its two lower-bound estimators can be established by considering the sampling properties of the four quantities H(R|S), Hind (R|S), H(R), and χ(R). Our approximations to the bias of these quantities were derived on the assumption that the number of trials per stimulus is large, so that the probability of each response is empirically sampled on the basis of many available trials (Panzeri & Treves, 1996). We will use the symbol ≈ to indicate that we report only the leading term of the perturbative evaluation in 1/N of the bias (N being the total number of trials across all stimuli). The bias of a given functional of the probability distributions is defined as the difference between the trial-averaged value of the functional when the probability distributions are computed from N trials and the value of the functional computed with the true probability distributions. We first consider the noise entropy H(R|S). The analytical expression for its bias is Bias[H(R|S)] ≈ −

1 ˜ ( R(s) − 1), 2N ln 2 s

(6.1)

˜ where N is the number of trials across all stimuli presentations and R(s) denotes the number of “relevant” responses of the stimulus conditional response probability distribution P(r|s), that is, the number of different responses r with nonzero probability of being observed when stimulus s is presented (Panzeri & Treves, 1996). Like all other entropy quantities, the noise entropy is biased downward when sampled with limited trials. Since H(R|S) depends on P(r|s), the number of relevant responses in the numerator of equation 6.1 is of order (M + 1) LC for each stimulus in the summation. It follows that for this bias to be small, N should be bigger than S × (M + 1) LC .

Lower Bounds to Spike Timing Information

1977

The analytical expression for the bias of H(R) is Bias[H(R)] ≈ −

1 ˜ − 1), (R 2N ln 2

(6.2)

˜ is the number of relevant responses of P(r). H(R) is still difficult to where R ˜ is still of order (M + 1) LC . However, since H(R) depends sample because R on only P(r), its bias is approximately S times smaller than the bias of H(R|S). This is an advantage when many different stimuli are presented. The bias of the mutual information I (R; S) is the difference between the biases of H(R) and H(R|S). As the most biased term is H(R|S), the mutual information is upward biased Panzeri & Treves, 1996). Estimating the number of relevant responses to compute the bias in equations 6.1 and 6.2 from small data samples is nontrivial. Panzeri & Treves, (1996) have proposed a “Bayes” procedure to estimate these parameters empirically. This approach works well when there are at least two to four times as many trials per stimulus as the number of parameters describing the responses, that is, (M + 1) LC (Panzeri & Treves, 1996; Pola et al., 2003). This bias estimate can be subtracted from the raw information estimate to get accurate and unbiased results. Throughout this letter, we will use this procedure. Let us consider now Hind (R|S). Since it can be expressed as the sum of simpler entropies (see equation 4.5), its bias has the following expression: Bias[Hind (R|S)] ≈ −

L C 1 ˜ ct (s) − 1), (R 2N ln 2 c=1 t=1 s

(6.3)

˜ ct (s) is the number of relevant responses of the marginal distribuwhere R ˜ c,t (s) is of order M + 1, the bias of Hind (R|S) is proportions P(rc (t)|s). As R tional to MLC and is thus much smaller than that of H(R|S). As a consequence, the bias of ILB1 (which is simply the difference between the biases of H(R) and Hind (R|S)) is dominated by the bias of H(R) and is therefore smaller by a factor of S than the bias of I (R; S). Like H(R), χ(R) depends on only the stimulus unconditional probability distributions. However, it has a feature that makes its bias properties much better than H(R). Bias arises from the logarithmic form of entropy functionals. The log in χ(R) depends on Pind (r). Since Pind (r) is better sampled than P(r), χ(R) has less bias than H(R), whose log depends on P(r). As a consequence, the bias of χ(R) is much smaller than the bias of H(R). In particular, the bias of χ(R) (whose expression is reported in equation A.4 in the appendix) scales approximately quadratically with LC, whereas that of H(R) scales exponentially. Differences in sampling properties between χ(R) and H(R) get more pronounced if the response space is high dimensional. The bias of ILB2 is the difference between the biases of χ(R) and Hind (R; S). Since χ(R) is less biased than H(R), the bias of ILB2 is much

1978

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

smaller than that of ILB1 (and thus of that of I (R; S)). From the above mathematical considerations, it is also expected that the improvement of the bias of ILB2 does not come at the expense of an increase in variance. 6.2 Investigation of the Bias Properties of Lower Bounds by Mean of Computer Simulations. In this section, we perform numerical simulations to validate the analytical estimates of the bias reported above and test the performance of bias removal procedures based on the subtraction of the analytical estimates. We performed extensive simulations with both correlated and uncorrelated data; however, for simplicity, we report only results from typical simulations, which summarize the general findings. The first computer simulation (see Figure 2) consists of a neuronal pair. The response time window was 60 ms long, and spikes were digitized with a time precision of 10 ms, with each time bin containing 0 or 1 spikes (i.e., LC = 12 and M = 1). We considered four stimulus conditions. The firing rate of each cell in the different stimulus conditions was in the range 12 to 48 Hz. The spike trains were designed to have weak stimulus modulation of the correlated activity. We started by studying the bias of the entropy quantities χ(R), H(R), Hind (R|S), and H(R|S) as a function of the number of trials per stimulus. In Figure 2A we report the values for the above quantities when obtained by a direct evaluation of equations 5.9, 2.2, 4.2, and 2.3, without application of any bias correction procedure. In agreement with the analytical result obtained above, χ(R) and Hind (R|S) were much less downward biased than H(R) and H(R|S). We then studied the sampling properties of the mutual information I (R; S) and its two lower-bound estimators ILB1 and ILB2 (see Figure 2B). ILB1 had better bias properties than the full mutual information I (R; S). However, it was two orders of magnitude less data robust than ILB2 . Even without any sampling bias correction procedure, ILB2 was well estimated with 2 × 102 trials per stimulus, while we needed, respectively, 3 × 104 and more than 105 to estimate ILB1 and I (R; S) with similar accuracy. In Figures 2C and 2D, we report the data sampling behavior of both entropy and information quantities after subtracting the bias estimates. The probabilities entering the bias of χ(R), equation A.4, were estimated directly from the experimental probabilities, whereas the number of relevant responses entering the entropy expression was estimated using the procedure of Panzeri & Treves, (1996). Convergence to the asymptotic value of the corrected estimates (see Figures 2C and 2D) was much better than in the “raw” case: ILB2 was well estimated with only 50 trials per stimulus, whereas we needed 3 × 103 and 104 trials per stimulus to estimate ILB1 and I (R; S), respectively. With LC = 12, the number of possible different responses was 212 = 4096 in this simulation. Thus, approximately two to four times more trials per stimulus than response classes were needed to obtain precise estimates of I (R; S), whereas ILB2 was well sampled with a number of trials

Lower Bounds to Spike Timing Information

H(R|S)

9

5

2

Entropy (bits)

4

10

2

0

−1

6

10

Corrected D) 2

9

2

10

4

10

Trials per stimulus

6

10

2

10

13

5

I LB2 LB1

1

Information (bits)

10

C)

Raw B)

χ(R) H(R) Hind(R|S)

13

Information (bits)

Entropy (bits)

A)

1979

4

10

6

10

1

0

−1

2

10

4

10

6

10

Trials per stimulus

Figure 2: Sampling behavior of the lower-bound estimators ILB1 and ILB2 . We generated correlated simulated data with weak stimulus modulation of correlation and tested how the estimates depend on the data size. We considered a neuronal pair responding to four different stimuli in a time window of 60 ms and a time precision of 10 ms. The simulated spike trains have been produced as in Figure 1B. The simulation parameters were as follows. For both cells, the mean rate of the independently generated spikes was, respectively, for the four stimulus conditions 32 Hz, 24 Hz, 16 Hz, and 8 Hz. The mean rate of the shared spikes was 16 Hz, 12 Hz, 8 Hz, and 4 Hz. In A and B, we report the raw values of the information estimators, entropy, and like-entropy terms where in C and D the values are corrected for finite sampling (see section 6). Results were averaged over repetitions of the simulation decreasing as the number of trials per stimulus available increases. (a) Raw values of χ(R), H(R), Hind (R|S), and H(R|S) obtained without using any bias corrections. (b) Raw values of I (R; S), ILB1 , and ILB2 . (c) Values of χ(R), H(R), Hind (R|S), and H(R|S) obtained after subtracting the bias corrections described in the text. (d) Corrected values of I (R; S), ILB1 , and ILB2 . Each value of the plot is obtained averaging over random repetitions of the same simulation.

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

13

A)

Raw B)

χ(R) H(R) H (R|S)

Information (bits)

1980

Entropy (bits)

ind

H(R|S) 9

5

2

I LB2 LB1

1

0

−1

5

5

10

Corrected 2 D)

Entropy (bits)

Information (bits)

13

C)

10

9

5

5

1

0

−1

5

10

Trials per stimulus

10

Trials per stimulus

Figure 3: Sampling behavior of the lower-bound estimators ILB1 and ILB2 . Conventions are as in Figure 2. We generated spike trains as in Figure 2; however, this time we considered 64 stimuli.

per stimulus that was approximately 80 times smaller than the number of possible responses. In order to study the effect of increasing the number of stimuli while keeping the number of trials per stimulus fixed, we considered next (see Figure 3) the same simulated responses as in Figure 2, but using 64 stimulus conditions rather than 4. Figure 3 shows that H(R) was better sampled than with 4 stimuli. As a consequence, ILB1 worked much better. However, the sampling behavior of χ (R) also improved. Both ILB1 and ILB2 improved their sampling properties when increasing the stimulus set size, but ILB2 retained overall better sampling properties. The noise entropy H(R|S) was still badly sampled when the stimulus size was increased but the number of trials per stimulus was kept fixed. As a consequence, the mutual information I (R; S) did not improve its sampling properties (see Figure 3D). The above results on the effect of using a large stimulus set have some implications for particular stimulation paradigms, such as a dynamic stimulus (Nirenberg et al., 2001) or an m-sequence (Reich et al., 2001) in which thousands of different stimuli are available. In such cases, both P(r) and Pind (r) will be extremely well sampled, whereas the stimulus-conditional

Lower Bounds to Spike Timing Information LB2

A)

Information (bits)

Information (bits)

0.5

0

2

10

0 1 10

2

10

Trials per stimulus

3

10

−2

2

10

3

10

0.4

std (bits)

std (bits)

0.2

−1

LB1

D)

0.4

0

−3 1 10

3

10

LB2

C)

LB1

B)

1

−0.5 1 10

1981

LC=16 LC=12 LC=8 LC=4 0.2

0 1 10

2

10

3

10

Trials per stimulus

Figure 4: Lower-bounds values and standard deviations of ILB1 and ILB2 . We plotted the mean values and the standard deviations of the lower bounds as a function of the number of trials per stimulus available. The values of the estimates and the standard deviations are computed averaging over repetitions of the same simulation (see main text). We simulated a neuronal pair as in Figure 2. We considered a time precision of 10 ms and time window, respectively, of 20 ms, 40 ms, 60 ms, and 80 ms; thus, LC was 4, 8, 12, and 16 for this neuronal pair. In these plots, we considered the values corrected for finite sampling as described in the main text. (a) ILB2 estimates. (b) ILB1 estimates. (c) ILB2 standard deviations. (d) ILB1 standard deviations.

probabilities will not be as well sampled. Thus, both ILB1 and ILB2 will be generally well sampled and relatively bias free when using such very large stimulus set (with ILB2 , however, retaining an advantage in tightness), whereas I (R; S) will not be as well sampled. In Figure 4 we report a study of the performance of the estimators when increasing the number of time bins L. We studied both the convergence of the estimators to the true asymptotic values and the standard deviation of the estimates on the available number of trials per stimulus. As in Figure 2, we simulated responses of two neurons to four different stimulus conditions, and we considered windows of length L = 2, 4, 6, 8. In this plot we considered only bias-subtracted values. In Figure 4A, we show the results for ILB2 . We found that approximately 50 trials per stimulus were enough

1982

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

to accurately estimate ILB2 , even up to LC = 16 (50 trials is more than three orders of magnitude less than the size of the response space, that is, 216 ). A direct estimation of I (R; S) through equation 2.1 required at least 3 × 105 trials per stimulus. To get accurate estimates of ILB1 (see Figure 4B), we needed many more data than for ILB2 . The downward bias of ILB1 made it negative in conditions of undersampling (see Figure 4B). In Figures 4C and 4D, we show the standard deviations of ILB2 and ILB1 , obtained over different random repetitions of the same simulated process. As the number of trials per stimulus increased, the standard deviation decreased. Increasing the number of time bins affected only weakly the standard deviations of the estimates. The standard deviation of ILB1 was larger than that of ILB2 , suggesting that ILB2 is not only a less biased estimator of the information than ILB1 but is also less variable. 7 A Tighter Lower Bound Including Short-Time-Range StimulusDependent Correlations ILB2 is data robust and can lead to precise and tight information estimates even in the presence of strong, nonstimulus-modulated, spike timing correlation. However, when there is significant stimulus modulation of correlations and Icor−dep is not negligible, then ILB2 may not quantify precisely the transmitted information. Although neurophysiological experiments reported so far have all found Icor−dep to be a small proportion of the total information, it is conceivable that stimulus modulations of correlation may convey substantial information in some neural systems or under specific stimulus conditions. A relevant question is thus how the total information can be better approximated with a data-robust bound in this case. This section addresses this issue by suggesting a new strategy, which consists of neglecting only stimulus modulations of long time-range correlations. 7.1 The Markov Approximation to Model the Stimulus-Response Probability. The reason that both Icor−dep and I (R; S) are strongly biased is that they depend on P(r|s), and the latter takes into consideration the complete history of firing: the probability of the neuronal response r(t) in the tth time bin is affected by the neural responses in all the previous time bins. This is made explicit by expressing P(r|s) using the chain rule (Cover & Thomas, 1991): P(r|s) = P(r(1), r(2), . . . , r(t), . . . , r(L − 1), r(L)|s) = P(r(1)|s)P(r(2)|r(1), s)P(r(3)|r(1), r(2), s) . . . × P(r(t)|r(1), . . . , r(t − 1), s) . . . P(r(L)|r(1), . . . , r(L − 1), s). = P(r(1)|s)

L t=2

P(r(t)|r(1), . . . , r(t − 1), s).

(7.1)

Lower Bounds to Spike Timing Information

1983

However, in many neural systems, correlations are significant only between spikes that are separated by a short time lag, in the range of 1 to 15 ms (Gray, Konig, ¨ Engel, & Singer, 1989; Brosch, Bauer, & Eckhorn, 1997; Dan et al., 1998; Nirenberg et al., 2001; Golledge et al., 2003). In such cases, to preserve the entire information, it is sufficient to take into account only correlations extending over a short lag. Our approach will be to approximate the real probability of current response r(t) given the past firing with a finitememory Markov model that looks back to only q time steps, as follows: P(r(t)|r(1), r(2), . . . , r(t − 1), s) → P˜ q (r(t)|r(t − q ), . . . , r(t − 1), s). (7.2) The latter is computed from the experimental probabilities via Bayes’ rule: P˜ q (r(t)|r(t − q ), . . . , r(t − 1), s) =

P(r(t − q ), . . . , r(t − 1), r(t)|s) . (7.3) P(r(t − q ), . . . , r(t − 1)|s)

P(r(t − q ), . . . , r(t − 1), r(t)|s) and P(r(t − q ), . . . , r(t − 1)|s) are marginal distributions of the full model P(r|s). They can be computed by integrating away the dependence on all the response variables that do not enter in their argument: P(r(t − q ), . . . , r(t − 1), r(t)|s) =

P(r|s),

(7.4)

r(1),...,r(t−q −1) r(t+1),...,r(L)

P(r(t − q ), . . . , r(t − 1)|s) =

P(r|s).

(7.5)

r(1),...,r(t−q −1) r(t),...,r(L)

By using the above equation and the chain rule, one arrives at the following equation for the q -length Markov probability of the stimulus-conditional response probability:

P˜ q (r|s) = P(r(1)|s)

L

P˜ q (r(t)|r(t − q ), . . . , r(t − 1), s),

t=2

if q = 1, . . . , L − 1, P˜ 0 (r|s) =

L

P(r(t)|s)

if q = 0

(7.6)

t=1

(in particular, P˜ L−1 (r|s) = P(r|s)). P˜ q (r|s) preserves all correlations (both cross-cell and within-cell) extending up to q time bins in the past, and it

1984

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

Table 1: Number of Free Parameters Required to Specify the Probability Models for Each Stimulus Configuration s, Computed Assuming That There Is at Most One Spike per Time Bin (M = 1). Pind (r|s) P˜ 0 (r|s) P˜ 1 (r|s) ... P˜ q (r|s) ... P(r|s)

LC L(2C − 1) 2C − 1 + (2C − 1)2C (L − 1) ... 2q C − 1 + (2C − 1)2q C (L − q ) ... 2 LC − 1

Notes: P, Pind , and P˜ q denote the full probability model, the independent model, and the q -step Markov model, respectively. C and L denote the number of cells and time bins, respectively.

neglects all correlations of range longer than q . Thus, it is a perfect description of neuronal firing if correlations extend to a lag shorter than or equal to q time bins. The longer is the range of correlations q included in the model, the closer is the Markov model to the real distribution P(r|s). The notion of closeness can be proved rigorously as follows. First, the q -length Markov model preserves all marginals of the original probability distribution extending up to q + 1 consecutive time bins: P˜ q (r(t − q ), . . . , r(t − 1), r(t)|s) = P(r(t − q ), . . . , r(t − 1), r(t)|s), (7.7) for t = q + 1, . . . , L. Second, it can be shown that the conditional KullbackLeibler distance between P(r|s) and P˜ q (r|s) decreases as q increases, D(P(r|s) P˜ q (r|s)) ≥ D(P(r|s) P˜ q +1 (r|s)),

(7.8)

for every q = 0, . . . , L − 2. As q increases, the probability model gets more and more complex, and it needs bigger data samples to be well estimated. The number of free parameters needed to specify the Markov model P˜ q (r|s) (reported in Table 1 for the case of M = 1—up to one spike in each time bin) grows with q . It can be seen that in this case, the number of free parameters of the Markov model is in between the LC parameters needed to describe the independent probability model Pind (r|s) and the 2 LC − 1 parameters describing the general model P(r|s). Thus, Markov models with larger q are more accurate, whereas Markov models with small q are more data robust. q

7.2 Tighter Lower Bounds ILB3 to Include the Contribution of Stimulus Modulation of Correlation. The Markov probabilities can be used to

Lower Bounds to Spike Timing Information

1985

compute tighter data-robust lower bounds. For each q = 0, . . . , L − 1, we define the following information lower bound: q

ILB3 = χ q (R) − H q (R|S),

(7.9)

where χ q (R) = −

P(r) log2 P˜ q (r),

r∈R

H q (R|S) = −

s∈S

P(s)

P˜ q (r|s) log2 P˜ q (r|s),

(7.10)

r∈R

and P˜ q (r) = s P(s) P˜ q (r|s). These lower bounds have a very similar expression to ILB2 , the difference being that Pind (r|s) is replaced by P˜ q (r|s). The fact q that ILB3 , for q = 0, 1, . . . , L − 1, are lower bounds to the total information is proved by the following: q I (R; S) − ILB3 = D(P(s|r) P˜ q (s|r)) ≥ 0.

(7.11)

q

To understand the conditions in which ILB3 is a tight information lower bound, we note that D(P(s|r) P˜ q (s|r)), the difference between the total inq formation and ILB3 , can be interpreted (see equation. 5.10) as the stimulusdependent correlational component related to a correlation coefficient defined as γ q (r|s) =

P(r|s) − 1, if P˜ q (r|s) = 0, ˜ P q (r|s)

γ q (r|s) = 0,

if P˜ q (r|s) = 0.

(7.12)

D(P(s|r) P˜ q (s|r)) is always positive or zero, and it is zero if and only if γ q (r|s) does not depend on s for every response r. Since γ q (r|s) is nonzero only when there are correlations extending over a time range greater than q , q ILB3 is not tight only if there are stimulus-modulated correlations occurring over a time range spanning more than q time steps. q

q

7.3 Bias Properties of I LB3 . The bias of ILB3 is given by the difference between the biases of χ q (R) and H q (R|S). q The bias of ILB3 is largely characterized by that of H q (R|S), because q χ (R) is better sampled than H q (R|S). In fact, (1) χ q (R) depends on P˜ q (r) and P(r), while H q (R|S) is a functional of the stimulus-conditional probabilities P˜ q (r|s), and (2) χ q (R) depends linearly on P(r), the argument of the logarithmic part being the much better sampled distributions P˜ q (r|s). The expression for the bias of H q (R|S) is reported in the appendix (see equation A.2). As shown in the appendix, H q (R|S) can be decomposed into

1986

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

a sum of lower-dimensional entropies of the marginal probability distributions of up to q + 1 time bins together. For this reason, the bias of H q (R|S) is smaller than that of H(R|S), and it is larger for larger q values. Thus, the larger the range of the Markov model, the larger the samples needed to get accurate estimates. q 0 The least biased of all ILB3 estimators is ILB 3 . In this case, the bias of 0 0 H (R|S) is reported in equation A.3. χ (R) also has a small bias (see equation A.18 in the appendix), which scales approximately as L 2 . 0 We investigated numerically the properties of ILB 3 by applying this analy5 sis to synthetic spike trains. We simulated a neuronal pair in a poststimulus time window of 80 ms with spike times digitized with a precision of 10 ms (thus, LC = 16). In this simulation, the stimulus-dependent correlational component was about 60% of the total information. The cross-correlations were short-ranged; data were generated in such a way that the CCG (not shown) had a gaussian shape with width 1 ms. Thus, given bin sizes of 10 ms, 0 we would expect to recover all information by using ILB 3 . Results of the simulations are reported in Figure 5. Since Icor−dep was a substantial fraction of the total information, ILB2 was not a tight estimator of the total informa0 tion. However, ILB 3 recovered all the information not captured by ILB2 . The 0 behavior of I (R; S), ILB2 , and ILB 3 when varying the number of trials per 0 stimulus shows that ILB3 is much more data robust than I (R; S) and almost as data robust as ILB2 . After correcting for the bias, an accurate estimate 0 of I (R; S) required about 3 × 105 trials per stimulus. Both ILB2 and ILB 3 re16 quired only 50 to 100 trials per stimulus. Given that there were 2 = 65, 536 possible responses, this is extremely good sampling behavior. 8 Application to Neurophysiological Data To illustrate and evaluate their possible practical applications, we apply the new lower-bound methods to real neuronal recordings from somatosensory cortex of anesthetized rats and from cortical visual area MT of awake macaques. 8.1 Spike Timing, Spike Count, and Short-Time-Range Correlations in Rat Somatosensory Cortex. We first apply the new method to spike trains recorded from the whisker representation in somatosensory (“barrel”) cortex of rats anesthetized with urethane. In this example, we analyze and compare two different data sets of S1 neuronal activity recorded with different techniques under the same stimulation paradigm. The first data set (kindly provided to us by M. Lebedev and M. Diamond; see Lebedev, Mirabella, Erchova, & Diamond, 2000, for q

5 A detailed numerical investigation of I LB3 with q > 0 will appear elsewhere (Panzeri, 2005).

Lower Bounds to Spike Timing Information Raw

A) Information (bits)

1987

1.5

I LB3 LB2

1

0.5

0 1 10

2

10

3

10

4

10

5

10

Corrected

B) Information (bits)

1.5

1

0.5

0 1 10

2

10

3

10

4

10

5

10

Trials per stimulus 0 Figure 5: Sampling behavior of the lower-bound estimators ILB2 and ILB . The 3 mutual information and the lower-bound estimates are plotted as a function of the number of trials per stimulus available; the values of the estimates are computed averaging over repetitions of the same simulation. We simulated a neuronal pair responding to four stimuli. We considered a time precision of 10 ms and a time window of 80 ms; thus, LC was equal to 16 for this pair. We considered a simulation with a strong stimulus modulation of the correlation. In this case, the ratio of independent versus shared spikes was modulated with the stimulus set. As a consequence, the γ (·) coefficients were stimulus dependent. The data were generated as follows. For both cells, the mean rate of the independently generated spikes was, respectively, for the four stimuli 16 Hz, 36 Hz, 8 Hz, and 8 Hz; the mean rate of the shared spikes was 32 Hz, 0 Hz, 16 Hz, and 4 Hz. As in the simulations in Figure 1, the shared spikes in the second neuron were shifted in time by a random amount chosen anew for each trial from a zero-mean gaussian distribution with standard deviation of 1 ms; this makes neurons nearly synchronous. (a) Lower-bounds estimates compared with the true mutual information. The values are not corrected for finite sampling. (b) Bias-corrected values of the lower-bounds estimates and the mutual information.

further details) consisted of single-neuron activity, each neuron being from a different tungsten electrode. The second data set consisted of multiunit activity (MUA) that was recorded from each electrode of a silicon electrode array (see Petersen & Diamond, 2000, for further details). For this MUA data set, it has been estimated that each electrode captured the spikes of a small cluster of neurons (≈2–5; see Petersen & Diamond, 2000). In both data sets,

1988

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

neural activity was recorded in response to individual stimulation of one of nine different whiskers (whisker D2 and its eight nearest neighbors); individual whiskers were stimulated near their base by a piezoelectric wafer, controlled by a voltage generator. The stimulus was an up-down step function of 80 µm amplitude and 100 msec duration, delivered once per second. The trials per stimulus available were 50 for the single-unit data set and 500 for the MUA data set. The interest in comparing the two data sets arises from the fact that the single-unit and the MUA data set have very different numbers of trials per stimulus available (50 versus 500) and that the MUA activity is more correlated than the single-unit activity. In fact, previous analysis(Panzeri, Petersen, et al., 2001; Petersen et al., 2001; Petersen & Diamond, 2000) on these S1 neurons has shown that spikes from the same cell have a weak negative autocorrelation (a period of refactoriness or inhibition follows a spike from the same cell), whereas nearby neurons have a substantial positive, near-synchronous cross-correlation (Lebedev et al., 2000). Thus, MUA contains also a positive correlation (resulting from correlations between nearby neurons) that is not present in the single-unit spike trains. It is interesting to study how our methods behave in these two different conditions of sampling and correlation sources. The time course of the estimates of the information transmitted about stimulus location by spike times of single cells (averaged over all 10 single cells in this dataset) is reported in Figure 6A. We used the procedure of Panzeri and Treves (1996) described in section 6 to correct the information estimates for finite sampling. We increased the poststimulus time windows from 0 to 80 ms, using 10 ms time bins to digitize the spike train. We compared the time course of the full spike timing information I (R; S) to that of the lower bounds ILB1 and ILB2 . To quantify whether spike timing added extra information to that conveyed by spike counts alone, we also computed the time course of the spike count information Ic (R; S), the latter being computed from equation 2.1 after quantifying neuronal responses as the total number of spikes emitted in each trial in the poststimulus window considered. The full spike timing information increased smoothly until 30 to 40 ms and then diverged rapidly. This is due to failure of removing the sampling bias, consistent with the rules of thumb for sampling correction, given in section 6, which predict that, with 50 trials per stimulus available, the mutual information should be well estimated only up to three to four time bins. In contrast to the total spike timing information I (R; S), its two lowerbound estimators did not diverge over time. This indicated that ILB1 and ILB2 were better sampled than I (R; S), consistent with the above simulation results. The spike count information Ic (R; S) also varied smoothly with time and was well sampled given that the firing rates in this data set were low (Lebedev et al., 2000). ILB2 was very close to the total spike timing information I (R; S) for the whole 0 to 40 ms range, in which both quantities

Lower Bounds to Spike Timing Information

Information (bits)

A)

S1 single cells 0.35 0.30 0.25 0.2

I LB2 LB1 Ic

0.15 0.1 10

20

B) Information (bits)

1989

30

40 50 Time (ms)

60

70

80

60

70

80

S1 single MUA channels 0.35 0.30 0.25 0.2 0.15 0.1 10

20

30

40 50 Time (ms)

Figure 6: Lower-bounds ILB1 and ILB2 to estimate the spike timing information in rat somatosensory cortex. We report the information analysis performed on (A) 10 single units and (B) 7 single MUA spike trains. In both cases, the spike trains were analyzed one at a time and then averaged across the population. The total spike timing information I , ILB1 , ILB2 , and spike count information Ic is plotted.

were well sampled, indicating that ILB2 is a tight estimator. The use of ILB2 demonstrated that spike timing conveys information above and beyond that carried by spike counts: at 0 to 80 ms poststimulus, the spike timing information computed with ILB2 was 90% higher than the spike count information. ILB1 was close to ILB2 up to 60 ms poststimulus and then dropped by 30% at 80 ms, consistent with the simulation predictions that ILB1 is less data robust than ILB2 and that it tends to be downward biased in conditions of undersampling (see Figure 4). The time course of the estimates of the information transmitted about stimulus location carried by spike times recorded from a single MUA channel are reported in Figure 6B. Seven single channels were analyzed separately and then averaged. With respect to the single-unit case above, the tenfold increase in the number of trials improved the sampling of all information estimates: I (R; S) now increased smoothly up to 60 ms, and both ILB1 and ILB2 varied smoothly with time in the whole 0 to 80 ms window considered. This is consistent with the simulation results in Figure 4. ILB2 was close to the total information in the range 0 to 60 ms, suggesting that ILB2 was tight. Despite the good sampling, ILB1 remained very close to the spike count information and was unable to reveal the presence of any extra

1990

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

spike timing information. At 0 to 80 ms, ILB2 was 28% higher than both the spike count information and ILB1 , thus demonstrating that spike timing conveys information above and beyond that carried by spike counts. There are two interesting differences between single-unit and MUA results. For MUA, we found that (1) ILB1 was not tight and (2) ILB2 was still tight, but there was less extra spike timing information than for single cells. Both facts can be accounted for by the fact that MUA contains more positive correlations between spikes. The loss of tightness of ILB1 is expected from the addition of correlations between local neurons introduced by MUA (the fewer correlation sources there are, the tighter ILB1 ). The loss of timing information is accounted for by the fact that 20% of the information carried by single S1 neurons is due to stimulus-indepedent negative autocorrelations (Panzeri, Petersen, et al., 2001) and that the addition of cross-correlation between nearby neurons introduces stimulus-independent positive crosscorrelations that decrease the spike timing information considerably (Petersen et al., 2001). Overall, these examples show that ILB2 can reliably reveal the presence of genuine spike timing information, even in cases when there would not be enough data to compute the full spike timing information and the use of ILB1 would fail to reveal it. They also show that the performance of the estimators on real data sets with different characteristics is consistent with the analytical and numerical results derived in previous sections. To study how the performance of the estimators varies when increasing the population size, we next considered pairs of S1 MUA recording channels. We analyzed only pairs located in the same barrel column recorded with silicon electrodes spaced by ≈0.4 mm. Their activity is known to be cross-correlated with short time lag (Lebedev et al., 2000; Petersen & Diamond, 2000). In Figure 7, we report the time course of the information analysis performed on the three same-column S1 pairs available (results averaged across pairs). We increased the poststimulus time windows in steps of 10 ms from 0 to 80 ms, and we used 10 ms time bins to digitize the spike times. Since we also wanted to investigate the role of short-time-range stim0 ulus modulations of cross-channel correlations, we considered ILB 3 alongside ILB1 and ILB2 . According to the numerical and analytical considerations 0 above, both ILB2 and ILB 3 were well sampled in the time range considered. Consistent with this prediction, they behaved smoothly as a function of 0 time. ILB1 was significantly smaller than both ILB2 and ILB 3 , and it decreased dramatically after 50 ms. ILB1 performed worse for longer windows when analyzing pairs than when analyzing single channels. This is because ILB1 was less data robust than the other two estimators. In particular, after five time bins (LC = 10), the response entropy in ILB1 (see Figure 4) gets strongly 0 downward biased, giving rise to the pattern in the figure. ILB 3 was consistently higher than ILB2 (it was 9% higher at 0–80 ms poststimulus). Since 0 ILB 3 considers only the effects of stimulus modulation of cross-correlations within the same time bin, these results show that stimulus modulations

Lower Bounds to Spike Timing Information

1991

S1 MUA pairs 0.5 LB3 LB2 LB1

Information (bits)

0.4

0.3

0.2

0.1

0

10

20

30

40 50 Time (ms)

60

70

80

Figure 7: Lower-bounds ILB1 , ILB2 , and ILB3 to estimate the spike timing information conveyed by the MUA activity recorded from paired electrodes in rat somatosensory cortex. We report the information analysis performed on pairs of MUA spike trains recorded simultaneously from two different electrodes (both located in the same barrel column; see main text). We analyzed three such paired spike trains; they were analyzed one at a time and then averaged across the population. Data were averaged across the population. The bias-corrected values of 0 ILB1 , ILB2 , and ILB are plotted. 3

of short-lag correlations contribute to information transmission. This was consistent with the finding that the strength of cross-correlation was weakly modulated by the stimulus: the Pearson cross-correlation coefficient of the joint spike counts (computed in the same 10 ms long sliding windows as for the information, and then averaged across all windows and cell pairs) was slightly smaller in response to stimulation of the principal whisker than in response to other whiskers (0.04 and 0.06, respectively). The contribution of stimulus modulation of correlations in this MUA pair data set was higher than in the corresponding analysis of pairs of single units (reported by Petersen et al., 2001). One possible explanation is that MUA pairs effectively sample a larger population and that correlations play a more important contribution for larger populations. In the case of MUA pairs, computation of the full mutual information was possible only for the first two or three time bins, after which the estimate diverged dramatically (data not shown). This illustrates that the lowerbound techniques developed here significantly extend the time range over

1992

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

which the spike timing information carried by neuronal populations can be analyzed. When analyzing pairs, our new approach allowed us to estimate the spike timing information with the 500 trials per stimulus, as opposed to the hundreds of thousands of trials that would have been required to estimate the total spike timing information. 8.2 Spike Timing and Coding of Motion Direction in the MT Visual Cortex of the Awake Macaque. In this section, we apply our new analysis to multiunit recordings collected from the MT visual area of a behaving monkey, and we show how our techniques could be successfully used to investigate whether neurons (such as those in MT) encode information about stimuli (such as motion direction) by means of spike timing. The MT neuronal responses analyzed in this section were recorded as follows. MUA was recorded through electrodes placed in area MT of a macaque monkey. MUA was subjected to thresholding to discriminate spike times. Although in general it was not possible to isolate spikes emitted by individual cells, in a few cases we were confident that there was only one single unit in the MUA (on the basis of standard clustering and autocorrelogram analysis criteria). The monkey was trained on a direction discrimination task and, during recording of neuronal activity, was fixating a screen centrally (fixation window ±0.5 degree). A structured background was backprojected onto the screen. After a randomized period of time, a grating was projected onto the screen moving in one of four possible directions along the cardinal axes, positioned over the receptive fields of the neurons under study. Luminance contrast (i.e., visibility) of the stimulus was randomized (0%, 2%, 4%, 17%). The monkey performed a reaction time task and indicated the perceived direction of motion by a hand movement to one of four touch bars located in front of the chest. Neural data analyzed in this article were from high-luminance stimulus conditions. (For more detailed information, see Thiele, Distler, & Hoffmann, 1999). Forty to 50 trials per stimulus were available. In this example analysis, we considered two different MUA recordings from MT (see Figures 8 and 9), which illustrate different ways in which information about motion direction might be encoded by MT neurons. An analysis of the first example of MUA single-channel recording is shown in Figure 8. The poststimulus time histograms (PSTHs) to different stimulus conditions (see Figure 8A) showed that neuronal activity from this electrode was strongly modulated by motion direction. In particular, responses to up and down motion directions were particularly strong, with large response peaks between 50 and 100 ms poststimulus. Although the peaks to up and down motion were of similar magnitude (≈150 Hz), there was a latency difference between responses to up and down motion (80 ms versus 70 ms, respectively). Responses to left and right motions were smaller in magnitude and occurred with longer latencies. These stimulusrelated latency differences suggest that spike timing may convey important

Lower Bounds to Spike Timing Information

1993

MT MUA single channel A)

B) Left

Firing Rates (Hz)

150

Up

100

50

Instantaneous Information 0.6

Information (bits)

PSTHs

LB2 I c

0.4

0.2

0

0

50

100

150

200

0

C) Right

Cumulative Information 0.6

Information (bits)

Firing Rates (Hz)

150

Down

100

50

0 0

50

100

150

Time (ms)

200 0

50

100

150

Time (ms)

200

0.4

0.2

0 50

60

70

80 90 Time (ms)

100

Figure 8: Information about motion direction transmitted by MUA in MT visual cortex. We quantified the information that spike times and spike counts convey about motion direction by applying the information analysis MUA recorded from one electrode in MT of an awake behaving monkey. Information conveyed by spike timing (computed with our ILB2 lower bound) is compared to the information in the spike count. The recording site considered here presents strong extra spike timing information not available in the spike count. The gray areas represent a (bootstrap-computed) confidence band for the spike count information: if the spike timing information lies in this area, it is likely (with a probability of 0.95) that the spike timing information equals the spike count information; hence, there is not extra information in the timing of spikes not provided in the count. (A) PSTHs of the neuron. (B) spike timing (ILB2 ) and spike count information in sliding windows from 0 to 200 ms, increased in steps of 50 ms. (C) Spike timing (ILB2 ) and spike count information in cumulative windows from 50 to 100 ms, increased in steps of 10 ms.

information about motion direction in this case. We investigated this hypothesis by using the lower-bound estimate ILB2 , which, on the basis of the ≈50 trials per stimulus available, was well enough sampled up to 12 time bins, and compared it to the information conveyed by spike counts. We first examined the information transmitted in sliding windows of 50 ms

1994

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

ranging from 0 to 200 ms.6 We found (see Figure 8B) that in the 50 to 100 ms time interval, there was twice as much information in spike times as in spike counts (0.6 bits versus 0.3 bits). This difference in information was highly significant, as the spike timing information ILB2 was far above the (P < 0.05) bootstrap-computed confidence interval of the spike count information (the gray area in Figure 8B). The time course of the information conveyed by the spikes in the 50 to 100 ms time window was magnified in Figure 8C, where we report the cumulative plot of information in the time windows [50,60] ms, [50,70] ms, [50,80] ms, [50,90] ms, and [50,100] ms. There was rapidly increasing extra information in spike timing after 80 ms poststimulus (the time when an observer of neuronal activity could use the latency difference between up and down motion direction responses to discriminate the stimulus). In Figure 9 we considered a second example of MUA activity. In this case, from PSTH inspection, it is likely that the MUA channel contains a smaller neuronal population than that in the previous example. Also in this case, PSTHs to different motion directions (see Figure 9A) showed that neuronal activity was strongly modulated by motion direction. Responses to left and right motion directions were particularly strong. However, responses were more tonic than in the previous case, and there was no marked latency difference between the motion directions eliciting stronger response. Thus, in this case, we expected that spike times did not add much information about motion direction to that provided by spike counts alone. Results of the information analysis using our new lower bound (see Figures 9B and 9C) confirmed this expectation: the lower-bound analysis could not find any evidence that knowledge of spike times further increased the information provided by spike counts. In both examples, we could not estimate the full information reliably out of 50 trials per stimulus for time windows as long as that analyzed here. This application is useful in showing that the new lower bounds reliably and consistently pick up spike timing information originating by stimulusrelated differences in the temporal shape of PSTHs, and they can achieve this by using the number of data that can be collected from a behaving animal. This shows that our bounds could become a useful tool to probe the importance of spike timing in neural coding in awake behaving animals, when it is usually not possible to record responses to hundreds of repetitions of the same stimulus. We would like to stress that we report this example only as a demonstration of the applicability of the new method. We do not draw from it general conclusions about the role of spike timing in coding of visual information

6

This means that we computed the information estimations in four time intervals: [0,50] ms, [50,100] ms, [100,150] ms, and [150,200] ms. The information in each of these four time periods was computed after digitizing the spike trains with 10 ms precision.

Lower Bounds to Spike Timing Information

1995

MT MUA single channel A)

B)

Instantaneous Information

PSTHs 0.6

Firing Rates (Hz)

Left

Up

50

Information (bits)

100

LB2 I c

0.4

0.2

0

0

50

100

150

200

0

C)

100

Right

Cumulative Information

Down Information (bits)

Firing Rates (Hz)

0.6

50

0 0

50

100

150

Time (ms)

200 0

50

100

150

Time (ms)

200

0.4

0.2

0 50

60

70

80 90 Time (ms)

100

Figure 9: Information about motion direction transmitted by MUA in MT visual cortex. In this second example, this recording site presented here does not convey any extra information by spike timing. Conventions are as in Figure 8.

in area MT, which still has to be determined. In particular, the use of MUA recording rather than single units may affect the spike timing results (as shown in the previous section). For examples the MUA channel in Figure 8 may carry significant spike timing information because it contains more than one cell with different latencies. In general, we found that in all but a few cases analyzed so far in which a single unit could be reliably discriminated, there was very little extra information carried by spike timing. For the present purpose, we preferred to report only the two above examples of MUA activity in MT, because those two examples show very clearly the relation between stimulus-related PSTH temporal properties and ILB2 spike timing information. 9 Discussion Multiple electrodes are a widely used tool in neuroscience research that makes it possible to study the simultaneous activity of neuronal populations (Brown, Kass, & Mitra, 2004). The information-theoretic analysis of

1996

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

such simultaneously recorded neuronal activity offers a principled way to study how spike timing and correlation contribute to neuronal population coding of sensory information (Nirenberg & Latham, 2003; Pola et al., 2003; Schneidman et al., 2003; Averbeck & Lee, 2004). However, despite the fact that information theory has had widespread use in single-neuron analysis (Optican & Richmond, 1987; Rieke et al., 1996; Borst & Theunissen, 1999), relatively few studies have used it for analysis of population spike trains (e.g., Petersen et al., 2001; Nirenberg et al., 2001; Rolls et al., 2003). The main reason for this has been the unpractically large numbers of data that are often required for information-theoretic analysis of populations (Brown et al., 2004). This article presents several advances that can help in alleviating this problem and extending the range of applicability of information theory to multi-spike-train analysis. First, this study alleviates the limited sampling problem by providing data-robust quantities that approximate precisely the information under very general conditions. This new approach complements other recent advances on the sampling problem (Victor, 2002; Paninski, 2003; Nemenman, Bialek, & de Ruyter van Steveninck, 2004). Second, the estimators introduced here approximate the mutual information from below: this means that any such estimate of information contained by spike timing does not contain any spurious information due to sampling artifacts. Thus, a demonstration, obtained with these methods, that population spike timing conveys substantial information not available in spike counts would be very robust and statistically significant. The examples reported in Figures 8 and 9 show that this is possible even with the numbers of data recoded from awake-behaving animals. Thus, the method opens up the possibility of quantitative investigations of the role of spike timing in some cognitive and perceptual tasks. Third, the information estimators developed here lend themselves to investigations of the role of correlated firing in coding. In fact, all estimaq tors ILB2 and ILB3 can take into account the effect of stimulus-independent q correlations. Moreover, the estimators ILB3 also take into account the effect of stimulus modulations of correlations between spikes that are separated by q time steps or fewer. By varying q within the range allowed by the q data size available in a particular experiment, one could use the ILB3 estimates to obtain a quantitative characterization of the timescales over which correlations contribute to population coding. This approach will succeed in determining the timescales over which correlations carry information if all the informative stimulus-dependent correlations are short-time-ranged. However, a practical problem is that determining with our approach the presence or absence of long time-range correlations may require large numbers of data. Thus, when working with typical neurophysiological data sets, it is useful to complement the q -time-steps analysis presented here with a rigorous assessment of long-range correlations based on other statistical methods (e.g., Oram et al., 2001).

Lower Bounds to Spike Timing Information

1997

It is important to note that the finding that the new estimators ILB2 and q ILB3 are less biased and more data robust than the mutual information is general and due to the intrinsic properties of these functionals. In fact, these estimators depend on (1) entropies that are “lower dimensional” than the full spike timing response entropy and are thus more data robust, and (2) other nonentropy quantities (such as χ ) that are intrinsically less biased than entropies. However, the actual performance of each estimator may depend on the particular method used to remove the sampling bias. Here we corrected for the bias using an analytical approximation based on the assumption that the number of trials N used to compute the probabilities was bigger than the number of possible responses R (Panzeri & Treves, 1996). Although the bias subtraction method presented here performs extremely well for nonentropy quantities (such as χ) and relatively well for the entropy quantities, it q is possible that estimating entropy quantities on which ILB2 and ILB3 depend by using recent advances on the entropy sampling problem (Paninski, 2003; Nemenman et al., 2004; Victor, 2002) may push the performance of the approach presented here much further. We are currently investigating in a systematic way, by means of computer simulations, how various bias elimination methods perform when applied to the probability functionals developed and studied here (Panzeri, 2005). Appendix: Bias Expressions of χ(R), χ 0 (R), and H q (R|S) This appendix reports explicit expressions for the bias approximations of the quantities H q (R|S), χ(R), and χ 0 (R) that were not reported in the main text. These approximations to the bias can be subtracted from the estimates of the functional obtained from limited sampled probabilities to correct for the sampling problem. The bias of a given functional of the probability distributions is defined as the difference between the trial-averaged value of the functional when the probability distributions are computed from N trials only and the value of the functional computed with the true probability distributions (obtained from an infinite number of observations). There are several ways to derive the bias correction: we have followed a simple procedure equivalent to that used and detailed in appendix B of Pola et al. (2003). In brief, we used a Taylor series expansion of the functional around the true probability value and then averaged over all possible outcomes of the N trials. We considered only the first two terms in the expansion (corresponding to the effect of mean and variance of the estimates of the probabilities obtained with N trials). This corresponds to computing the bias to first order in N1 (see Pola et al., 2003). Thus, all bias equations reported in this appendix are valid to the N1 order. This approximation is good if there are enough experimental trials N so that fluctuations of the estimated probability distributions around the asymptotic value are small. In this appendix, we report schematically the

1998

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

derivation of only the bias expressions that were not derived in Pola et al. (2003). A.1 Bias of H q (R|S). The bias for H q (R|S) q = 1, . . . , L − 1 can be calculated by expressing it as a sum of lower-dimensional noise entropies: H q (R|S) = H(R1 , . . . , Rq |S) + +

L

[H(Rt−q , . . . , Rt |S) − H(Rt−q , . . . , Rt−1 |S)], (A.1)

t=q +1

where, in the above, H(R1 , . . . , Rq |S) is the noise entropy of the marginal probabilities P(r(1), . . . , r(q )|s). By applying equation. 6.1 (whose derivation is reported in Panzeri & Treves, 1996, and Pola et al., 2003) to all noise entropies in the above, we obtain: Bias[H q (R|S)] ≈ − −

1 ( R˜ 1,...,q (s) − 1) 2N log 2 s L 1 [ R˜ t−q ,...,t (s) − R˜ t−q ,...,t−1 (s)], 2N log 2 t=q +1 s

(A.2)

where R˜ 1,...,q (s), R˜ t−q ,...,t (s), and R˜ t−q ,...,t−1 (s) stand, respectively, for the number of relevant responses of the probability distributions P(r(1), . . . , r(q )|s), P(r(t − q ), . . . , r(t)|s), and P(r(t − q ), . . . , r(t − 1)|s). As for the case of the full probability distribution P(r|s), the determination of the number of the number of relevant bins of these marginal probabilities may not be straightforward when data are scarce. As discussed in the main text, an approach to this problem was presented by Panzeri and Treves (1996). The bias of H 0 (R|S) can be derived in an analogous way and has the following simpler expression: Bias[H 0 (R|S)] ≈ −

1 ( R˜ t (s) − 1), 2N ln 2 s t

(A.3)

where R˜ t (s) is the number of relevant responses of the marginal distributions P(r(t)|s). A.2 Bias of χ (R). The derivation of the bias of χ(R) follows almost exactly the one reported in appendix B of Pola et al. (2003) and is very

Lower Bounds to Spike Timing Information

1999

similar to the derivation of the bias of χ 0 (R) (reported below). Thus, for conciseness, here we report only the full result: Bias[χ(R)] ≈

+ LC + L 2 C 2 , 2N ln 2

(A.4)

The coefficients , , and are functionals of the stimulus conditional probability distributions P(r|s), the marginal distributions P(rc (t)|s), and P(s). Their values are given by P(r) Pind (r|s)β(r|s) r Pind (r) s α(r|s) P(r) 2 + 1 + (r|s) + β(r|s) P ind 2 Pind (r|s) s r Pind (r) 2 − P(r|s)[Pind (r|s) + α(r|s)] , s r Pind (r)

=1 −

= =

1

r

Pind (r)

s

Pind (r|s)[2P(s)P(r|s) − P(r)],

P(r) Pind (r|s)[Pind (r) − P(s)Pind (r|s)], 2 r Pind (r) s

(A.5) (A.6) (A.7)

where r is a summation restricted to the response variables r such that Pind (r) = 0. α(r|s) and β(r|s) are defined as follows, α(r|s) = β(r|s) =

Pind (r|s) , c,tc P(rc (tc )|s) (b,tb )=(c,tc )

P(rb (tb ), rc (tc )|s) , P(rb (tb )|s)P(rc (tc )|s)

(A.8) (A.9)

where in the last equation, we sum up over every b, tb , c, and tc such that the couple of values (b, tb ) is different from (c, tc ). It is worth stressing that both α(r|s) and β(r|s) are regular and finite quantities. The bias coefficients , , and are functionals of the full probability distributions. Thus, an interesting question is how to compute these terms from raw data. In this letter, we have performed this evaluation by plugging the empirically obtained probabilities into the above expression for , , and . Although this may potentially introduce systematic inaccuracies in the determination of these terms, the numerical simulations presented in Figures 2 and 3 show that this problem is negligible and that a

2000

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

very accurate estimation of the bias of χ(R) can in general be reached even with small data sets. A.3 Bias of χ 0 (R). We start the derivation with a reminder that χ 0 (R) can be expressed as

χ 0 (R) = −

r

P(r) log2 P˜ 0 (r),

(A.10)

where the notation r reminds that the summation over responses is restricted to r such that P˜ 0 (r) = 0. The first step to compute the bias of χ 0 (R) is to perform a second-order series expansion of χ 0 (R) around the true probability distributions, Bias[χ 0 (R)] ≈

1 δ2 χ 0 2 σ [P(s), P(s)] 2 s δ P(s)2 N +

1 δ2 χ 0 σ 2 [P(s), P(s )] 2 s,s ,s=s δ P(s)δ P(s ) N

+

δ2 χ 0 1 σ 2 [P(˜r(t)|s), P(˜r(t)|s)] 2 s t r˜(t) δ P(˜r(t)|s)2 N

+

1 δ2 χ 0 2 s t,t ,t=t r˜(t),˜r(t ) δ P(˜r(t)|s)δ P(˜r(t )|s)

× σ N2 [P(˜r(t)|s), P(˜r(t )|s)] +

+o

s

t

1 , N

r˜(t)

rˆ

δ2 χ 0 σ 2 [P(˜r(t)|s), P(ˆr|s)] δ P(˜r(t)|s)δ P(ˆr|s) N (A.11)

stands for the functional derivative of χ 0 (R) with respect to rewhere δχ δP sponse probability distributions (computed in the true asymptotic value obtained with an infinite amount of data). σ N2 [·, ·] are the variances and covariances of the probability distributions. We introduced r˜(t) and rˆ to distinguish them from r, which is a running variable used to define χ 0 (R) (see equation 7.10). While r˜(t) corresponds to the neuronal population response in the tth time bin only, rˆ stands for the neuronal response in all the time bins. Computing the functional derivatives explicitly, and omitting for brevity all the terms that simplify away, we obtain the following expression for the leading term of the bias of χ 0 (R): 0

Lower Bounds to Spike Timing Information

1 P˜ 0 (r|s) Bias[χ (R)] = r P ˜ 0 (r) 2 ln 2 s 0

2001

P(r) P˜ 0 (r|s) − 2P(r|s) P˜ 0 (r)

× σ N2 [P(s), P(s)] +

P(r) 1 P˜ (r|s) P˜ 0 (r|s ) r P ˜ 0 (r)2 0 2 ln 2 s,s ,s=s

P(r|s) P˜ 0 (r|s ) + P˜ 0 (r|s)P(r|s ) 2 σ N [P(s), P(s )] P˜ 0 (r) 2 1 P(r) P˜ 0 (r|s) 2 + P (s) r 2 ln 2 s t r˜(t) P˜ 0 (r)2 P(˜r(t)|s)

−

× δ[r(t),˜r(t)] σ N2 [P(˜r(t)|s), P(˜r(t)|s)] 1 P(s) + r 2 ln 2 s t,t ,t=t r˜(t),˜r(t ) P˜ 0 (r|s) P(˜r(t)|s)P(˜r(t )|s) P(r) P(s) P˜ 0 (r|s) − 1 σ N2 [P(˜r(t)|s), P(˜r(t )|s)] × P˜ 0 (r) P˜ 0 (r) × δ[r(t),˜r(t)] δ[r(t ),˜r(t )]

−

1 P˜ 0 (r|s) 2 P (s)δ[r(t),˜r(t)] r ln 2 s t r˜(t) P(˜r(t)|s) P˜ 0 (r)

× σ N2 [P(˜r(t)|s),

1 P(r|s)] + o , N

(A.12)

where δ[r(t),˜r(t)] is a Kronecker delta (i.e., δ[r(t),˜r(t)] = 1 if r(t) = r˜(t) and δ[r(t),˜r(t)] = 0 if r(t) = r˜(t)). The values of the variances and covariances are as follows: P(s)(1 − P(s)) 1 σ N2 [P(s), P(s)] = +o , (A.13) N N P(s)P(s ) 1 2 +o , (A.14) σ N [P(s), P(s )] = − N N P(˜r(t)|s)(1 − P(˜r(t)|s)) 1 , (A.15) +o σ N2 [P(˜r(t)|s), P(˜r(t)|s)] = Ns Ns P(˜r(t)|s)P(˜r(t )|s) P(˜r(t), r˜(t )|s) + Ns Ns 1 +o , Ns

σ N2 [P(˜r(t)|s), P(˜r(t )|s)] = −

(A.16)

2002

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

P(r|s) P(˜r(t)|s)P(r|s) + δ[r(t),˜r(t)] Ns Ns 1 . +o Ns

σ N2 [P(˜r(t)|s), P(r|s)] = −

(A.17)

After replacing the explicit values of variances and covariances in equation A.12, and after performing some algebra, we obtain the following final expression for the bias of χ 0 (R): Bias[χ 0 (R)] ≈

0 + L 0 + L 2 0 . 2N ln 2

(A.18)

The coefficients 0 , 0 , and 0 are functionals of the stimulus conditional probability distributions P(r|s) and of the marginal distributions P(r(t)|s), and P(s). Their values are given by P(r) ˜ P 0 (r|s)β 0 (r|s) ˜ r P 0 (r) s α 0 (r|s) P(r) 0 + + β P(s) 1 + (r|s) P˜ 0 (r|s)2 2 ˜ P˜ 0 (r|s) r P 0 (r) s

0 = 1 −

−

2 P(s)P(r|s)[ P˜ 0 (r|s) + α 0 (r|s)], ˜ r P 0 (r) s

(A.19)

0 =

1 ˜ P 0 (r|s)[2P(s)P(r|s) − P(r)], ˜ r P 0 (r) s

(A.20)

0 =

P(r) ˜ P 0 (r|s)[ P˜ 0 (r) − P(s) P˜ 0 (r|s)], 2 ˜ r P 0 (r) s

(A.21)

where r is a summation restricted to the response variables r such that P˜ 0 (r) = 0. α 0 (r|s) and β 0 (r|s) are defined as follows: α 0 (r|s) = β 0 (r|s) =

P˜ 0 (r|s) , t P(r(t)|s) t,t ,t=t

P(r(t), r(t )|s) . P(r(t)|s)P(r(t )|s)

(A.22) (A.23)

Acknowledgments We are grateful to M. E. Diamond and M. Lebedev for kindly making available to us the example data used in Figure 6A and to M. E. Diamond,

Lower Bounds to Spike Timing Information

2003

P. Latham, M. A. Montemurro and S. R. Schultz for many useful discussions. This research was supported by an MRC Research Fellowship (S.P.), Wellcome Trust 066372/Z/01/Z and GR070380, Royal Society and DFG SFB 509. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatio-temporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629–1638. Adrian, E. D. (1926). The impulses produced by sensory nerve endings: Part I. J. Physiol. (Lond.), 61, 49–72. Averbeck, B. B., & Lee, D. (2004). Coding and transmission of information by neural ensembles. Trends in Neurosciences, 27, 225–230. Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brosch, M., Bauer, R., & Eckhorn, R. (1997). Stimulus dependent modulations of correlated high frequency oscillations in cat visual cortex. Cerebral Cortex, 7, 70– 76. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7, 456–461. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dan, Y., Alonso, J.-M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1, 501–507. de Oliveira, S. C., Thiele, A., & Hoffman, K.-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. J. Neurosci., 17, 9248–9260. DeWeese, M. R., Wehr, M., & Zador, A. M. (2003). Binary spiking in auditory cortex. J. Neurosci., 23, 7940–7949. Dimitrov, A. G., & Miller, J. P. (2001). Neural coding and decoding: Communication channels and quantization. Network: Comput. Neural Syst., 12, 441–472. Furukawa, S., Xu, L., & Middlebrooks, J. C. (2000). Coding of sound-source location by ensembles of cortical neurons. J. Neurosci., 20, 1216–1228. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758–2771. Golledge, H. D. R., Panzeri, S., Zheng, F., Pola, G., Scannell, J. W., Giannikopoulos, D. V., Mason, R. J., Tovee, M. J., & Young, M. P. (2003). Correlations, feature binding and population coding in primary visual cortex. Neuroreport, 14, 1045–1050. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Hatsopoulos, N. G., Ojakangas, C. L., Paninski, L., & Donoghue, J. P. (1998). Information about movement direction obtained from synchronous activity of motor cortical neurons. PNAS, 95(26), 15706–15711.

2004

G. Pola, R. Petersen, A. Thiele, M. Young, and S. Panzeri

Jung, M. W., Qin, Y., Lee, D., & Mook-Jung, I. (2000). Relationships among discharges of neighboring neurons in the rat prefrontal cortex during spatial working memory tasks. J. Neurosci., 20, 6166–6172. Lebedev, M. A., Mirabella, G., Erchova, I., & Diamond, M. E. (2000). Experiencedependent plasticity of rat barrel cortex: Redistribution of activity across barrelcolumns. Cerebral Cortex, 10, 23–31. Mastronarde, D. N. (1983). Correlated firing of cat retinal ganglion cells. I. Spontaneously active inputs to x- and y-cells. J. Neurophysiol., 49, 303–324. Miller, G. A. (1955). Note on the bias of information estimates. In H. Quastler (Ed.), Information theory in psychology: Problems and methods (vol. 2B, pp. 95–100). Glencoe, IL: Free Press. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5), 056111. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. In S. B. T. G. Dietterich & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 95–100). Cambridge, MA: MIT Press. Nirenberg, S., Carcieri, S. M., Jacobs, A., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Nirenberg, S., & Latham, P. E. (1998). Population coding in the retina. Current Opinion in Neurobiology, 8, 488–493. Nirenberg, S., & Latham, P. E. (2003). Decoding neuronal spike trains: How important are correlations. Proc. Natl. Acad. Sci. USA, 100, 7348–7353. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex: III. Information theoretic analysis. J. Neurophysiol., 57, 162–178. Oram, M. W., Foldi´ ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends in Neurosciences, 21(6), 259– 265. Oram, M. W., Hatsopoulos, N., Richmond, B., & Donoghue, J. (2001). Excess synchrony in motor cortical neurons provides redundant direction information with that from coarse temporal measures. J. Neurophysiol., 86, 1700–1716. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15, 1191–1253. Panzeri, S. (2005). A numerical evaluation of different approaches to measure spike timing information. Manuscript in preparation. Panzeri, S., Golledge, H. D. R., Zheng, F., Tovee, M. J., & Young, M. P. (2001). Objective assessment of the functional role of spike train correlations using information measures. Visual Cognition, 8, 531–547. Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Panzeri, S., Pola, G., Petroni, F., Young, M. P., & Petersen, R. (2002). A critical assessment of different measures of the information carried by correlated neuronal firing. Biosystems, 67, 177–185. Panzeri, S., & Schultz, S. (2001). A unified approach to the study of temporal, correlational and rate coding. Neural Computation, 13, 1311–1349.

Lower Bounds to Spike Timing Information

2005

Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Petersen, R. S., & Diamond, M. (2000). Spatio-temporal distribution of whiskerevoked activity in rat somatosensory cortex and the coding of stimulus location. J. Neurosci., 20, 6135–6143. Petersen, R. S., Panzeri, S., & Diamond, M. (2001). Population coding of stimulus location in rat somatosensory cortex. Neuron, 32, 503–514. Pola, G., Thiele, A., Hoffmann, K.-P., & Panzeri, S. (2003). An exact method to quantify the information transmitted by different mechanisms of correlational coding. Network, 14, 35–60. Reich, D. S., Mechler, F., Purpura, K. P., & Victor, J. D. (2000). Interspike intervals, receptive fields, and information encoding in primary visual cortex. J. Neurosci., 20, 1964–1974. Reich, D. S., Mechler, F., & Victor, J. D. (2001). Independent and redundant information in nearby cortical neurons. Science, 294, 2566–2568. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rolls, E. T., Franco, L., Aggelopoulos, N. C., & Reece, S. (2003). An information theoretic approach to the contributions of the firing rates and the correlations between the firing of neurons. J. Neurophysiol., 89, 2810–2822. Romo, R., Hernandez, A., Zainos, A., & Salinas, E. (2003). Correlated neuronal discharges that increase coding efficiency during perceptual discrimination. Neuron, 38, 649–657. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. J. Neurosci., 23(37), 11539–11553. Schultz, S., & Panzeri, S. (2001). Temporal correlations and neural spike train entropy. Phys. Rev. Lett., 86, 5823–5826. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and coding. J. Neurosci., 18(10), 3870– 3896. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Labs. Tech. J., 27, 379–423. Thiele, A., Distler, C., & Hoffmann, K. P. (1999). Decision-related activity in the macaque dorsal visual pathway. European J. Neurosci., 11, 2044–2058. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the response of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Victor, J. D. (2002). Binless strategies for estimation of information from neuronal data. Physical Review, E 66, 51903–51918.

Received September 7, 2004; accepted March 3, 2005.

LETTER

Communicated by Maneesh Sahani

Fluctuation-Dissipation Theorem and Models of Learning Ilya Nemenman [email protected] Kavli Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106, and Joint Centers for Systems Biology, Columbia University, New York, NY 10032, U.S.A.

Advances in statistical learning theory have resulted in a multitude of different designs of learning machines. But which ones are implemented by brains and other biological information processors? We analyze how various abstract Bayesian learners perform on different data and argue that it is difficult to determine which learning–theoretic computation is performed by a particular organism using just its performance in learning a stationary target (learning curve). Based on the fluctuation–dissipation relation in statistical physics, we then discuss a different experimental setup that might be able to solve the problem.

1 Introduction Learning based on experience (variously known as sensing, information processing, or adaptation) is ubiquitous on all scales in biology. For example, on the molecular scale, the Lac operon in Escherichia coli learns the lactose concentration to produce β–galactosidase (the lactose–metabolizing enzyme) in proper quantities (Cohn & Horibata, 1959). Similarly, in the sensory system, the retinal phototransduction cascade uses the information in the arrivals of photons to learn the instantaneous light intensity and thus the current visual scene (Detwiler, Ramanathan, Sengupta, & Shraiman, 2000). Additionally, it also learns the ambient light level (to adapt to it) and the temporal correlations (to estimate motion) (Reichardt, 1961; Bialek & de Ruyter van Steveninck, 2005). On the scale of cellular (neuronal) networks, learning, memory, and adaptation in the neural code are textbook knowledge see, e. g., Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). At yet larger scales, experiments on rodents are revealing how they learn and respond to changes in their environments (Gallistel, Mark, King, & Latham, 2001); this is a simple, albeit quantifiable, example of the general phenomenon we call “learning” in everyday life. Finally, we may also view evolution as an example of learning, where entire species adapt to the world by means of natural selection. Neural Computation 17, 2006–2033 (2005)

© 2005 Massachusetts Institute of Technology

Fluctuation-Dissipation Theorem and Models of Learning

2007

The creativity of theorists matches that of nature, and the number of various learning paradigms different in their goals, assumptions, methods, and performance guarantees is astonishing—too large to enumerate here. Fortunately, it is possible to build uniform foundations for many of these learning machines (Vapnik, 1998; Nemenman, 2000; Bialek, Nemenman, & Tishby, 2001) and to find analogs among, say, structural risk minimization (Vapnik, 1998) and Bayesian (Press, 1989) models. However, while one might argue that biological systems are (efficiently) implementing one of many abstract learning–theoretic computations (Attneave, 1954; Barlow, 1959, 1961; Atick, 1992; Bialek et al., 2001), it is often unclear which exact computation is performed in a particular case. For example, what is a learning–theoretic model equivalent to a rat (Gallistel et al., 2001) or to a simple neural network that tries to maximize its reward (Seung, 2003)? Answering such questions may explain some of animal behaviors, uncover which assumptions they make about the surrounding world, and establish quantitative limits on their learning performance. To attack the problem, one can construct a biologically plausible computing machine with a known learning–theoretic equivalent (Rao, 2004) and then search for a structural similarity with a real living organism. We do not pursue this approach, but choose a more traditional route to establish the equivalence: comparison of the performance of real creatures to that of abstract learning machines. As we will argue, analysis of paradigmatic learning curves is not always easy. Thus, one of the most important results of the article is a suggestion of a new protocol for making such comparisons. The intuition behind the suggestion comes from the famous fluctuation–dissipation theorem (Ma, 1985). Based on this analysis and on plausible assumptions about statistics of natural stimuli, we also suggest that a particular learning–theoretic model might be better suited for biological learning than the alternatives, and thus it should be realized often in reality if optimization of learning is desired. To follow this route, we need to understand characteristics of learning within different mathematical models fairly well, and a large part of the article is devoted to this. Analysis is done in the framework of unsupervised Bayesian learning of probability distributions since (1) evidently, Bayesian paradigm is relevant in neuroscience (Kording & Wolpert, 2004), (2) different learning frameworks are often equivalent, and (3) according to Bialek et al. (2001), other learning problems usually can be reduced to unsupervised learning of distributions. Much of this first part of the article is an abridged review, which follows the spirit and the notation of Bialek et al. (2001) and often prefers clarity to mathematical rigor. We do not try to make the review self-contained, but instead want to elucidate and emphasize some important points that might have been of lesser interest in other contexts and also to present some novel results, mostly developed in the appendixes. After these developments, we return to the main question of this work: How

2008

I. Nemenman

can one realistically determine an equivalent learning–theoretic model for a biological organism? 2 The Basics of Learning Learning machines should be powerful enough to explain complex phenomena. However, when data are scarce, this power leads to overfitting and poor generalization. Thus, a balance must be struck between the abilities to explain and to overfit, and this balance will depend on the number of data available. In accord, much of statistical learning theory (Jeffreys, 1936; Schwartz, 1978; Janes, 1979; Rissanen, 1989; Clarke & Barron, 1990; MacKay, 1992; Balasubramanian, 1997; Vapnik, 1998; Nemenman, 2000; Bialek et al., 2001) has been devoted to putting the famous paradigm of William of Ockham, Pluralitas non est ponenda sine neccesitate, on firm mathematical footing in various theoretical frameworks. In particular, in Bayesian formulation (Press, 1989; Bernardo, 2003), we know now how proper Bayesian averaging creates Occam factors that punish for complexity and weigh posterior probabilities toward those estimates among a finite set of parametric model families that have the best overall predictive power (Bialek et al., 2001) but do not necessarily produce the best fit to the observed data. This has been called Bayesian model selection.1 The waters get murkier in a nonparametric or infinite parameter setting when the whole functional form of an unknown object is to be inferred. Bayesian nonparametric developments generally parallel parametric ones, and techniques of quantum field theory (QFT) help in computations (Bialek, Callan, & Strong, 1996; Holy, 1997; Bialek et al., 2001; Cucker & Smale, 2001; Nemenman & Bialek, 2002; Lemm, 2002). However, the exact relationship between the two settings is unknown, and some results suggest subtle logarithmic differences between the cases (Hall & Hannan, 1988; Rissanen, Speed, & Yu, 1992; Bialek et al., 2001). We now review these and other Bayesian learning machines. We start with an introduction of some important and useful quantities. Suppose we observe independently and identically distributed (i.i.d.) samples xi , i = 1, . . . , N. For simplicity, we assume that x is a scalar, but this does not affect most of the discussion. We need to estimate the probability density that generates the samples. A priori we know that this density, Q(x|α), can be indexed by some (possibly infinite-dimensional) vector of parameters α, and the probability of each parameter value is P(α). Then we 1 With the creationism–evolution tension mounting in teaching of biology in U.S. schools, it is amusing to see how two friars, William of Ockham and Thomas Bayes, teamed up with modern mathematicians to produce, in my view, the clearest formulation of the theory of learning from past experiences. If this approach results in a better understanding of biological designs, the situation will be even more peculiar.

Fluctuation-Dissipation Theorem and Models of Learning

2009

define the density of models (solutions) at a given distance (dissimilarity, or divergence) D(α, ¯ α) = away from the unknown true target α, ¯ which is being learned: ρ(; α) ¯ =

dα P(α) δ [D(α, ¯ α) − ] .

(2.1)

For Bayesian inference of probability densities, the correct measure of dissimilarity is the Kullback–Leibler (KL) divergence, DKL (α||α) ¯ = α) ¯ (Bialek et al., 2001), which has an important d x Q(x|α) ¯ log Q(x| Q(x|α) information–theoretic interpretation (Cover & Thomas, 1991). However, in other situations, different choices of D can and should be made. Performance of a Bayesian learner is usually measured by the speed with which the posterior probability concentrates for N → ∞ (the learning curve) and by whether the point of concentration is the true unknown target (consistency). These characteristics illuminate the importance of ρ as it relates to both of them. First, it has been proven that if for → +0, the density, ρ(; α), ¯ is not zero, then the Bayesian problem is consistent (Nemenman, 2000; Bialek et al., 2001). Intuitively this is because, for large density, statistical fluctuations of the sample and of the estimated parameters result in small DKL (α||estimate), ¯ making convergence to the target almost certain. Relation of ρ to the learning curve is more complicated. We can calculate the average (over samples) Occam factor for a given target (the generalization error, or the fluctuation determinant) to the leading order in 1/N: D(α; ¯ N) ≈ − log

d ρ(; α)e ¯ −N .

(2.2)

This is the term that emerges as the penalty for complexity in Bayesian model selection (Balasubramanian, 1997; Bialek et al., 2001). If averaged over α, ¯ the Occam factor becomes predictive information (Bialek et al., 2001), which is the average number of bits that N samples provide about the unknown parameters, Ipred (N) =

dα ¯ P(α)D( ¯ α; ¯ N).

(2.3)

Finally, one can define the universal learning curve, which measures the expected DKL between the target and the estimate after N observations (Bialek et al., 2001). Up to the first order in the large parameter N, this is (α; ¯ N) ≈

dD(α; ¯ N) , dN

(2.4)

2010

I. Nemenman

(N) =

dα ¯ P(α)( ¯ α; ¯ N) ≈

d Ipred . dN

(2.5)

Many of these quantities, especially Ipred , are also natural objects when analyzing the complexity of a time series (Bialek et al., 2001). 3 Different Models of Learning Since one of the goals of this work is to investigate if learning machines can be discriminated by means of their learning performance, specifically (N), here we discuss how depends on N for different scenarios. 3.1 Learning in a Finite Set of Parameters. Consider a setup where α can take M discrete values a 1 , a 2 , . . . , a M with a priori probabilities P1 , P2 , . . . , P M , and their divergences from the target a 1 are 0 = d1 < M d2 < · · · < d M . The density is ρ(; a 1 ) = i=1 Pi δ(di − ). For N → ∞, we have D(a 1 ; N) = − log

M

Pi exp[−Ndi ] ≈ − log P1 − P2 /P1 exp[−Nd2 ],

(3.1)

i=1

(a 1 ; N) ≈ d2 P2 /P1 exp[−Nd2 ].

(3.2)

So exponential learning curves (and asymptotically finite D and Ipred ) correspond to learning a possibility in a finite set. Similarly, we can construct models with (N) ∝ 1/Nν , ν > 1, and they will also have asymptotically finite Ipred . 3.2 Finite Parameter Learning. Now let the target probability density Q(x|α), ¯ or a model, belong to a set of densities A, a model family, that can be indexed by a vector of parameters α ∈ A, dim α = K < ∞, and ∀α ∈ A, P(α) > 0. Then if A is not compact, or if the KL divergence between α ¯ and the boundary of A is larger than , then the density of solutions for such K –parametric family is (Bialek et al., 2001)2 ρ(; α) ¯ ≈ P(α|r ¯ )

2

2π K /2 (K −2)/2 , √ (K /2) det F K

(3.3)

A different scaling dimension d K may appear in these formulas instead of K , the number of parameters. For example, for a redundant parameterization, d K < K . Opposite situations, d K > K and even d K → ∞, are also possible (Bialek et al., 2001).

Fluctuation-Dissipation Theorem and Models of Learning

2011

where ¯ ∂ 2 DKL (α||α) . ∂αµ ∂αν α=α¯

µν

¯ = F K (α)

(3.4)

NF K is the Fisher information matrix (Cover & Thomas, 1991); its eigenvectors are the principal axes of the error ellipsoid in the parameter space, and the (inverse) eigenvalues are variances of parameter estimates along each of these directions. The prefactor 2π K /2 / (K /2) is the area of the K-sphere, and it has to be multiplied by the fraction of the sphere that is inside A if the latter is (semi)compact. Equation 3.3 now gives Ipred (N) ≈ D(α, ¯ N) ≈ K /2 log N,

(3.5)

(N) ≈ (α, ¯ N) ≈ K /(2N).

(3.6)

¯ ∈ A (here Q ¯ is the target density), The situation changes slightly if Q and the prior assumptions about the world are wrong. Then we find the ¯ best approximation to the target within A, α ˆ = arg minα∈A DKL ( Q||α), and ¯ ¯ ¯ define the distance between Q and A, DA( Q) ≡ DKL ( Q||α) ˆ (this is similar to the I-projection—Csiszar, 1975—but the order of arguments in DKL is ¯ and the different). In this case, the model density is zero for ≤ DA( Q), estimate concentrates near α ˆ as N → ∞. Thus, if the radius of curvature of ¯ is also small, then equations 3.3 and 3.6 A is much larger than and DA( Q) generalize to ρ(; α) ¯ ≈

K /2

¯

(K −2)/2

[−DA( Q)] 2π √ P(α) ˆ [K /2] det F

0, ≤ DA,

¯ N) ≈ DA( Q) ¯ + K /(2N). ( Q,

K

¯ , > DA( Q),

(3.7) (3.8)

¯ 3.3 Nested Finite Parameter Models. Suppose now the target Q(x) that generates the observations belongs to one of R model families, Ar , r = ¯ ∈ Ar ) = P(r ). Models in each of the families are indexed 1 . . . R, with Prob( Q by parameters α(r ) , dim α(r ) = K (r ) < ∞, so that the density of observing x in a given model is Qr (x|α(r ) ). Within each family, the parameters are a priori distributed according to P(α(r ) |r ). We will assume that the families are nested. That is, Qr (x|α(r ) ) ≡ Q(x|α) are independent of r , and in each family the values of αµ , µ > K (r ) are identically zero. Further, the nonzero parameters have the same a priori distributions in all families: P(αµ |r ) =

p(αµ ) , µ ≤ K (r ) δ(αµ ) , µ > K (r )

(3.9)

2012

I. Nemenman

P(α|r ) =

R

P(αµ |r ).

(3.10)

µ=1

Thus, a parameter αµ is switched on (or activated) when r reaches rµ ≡ minr {r : K (r ) ≥ µ}. Discussion of such nested models has been current in Bayesian (Bernardo, 2003; Raftery & Zheng, 2003) and frequentist (Neter, Kutner, Nachtsheim, & Wasserman, 1996) literature for many years. However, we are unaware of any comprehensive analysis relevant to the questions analyzed in our current presentation, such as those in appendixes A to C. If R → ∞, then we require that the union of all families forms a complete set, so that every sufficiently smooth probability density can be approximated arbitrarily closely by some member of the union (if needed, this definition can be made more precise). For simplicity, in this letter we focus on3 p(αµ ) = N 0, σµ2 ,

(3.11)

σµ = crµ−β , β ≥ 0, c = const,

(3.12)

where N (a , b) denotes a normal distribution with the mean of a and the variance of b. In particular, β = 0 corresponds to the same in-family a priori variances for all active parameters. This is common when discussing Bayesian model selection. While these priors describe a set of parametric models, another view is also possible. The joint distribution of r , α, and {x} is P({x}, α, r ) = Q({x}|α)P(α|r )P(r ), which results in P({x}, α) =

Q({x}|α)P(α|r )P(r ) = Q({x}|α)

r

≡ Q({x}|α)P(α),

P(α|r )P(r )

r

(3.13)

where the last equation defines P(α), the overall prior over α. Unlike P(α|r ), P(α) is not factorizable and is not differentiable at zero for any αµ , µ > K (1). Thus, the nested setup may be viewed as inference in a combined model family with K (R) parameters. In particular, for R and K (R) → ∞, the learning problem has a countable infinity of parameters leading to the common assumption of equivalence with the nonparametric inference.

3 Nestedness, completeness, and normality of the priors are needed only for comparison with models discussed later; they are not essential for Bayesian learning.

Fluctuation-Dissipation Theorem and Models of Learning

2013

It is of interest to calculate the combined a priori mean and variance of α. Integrating over all αν , ν = µ, we get the combined prior for αµ : P(αµ ) = δ(αµ )

r
P(r ) + p(αµ )

P(r ).

(3.14)

r ≥rµ

By equation 3.11, the a priori means of all parameters are zero, and the variances are

δαµ2 = σµ2 P(r ).

(3.15)

r ≥rµ

Thus, the bare variance σµ2 is “renormalized” by the probability of being in a family in which the parameter is nonzero. An interesting special case is P(r ) ∝ r −γ ,

γ > 1, R → ∞,

rµ = µ.

(3.16) (3.17)

Then the a priori variance gets a simple form: ∞ r −γ ∼ µ−β−γ +1 . δαµ2 ∝ µ−β

(3.18)

r =µ

Thus, δαµ2 depends as much on the bare variance as on the speed of decay of P(rµ ). This suggests that the learning properties of the nested setup will depend equivalently on β and γ . In fact, as shown in appendix A, this is not true: while behavior of p(αµ ) is important, any reasonable choice of P(r ) does not affect success of the learning. From equation 3.7, we can now evaluate the model density and the learning curve for the nested setup. For each value of α ¯ and r , we can find the model α ˆ r = arg minα∈Ar DKL (α||α) ¯ that best approximates α ¯ in Ar , and define Dr (α) ¯ ≡ DKL (α|| ¯ α ˆ r ), the distance between α ¯ and Ar . If α ¯ ∈ Ar , then α ¯ =α ˆ r , and Dr (α) ¯ = 0. However, if α ¯ ∈ Ar , then Dr (α) ¯ > 0. We then have ρ(; α) ¯ =

r : Dr (α)≤ ¯

P(r )P(α ˆ r |r )

¯ [K (r )−2]/2 2π K (r )/2 [ − Dr (α)] . [K (r )/2] det F K (r )

(3.19)

The learning curve in this scenario strongly depends on the target. Let α ¯ have r¯ active modes (with r¯ determined according to P(¯r )), and let each

2014

I. Nemenman

of these modes have an amplitude ∼σ (that is, β = 0). Then Dr (α) ¯ is either exactly zero (for r ≥ r¯ ) or large (for r < r¯ ). So equation 3.19 becomes ρ( → 0; α ¯ typ;¯r ) ∼

r ≥¯r

P(r )P(α ˆ r |r )

2π K (r )/2 [K (r )−2]/2 , [K (r )/2] det F K (r )

(3.20)

where α ¯ typ;¯r is a distribution typical in Ar¯ . This is dominated by r = r¯ , and for N K (¯r ), the learning curve is (N) ≈

K (¯r ) . 2N

(3.21)

It is now clear that averaging over P(¯r ) is not informative. Note also that for N r¯ , the learning curve goes through a cascade of K (r )/N behaviors, 1 ≤ r ≤ r¯ , and changes of the prefactor of the N−1 scaling correspond to activations of new parameters, which happen rather abruptly (see appendix A). 3.4 Nonparametric Learning. Nonparametric learning usually refers to inferring a functional form of a probability density Q(x), or rather of φ(x) ≡ − log Q(x), with some smoothness constraints on it. The constraints may be in the form of bounding some derivatives of Q or φ, which was the choice of Hall & Hannan (1988) and Rissanen et al. (1992).4 Alternatively, in the Bayesian framework followed here, the constraints may be incorporated into a functional prior that makes sense as a continuous theory, independent of discretization of x on small scales. For x in one dimension, the minimal and the most common choice is (Bialek et al., 1996; Aida, 1999; Nemenman & Bialek, 2002; Lemm, 2002)

η 2 1 2η−1 ∂ φ P[φ(x)] = exp − dx Z 2 ∂ xη 1 d x e−φ(x) − 1 , ×δ l0

(3.22)

where η > 1/2, Z is the normalization constant, and the δ-function enforces normalization of Q. The hyperparameters and η are called the smoothness scale and the smoothness exponent, respectively. Fractional-order 4 These authors used histogramming density estimators, which have no hierarchy of model families; this is especially true for Rissanen et al. (1992), who allowed locally varying bin widths. Therefore, these techniques cannot be referred to as nested parametric methods. On the other hand, they allow an arbitrarily precise fit to any probability density and may require an arbitrarily large number of break points and density values for complete specification. This is the reason for treating them as nonparametric.

Fluctuation-Dissipation Theorem and Models of Learning

2015

derivatives are defined by multiplying by the wave number to the appropriate power in the Fourier representation of φ (we assume periodicity on [0, 1)). This prior is equivalent to specifying a one-dimensional quantum field theory (Bialek et al., 1996; Holy, 1997; Nemenman & Bialek, 2002; Lemm, 2002), and QFT methods have been successful in the analysis. In particular, the maximum likelihood estimate of the distribution Q∗ (x) ≡ exp[−φ ∗ (x)] is given by the following differential equation, 2η−1 R−2η

∂ 2η φ ∗ (x) − NQ∗ (x) + δ(x − xi ) = 0, 2η ∂x i

(3.23)

where the operator Rθ shifts the phase of each Fourier component of its argument by πθ/2.5 The equation shows that derivatives of φ ∗ and Q∗ of order 2η − 1 have step discontinuities. Thus, for 2η = 1, the maximum likelihood solution itself, φ ∗ (x), is discontinuous, and for 2η < 1, the singularities are even more severe. We may characterize sample-dependent fluctuations in Q∗ by DKL = d x Q∗1 (x) log Q∗1 (x)/Q∗2 (x), where Q∗1 and Q∗2 are maximum likelihood solutions for different sample realizations. If Q∗ has at least step discontinuities at the sample points and these points are random, then DKL does not fall to zero as N grows. Therefore, the QFT setup becomes inconsistent at η = 1/2, even though Bayesian formulation is proper, and the prior can still be normalized by, for example, going to the Fourier representation. This is in contrast to the nested setup, where normalizable priors guarantee consistency. Bialek et al. (1996, 2001) have calculated the → 0 model density and the fluctuation determinant for different η’s. By noticing from equation 3.23 that N and can enter the solutions only in a combination N/2η−1 , we extend their results and recover correct dependence not only on η (for η > 1/2) but also on : ¯ B[φ] ξ ¯ ¯ ρ(; φ) ≈ A[φ] exp − 1/(2η−1) , (3.24) N 1/2η ¯ ¯ , (3.25) D(φ; N) ≈ C[φ] 2η−1 ¯ N 1/2η−1 C[φ] ¯ . (3.26) (φ, N) ≈ 2η 2η−1 2η−1 Here, ξ depends only on η, and A, B, and C are some known related functionals that do not depend on . These asymptotics kick in when N 1/. 5 For a comprehensive treatment of fractional differentiation, see Samko, Kilbas, and Marichev (1987).

2016

I. Nemenman

In particular, for smaller N, ∼ 1 and is barely decreasing. The dependence ¯ on φ¯ may be significant (and, possibly, diverging for ill-behaved tarof C[φ] gets). However, from equation 3.24, the dependence on η near 2η − 1 → +0 is easier to analyze, ¯ ∼ (2η − 1)1/2η−1 , C[φ]

(3.27)

with an undetermined value at 2η = 1. So for η → 1/2, D approaches extensivity in N and then becomes ill defined, again signaling inconsistency. As discussed by Bialek et al. (2001), problems with D(N)/N → const are the most complicated correctly posed learning problems that exist, and they can be studied in the Bayesian QFT setting. For comparison with the nested case (see appendix B), we may replace φ by its Fourier series, φ(x|α) ≡ − log Q(x|α) = α0 +

α0 = log

d x exp −

r µ=1

r µ=1

αµ+

αµ+ cos 2π µx + αµ− sin 2π µx , (3.28)

cos 2πµx +

αµ−

sin 2π µx

,

(3.29)

with r → ∞ (finite r results in a finite parameter model). The last equation enforces = 1, and it is equivalent to normalization, d x Q(x|α) the constraint δ exp[−φ(x|α)]d x − 1 in the prior P(α|r ) or P[φ(x)]. Since the Jacobian of the transformation φ(x) → {αµ± } is a constant, equation 3.22 amounts to zero-mean gaussian priors over αµ± with the variance (Nemenman & Bialek, 2002),

± 2 δαµ =

2

1

2η−1

(2πµ)2η

,

µ > 0.

(3.30)

Equations 3.18 and 3.30 suggest that the nested and nonparameteric case are similar: the a priori means of the amplitudes are zero, and the variances fall off as power laws in µ. However, the field theory model requires the variance to decrease at least as fast as 1/µ (recall that η > 1/2), while the finite parameter case does not impose such constraints. This is an indication of an essential difference between the models: in the nested case, the priors, specifically the a priori variances of parameters, have less of an influence on learning. This can be easily explained. QFT nonparametric models do not have a sharp separation between active and passive modes. The modes with low µ are determined by the data, but fluctuations for larger µ are inhibited only due to the small a priori variances, equation 3.30. The exact attenuation of the fluctuation depends on the values of η and , and the cumulative contribution to posterior variance of the estimator may be substantial. In

Fluctuation-Dissipation Theorem and Models of Learning

2017

contrast, for the finite parameter nested case, once the most probable model family is determined, fluctuations of the higher-order parameters are inhibited exponentially (see appendix A). The cumulative fluctuations are then small and almost independent of the a priori parameter variances, and the learning may succeed even for P(r ) with a long tail. The dependence of the QFT model on the prior can be weakened by treating as an unknown random variable and averaging over it (Bialek et al., 1996; Nemenman & Bialek, 2002) (similar averaging over η has not yet been performed). This is akin to nesting of finite-dimensional models and improves learning curves for a wide range of targets. On the other hand, integration over produces the theory that is not necessarily local in φ(x), couples all of the Fourier amplitudes, and is difficult to compare to the nested finite parameter setup directly. Therefore, we do not discuss the averaging in what follows, but assume that the values of η and used for learning are the best for a particular target being learned. 3.5 Comparing the Performance. One of the goals of the letter is to decide if learning curves can be used to distinguish which learning machine is a good description of a particular biological system. To this extent, we need to analyze responses of various learners to data that they are not expecting. Thus, in this section, we derive learning curves for a finite parameter, a nested (β = 0), and a QFT machine on data that is typical in the prior of one of the two others. With mismatched data and expectations, the learning curve cannot be optimal but may come quite close. ¯ taken from the QFT prior. The learning curve for the First, consider Q finite parameter model is given by equation 3.8—a N−1 decay toward some approximate target. Further, as shown in appendix C, the learning curves for complete nested models, equation C.4, and for the QFT machine, equation 3.26, which is the best possible machine for such data, differ only logarithmically. If instead we study a distribution that is typical in the nested case for some r¯ (equivalently, a finite parameter distribution with K (¯r ) parameters), then a finite parameter model again gives equation 3.8. On the other hand, for N K (¯r ), no complete learning machine can estimate all required unknown parameters, and does not have a well-defined scaling (Nemenman & Bialek, 2002). The differences between the machines emerge for N r¯ . The nested machine eventually asymptotes to equation 3.21, and starts learning at the rate of 1/N. However, the QFT setup performs differently: when → 0 and all r¯ modes are well approximated, the machine continues trying to fit higher-order modes, which it expects to be present even though they are not. This will result in the same fluctuation determinant as in equation 3.25, switching to the usual asymptotic ∝ (N)1/2η−1 instead of the optimal N−1 . So, surprisingly, when the target has a finite number of degrees of freedom, the nested setup is qualitatively faster than the QFT learning machine.

2018

I. Nemenman

4 Learning a Changing Target One never needs to know the distribution that generated the data to an infinite precision, and some > 0 approximation is usually enough. Further, if learning in biological systems is stochastic, as argued, for example, by Seung (2003), then is bounded from below by the noise variance. As shown by Fairhall et al. (2001) and especially by Gallistel et al. (2001), convergence to the “good enough” estimate happens so quickly that the transient learning curves are difficult to resolve. Is the performance difference between the nested and the QFT scenarios seen in the previous section important? And can it be used to discriminate between the models? 4.1 Model Density and Variable Stimuli. Notice that often the target itself changes while being learned. The ambient light intensity may be fluctuating while our eye estimates it, or the variance of angular velocities measured by a fly motion-sensitive neuron can be varied by an experimenter while the fly tries to adapt to it (Fairhall et al., 2001). In these cases, one has to learn constantly to stay at the allowed –error, and then a faster learning machine may be truly advantageous. However, even for a variable target, the nested learner will not be helpful if a small change of the target parameters throws it back to a very large , or the changing target may drift to a region where r¯ is so large that the nested setup is not better than the nonparametric one anymore. To answer these concerns, instead of focusing on the density of solutions as a function of the allowed error , we will keep fixed and vary α. ¯ For some small , a schematic drawing of dependence of ρ on α¯ 1+ and α¯ 2+ with the other parameters fixed at 0 is shown in Figure 1. In the nested case, there is a ridge √ along α2+ ≈ 0, where the density is, at least, ∼ 1/ larger than anywhere else (cf. equation 3.19). The ridge comes from the prior, equation 3.13, for α¯ 2+ = 0 being singularly larger than for α¯ 2+ = 0, and the singularity is then smoothed out by –approximation. In comparison, the nonparametric prior has a bivariate normal shape, which, after –smearing, results in a weak target dependency of –independent prefactors in equation 3.24; thus, ρ(α) ¯ varies slowly.6 Figure 1 answers both of the concerns mentioned above. For a QFT machine, densities everywhere are comparatively small. So a small change of the target means vast and slow relearning. In contrast, if, for a nested case,

6 The plots of P(α) ¯ and ρ(; α) ¯ have very different meanings. The volume under the P(α) ¯ surface is fixed by normalization, d α ¯ P(α) ¯ = 1. Thus, high a priori probability on any singular line, for example, α+ 2 = 0, necessarily means a lower prior elsewhere. Such considerations are the reason for no-free-lunch theorems (Wolpert, 1995). In the language of the model density, the normalization condition is d ρ(α; ¯ ) = 1. However, there are no constraints on the density integrated over α, ¯ and a large density for some target does not necessarily result in a lower density elsewhere.

Fluctuation-Dissipation Theorem and Models of Learning

2019

ρ

nested QFT

0 0 0

α+ 2

+

α1

Figure 1: Schematic density of models as a function of the target location.

α ¯ is in the large-density region, then there are many other models in the vicinity. Small parameter changes likely leave the target close, and not much needs to be relearned. Further, since the ridge drops off smoothly, models in the vicinity of a large-density target also have large densities, and thus are learned fast as well. Of course, this holds only when the target varies mostly along a small set of directions, and density ridges are aligned with those. Importantly, since at a finite , the ridge has a finite width, a perfect alignment is not necessary. We believe that many natural signals have such structure. For example, in phototransduction, instantaneous intensity is determined by the statistics of reflectivities of objects that come in the view and by the mean ambient light intensity. The statistics barely change over long timescales, while the mean intensity depends on, for example, clouds shading the sun and varies a lot and rapidly. The photoreceptor may want to adapt to intricate details of the distribution of reflectivities, but only after it accurately learns the mean light level. A similar separation of timescales is observed in transcriptional regulation, where, for example, changes in the lactose concentration happen on the scale of minutes, while statistics of lactose bursts depend on the environment and is constant for generations. In neuroscience, when

2020

I. Nemenman

estimating an angular velocity, a fly takes into the account the preceding velocity variance (Fairhall et al., 2001), but it may not have time for reaction to higher-order moments. Thus, we believe that many natural learners that have a need to learn fast, but also to be able to learn a very wide class of models accurately on longer timescales, will be organized as nested learning machines with the density ridges approximately adjusted to fast variable directions. 4.2 Fluctuation-Dissipation and Determining the Model. The prediction in the last section brings us back to the main question of this work: How can an underlying learning–theoretic computation be inferred? For many reasons, analysis of learning curves is not always a good idea. First, learning may happen so fast that resolving it might present a problem (Gallistel et al., 2001). Second, to estimate (N) reliably, we need to average, and a complete instance of the learning curve is just one sample. Such averaging may require prohibitively long experiments. Third, it is well known that animals adapt. Thus, eliciting the same response to the same target requires large intertrial time delays, further increasing the experimental duration. These problems can be traced to learning being an inherently transient behavior, and they might become less severe if we can characterize learning machines by some stationary response properties. A hint comes from the fluctuation– dissipation theorem in statistical physics (Ma, 1985), which states that if a system fluctuates in the presence of a linear dissipative restoring force, then the variance of fluctuations (a stationary property) is linearly related to the dissipation coefficient (a feature of the transient response). In our case, we may hope that response to a variable target (fluctuations) reveals information about the learning curve (dissipation). In view of this suggestion, let us now analyze a few examples of a variable target learning.7 We now denote by α an estimate of α ¯ averaged over many presentations of the same data. We keep almost all parameters fixed (or changing very slowly), while α¯ 1 (t), which is approximately the direction of the ridge in the density of solutions, is allowed to vary. If data are observed for a long time, then α¯ µ ≈ αµ for µ = 1 (provided α ¯ = α). ˆ Now remember that is the expected KL divergence between α ¯ and α, which converges to the χ 2 distance when it is small. Thus, if α¯ 1 − α1 is not large, ∝ (α1 − α¯ 1 )2 .

(4.1)

7 It is clear that stretching the theory of learning a fixed target to the fluctuating case may hide many potential pitfalls. We do this because we are unaware of any comprehensive treatments of the latter problem, though some progress is being made (cf. DeWeese & Zador (1998); Atwal & Bialek (2004).

Fluctuation-Dissipation Theorem and Models of Learning

2021

For a fixed target and → 0 (that is, for N → ∞, α ˆ = α), ¯ all learning curves we studied can be summarized as ∂ = −ζ N ν . ∂N

(4.2)

Here, in particular, ν = 1 corresponds to a finite set of solutions along the direction of α1 , ν = 2 is the finite–parameter or nested case, and ν = 3 is the η = 1 QFT model. In principle, other values of ν ∈ (0; ∞) are possible. The constant ζ N ∼ 1 depends on the details of the learning setup. For example, for parametric cases, ζ N = 2/K (¯r ). For equation 4.2, which is manifestly true for a fixed α, ¯ to also hold in the fluctuating target case, the learning machine must quickly notice the target’s variation and disregard old samples as soon as they become outdated. Gallistel et al. (2001) show that a rat reacts to changes in the reward rates as fast an ideal detector would. Therefore, this assumption is reasonable for biological systems.8 If measurements are taken at a fixed rate, so that d N/dt = const, we can combine equations 4.1 and 4.2 to get d = −ζ sign() ||2ν−1 − vα¯ , dt

(4.3)

where = α1 − α¯ 1 is the average error of the estimation, ζ is some unknown constant with the dimensionality of 1/t and is basically the scaled sampling rate, and vα¯ is the drift velocity of the target. Equation 4.3 is a clear example of a dissipative system, and it has many analogs in the theories of classical and quantum dissipation (Weiss, 1995). Note also that unlike in the fluctuation-dissipation analysis in statistical physics, the spectrum of fluctuations, vα¯ , is not necessarily white and can be controlled by an experimentalist, potentially providing more ways to probe the underlying dissipative dynamics. If the target’s variation cannot be learned (incomplete or mismatched machine), then equation 4.3 still holds. However, because of equation 3.8, we now have = α1 − αˆ 1 (recall that α ˆ is the best approximation to the target by a particular learning machine). Thus, to trace the evolution of α1 using equation 4.3, one would need to evaluate α( ˆ α), ¯ which can be done from the stationary target analysis. Further, if the target varies along many learnable directions, then for each such direction, we have an analog of equation 4.3, possibly with different ξ . So the dynamics of is still given

8

We leave aside important comments by DeWeese and Zador (1998), who argued that time needed to notice a change may be not invariant with respect to the direction of the change.

2022

I. Nemenman

by equation 4.2 with forcing, but the dissipation constant depends on the number of varying parameters. Let us now consider a few different examples of vα¯ . If vα¯ = A is a constant, then asymptotically for t → ∞, setting d/dt = 0, we find → ∞ = −

A ζ

1/(2ν−1) .

(4.4)

The ratio vα¯ /ζ must be 1; otherwise, is outside the → 0 asymptotic, for which equation 4.2 is valid. Thus, for small drifts, setups with smaller ν win qualitatively. It is also of interest to consider the situation when α¯ 1 undergoes a Brownian motion, vα¯ (t)vα¯ (t ) = δ(t − t ). Writing the Fokker-Planck equation for this Langevin dynamics, we easily find the stationary distribution of , P() =

ν 1 2ν

ζ ν

1/(2ν)

ζ ||2ν , exp − ν

(4.5)

which results in the root mean square (rms) fluctuations rms = ν

1/ν

3 1/2

2ν 2ν1

ζ

1/(2ν) .

(4.6)

Again, these results are true only if rms 1, and again smaller ν provides for better trailing of the target. Finally, inspired by Fairhall et al. (2001), let us examine the case of a periodic motion of α¯ 1 and take, for simplicity, α¯ 1 = A sin ωt and vα¯ = Aω cos ωt. Now equation 4.3 does not have a simple solution. However, we search for an asymptotically periodic (t) with the same angular frequency of ω. Therefore, if we multiply equation 4.3 by cos ωt, integrate over a full period, and exchange the order of the differentiation and the integration, we get d cos ωt = −ζ sign() ||2ν−1 cos ωt − Aω cos2 ωt, dt

(4.7)

where . . . denotes averaging over the period. Since we are looking for stationary oscillations, time derivative applied to any average is zero. This gives

sign() ||2ν−1 cos ωt = −

Aω . 2ζ

(4.8)

Fluctuation-Dissipation Theorem and Models of Learning

2023

Now multiplying equation 4.3 by sign() 2ν−1 and averaging again results in

||4ν−2 =

(Aω)2 , 2ζ 2

(4.9)

which is the same scaling as in equation 4.4. However, now we also have a dependence on ω. There are other cases that can be analyzed, such as a step jump in the target, α¯ 1 , its square wave modulation, or its diffusion in a potential (OrnsteinUhlenbeck process). Interestingly, the last two of these cases were used experimentally by Fairhall et al. (2001). However, we leave the analysis for the future, when it will be answering some specific question and will not be just a mathematical exercise. Even with the three examples already discussed, it is clear that letting the target move maps the scaling of the learning curve into a stationary property (e.g., variance of the estimation error), which might be easier to analyze experimentally.

5 Discussion We have shown that with a moving target, transient learning curves are replaced by different scaling dependences of the estimation errors on the amplitude of the target’s motion. This effect is stationary and may be easier to observe experimentally. However, since we do not have a comprehensive theory of variable target learning yet, a few precautions are in order when designing and analyzing experiments along these lines: 1. Target velocities must be kept small so that the asymptotic analysis presented in this work holds. 2. The analysis is valid only when the learner forgets past observations as soon as the target changes appreciably. Learning will be much slower if such outdated samples are kept. 3. When varying the stimulus, we have to be reasonably sure that the animal only tracks it, and does not predict it. White noise vα¯ or a multiparameter representation of the target in terms of the position, velocity, or acceleration, for example, might be a solution. 4. Finally, we have to keep in mind that in a behaving animal, learning a change in a signal and reacting to it may be separated by a long delay, and special care is needed to observe the former, but not the latter. This being said, it nevertheless is possible that all these and other disadvantages will be outweighed by the ability to determine the correct learning– theoretic model of the organism by varying the amplitude and the nature

2024

I. Nemenman

(say, stochastic or periodic) of the target’s motion and studying typical responses as functions of these parameters. Consider, for example, the experiment described in Figure 4 of Fairhall et al. (2001). There, the input signal (the standard deviation of the angular velocity, σ (t)) undergoes a finite variance 2 and a finite correlation time τ random motion. The instantaneous neuron firing rate r (t) is the estimate of σ (t). Repeating exactly the same randomly generated stimulus many times and averaging over spike trains, one may estimate r (t) and, consequently, 1/2 the rms estimation error rms = (σ (t) − r (t)2 t . Studying dependence of rms on and τ along the lines of equation 4.6, one can estimate ν. Any ν = 2 uniquely determines the underlying computational model. For ν = 2, to distinguish a usual finite parameter model from the one that is nested, one makes the signal multidimensional (other parameters of the angular velocity, such as the mean and the skewness, vary together with σ ). For at least some signal extensions, the nested model will change the magnitude (but not the scaling) of rms since ζ ∝ 1/K (¯r ). In contrast, the simpler model will keep the same prefactor but will be converging only to an approximation of the target. In the cognitive experiments of Gallistel et al. (2001), a rat was trying to learn reward rates on different terminals and match its foraging habits correspondingly. It was determined to be an ideal change detector. To build a more detailed model of the animal, one can vary the reward rates continuously, repeat experiments many times, and then look at the average mismatch between the stimulus and the response. Then dependence of the mismatch on the parameters of the rate changes will point to a proper class of learning–theoretic models to compare the rat to. Similarly, one can do this type of analysis on artificial neural networks designed explicitly to model particular animal behavior (Seung, 2003); this will build connections between network architectures and types of inference tasks performed by them. Another conclusion of our work is that the nested setup may learn faster than the QFT one under some conditions. Thus, if one desires a complete learning machine, a nested machine should be built unless there is some specific reason to do the opposite (such as knowing that the world is unlikely to have sharp cutoffs). With experiments along the lines suggested above, this prediction should be testable. We should be able to see if our intuitive beliefs about appropriate complexities of learners for particular tasks match nature’s choices. It would also be interesting to study if structural characteristics of a learner are correlated with its learning-theoretic description. That is, could it be that modular, irregular networks, like those seen in biochemistry, often compute like parametric or nested machines? And could layered, regular networks in our brains, which are believed to be able to solve the most complicated learning problems, be realizing QFT machines instead?

Fluctuation-Dissipation Theorem and Models of Learning

2025

Appendix A: Model Family Selection in the Nested Setup Inference in Bayesian i.i.d. setup is quite standard (Press, 1989; Bialek et al., 1996; Balasubramanian, 1997; Bernardo, 2003; Raftery & Zheng, 2003), and the nested case is not very different. For example, a posteriori expectations of parameter values are given by a derivative of the posterior moment generating function (or the partition function), Z(J): ∂ log Z(J) , ∂ J µ J=0 Z(J) ≡ dα P(α) e−L(α)+J·α

αµ =

=

(A.2)

P(r )Zr (J)r ,

(A.3)

r

Zr (J)r ≡ L(α) ≡

(A.1)

K (r )

d K (r ) α e−Lr (α)+

N

µ=1

J µ αµ

,

φ(xi |α),

(A.4) (A.5)

i=1

Lr (α) ≡ −

K (r )

log p(αµ ) +

µ=1

φ(x|α) ≡ − log Q(x|α).

N

φ(xi |α),

(A.6)

i=1

(A.7)

The posterior expectations are thus determined by the properties of the Z(J), which can be calculated using the saddle point analysis for N 1. This is difficult for the first form of Z(J), equation A.2, due to the singularity at αµ = 0 (the singularity was also the reason that we left P(α) out of the combined Lagrangian, equation A.5). Hence, we return to the nested form, equations A.3 and A.4, but the equivalence between the representation should be kept in mind. Exchanging the order of integration and summation in equations A.2 and A.3 and similar is possible if the priors decay sufficiently fast at r → ∞, or are regularized with regularization lifted after averages are calculated. Unless mentioned otherwise, this is always assumed. The expectation of αµ in the model families with K (r ) < µ is necessarily zero, and a similar bias toward smaller magnitudes of parameters will be present when we average over families. Therefore, the a priori decrease of the variances with µ, equations 3.15 and 3.18, will persist a posteriori for finite N. This is the famous James and Stein (1961) shrinkage. The saddle point, also called classical or maximum likelihood, values of ∗ parameters in each family, αr∗ ≡ {αµ;r }, and the second derivatives matrix

2026

I. Nemenman

∗ at the saddle, Fr , are determined by (remember that αµ;r ≡ 0 for µ > K (r ))9

∂Lr (α) = 0 , µ ≤ K (r ) , ∂αµ α=α∗ r ∂ 2 Lr (α) µν = Fr , µ, ν ≤ K (r ). ∂αµ ∂αν α=α∗

(A.8) (A.9)

r

To the first order in 1/N, this gives Z(J) =

P(r )

r

P(αr∗ |r )(2π ) K (r )/2 N K (r )/2

det1/2 FNr

−1

Q({x}|αr∗ ) e 2 Jr Fr 1

Jr +Jr ·αr∗

,

(A.10)

where µ ≤ K (r ) components of Jr are the same as those of J, and all higherorder components are zero. Differentiating, we get: R

∗ −L(r ) r =1 αµ;r e R −L(r ) , r =1 e

αµ =

L(r ) ≡ − log P(r ) −

K (r ) µ=1

+

N i=1

φ(xi |αr∗ ) +

(A.11)

∗ log p(αµ;r )

K (r ) N Fr log + Tr log . 2 2π N

(A.12)

For finite R and β = 0, this is the usual Bayesian model family selection: a posteriori expectations are weighted sums over posterior probabilities of families defined by e−L(r ) . This posterior includes the negative maximum N likelihood term, i=1 φ(xi |αr∗ ), which grows in magnitude linearly with N but decreases as r grows due to nestedness. It also incorporates the fluctuaN + Tr log FNr , which grows logarithmically in N tion determinant K2(r ) log 2π but increases with r . Depending on the value of N, there will be some r ∗ , for which L(r ) is minimal. For large N, as a discrete analog of the saddle point argument, this value will dominate the sums in equation A.11, and hence some model family will be “selected.” However, equations A.11 and A.12 become more interesting if one lets R → ∞. The completeness condition ensures thatfor large enough r , one will be overfitting the data, and Q(x|αr∗ ) → 1/N δ(x − xi ). Therefore, if

9

There are possibilities of more than one saddle point and of other anomalies. This was analyzed by Bialek et al. (2001). The conditions to prevent such problems are mild, and we assume them to hold in what follows.

Fluctuation-Dissipation Theorem and Models of Learning

2027

the sums are dominated by r → ∞, then consistency breaks and the learning fails. One would thus expect two features to influence the success of the learning. First, it is the prior P(r ), which switches on extra degrees of freedom: for slowly decaying priors, one would expect r → ∞ terms to win. Second, it is the dependence of the likelihood term on r , which measures how capable are the newly activated degrees of freedom of overfitting, or, equivalently, how fast maxi Q(xi |αr∗ ) grows. From equation A.12 it is easy to see that large r will have an exponentially small weight in the posterior probability if lim

r →∞

N maxi log Q(xi |αr∗ ) + log P(r ) = 0. K (r ) log N

(A.13)

Under this condition, Q(x|α∗ ) will eventually approach the correct distribution but not the sum of δ–functions. Colloquially, equation A.13 requires the explanatory capacity of the new, high-order degrees of freedom to be small enough so that keeping them always “on” does not make sense. This criterion, which we have not seen explicitly presented anywhere before, is similar to the consistency condition of the structural risk minimization (SRM) theory, which requires that the Vapnik–Chervonenkis dimension, the SRM capacity measure of the selected model, grows more slowly than the number of samples to be explained (Vapnik, 1998; Nemenman, 2000). As an example, let us analyze how the condition in equation A.13 may be violated for K (r ) ∼ r . In this case, P(r ) must be superexponential to be relevant for finding r ∗ . Thus, it is not required to decay at some minimal speed as might have been expected, though a need to exchange the order of integrations and summations in arriving at equation A.12 may still force that. Due to light tails and small effective support, exponentially decaying priors are not very interesting, so we disregard the prior term in equation A.13. Then for a fixed large N, a finite r ∗ will be dominant if log Q(xi |αr∗ )/r → 0. That is, the growth of the δ function-like peaks of the maximum likelihood distribution should be superlinear in K , the number of parameters in the model family, in order for r ∗ → ∞ and Bayesian setup to be inconsistent. Appendix B: Fourier Polynomials Nested Model To compare nonparametric and finite parameter nested scenarios directly, we analyze the following example. Consider families of probability distributions periodic on [0, 1) and with the logarithms of the distributions given by Fourier polynomials of degree r < ∞, as in equation 3.28. Due to the normalization condition, equation 3.29, the number of parameters in the r th model family is K (r ) = 2r . With an appropriate choice of priors, equations 3.9 and 3.10, these families form a nested set, and the completeness for R → ∞ follows from the Fourier theorem.

2028

I. Nemenman

The classical solution for this parameterization is (1 < µ ≤ r ) αµ∗± σµ2

+

cos sin

i

αµ∗± σµ2

+

2πµxi − N

d x Q(x|α∗ )

cos 2π µx ≡ sin

N ± N ∗± − Qµ = 0 . 2 µ 2

(B.1)

Here, Q± µ are the cosine (sine) amplitudes of the µth mode in the Fourier expansion of Q(x|α), and ± µ are the same for the empirical probability density, 1/N δ(x − xi ). ± µ are also the stochastic Fourier transform of Q(x). The cosine–cosine components of the second derivative matrix at the saddle point are ∂ 2 L ∂αµ+ ∂αν+

=

α=α∗

δµν + N d x Q(x|α∗ ) cos 2πµx cos 2π νx σµ2 −N d x Q(x|α∗ ) cos 2πµx dy Q(y|α∗ ) cos 2π νy (B.2)

=

N ∗+ ∗+ δµν N ∗+ + Qµ+ν + Q∗+ Q Q , µ−ν + σµ2 4 2 µ ν

(B.3)

and the sine-sine and the sine-cosine components are written similarly. This matrix is provably positive definite. Thus, for N → ∞, we can perform the saddle point analysis. For β = 0, the variance σµ2 is constant, and we can neglect the first term in equation B.1 in the limit of large N. This leads to the following solution of the saddle point equations: ± Q∗± µ ≈ µ ,

µ = 1, . . . , r.

(B.4)

For β > 0, Q∗± µ will be corrected by a systematic β-dependent bias, which will tend to 0 for fixed µ as N grows. This will decrease the posterior variance of the estimator Q∗ . Equation B.4 says that the first r pairs of coefficients αµ∗± are such that the corresponding Fourier amplitudes of the classical solution Q∗ match those of the empirical one. By Nyquist theorem and the law of large numbers, for r < N/2, ± µ approach the Fourier amplitudes of the unknown target probability ¯ Thus, the low-frequency modes will be learned well. However, density Q. if r > N/2, the saddle point solution will start to overfit and develop δ-like spikes at each observed data point. This is in accord with the observation we have already mentioned: to guarantee consistency, the capacity of models, as measured by either the VC dimension or the scaling dimension of Bialek

Fluctuation-Dissipation Theorem and Models of Learning

2029

et al. (2001), which in this case is equal to the number of free parameters, must grow slower than N. To avoid overfitting when averaging over r , we must make sure that the contribution of r → ∞ to the posterior log probability, equation A.12, is negligible. In this regime, according to equation B.4, the r available modes will create peaks of height ∼ r (recall the Fourier expansion of the δ–function) at the observed sample points. With K (r ) = 2r , this ensures consistency by satisfying equation A.13. Further, we can prove that r ∗ is not only finite, but actually grows sublinearly in N, again paralleling results for SRM (Vapnik, 1998) and their Bayesian equivalent (Nemenman, 2000). Suppose r N dominates the posterior. Then, for a slowly decaying P(r ), equation A.12 can be rewritten as L(r ) ∼ −N log r + r log N.

(B.5)

This is minimized (and the posterior probability is maximized) for r ∗ ∼ logNN , and higher values of r are exponentially inhibited. Thus, the assumption of r N being dominant is incorrect, and the posterior probability is dominated by r ∗ ∼ N for all reasonable priors. This is, of course, the worst-case estimation, and in many typical applications the value of r ∗ is even lower. Appendix C: Fourier Nested Model and QFT Targets As shown above, r ∗ that minimizes equation A.12 for a Fourier nested setup is much smaller than N. This is true for any target, including QFT– typical targets. For r of such magnitude, the first r modes of the target are well approximated by the estimate, and they contribute O(r/N) to the leading data-dependent term in equation A.12. The modes of the target above the r th are not fitted by the estimate, and each of them contributes its variance of about −2η+1 µ−2η to the data-dependent term, adding up −2η+1 −2η to ∞ µ ∝ (r )−2η+1 . Combined with the fluctuation determiµ=r +1 nant, this gives L(r ) ∼ −N(r )−2η+1 + r log N

(C.1)

for determining the most probable r . Thus, for N 1 (or 1),

1/2η N 1/2η−1 , and r ∝ log N 1−1/2η 1/2η log N D∝N . ∗

(C.2) (C.3)

2030

I. Nemenman

∼

log N N

1−1/2η .

(C.4)

Due to the many simplifications made here, the exact form of the logarithmic terms in these expression is questionable,10 and, in practice, they are impossible to observe for realistic N due to the target-dependent prefactors in front of the universal scaling term and various statistical fluctuations. However, the power law in equation C.4, which is definitely correct, suggests that the performance of the nested model is comparable to that of the true QFT one. In particular, the nested learning machine also can solve arbitrarily complex inference problems. A rigorous way to estimate performance ofthe nested learning on a nonparametric target is to calculate ρ() = d α ¯ P(α)ρ(; ¯ α), ¯ where ρ is of the form equation 3.19, and the averaging is done over the QFT prior, and then calculate D and from this averaged ρ. This is difficult, and instead we may choose to replace ρ() by ρ(; α ¯ typ ), where α ¯ typ is a typical target in the nonparametric prior, equation 3.22.11 ± For such α ¯ typ , Dr (α) ¯ ∼ µ>r −2η+1 µ−2η ∝ (r )−2η+1 . Further, P(αˆ µ;typ |r ) ∼ −2η −2η+1 2 2 −2β exp[−0.5 µ /σµ ]. In our case, σµ ∼ µ . Therefore, for 2(η − β) > 1, which is satisfied for η > 1/2 and β = 0, this gives P(α ˆ r |r ) ∼ r −2η −2η+1 2 −2(η−β)+1 exp − µ=1 µ , where C1 and C2 /σµ ∼ exp C1 − C2 r are constants. For large enough r , this whole expression tends to a constant. Thus, combining with equation 3.19, we get ρ(; α ¯ typ ) ∼

r : r −2η+1 ≤

P(r )

2π r [ − (r )−2η+1 ](r −1) . (r ) det F K (r )

(C.5)

10 If a certain derivative of the target distribution satisfies some Lipschitz conditions, then the Occam factor and the learning curve for histogramming density estimators provably have logarithmic contributions (Hall & Hannan, 1988; Rissanen et al., 1992). In contrast, logarithmic corrections for QFT models and for parametric learning of QFT–typical targets have not yet been analyzed. However, the logarithmic differences between the cases have been expected: in discrete case, β = 0, once we know K ∗ ∼ Nω , each of K ∗ parameters is free to vary with the same variance, giving familiar Nω log N fluctuations. For the nonparametric case, σµ < σν for µ > ν. Thus, each next parameter varies less, somewhat decreasing the total fluctuations (Bialek et al., 2001). These logarithmic terms have the same roots as the difference between crossvalidation, bootstrap, and Akaike’s model selection criterion, on one hand, and Dawid’s prequential statistics and Bayesian model selection, on the other (Stone, 1977; Dawid, 1984). There, the difference in the magnitude of the prediction error is also due to most of the parameters that are active at a given N being latent for smaller sample sizes. 11 The benefit ρ() provides over ρ(; α ¯ typ ) is knowing the prefactors in D and . We do not believe that any of the priors studied in this work will be exactly realized in nature. Therefore, calculation of ρ() is not a priority.

Fluctuation-Dissipation Theorem and Models of Learning

2031

If P(r ) is subexponential as before, we get ρ to the leading order in small by calculating the sum in equation C.5 using the saddle point analysis and taking just the zeroth-order term. The saddle value for r is r ∗ ∼ −1/(2η−1) −1 , which gives ρ(; α ¯ typical ) ∼

−1/(2η−1) −1

,

(C.6)

with the first subleading term of O exp − −1/(2η−1) −1 . Doing the leading-order evaluation of the integral in equation 2.2, we now get ∗ ∼ 1/2η−1 N/ log N , which again results in equation C.3. In summary, learning a distribution typical in the nonparametric model by means of the nested setup results in, at most, a logarithmic performance loss. Acknowledgments I thank William Bialek and Chris Wiggins for many stimulating discussions. I am also grateful to Ila Fiete and two anonymous referees for carefully reading the manuscript for this article and providing important feedback. This work was supported in part by NSF grants PHY99–07949 to Kavli Institute for Theoretical Physics and ECS-0332479 to Chris Wiggins and myself. References Aida, T. (1999). Field theoretical analysis of on-line learning of probability distributions. Phys. Rev. Lett., 83, 3554–3557. Atick, J. (1992). Could information theory provide an ecological theory of sensory processing? In W. Bialek (Ed.), Princeton Lectures on Biophysics (pp. 223–289). Singapore: World Scientific. Attneave, F. (1954). Some informational aspects of visual perception. Psych. Rev., 61, 183–193. Atwal, G., & Bialek, W. (2004). Ambiguous model learning made unambiguous with 1/f priors. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neur. Comp., 9, 349–368. Barlow, H. (1959). Sensory mechanisms, the reduction of redundancy and intelligence. In D. Blake & A. Uttley (Eds.), Proc. Symp. Mechanization of Thought Processes (Vol. 2, pp. 537–574). London: H. M. Stationery Office. Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bernardo, J. (2003). Bayesian statistics. In UNESCO Encyclopedia of Life Support Systems (EOLSS). Oxford: EOLSS Publishers.

2032

I. Nemenman

Bialek, W., Callan, C., & Strong, S. (1996). Field theories for learning probability distributions. Phys. Rev. Lett., 77, 4693–4697. Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neur. Comp., 13, 2409–2463. Bialek, W., & de Ruyter van Steveninck, R. R. (2005). Features and dimensions: Motion estimation in fly vision. Manuscript submitted for publication. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Clarke, B., & Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inf. Thy., 36, 453–471. Cohn, M., & Horibata, K. (1959). Inhibition by glucose of the induced synthesis of the β-galactoside-enzyme system of Escherichia coli: Analysis of maintenance. J. Bacteriol., 78, 601–612. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Csiszar, I. (1975). I–divergence geometry of probability distributions and minimization problems. Ann. Probab., 3(1), 146–158. Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. Bull. Amer. Math. Soc. n.s., 39(1), 1–49. Dawid, A. (1984). Present position and potential developments: Some personal views. Statistical theory: The prequential approach. J. Roy. Stat. Soc. A, 147 (p. 2), 278–292. Detwiler, P., Ramanathan, S., Sengupta, A., & Shraiman, B. (2000). Engineering aspects of enzymatic signal transduction: Photoreceptors in the retina. Biophys. J., 79, 2801–2817. DeWeese, M., & Zador, A. (1998). Asymmetric dynamics in optimal variance adaptation. Neur. Comp., 10, 1179–1202. Fairhall, A., Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Gallistel, C., Mark, T., King, A., & Latham, P. (2001). The rat approximates an ideal detector of changes in rates of reward: Implications for the law of effect. J. Exper. Psych.: Animal Behav. Proc., 27, 354–372. Hall, P., & Hannan, E. (1988). On stochastic complexity and nonparametric density estimation. Biometrika, 75(4), 705–714. Holy, T. (1997). Analysis of data from continuous probability distributions. Phys. Rev. Lett., 79, 3545–3548. James, W., & Stein, C. (1961). Estimation with quadratic loss. In J. Neyman (Ed.), editor, Proc. Fourth Berkeley Symposium Mathematical Statistics and Probability (Vol. 1, pp. 361–379). Berkeley: University of California Press. Janes, E. (1979). Inference, method, and decision: Towards a Bayesian philosophy of science. J. Amer. Stat. Assoc., 74(367), 740–741. Jeffreys, H. (1936). Further significance tests. Proc. Camb. Phil. Soc, 32, 416– 445. Kording, K., & Wolpert, D. (2004). Bayesian integration in sensorimotor learning. Nature, 427, 244–247. Lemm, J. (2002). Bayesian field theory. Baltimore, MD: Johns Hopkins University Press. Ma, S. (1985). Statistical mechanics. Singapore: World Scientific. MacKay, D. (1992). Bayesian interpolation. Neur. Comp., 4, 415–448.

Fluctuation-Dissipation Theorem and Models of Learning

2033

Nemenman, I. (2000). Information theory and learning: A physical approach. Unpublished doctoral dissertation, Princeton University. Nemenman, I., & Bialek, W. (2002). Occam factors and model-independent Bayesian learning of continuous distributions. Phys. Rev. E, 65, 026137. Neter, J., Kutner, M., Nachtsheim, C., & Wasserman, W. (1996). Applied linear regression models (3rd ed.). Chicago: Irwin. Press, S. (1989). Bayesian statistics: Principles, models, and applications. New York: Wiley. Raftery, A., & Zheng, Y. (2003). Discussion: Performance of Bayesian model averaging. J. Amer. Stat. Assoc., 98(464), 931–938. Rao, R. (2004). Bayesian computation in recurrent neural circuits. Neur. Comp., 16(1), 1–38. Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W. Rosenblith (Ed.), Principles of sensory communication (pp. 303–317). New York: Wiley. Rissanen, J. (1989). Stochastic complexity and statistical inquiry. Singapore: World Scientific. Rissanen, J., Speed, T., & Yu, B. (1992). Density estimation by stochastic complexity. IEEE Trans. Inf. Thy., 38(2), 315–323. Samko, S., Kilbas, A., & Marichev, O. (1987). Integraly i proizvodnye drobnogo poriadka i nekotorye ikh prilozheniia. Minsk, Belarus. Nauka i tekhnika. (In Russian) Schwartz, G. (1978). Estimating the dimension of a model. Ann. Stat., 6, 461–464. Seung, H. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. Roy. Stat. Soc. B, 39(1), 44–47. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weiss, U. (1995). Quantum dissipative systems (2nd ed.). Singapore: World Scientific. Wolpert, D. (1995). On the Bayesian ”Occam factors” argument for Occam’s razor. In T. Petsche T (Ed.), Computational learning and natural learning systems (Vol. 3). Cambridge, MA: MIT Press.

Received February 17, 2004; accepted February 7, 2005.

LETTER

Communicated by Mark van Rossum

Correlated Firing in a Feedforward Network with Mexican-HatType Connectivity Kosuke Hamaguchi [email protected] RIKEN, Brain Science Institute, Wako-shi, Saitama, 351-0198 Japan

Masato Okada [email protected] Department of Complexity Science and Engineering, University of Tokyo, Kashiwa, Chiba, 277-8561, Japan; and Intelligent Cooperation and Control, PRESTO, JST, Shibuya-ku, Tokyo, 151-0065, Japan

Michiko Yamana [email protected] Central Research Institute of Electric Power Industry, System Engineering Research Laboratory, Komae-shi, Tokyo, 201-8511, Japan

Kazuyuki Aihara [email protected] Institute of Industrial Science, University of Tokyo, Meguro, Tokyo 153-8505, Japan, and ERATO Aihara Complexity Modeling Project, JST, Shibuya-ku, Tokyo, 151-0065, Japan

We report on deterministic and stochastic evolutions of firing states through a feedforward neural network with Mexican-hat-type connectivity. The prevalence of columnar structures in a cortex implies spatially localized connectivity between neural pools. Although feedforward neural network models with homogeneous connectivity have been intensively studied within the context of the synfire chain, the effect of local connectivity has not yet been studied so thoroughly. When a neuron fires independently, the dynamics of macroscopic state variables (a firing rate and spatial eccentricity of a firing pattern) is deterministic from the law of large numbers. Possible stable firing states, which are derived from deterministic evolution equations, are uniform, localized, and nonfiring. The multistability of these three states is obtained where the excitatory and inhibitory interactions among neurons are balanced. When the presynapsedependent variance in connection efficacies is incorporated into the network, the variance generates common noise. Then the evolution of the macroscopic state variables becomes stochastic, and neurons begin to fire in a correlated manner due to the common noise. The correlation structure Neural Computation 17, 2034–2059 (2005)

© 2005 Massachusetts Institute of Technology

Correlation in a Mexican-Hat-Type Feedforward Network

2035

that is generated by common noise exhibits a nontrivial bimodal distribution. The development of a firing state through neural layers does not converge to a certain fixed point but keeps on fluctuating.

1 Introduction Many experiments have revealed the prevalence of synchronous, correlated, or rhythmic activities in the cortex (Gray, Konig, ¨ Engel, & Singer, 1989; Eckhorn et al., 1988). A homogeneous feedforward network has been proposed as a simple model of transmitting a synchronous spike packet and has been intensively studied theoretically (Abeles, 1991; Diesmann, Gewaltig, & Aertsen, 1999; Cˆateau & Fukai, 2001; Kistler & Gerstner, 2002; Yazdanbakhsh, Babidi, Rouhani, Arabzadeh, & Abbassian, 2002; Aviel, Nearing, Abeles, & Horn, 2003). These studies clarified that a synchronous firing state is stable in a homogeneous feedforward network. Once a certain number of synchronous spikes occur in one layer, their postsynaptic-layer neurons generate synchronous spikes again in response to the bombardment of incident spikes. The chain of neural layers where a synchronous spike packet propagates is called the synfire chain (Abeles, 1991). The synfire chain is unique in a point that it can transmit a synchronous activity across many neural layers. Therefore, the most direct way to detect the synfire chain is to record from several cells and to show that firing in one cell is tightly locked with a certain delay to a specific firing of another cell. Such correlated activity is experimentally demonstrated in the anterior forebrain pathway of songbirds (Kimpo, Theunissen, & Doupe, 2003). Other experiments showing such temporal structure of neural networks have been reported in the prefrontal cortex of primates (Abeles, Bergman, Margalit, & Vaadia, 1993), an artificially constructed network in vitro (Reyes, 2003), and both in vivo and in vitro (Ikegaya et al., 2004). An activity observed in the homogeneous feedforward network is uniform, and correlations between any two neurons are also uniform. However, the brain structure is not homogeneous but heterogeneous, and it would be a more intriguing problem to study a biologically realistic, structured network in the context of the synfire chain. The columnar localized activity observed in the cortex suggests that the cortical network is structured, and the most prevalent one is the Mexican hat (MH)–type connectivity (Hubel & Wiesel, 1962). This MH-type connectivity denotes the interaction with nearby excitation and distal inhibition (Blakemore, Carpente, & Georgeso, 1970); it is also referred to as lateral inhibition. The recurrent neural network model with the MH-type connectivity has been analyzed intensively (Wilson & Cowan, 1972; Amari, 1977; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Laing & Chow, 2001; Shriki, Hansel, & Sompolinsky, 2003) and plays prominent roles in modeling neural functions: control of saccade eye movement (Droulez & Berthoz, 1991), coding of arm movement in the motor

2036

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

cortex (Lukashin & Georgopoulos, 1993), and orientation selectivity and mental rotation (Ben-Yishai et al., 1995). Another interest of this study is the effect of common noise. The noise in neural networks is thought to be generated by other neural activities, and the noise impinging each neuron is actually correlated (Arieli, Sterkin, Grinvald, & Aertsen, 1996). With respect to population coding, the localized activity model is often used, and there have been many articles on the information obtainable from a pool of neurons with spatially correlated noise (Abbott & Dayan, 1999; Sompolinsky, Yoon, Kang, & Shamir, 2001; Wu, Amari, & Nakahara, 2002). However, the relation between the dynamics of a neural network with localized activity and the effect of noise correlation has not been thoroughly studied. Recently it was shown that noise correlation changes the statistical property of macroscopic variables of a pool of binary neurons driven by the noise and inputs; the macroscopic variables become stochastic variables (Amari, Nakahara, Wu, & Sakai, 2003). Such correlated noise is generated by several ways, including sparse connections and variance of connection efficacy. The former has been reported in Amari et al. (2003), and we study the latter case. However, it is difficult to study the effect of the common noise in the recurrent network if the common noise is generated by the variance of connection efficacy, because the connection efficacy and neural activities are correlated and their analytical treatment becomes quite difficult. In contrast, feedforward networks allow the theoretical study of network dynamics with common noise because the independence of connection efficacy at each layer eliminates the correlation of the activities. Therefore, we study the effect of common noise on the feedforward network with MH-type connectivity. Studies of a feedforward network with MH-type connectivity have been reported recently (van Rossum, Turrigiano, & Nelson, 2002; Hamaguchi & Aihara, 2004). In this network, a localized spike packet propagates through the network. It can also transmit analog information of a stimulus, such as a position (van Rossum et al., 2002) and both a position and an intensity (Hamaguchi & Aihara, 2004) by one shot of a spike volley. However, few analytical works have studied the dynamics of the network, the stability of the localized spike packet, and the effect of common noise on the spike packet propagation. To analyze the development of a propagating activity, the McCullochPitts (MP) neuron model is used in this article. It has recently been shown that the MP neuron model can explain qualitative phenomena of the synfire chain (Nowotny & Huerta, 2004). The shifting picture method introduced in Cˆateau and Fukai (2001) also provides a rational reasoning for the use of the MP neuron. By using the MP neuron model, we can describe the dynamics of a propagating activity as a deterministic map of the population firing rate of the presynaptic neural layer to that of postsynaptic one. Note that the population firing rate takes a deterministic value when the noise is independent and in the limit of an infinite number of neurons. The correlated

Correlation in a Mexican-Hat-Type Feedforward Network

2037

noise generates stochastic dynamics, which is distinct from the independent noise case. This article examines the feedforward neural network model with the MH-type connectivity. Our purpose is to study the macroscopic deterministic evolution of the localized activity without common noise and the macroscopic stochastic evolution driven by common noise. Our strategy is to describe the evolution of firing states through order parameter equations by representing the MH-type connection through the cosine function (Ben-Yishai et al., 1995). Section 2 introduces the network model we study. Section 3 describes the theory of stochastic evolution of the firing states in a feedforward network with the MH connectivity. Section 4 examines, first, the deterministic evolution of firing states without common noise as a basis for subsequent analysis on the stochastic behavior of the network. Then we analyze the emergence of stochastic evolution of the order parameters and resulting correlation structures. We draw conclusions and discuss our results in the final section. 2 Model The network model used in this article is depicted in Figure 1 and described as xθl+1 = h l+1 = θ J θl θ xθl − h , (2.1) θ where xθl+1 = {0, 1} is the output of a neuron at position θ in the (l + 1)th layer. Each layer consists of N neurons with a periodic boundary condition.

Layer 1

Layer 2

Layer M

xθ2

xMθ

1

Jθθ'

x1θ'

...

Periodic Boundary Mexican-Hat type Connectivity Condition

Figure 1: Network architecture. Each layer consists of N neurons arranged in a one-dimensional layer with a periodic boundary condition. Output of xθ takes the value of 1 (firing) or 0 (resting). M sheets of neural layers have MH-type feedforward connections.

2038

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

4 A B C D

J θθ'

2 0 -2 -4

- π/2

0 θ−θ'

π/2

Figure 2: Connection weights with MH-type functions. Lines A, B, C, and D correspond to parameters (J 0 , J 2 ) = (−0.75, 3), (0.2, 3), (−0.75, 1), and (0.2, 1), respectively.

Here, is a step function, and h is a threshold. h l+1 is the internal state of the θ π neuron. One neuron at position θ in the lth layer, where θ = {− π2 , − π2 + N , π 2π π π − 2 + N , ..., 2 − N } is making a synapse on the (l + 1)th layer neuron at θ with connectivity J θl θ . J θl θ is described by a cosine function (Ben-Yishai et al., 1995) as follows:

J θl θ = −

J0 J2 + cos(2(θ − θ )) + wθl θ + wθl , N N

(2.2)

where J 0 is a parameter of homogeneous connectivity, and J 2 is the ecl l centricity parameter of the MH-type connectivity. wθθ and wθ are fluctul ations in connectivity described as the gaussian distribution. Here, wθθ ∼ N (0, 2 /N), and wθl ∼ N (0, δ 2 /N). wθl θ means connection efficacy variance between a pair of pre- and postsynaptic neurons, which is independent of each connection. We also introduce wθl , the heterogeneity of presynaptic neurons, represented as a variance of connection efficacies depending on each presynaptic neuron. If wθl > 0 and the neuron at position θ fires, the probability of emitting a spike in postsynaptic neurons is increased. Therefore, wθl is a source of common noise and the correlations of each neural activities. The common noise is literally common to all neurons and fluctuates as the cluster of presynaptic firing neurons changes. Four examples of the MH connectivity without noise are shown in Figure 2. All four parameters correspond to different stable firing patterns (see Figures 4A–4D) when = 0.5.

Correlation in a Mexican-Hat-Type Feedforward Network

2039

3 Theory 3.1 Evolution Equations for Order Parameters. In this section, we consider the thermodynamical limit: N → ∞. Three macroscopic-order l l parameters, r0l , r2c , and r2s are introduced as the zeroth and second coefficients of the Fourier transformation of the firing state at the lth layer. The order parameters on the lthe layer are defined as follows: 1 θ xθl , N 1 l r2c = θ cos(2θ)xθl , N 1 l r2s = θ sin(2θ)xθl , N r0l =

(3.1) (3.2) (3.3)

l l where r0l is the mean firing rate of the lth layer and r2c and r2s are eccentricity parameters that represent the localized activity around θ = 0 and θ = π/4. Then an internal state (or local field) of a neuron on the l + 1th layer is described by order parameters on the lth layer. Substituting equations 3.1 to 3.3 into equation 2.1,

h l+1 =− θ

J0 J2 θ xθl + θ cos(2(θ − θ ))xθl N N

+ θ wθl θ xθl + θ wθl xθl − h l l = − J 0 r0l + J 2 r2c cos(2θ) + r2s sin(2θ) + θ wθl θ xθl . + θ wθl xθl − h.

(3.4)

In the thermodynamical limit with an infinite neuron number N, the above l l l l two summations of connection efficacy variances, θ wθθ xθ and θ wθ xθ , are reduced to independent and identical gaussian distributions with mean 0 and variance 2 r0l and δ 2 r0l , respectively. These two summations can be rewritten as follows: zlθ = θ wθ θ xθl ∼ N (0, 2 r0l ) and ηl = θ wθ xθl ∼ N (0, δ 2 r0l ). Here, zlθ is gaussian noise independent of θ , but ηl is common to all neurons in the l + 1th layer. Although ηl is the same as what is called common input, we refer to ηl as common noise because it fluctuates for every trial with the gaussian distribution. Hence, l l l h l+1 θ = − J 0 r 0 + J 2 r 2c cos(2θ) + r 2s sin(2θ ) + zlθ + ηl − h.

(3.5)

Using the above notations, we can derive the order parameter equations for the evolution of firing states. Let r = (r0 , r2c , r2s )t . Assuming that ηl is given,

2040

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

we can derive a map from r l to r l+1 as 

 dθerfc zlθ (ηl ) l l  1  π/2 , = F (r l , ηl ) =  −π/2 dθ cos(2θ)erfc zθ (η )   π π/2 l l −π/2 dθ sin(2θ )erfc zθ (η ) π/2

−π/2

r l+1

(3.6)

where zlθ (ηl )

l l −J 0 r0l + J 2 cos(2θ )r2c + sin(2θ)r2s + ηl − h

=− , 2r0l

(3.7)

and erfc function is the complementary error function erfc(x) = ∞ −u2 √1 e du; the details of the calculation of equation 3.6 are shown in π x appendix B. We obtain one set of evolution equations for every ηl . Note that ηl fluctuates from trial to trial and is independently distributed from layer to layer. Therefore, the self-averaging of r l breaks down, and order parameters r0 , r2c , and r2s become stochastic variables. Given probability distributions for lth layer activity r l and common noise ηl , the probability distribution on the l + 1th layer activity is written as

p(r l+1 ) =

dηl p(r l+1 |r l , ηl ) p(r l , ηl )

dr l R

=

dηl δ(r l+1 − F (r l , ηl )) p(r l , ηl ),

dr l

(3.8)

R

where region R is the area indicated later in equation 3.14. The joint probability p(r l , ηl ) can be divided into p(ηl |r0l ) p(r l ) because ηl is a gaussian distributed stochastic variable depending on the firing rate r0l of the layer: (ηl )2 p(ηl |r0l ) = √ 1 2 l exp(− 2δ 2 r l ). Therefore, equation 3.8 can be integrated as a 2π δ r0 0 double integral over two variables, r l and ηl , as

p(r l+1 ) =

dr l R

dηl δ(r l+1 − F (r l , ηl )) p ηl r0l p(r l )

dr l K (r l+1 , r l ) p(r l ),

=

(3.9)

R

where K (r l+1 , r l ) is a kernel function, K (r l+1 , r l ) =

∞ −∞

dηl p ηl r0l δ(r l+1 − F (r l , ηl )).

(3.10)

Correlation in a Mexican-Hat-Type Feedforward Network

2041

Here, we have seen that the presynaptic-dependent weight fluctuation wθl is the origin of stochastic activity propagation. The noise component of an input to one postsynaptic neuron is divided into two parts: independent gaussian noise and common noise. These correspond to zlθ and ηl , respectively. In both the deterministic and the stochastic cases, the nonfiring state is stable even though there is noise. The noise generated by connectivity depends on the firing rate of the presynaptic layer. It is easy to show that the nonfiring state is an asymptotically stable fixed point if h ≥ 0. When there is no spike in layer l, zlθ and ηl are zero for all θ . It therefore follows that l+1 l+1 the firing state of the next layer is r0l+1 = r2c = r2s = 0 if h ≥ 0. Thus, the nonfiring state is a fixed point. Furthermore, we can choose sufficiently small l l r0l , r2c , r2s so that h l+1 < 0. Hence, the nonfiring state is an asymptotically θ stable fixed point. l 3.2 Rotation-Invariant Order Parameter r2 . Two order parameters r2c l and r2s depend on the rotation of the localized activity. We find the rotationinvariant order parameter from the relation below: l l cos(2θ) + r2s sin(2θ), r2l cos(2θ − φ) = r2c

l 2 l 2 + r2s , r2l = r2c

φ = tan−1

l r2s . l r2c

(3.11) (3.12) (3.13)

In general, the upper and lower bound of r2l is 0≤

r2l

sin r0l π ≤ . π

(3.14)

The upper bound depends on r0l . When a firing pattern is most densely packed within a certain continuous region, the inequality of the upper bound becomes equality (see the proof in appendix A). The lower bound condition (r2 = 0) is satisfied when a firing pattern is spatially uniform. 4 Results 4.1 Deterministic Evolution Equations. First, let us examine the case with no common noise, δ = 0, as a foundation for understanding the effects of common noise. Throughout this letter, threshold parameter h will be set to 0.25. Since δ = 0, this leads to ηl = 0, and equation 3.10 becomes K (r l+1 , r l ) = δ(r l+1 − F (r l , 0)).

(4.1)

2042

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

0.4

r0,r2

0.3 0.2 r0 r2

0.1 0

5

10 Layer

15

20

Figure 3: Evolution of order parameters. Evolution of r0 (open circle) and r2 (filled diamond) are calculated by computer simulation. Analytical results (solid lines) are also plotted. Parameters are L = 500, N = 3000, J 0 = 0.2, J 2 = 3, = 0.5, δ = 0.0, and h = 0.25.

Substituting equation 4.1 into equation 3.9, we obtain the deterministic evolution equations for order parameters. The deterministic map from r l to r l+1 becomes r l+1 = F (r l , 0).

(4.2)

Lπ The initial input to the network is localized input spanning from θ = − 2N Lπ to 2N . The outputs of first-layer neurons are given as

xθ1

=

1

θ∈D

0

θ∈ / D

,

(4.3)

Lπ Lπ where D = [− 2N , 2N ]. We simulated this feedforward network with N = 3000 neurons, L = 500, J 0 = 0.2, J 2 = 3, = 0.5, and δ = 0. The results of 10 sample computer simulations and an analytic solution are plotted in Figure 3, and there is a very good correspondence between them. Here, the computer simulation denotes numerical results with N neurons, and the analytical solution means the solution obtained from the evolution equation of order parameters (equation 3.6). The firing states converge to a stable fixed point (r0∗ , r2∗ ) = (0.34, 0.25) in this parameter set. Here, (r0∗ , r2∗ ) represents a stable fixed point of the network dynamics. The firing pattern propagated in this network is a localized activity because r2∗ > 0. The stationary state of the network depends on parameters J 0 , J 2 , , and h and initial input width L.

Correlation in a Mexican-Hat-Type Feedforward Network

2043

We classified the firing states to proceed with further investigations into the effect of these parameters. From equation 3.14, r2 = 0 is always satisfied when r0 = 0. Therefore, the firing states can be classified into the following three states: Firing State Nonfiring Localized activity Uniform activity

Definition r0 = 0, r2 = 0 r0 = 0, r2 = 0 r0 = 0, r2 = 0

The flow diagrams for firing states with four characteristic parameter values (J 0 , J 2 ) = (0.2, 1), (0.2, 3), (−0.75, 1), and (−0.75, 3) are shown in Figures 4A to 4D with = 0.5 and δ = 0. They illustrate the evolution of firing states in the r0 − r2 plane. Each arrow indicates the direction of evolution. The values of J 0 and J 2 in Figures 4A to 4D correspond to the MH connectivity shown as curves A to D in Figure 2. Both Figures 4A and 4B exhibit the stability of localized firing states. Figure 4B illustrates the stability of both localized and nonfiring states. Moreover, Figure 4A shows the multistability with three stable states: uniform, localized, and nonfiring. In contrast, Figure 4C shows that the uniform activity and the nonfiring state are stable. In Figure 4D, the stable state is only the nonfiring state. These four figures indicate that there are four types of phases composed of combinations of the three stable states. Figures 4E to 4G show J 0 − J 2 phase diagrams for three values of noise intensity = 0.25, 0.5, and 0.75. Open circles in each panel indicate the characteristic parameter values of Figures 4A to 4D. In the multistable phase where J 0 and J 2 are balanced, the network shows multistability; all three firing states—uniform, localized, and nonfiring—are stable. In the localized phase, a localized activity and a nonfiring state are stable. In the uniform phase, the activity converges into uniform activity or a nonfiring state. The nonfiring phase indicates the parameter region where only the nonfiring state is stable. As noise intensity increases, the multistable phase region decreases, and the uniform phase region expands. The relations between stable fixed points of the network, r0∗ , r2∗ , and noise intensity are plotted in Figure 5. The nonfiring state, which is always stable, is not shown here. We numerically calculated the evolution of the activity until it converged for four characteristic parameters (J 0 , J 2 ) = (0.2, 1), (0.2, 3), (−0.75, 1), and (−0.75, 3). For (J 0 , J 2 ) = (−0.75, 3), the network shows multistability in the low regime, as indicated by Figure 5A. A localized activity becomes unstable, and the multistability is lost after the noise intensity exceeds a certain critical value. Stable uniform states gradually converge to (r0 , r2 ) = (0.5, 0) as noise intensity increases. This is because the random fluctuation of noise gradually exceeds the amplitude of input as increases, and each neuron begins to fire independently (see also Figure 5C). This is a common

2044

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

A

(J 0, J 2) = (-0.75,3)

B

(J0 , J2) = (0.2,3)

C

(J0 , J2) = (-0.75,1)

D

(J0 , J2) = (0.2,1)

0.3

r2

0.2

0.1

0 0.3

r2

0.2

0.1

0

0

0.5

1 0

0.5

r0 E 4

1

r0

0.25

F

0.5

G

0.75

J2

3 2 1 0 -1

0

1 -1

0

1 -1

0

1

J0 Figure 4: Flow diagrams. Panels A, B, C, and D correspond to parameter values (J 0 , J 2 ) = (−0.75, 3), (0.2, 3), (−0.75, 1), and (0.2, 1), respectively. Panels E, F, and G correspond to J 0 − J 2 phase diagrams for the independent noise intensity = 0.25, 0.5, and 0.75. Phase space is classified into four phases based on attractors: multistable, localized, uniform, and nonfiring.

Correlation in a Mexican-Hat-Type Feedforward Network

A (J0, J2) = (-0.75,3)

2045

B (J0, J2) = (0.2,3)

1 r0 r2

0.8 0.6 0.4

0

r ,r

2

0.2 0 C (J0, J2) = (-0.75,1)

D (J0, J2) = (0.2,1)

1 0.8 0.6 0.4 0.2 0

0

1

2

3

4 0 ∆

1

2

3

4

Figure 5: Plot of stable fixed points r0∗ , r2∗ versus .

phenomenon in all of the (J 0 , J 2 ) parameter regions. For (J 0 , J 2 ) = (0.2, 3), a localized activity is a stable fixed point in the low regime, but a uniform activity also becomes stable in the high regime (see Figure 5B). In between them is a small region where multistability can be observed ( ∼ 0.8). In Figure 5C with (J 0 , J 2 ) = (−0.75, 1), an activity does not change qualitatively, but the order parameters also asymptotically approach (r0∗ , r2∗ ) = (0.5, 0) with increasing . In Figure 5D, a uniform activity appears above a certain noise level. No two localized activities can coexist in our model. This is a natural consequence because our model represents a neural network like the hypercolumn in a visual cortex where neurons represent an orientation, or a direction of a stimulus on an identical receptive field. If there was more than one localized activity, they would counterbalance each other and disappear or merge into one localized activity. 4.2 Position of Localized Activity. So far, we have studied the evolution of firing patterns without considering the rotation of the localized activity. The position of the localized activity can be an information carrier on the

2046

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

0.05

V[φ]

N=1000 3000 10000

0

r2

1/π

0 0

25

50

Layer Figure 6: (Top) Variance of φ in each layer as varying N = 1000, 3000, 10,000. (Bottom) r2 retains its size throughout all the layers.

position of input, that is, the orientation or the direction of a stimulus. From equation 3.13, we can estimate the center of activity as φ/2. The development of the variance of φ, V(φ), was calculated from 1000 samples in a computer simulation with (J 0 , J 2 ) = (0.2, 3), and Figure 6 illustrates the evolution of V(φ). We simulated three networks with N = 1000, 3000, and 10,000 and confirmed that the firing states converged to a localized activity of a fixed size (the lower portion of Figure 6). However, the diffusion speed of φ depends on the system size N (the upper portion of Figure 6). When N is small, the position of propagating localized activity fluctuates quickly. The larger N is, the smaller the fluctuations in the center of the activity are. In the limit of an infinite neuron number N, there is no rotation of the activity, and information on the signal position is maintained throughout the network. 4.3 Effect of Common Noise: δ = 0. Until now, we have investigated the activity of the network without common noise. Now we introduce δ, which indicates the size of the variance of common noise. We will focus on the parameter region where the network exhibits multistability or a localized activity because our model is a superset of the homogeneous feedforward network, and stochastic activity propagation on the homogeneous feedforward network has been studied recently (Amari et al. 2003). The stochastic dynamics in the uniform phase will be briefly described. We are currently interested in how common noise changes the dynamics of the network. Before we go further into the detailed analysis of the

Correlation in a Mexican-Hat-Type Feedforward Network

A

r0

1

2047

B

δ =0

δ =1/4

0.5

r2

0 1/π

0.1 0

0

10

20

30

40

50 0

10

20

30

40

50

Layer Figure 7: (A) δ = 0. (B) δ = 1/4. Each case shows the results for 10 samples of computer simulation.

stochastic behavior of the system, we show examples obtained from 10 computer simulations for both δ = 0 and δ = 0.25 with parameter (J 0 , J 2 ) = (0.2, 3) in Figure 7. The firing states for δ = 0.25 are distributed around a stable fixed point of δ = 0. In section 4.1, we showed that the evolution of the firing state is deterministic when δ = 0. The results obtained from computer simulations with a finite number of neurons show good convergence (see Figure 7A). In contrast, introducing δ generates distributed firing states (see Figure 7B). To clearly show the difference in the evolution of the firing state between δ = 0 and δ = 0, we plotted several kernel functions K (r l+1 , r l ) = p(r0l+1 , r2l+1 |r0l , r2l ) in Figure 8. It corresponds to the projection of the kernel function (see equation 3.10) from (r0 , r2c , r2s ) space into the (r0 , r2 ) space. This projection can be easily obtained by assuming that r2s = 0. The solid lines connected with open circles represent the deterministic evolution of firing states with δ = 0. The closed circles indicate first-layer activity states, and the activities of 10 subsequent layers are represented as open circles. Every gray line represents the conditional probability distribution K (r l+1 , r l ) > √12π exp(− 12 ) for a given point r l = (r0l , r2l ). To show the shape of kernel function at various points in (r0 , r2 ) space, each K (r l+1 , r l ) is calculated for a given point r l which is employed from the deterministic case (open circle). In other words, we set P(r l ) as a delta function on an open circle. These flow diagrams illustrate one step of stochastic evolution from one layer to the next. Introducing the δ term leads to fluctuations around the deterministic behavior of the network. Probability distribution p(r ) mainly depends on the configuration of stable fixed points in the deterministic counterpart (see the difference in flows in Figures 8A and 8B.) A kernel function K (r l+1 , r l ) represents a one-dimensional line in the (r0 , r2 ) space. Given that r l = (r0l , r2l ) and common noise ηl , we get a map

2048

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

A

B

0.3

r2

0.2

0.1

0 0

0.5

1 0

0.5

1

r0 Figure 8: Kernel functions at each point (r0 , r2 ) indicated as circles. Open circles connected by a black line indicate the deterministic evolution starting from the filled circle. Gray lines intersecting open circles indicate probability distribution K (r l+1 , r l ) = p(r0l+1 , r2l+1 |r0l , r2l ) > √12π exp(− 12 ) when the prior distribution is a delta function (an open circle). A and B correspond to parameter sets (J 0 , J 2 ) = (−0.75, 3), and (0.2, 3), respectively.

from (r0l , r2l , ηl ) to (r0l+1 , r2l+1 ) as a kernel function (see equation 3.6). Since the kernel function K (r l+1 , r l ) is obtained by changing a parameter ηl continuously in the integral in equation 3.9, the kernel function should look like a one-dimensional line. These lines illustrate only one step of evolution, and the actual probability distribution p(r l ) is obtained by accumulating the effect of each layer’s common noise. These cases are shown in Figure 9. Figure 8A represents the evolution of firing states for (J 0 , J 2 ) = (−0.75, 3), whereby the network has multistable fixed points when δ = 0. Even if there is no fixed point in the δ = 0 case, the flow of firing states near the fixed point is slow. Therefore, the firing states tend to remain within regions where fixed points have existed. For (J 0 , J 2 ) = (0.2, 3), where both a localized activity and a nonfiring state are stable, the stochastic evolution of firing states also converges to distributed localized activities and a nonfiring state. We calculated the analytical distribution obtained from the iterative calculation of equation 3.9 and 104 computer simulations with an initial condition of p(r01 , r21 ) = δ(r01 − 0.5)δ(r21 − 1/π ). The probability distribution for p(r0l , r2l ) at the 20th layer for a parameter set (J 0 , J 2 ) = (−0.75, 3) is shown in Figure 9A. It has been reported that the distribution of a population firing rate over many trials in a homogeneous feedforward network has bimodal distribution in both high- and low-firing-rate regions (Amari et al., 2003). In contrast, a feedforward network with MH-type interaction has bimodal distribution in the uniform and localized firing regions in this parameter

Correlation in a Mexican-Hat-Type Feedforward Network

A

Analytical distribution (J0, J2) = (-0.75, 3)

B

Analysis

2049

Simulation 80

0.3

two peaks

60

p(r)

0.2

r2

40 0.1 20

r2 C

r0

(J0, J2) = (0.2, 3)

0 0

D

0.5

10

0.5

1 1000

0.3

p(r)

800

r2

0.2

600 400

0.1

200

r2

r0

(J0 , J2 ) = (-0.75, 1)

p(r)

E

0

0

0.5

10

0.5

1

F 0.3

r2

400 300

0.2

200

0.1

100

r2

r0

0

0

0.5 r0

10

0.5 r0

1

Figure 9: (A, C, E) The probability distribution of (r020 , r220 ). Parameter values in A, C, and E are (J 0 , J 2 ) = (−0.75, 3), (0.2, 3), and (−0.75, 1), respectively. The common noise amplitude is δ = 0.25. (B, D, E) Contour plots of probability distribution obtained from analysis and simulation. The multistable phase (A and B) shows the bimodal distribution; activities fluctuate between the localized and the uniform patterns. The localized phase (C and D) and the uniform phase (E and F) also show distributed activities, but these are more restricted to smaller regions.

set. This indicates that firing states fluctuate between uniform and localized activities. The probability distribution tends to have peaks near fixed points of the δ = 0 case. In the localized phase, where a localized activity is stable in the deterministic case, the localized activity seems to be dominant in the stochastic case (see Figures 9C and 9D). In the uniform phase, the analytical solution also exhibits distributed uniform activity (see Figures 9E and 9F). However, the distribution calculated by computer simulations with N = 3000 is a bit widely distributed in the r2 direction. The probabilities can be

2050

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

observed out of the r2 = 0 line. This inconsistency between the analysis and simulations results from the finite-size effect and is observed in the low r2 region.

4.4 Multimodal Correlation Structure. Without common noise, the activity of the network develops deterministically, and there is no correlation between the activities of two neurons within a layer. Since common noise is literally common to all the neurons in a layer, the positive (negative) common noise input increases (decreases) the probability of emitting a spike within all the neurons. This common input fluctuates from trial to trial, so it generates the correlation of each neuron. The higher-order correlations exist because the common noise is shared with more than two neurons; the third, the fourth, and other higher-order correlations exist. A more detailed mathematical description can be found in Amari et al. (2003). In a homogeneous feedforward network, the correlations of any two neurons are the same. Thus, the correlation structure in the θ − θ plane is flat. In contrast, the network with MH-type connectivity generates a localized activity, and the correlation is expected to have a nonflat spatial structure. In studies on the coding and decoding in a neural field model with localized activity, it has been assumed that the correlation between the activities of two neurons shows distance-dependent unimodal correlation structures, for example, the gaussian, or table shape (Abbott & Dayan, 1999; Yamazaki, 2002; Wu et al., 2002). However, an analytical solution to the correlation structure is generally difficult to obtain because it needs the distribution of firing states p(r l ) for every layer l in a given network. Since we have analytically obtained the distribution for firing state p(r l ), we can calculate the l correlation structure numerically. The covariance matrix Cθθ between two neurons at θ and θ on the lth layer is Cθl θ ≡ E xθl xθl − E xθl E xθl ,

(4.4)

where E(·) denotes averaging over trials, and the mean activity of a neuron at θ on the lth layer is E xθl ≡

dr l−1

R

dηl−1

l−1 l−1 l−1 ) p zl−1 p η r0 dzl−1 θ p(r θ

l−1 l−1 cos(2θ) + r2s sin(2θ) + zl−1 + ηl−1 − h × –J 0 r0l−1 + J 2 r2c θ l−1 dr l−1 dηl−1 p(r l−1 ) p ηl−1 r0l−1 erfc zl−1 ) . (4.5) = θ (η R

Correlation in a Mexican-Hat-Type Feedforward Network

Analysis 0.02 0.01 0 π/2 0 0 −π/2 -π /2 Simulation 0.02 0.01 0 π/2 0 0 θ' −π/2 -π/2 θ

B

η

Internal States Threshold θ High Correlation Regions

π /2

D

p(r pre ) = δ(r0 - 0.56) δ(r2 - 0.31) 0.4 ∆ = 0.001 0.2 0.01 0.1 0 0

0.1

δ = 0.25

0 π/2

0.5

0.5

δ = 0.5

20

max C θθ’

post

C

π /2

C θθ’

20

C θθ’

20

C θθ’

A

1 δ

1.5

2051

2

θ'

δ = 0.125 δ = 0.05

0

−π/2 -π/2

0θ

π/2

Figure 10: Correlation structure (A) Covariance Cθ20θ at the twentieth layer with the parameter set (J 0 , J 2 ) = (−0.75, 3). (B) Schematic view of internal states of neurons when a layer shows a localized activity. Fluctuation of η generates the correlation of neural activities. (C) max Cθθ depends on the ratio of to δ. The prelayer activity is p(r ) = δ(r0 − r0∗ )δ(r2 − r2∗ ), where (r0∗ , r2∗ ) = (0.56, 0.31) is a stable fixed point in the deterministic case. (D) Correlation structure with different δ values. From top to bottom, δ = 0.5, 0.25, 0.125, and 0.05.

The second moment E(xθl xθl ) is calculated as follows: l−1 l−1 p zθ ) p zl−1 E xθl xθl ≡ dr l−1 dηl−1 dzl−1 dzl−1 θ θ θ p(r R

l−1 l−1 cos(2θ) + r2s sin(2θ ) × p ηl−1 r0l−1 –J 0 r0l−1 + J 2 r2c l−1 + ηl−1 − h –J 0 r0l−1 + J 2 r2c cos(2θ ) + zl−1 θ l−1 l−1 sin(2θ ) + zl−1 −h + r2s θ + η l−1 ) erfc = dr l−1 dηl−1 p(r l−1 ) p ηl−1 r0l−1 erfc zl−1 θ (η R

l−1 ) . × zl−1 θ (η

(4.6)

The covariance matrix for a parameter set (J 0 , J 2 ) = (−0.75, 3), = 0.5, and δ = 0.25 is shown in Figure 10A. The correlation structure has

2052

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

multimodal distribution and is neither gaussian nor table shaped. We can explain the multimodality as follows: internal state h lθ has unimodal distribution when r2l−1 = 0 (see Figure 10B). The internal state distribution also fluctuates due to common noise ηl−1 . Since the correlation is calculated after the internal states have been transformed by the step function, most fluctuating activity is located near the point where the internal state crosses the threshold. Since internal states cross threshold at two points, the correlation structure has multimodal distribution. The activity around the center of a localized activity barely fluctuates, and the covariance of the activity becomes very small. Note that the above explanation holds if most of the trials have localized activity. Finally, we investigate how the size of δ affects the correlation structure. The amplitude of Cθ θ is determined by the competition between the independent noise zθ and the common noise η. Thus, Cθθ depends on the ratio of to δ. The maximum value for correlations within a layer for several with varying δ is plotted in Figure 10C. The overall shape of correlation on the twentieth layer, Cθ20θ , with a fixed (= 0.5) and various δ(= 0.01, 0.125, 0.25, and 0.5) obtained from the analytical solution is depicted in Figure 10D. 5 Summary and Discussion 5.1 Order Parameter Equations. In this letter, we have formulated the evolution of an activity in terms of the order parameters. We have embedded two types of connection noise, wθ θ and wθ . The former connection noise generates only the deterministic evolution of firing patterns, and the latter connection noise wθ generates the stochastic dynamics of propagating activities and correlation between each neuron. In both cases, evolution equations for order parameter r0 , r2c , and r2s are obtained in the limit of N → ∞. We could eliminate one order parameter r2s from the symmetry condition when the size and the shape of the localized activity are of interest. However, a network in numerical simulations is composed of a finite number of neurons. In neutrally stable systems as studied in this article, firing states can easily fluctuate in the direction along the neutrally stable states. Therefore, the activity patterns rotate because of the finite size effect (see section 4.2). It would cast some computational limit on a neural system to encode analog information, but the rotational fluctuation is unavoidable even if a system has a recurrent MH-type connection. The real brain may also have this limitation. The finite size effect also prevents the activity from yielding uniform firing patterns and causes a little disagreement between theoretical analysis and computer simulations in the small r2 region. We note the brief explanation on the convergence of order parameters. The law of large numbers tells us that the sum of independent variables converges to the sum of their mean values. When the connection noise is independent of each other, the order parameters, which are the statistical quantities of activities of neurons, are self-averaging, and development of

Correlation in a Mexican-Hat-Type Feedforward Network

2053

them can be described as the deterministic map of them. However, the activities of neurons are not an independent variable if common noise exists. Therefore, order parameters do not converge, and self-averaging breaks down. The order parameters fluctuate from trial to trial due to the effect of common noise, and they become stochastic variables. The deterministic map becomes the map of the probability distribution of order parameters. We can observe the common noise in general model settings, such as sparse connectivity (Amari et al., 2003) and variable connection efficacies in an associative memory model (Yamana & Okada, 2002). 5.2 Relation to Recurrent Network. Compared with the recurrent network with no noise source, the dynamics in the deterministic case with δ = 0 studied in the first half of this article is qualitatively the same. However, we have shown a new result: the existence of a multistable phase. The most striking difference reported here is the stochastic dynamics induced by common noise (δ = 0 case). In the feedforward network, the independence of {wθl } and {wθl+1 } allows us to calculate the development of probability distribution of order parameters because p(r l , ηl ) can be divided into p(ηl |r0l ) p(r l ) (see equation 3.9). In contrast, it is impossible to formulate the development of probability distribution for a recurrent network because of the mutual dependence of ηl and r l . The mutual dependence arises from the recurrent connections of a set of {wθ }, which is identical to each time step. This is the essential difference between this work and conventional studies on the recurrent networks. 5.3 Deterministic Dynamics of the Network. Possible stable firing patterns, which are derived from deterministic evolution equations, are uniform, localized, and nonfiring. The nonfiring state is always stable because we did not incorporate the thermal noise. Therefore, we have four phases as combinations of other two stable firing states: nonfiring, localized, uniform, and multistable. These results provide a good approximation of the synfire chain composed of LIF neuron models. Here we investigate to what extent and on what conditions our results with the MP neuron model can be also valid in the leaky integrate-and-fire (LIF) neuron network. There may be several conditions for binary neuron approximation of the LIF neuron. One is the limit case of the LIF neuron whose membrane time constant is zero; in other words, memory is very quickly lost. By this limit, the LIF neuron model becomes exactly the MP neuron. Another case is essentially the same as described in Cˆateau and Fukai (2001), which is called a shifting picture method. The development of the membrane potential distribution of the LIF neurons driven with the gaussian white noise is described as an Ornstein-Uhlenbeck process. Therefore, the distribution is approximately the gaussian distribution at a resting state. If the time variance of a spike packet and the time constant of a synaptic current are shorter than the membrane time constant,

2054

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

the membrane potential distribution crosses the threshold keeping its shape. This phenomenon is qualitatively the same as the thresholding process of the MP neuron that has gaussian distributed internal states. Therefore, analytical treatment of the synfire chain with MP neurons is still valid for the synfire chain with the LIF neurons when the spike packet is strongly synchronized (Cˆateau & Fukai, 2001). (See Nowotny & Huerta, 2004, for another reasonable explanation of this matter.) 5.4 Stochastic Dynamics of the Network and Correlation. In the latter half of this letter, the connection noise contains the wθ term. This generates the common term in the noise and leads to stochastic dynamics of the system. This common noise has also played another role in our model: it generates the correlations between two neurons in the same layer. Trial-to-trial fluctuation of each neuron’s firing states is observed in both the deterministic and stochastic case, but their statistical properties of fluctuation are different. In the deterministic case, the noise is independent and the activity fluctuation is also independent, so there is no correlated activity. In the stochastic case, the common term in the noise generates the correlated activities of neurons. The introduction of common noise made the evolution of the firing state more stochastic and probability distribution p(r ) broader. Figure 9A shows that a population firing rate r0 is distributed from low to high. It means that many neurons become active (emit a spike) at one time, and they remain in a silent state at other times. It should be noted that this correlated activity observed in the stochastic case is generated by common noise, not the input itself. The correlation can be defined on many trials here. On the other hand, the LIF neurons usually receive a temporally modulated input, and an input generates the correlated spikes (synchrony of spikes). When the cross-correlation of two neurons has its peak around time lag 0, this special case is called synchrony. Their synchrony can be calculated on a single trial. Due to the difference of the neuron models, we call the cooperative activity in the stochastic case correlated activity. Correlation between two neurons at θ and θ , Cθθ , is calculated from the first principle in this letter. It is often assumed that the correlation structure is a gaussian shape or bell shape when a localized activity is observed. However, the bell-shape correlation will be generated by an input itself, and what we have observed here is generated by common noise. The correlation structure has a highly nontrivial shape—a multimodal shape—in both the analytical study and numerical simulations. The highest correlation is observed near the point where an input crosses the threshold. Such a correlation structure could be verified if the trial-to-trial correlation is measured at the same timing to the same initial input to a feedforward network with MH connectivity. We have focused on the network dynamics with the common noise, but functional roles of the common noise have not been investigated. This is not the main scope here, but we briefly discuss the possible functional role

Correlation in a Mexican-Hat-Type Feedforward Network

2055

of common noise. In the multistable phase, the characteristic phase of this network, the activity is occasionally uniform and sometimes localized. The stochastic dynamics generated by the common noise causes the network state to fluctuate. Therefore, we expect the network to show a phenomenon like stochastic resonance on the level of order parameters. Even if an initial network state is just weakly localized and not included in the basin of the localized activity state, a network can settle into the strongly localized activity with the help of common noise. Finally, we discuss the relation of this work and experimental facts. The homogeneous synfire chain has only two attractors: a mode with a low spike timing variance with high firing rate (a synchronous spike packet mode) and a mode with a high spike timing variance with low firing rate (spontaneous firing mode). However, it is difficult to expect that a neural population in a certain cortical region has only two stable states; rather, it is natural that it has several attractors to perform various types of information processing. There are different approaches to endow a network with more than two attractors, such as the Hebbian synapses embedded with associative memory, which generate multiple attractors, and MH-type connections with a neutrally stable attractor. In this article, a feedforward network with MH-type connectivity has a neutrally stable attractor and can realize innumerable stable states on the attractor. Experimental evidence for the synfire chain indicates that there are many repeated spatiotemporal spike patterns in vivo (Abeles et al., 1993) and in vitro (Ikegaya et al., 2004). The localized activity propagation mode observed in our model predicts that different stimuli can generate different repeated spike patterns. We expect that careful examination of neuronal responses to several stimuli can distinguish whether localized activity propagation is generated. For example, there will be three firing groups: strongly synchronized, activated but highly variable, and strongly suppressed neuron populations. The composition of each group can be changed by different stimuli, and this would result in different repeated spike patterns depending on different stimuli. The formulation of the synfire chain with a simple MP neuron model allows us to study the qualitative phenomena of the feedforward network with MH-type connectivity. The noise embedded in connection efficacy also provides an understanding of the effect of local and global fluctuations of neural network activity. Stability of a localized spike packet is currently being investigated with the LIF neuron model (Hamaguchi, Okada, & Aihara 2004). Appendix A: Upper Bound of r2 Let us assume that each neuron is uniformly aligned on the circumference on a circle with the radius of 1/N length (see Figure 11). When the number of neurons is N and L unit of neurons are firing, the firing rate is r0 = L/N. We define an r 2 vector as the sum of the position vectors of firing neurons.

2056

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

A

max θ

non-dense firing r2

B

dense firing

r2

exchange

: a firing neuron : a non-firing neuron

Figure 11: Neural field represented in a two-dimensional plane. Firing states in A and B have the same number of firing neurons but different r2 values. v is a position vector of a firing neuron. r 2 is the sum of all the firing neurons’ position vector v, and the order parameter r2 is the length of r 2 . (A) Neural activity is spatially dispersed (low r2 ). To increase r2 , find the farthest vector v from r 2 , and exchange its position with that of any nonfiring neuron located within vectors r 2 and v. This leads to an increase in the length of the r2 vector. Repeating this procedure can maximize r2 value. (B) The maximum length of r2 is obtained when firing neurons are most densely located.

Order parameter r2 is the length of the r 2 vector. We also refer to dense firing as the situation where all L firing neurons are next to each other (see Figure 11B). We will show that the upper bound of r2 is obtained in the dense firing state. Note that if a firing state is dense, there is no way to increase r2 . Therefore, the dense firing state is at least a local maximum. Let us now consider a case where the firing state is not a dense state (see Figure 11A). First, choose a position vector v for one firing neuron that has a maximum angle with the r 2 vector (max θ in Figure 11). Exchange v with any other position vector for a nonfiring neuron within vectors r 2 and v. This leads to an increase in the length of the r2 vector. If there is no nonfiring neuron within the angle, the firing state is dense. If the firing state is not dense, one can always increase r2 by exchanging firing and nonfiring neuronal positions, as described above. This demonstrates that a maximum r2 is obtained only as a dense firing state. Thus, the upper bound of r2 is expressed as 1 π

Lπ 2N −Lπ 2N

cos(2θ)dθ =

sin(Lπ/N) . π

(A.1)

Substituting r0 = L/N, we obtain the upper bound for r2 as r2 ≤

sin(r0 π) . π

(A.2)

Correlation in a Mexican-Hat-Type Feedforward Network

2057

From the definition, the lower bound of r2 is zero. Appendix B: Derivation of Equation 3.6 The mean and the variance of wθl θ xθl are E θ wθl θ xθl = 0, V wθl θ xθl = p xθl = 1 2 /N.

(B.1) (B.2)

Since wθl θ xθl is independent of θ , from the law of large numbers, zlθ = θ wθl θ xθl ∼ N 0, 2 r0l .

(B.3)

Similarly, ηl = θ wθl xθl ∼ N 0, δ 2 r0l .

(B.4)

Note that ηl is independent of each postsynaptic neuron at θ and is the common input in the l + 1th layer neurons. Given r l and ηl , the evolution of r0 in the thermodynamical limit of N → ∞ is r0l+1 (r l , ηl ) =

1 π

π/2 −π/2

dθ

π/2

∞

dzlθ p zlθ h l+1 θ

(B.5)

−∞

2 1 dzlθ √ exp − zlθ /22 r0l 2π −π/2 l l l × J 0 r0 + J 2 r2c cos(2θ ) + r2s sin(2θ ) + zlθ + ηl − h 1 π/2 ∞ = dθ Dz, (B.6) −J r l +J (r l cos(2θ )+r l sin(2θ ))+ηl −h π −π/2 √ l2s − 00 2 2

1 = π

dθ

r

0

where Dz = dz √12π e −z /2 . By using the erfc function, erfc(x) = we get 2

r0l+1 (r l , ηl ) = 

1 π

π/2 −π/2

√1 π

∞ x

e −u du, 2

dθ

 l sin(2θ ) + ηl − h −J 0 r0l + J 2 r2l cos(2θ ) + r2s .

× erfc − 2r0l l l The evolution equations for r2c and r2s are obtained in similar ways.

(B.7)

2058

K. Hamaguchi, M. Okada, M. Yamana, and K. Aihara

Acknowledgments We thank Hidetoshi Shimokawa for his valuable discussions on the proof in appendix A and Naoki Masuda for a careful reading of our manuscript. This study is partially supported by the Advanced and Innovational Research Program in Life Sciences, Grant-in-Aid 15016023 for Scientific Research on Priority Areas (2) Advanced Brain Science Project, and Grand-inAid 14084212 from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government.

References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11(1), 91–101. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol. 70, 1629– 1638. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87. Amari, S., Nakahara, H., Wu, S., & Sakai, Y. (2003). Synchronous firing and higherorder interactions in neural pool. Neural Comp., 15, 127–142. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273(5283), 1868–1871. Aviel, Y., Nearing, C., Abeles, M., & Horn, D. (2003). On embedding synfire chains in a balanced network. Neural Comp., 15, 1321–1340. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3948. Blakemore, C., Carpente, R., & Georgeso, M. (1970). Lateral inhibition between orientation detectors in human visual system. Nature, 228, 37–39. Cˆateau, H., & Fukai, T. (2001). Fokker-Planck approach to the pulse packet propagation in synfire chain. Neural Networks, 14, 675–685. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Droulez, J., & Berthoz, A. (1991). A neural network model of sensoritopic maps with predictive short-term memory properties. Proc. Natl. Acad. Sci. U.S.A., 88, 9653– 9657. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337.

Correlation in a Mexican-Hat-Type Feedforward Network

2059

Hamaguchi, K., & Aihara, K. (2004). Quantitative information transfer through layers of spiking neurons connected by Mexican-hat type connectivity. Neurocomputing, 58–60, 85–90. Hamaguchi, K., Okada, M., & Aihara, K. (2005). Theory of localized synfire chain: Characteristic propagation speed of stable spike patterns. In L. K. Saul, Y. Weiss, & L. Botlou (Eds.), Advances in neural information processing systems, 17 (pp. 553–560). Cambridge, MA: MIT Press. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304, 559–564. Kimpo, R., Theunissen, F., & Doupe, A. (2003). Propagation of correlated activity through multiple stages of a neural circuit. J. Neurosci., 23(13), 5750–5761. Kistler, W. M., & Gerstner, W. (2002). Stable propagation of activity pulses in populations of spiking neurons. Neural Comp., 14, 987–997. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Comp., 13, 1473–1494. Lukashin, A., & Georgopoulos, A. (1993). A dynamical neural network model for motor cortical activity during movement: Population coding of movement trajectories. Biol. Cybern., 69, 517–524. Nowotny, T., & Huerta, R. (2004). Explaining synchrony in feedforward networks. Biol. Cybern., 89, 237–241. Reyes, A. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nature Neuroscience, 6, 593–599. Shriki, O., Hansel, D., & Sompolinsky, H. (2003). Rate models for conductance-based cortical neuronal networks. Neural Comp., 15(8), 1809–1841. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64, 051904. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of firing rates through layered networks of noisy neurons. J. Neurosci., 22, 1956–1966. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized population of model neurons. Biophysical Journal, 12, 1–24. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Comp., 14, 999–1026. Yamana, M., & Okada, M. (2002). Synfiring in a layered associative neural network. (Tech. Rep. 101.) Tokyo: Institute of Electronics, Information, and Communication Engineers. (In Japanese) Yamazaki, T. (2002). A mathematical analysis of the development of oriented receptive fields in Linsker’s model. Neural Networks, 15, 201–207. Yazdanbakhsh, A., Babidi, B., Rouhani, S., Arabzadeh, E., & Abbassian, A. (2002). New attractor states for synchronous activity in synfire chains with excitatory and inhibitory coupling. Biol. Cybern., 86, 367–378.

Received May 7, 2004; accepted February 24, 2005.

LETTER

Communicated by Suzanna Becker

Supervised Learning in a Recurrent Network of Rate-Model Neurons Exhibiting Frequency Adaptation Pierre A. Fortier [email protected] Department of Cellular and Molecular Medicine, Univ of Ottawa, Canada, K1H 8M5

Emmanuel Guigon [email protected] INSERM U483, Universit´e P. et M. Curie, 75005 Paris, France

Yves Burnod [email protected] UMR 5015 CNRS, Universit´e Claude Bernard, 69675 Lyon, France

For gradient descent learning to yield connectivity consistent with real biological networks, the simulated neurons would have to include more realistic intrinsic properties such as frequency adaptation. However, gradient descent learning cannot be used straightforwardly with adapting rate-model neurons because the derivative of the activation function depends on the activation history. The objectives of this study were to (1) develop a simple computational approach to reproduce mathematical gradient descent and (2) use this computational approach to provide supervised learning in a network formed of rate-model neurons that exhibit frequency adaptation. The results of mathematical gradient descent were used as a reference in evaluating the performance of the computational approach. For this comparison, standard (nonadapting) rate-model neurons were used for both approaches. The only difference was the gradient calculation: the mathematical approach used the derivative at a point in weight space, while the computational approach used the slope for a step change in weight space. Theoretically, the results of the computational approach should match those of the mathematical approach, as the step size is reduced but floating-point accuracy formed a lower limit to usable step sizes. A systematic search for an optimal step size yielded a computational approach that faithfully reproduced the results of mathematical gradient descent. The computational approach was then used for supervised learning of both connection weights and intrinsic properties of rate-model neurons to convert a tonic input into a phasic-tonic output pattern. Learning Neural Computation 17, 2060–2076 (2005)

© 2005 Massachusetts Institute of Technology

Learning with Neurons Exhibiting Frequency Adaptation

2061

produced biologically realistic connectivity that essentially used a monosynaptic connection from the tonic input neuron to an output neuron with strong frequency adaptation as compared to a complex network when using nonadapting neurons. Thus, more biologically realistic connectivity was achieved by implementing rate-model neurons with more realistic intrinsic properties. Our computational approach could be applied to learning of other neuron properties.

1 Introduction The building blocks of nervous systems evolved very early. Most of the small neurotransmitters as well as peptides and their associated G-protein coupled receptor systems are present in protozoa (Harris-Warrick, 2000; Ranganathan, 1994). Furthermore, the major families of ion channels probably evolved from prokaryote precursors, and most of the major classes of ion channels were present about a billion years ago by the time the first nervous systems began to evolve (Harris-Warrick, 2000). Modern families of animals share a similar set of ion channel genes, yet there has been considerable evolution in channels. Changes have been in domains that are not responsible for channel formation but for voltage, kinetic, or other properties of channels. Single neurons and even neuronal circuits can show dramatic alterations in activity with only minute changes in channel properties (Katz & Harris-Warrick, 1999). Thus, evolutionary change has packaged enormous information processing power by utilizing not only connectivity but also intrinsic neuronal properties in the formation of nervous systems (Barish, 1988; Finlay & Darlington, 1995). Neurobiologists are revealing more and more of the connectivity and intrinsic properties of neurons in an attempt to reveal the representation and processing of information by nervous systems and also to explain the mechanisms underlying behavior. Among all the nervous systems that have been studied, the cerebral cortex still continues to be one of the most intensively studied nervous tissues because of its unique ability in humans to implement adaptive measures and realize creative genius. Animal research has provided considerable detail on the synaptic inputs to the cerebral cortex, the intrinsic cortical circuitry and neuronal firing properties, and the output projections to extracortical regions (Jones, 2000; deCharms & Zador, 2000). This detail is nonetheless insufficient to fully explain the transformation of information and its ultimate contribution to behavior. There is a need for simultaneous information on the connectivity and activity of neurons contributing to cortical network behavior. The most direct approach would be to obtain this information in live preparations of cortical tissue, but this involves technically challenging experiments. A less direct but more feasible approach would be to get a preview of this information from neural network simulations of cortical tissue (Arbib & Erdi, 2000; Koch

2062

P. Fortier, E. Guigon, and Y. Burnod

& Segev, 1989). The word preview is used to acknowledge the fact that insight gained from simulations must be confirmed in neurobiological experiments. However, less technically challenging experiments may suffice for confirmation. The likelihood that the connectivity achieved after learning in a neural network simulation will be recognizable within the true cortical architecture giving rise to the biologically observed behavior is dependent on the accuracy to which simulated neurons reproduce their biological counterparts. For example, a single monosynaptic connection to a regular-spiking pyramidal cell is all that is required to convert a presynaptic tonic discharge into a postsynaptic phasic-tonic discharge (Schwindt, Spain, & Crill, 1992); however, a complicated network architecture is required by a simulated neural network using rate-model neurons that do not exhibit the frequency adaptation of regular-spiking pyramidal cells. The obvious solution is to implement frequency adaptation in rate-model neurons, but the consequence is that supervised learning using backpropagation or other mathematical variations of gradient descent cannot be used straightfowardly because the standard form of these learning algorithms does not take into account the activation history of a neuron that gives rise to the frequency adaptation. The problem is that the derivative of the activation function depends on the activation history of a neuron with frequency adaptation. A straightforward solution to this problem was sought using a computational approach that does not require defining the derivative of the activation function in order to reproduce the mathematical implementation of gradient descent. The distinction between the mathematical and computational approaches is that the mathematical approach uses calculus to derive the gradient, while the computational approach evaluates the change in neuronal activity following a change in connection weight. Theoretically, the derivative of the activation function is best approximated as the tested change in connection weight approaches zero (i.e., lim wi j → 0 where wi j is the change in connection weight between neurons i and j that is used to measure an effect on neuronal activity). However, our initial attempts using minute changes in connection weight achieved minimal learning. It became obvious that more work was required before we could use a computational approach to reproduce gradient descent learning. Therefore, there were two objectives in this study. The first was to develop a simple generalized computational gradient descent approach that reproduces the classical mathematical approach described by Williams and Zipser (1989). Note that the purpose was to reproduce and not to validate the gradient descent approach since this approach has been extensively studied over the years (Hertz, Krogh, & Palmer, 1991). The second objective was to demonstrate the ability of this computational approach to achieve supervised learning in a network composed of rate-model neurons with frequency adaptation.

Learning with Neurons Exhibiting Frequency Adaptation

2063

2 Methods The rate-model neuron with frequency adaptation will be described before the computational approach of gradient descent learning so that we can highlight the features of the rate-model neuron that prohibit the use of mathematical gradient descent. Among all the possible ways to model frequency adaptation, the rate-model neuron of Cartling (1995, 1996) was chosen because it is directly based on the biology of neurons, which use calcium-sensitive potassium (KCa ) channels to provide frequency adaptation (Schwindt et al., 1992, 1988). This frequency adaptation is due to calcium entry during membrane depolarization and a progressive increase in intracellular calcium during repetitive firing, which leads to calcium binding to KCa channels that open to hyperpolarize the membrane and gradually reduce the firing rate. Cartling (1995) derived a rate-model neuron that reproduced the frequency adaptation of a Hodgkin-Huxley formalism incorporating KCa channels. The level of frequency adaptation in this model can be varied over a wide range, from zero to maximum frequency adaptation, such that the same current step applied to a neuron could evoke discharges ranging from tonic to phasic-tonic profiles. This reflects the wide range of discharge profiles observed during single unit recordings in behaving animals (Fortier, Smith, & Kalaska, 1993; Fetz, Cheney, Mewes, & Palmer, 1989). These adapting rate-model neurons were incorporated into a fully recurrent neural network. The influence of calcium on neuron firing was implemented in the otherwise standard activation function of neurons (Williams & Zipser, 1989). Neuronal activity was calculated by taking the sum of all neuronal inputs (see equation 2.1) and then applying a squashing function to limit values between −1 and +1 (see equation 2.2). In the present case, these values represented an input current of −1 to +1ηA to the postsynaptic neuron. Such a range of input currents produced, according to the equations of Cartling (1996), neuronal activities ranging from 0 − 224 Hz (see equation 2.3). These activities were then divided by the maximum firing rate of 224 Hz (see equation 2.4) in order to linearly rescale the activities between 0 − 1. Thus, the inputs to a neuron represented an input current that produced firing rates scaled between 0 and 1 according to the following equations: si (t) =

wi j y j (t)

(2.1)

j

i i (t) = gi (t) =

e si (t) − e −si (t) e si (t) + e −si (t) φ(i i (t) − vi c i (t) − )ρ 0

(2.2) if i i (t) − vi c i (t) − ≥ 0 otherwise

(2.3)

2064

P. Fortier, E. Guigon, and Y. Burnod

yi (t + 1) = gi (t)/ω,

(2.4)

where si (t) is the net input; yi (t) is the neuron activity scaled between 0 and 1; i i (t) is the input current; gi (t) is the neuron activity in Hz; c i (t) is the intracellular calcium concentration; wi j is the synaptic weight ranging between positive and negative values for excitation and inhibition; vi is the sensitivity of firing to calcium, which ranges between 0 and 1; and the others are constants with values ω = 224 Hz, φ = 254.7, = 0.12, and ρ = 1. The indexes are defined as j ∈ presynaptic neurons (includes input, hidden, and output neurons), i ∈ postsynaptic neurons (includes hidden and output neurons), and t is the time steps in the activity of a neuron. This implementation does not include a time constant of 1 ms (Cartling, 1996) for the change in neuron discharge frequency because each time step was 10 ms. The calcium concentration is calculated according to the differential equation derived by Cartling (1996): dc i (t)/dt = q (c i (t))gi (t)/1000 − c i (t)/τc where τc is the calcium time constant equal to 111 ms and q (c i (t)) is the increase in intracellular calcium during neuron activity calculated as q (c i (t)) = 0.11/(0.9 + c i (t)). Supervised learning in a fully recurrent network of rate-model neurons can be achieved using the gradient descent approach of Williams and Zipser (1989). This gradient descent procedure requires changing the synaptic weights along the negative of the gradient of the network error function, wi j (t) = −α

∂ J (t) , ∂wi j

(2.5)

where w is a connection weight; j ∈ presynaptic neurons, which includes input, hidden, and output neurons; i ∈ postsynaptic neurons, which includes hidden and output neurons; t is a time step in the activity of a neuron; α is the learning rate; and J is the network error function, which is defined as the sum of differences-squared between the target and actual activities. For a neuron whose net input is si (t) = j wi j y j (t) and whose activation is yi (t + 1) = 1+e1−si (t) , Williams and Zipser (1989) derived the gradient of the error function as −

∂ J (t) ∂ yk (t) = e k (t) , ∂wi j ∂wi j k

(2.6)

where k ∈ postsynaptic neurons (hidden and output units) and e is the difference between the target (dk (t)) and actual (yk (t)) activity (hidden units have no target activity so their error would be 0). This equation essentially determines how the activity of a target neuron is affected by a change in weight anywhere in the network (i.e., it may be a weight to the target neuron

Learning with Neurons Exhibiting Frequency Adaptation

2065

or to another neuron that could ultimately have a polysynaptic influence on the target neuron activity). Mathematical calculation of the gradient requires knowledge of the derivative of the activation function. For rate-model neurons with frequency adaptation, the derivative of the activation function (∂ yk (t)/∂wi j ) constantly changes with the calcium concentration. Moreover, the calcium concentration depends on the neuronal activation history since it changes slowly with a time constant of 111 ms. Of all the possible methods to manage the complexity of finding the derivative of the activation function for adapting rate-model neurons, we chose to compute the slope of the activation function and use this value as an estimate of the derivative. For this simple computational approach, a connection weight (wi j ) was increased by a fixed step size (wi j ), and the change in activities (yk (t)) was calculated. This was repeated for all connection weights. The weights remained fixed throughout the trajectory and then were updated in parallel from the following estimate of the negative gradient of the error function:

e k (t)

k

∂ yk (t) yk (t) = e k (t) . ∂wi j wi j k

(2.7)

This computational gradient descent is identical to the mathematical approach of Williams and Zipser (1989) except that the derivative (∂ yk (t)/∂wi j ) is estimated by computing the slope (yk (t)/wi j ) for a given change in connection weight (wi j ). Two techniques were used to optimize mathematical gradient descent (Hertz et al., 1991) and significantly improve learning. The first technique was weight initialization (Nguyen & Widrow, 1990) where the weights of synaptic inputs to a neuron are set randomly and then each weight is divided by the norm of these random inputs, and then all the connection weights are adjusted linearly so that the neurons will not saturate during the initial propagation of activities. Learning time with the mathematical and computational approaches was typically reduced by a factor of 3.0 to 3.5 using this technique. The second optimization technique used was to add a momentum term to equation 2.5, wi j (t) = −α

∂ J (t) + β wi j (t − 1), ∂wi j

(2.8)

where β is 0 − 1. Several values of momentum were tested, but the optimal value was consistently about 0.5 for all simulations in this study. 3 Results This section begins by showing how the computational approach of gradient descent can be used to reproduce the results of the mathematical approach

2066

P. Fortier, E. Guigon, and Y. Burnod

defined by Williams and Zipser (1989) when both approaches use standard (nonadapting) rate-model neurons. This is followed by showing how our computational approach can be used for supervised learning in a network with rate-model neurons that exhibit frequency adaptation (Cartling, 1996). 3.1 Reproducing Mathematical Gradient Descent. The task that the network was trained on involved mapping from a tonic input discharge pattern on a single input unit to a phasic-tonic output pattern on a single output unit. The mathematical approach of Williams and Zipser (1989) uses rate-model neurons, without frequency adaptation, whose activation is set by the logistic squashing function (1/(1 + e −sk (t) )) since its derivative is known. These rate-model neurons were used to create a network formed of one bias, one input, three hidden, and one output neurons. Both the mathematical and computational approaches of gradient descent were required to transform a tonic input discharge pattern into a phasic-tonic output pattern. Ten samples of training data were used. Thus, the conditions for both mathematical and computational gradient descent were identical. Our computational gradient descent approach was designed to reproduce the mathematical approach of Williams and Zipser (1989) except that the derivative of the activation function (∂ yk (t)/∂wi j ) was estimated by using a finite difference approximation to its slope in the interval [wi j , wi j + wi j ], resulting in yk (t)/wi j . The optimal step size (wi j ) had to be determined empirically. The results describe the selection of an optimal step size that was defined as that which yielded results reproducing mathematical gradient descent (Williams and Zipser, 1989). Figure 1 shows learning using mathematical calculation of the gradient with α = 0.04 and β = 0.5. Learning was stopped after 1298 cycles when the error (measured as the sum of the differences-squared between the target and actual activities) dropped from an initial value of 3.14 to a value below 0.03. This served as the point of reference for comparison with our computational gradient descent approach. We first sought an optimal step size yielding the least absolute difference between the mathematical and computational gradient trajectories. Step sizes within wi j = 10−15 − 10 were tested at different levels of network error obtained through successive learning cycles. There was a sigmoidal relationship such that the optimal step size decreased together with network error. We also examined gradient ratios calculated by dividing the amplitude of the mathematical gradient trajectory by the corresponding amplitude of a computational gradient trajectory obtained using the optimal step size. These gradient ratios also formed a sigmoidal relationship, which decreased along with network error. At the initial error of the network (3.14), the optimal step size was 1.5. Figure 2 shows an example of computational gradient descent using this step size. The learning rate

Learning with Neurons Exhibiting Frequency Adaptation

b

i1

h1

h2

h3

2067

Actual Target

o1

b i1 h1 h2 h3 1 0.5 0

o1

0

Time

50

Error

4

2

0

0

200

400

600 800 Learning Cycles

1000

1200

1400

Figure 1: Learning of the transformation of a tonic input into a phasic-tonic output pattern using mathematical gradient descent (Williams & Zipser, 1989). Learning was stopped after 1298 learning cycles (α = 0.04 and β = 0.5) when error fell below 0.03 and the actual output pattern (o1 solid line) closely matched the target pattern (o1 dotted line). The area of the blocks reflects strength of connection (black is excitatory and white is inhibitory) between the presynaptic neuron (column) and the postsynaptic neuron (row). b = bias, i1 = input unit, h1-3 = hidden units, o1 = output unit.

(α = 0.16) was selected as four times that used in the mathematical gradient descent (α = 0.04 in Figure 1) because the amplitude of the gradient trajectory obtained using a step size of 1.5 was one-fourth that obtained using mathematical gradient descent. Although the network error was similar in Figures 1 and 2, the connection weights were slightly different and, consequently, so were the neuron activities. We sought to improve learning by using a sigmoidal step size and learning rate; however, this did not improve on learning with a fixed step size of 1.5. The sigmoidal drop in both optimal step size and gradient ratios suggested that a single optimal step size could be defined better based on the least variance of gradient ratios throughout the trajectory. Recalculating

2068

P. Fortier, E. Guigon, and Y. Burnod

b

i1

h1

h2

h3

o1

Actual Target

b i1 h1 h2 h3 1 0.5 0 0

o1

Time

50

Error

6 4 2 0

0

200

400

600 800 Learning Cycles

1000

1200

Figure 2: Same as Figure 1 except that error was reduced below 0.03 after 1007 learning cycles of computational gradient descent with α = 0.16, β = 0.5, and step size = 1.5.

the optimal step size on this basis yielded consistently small optimal step sizes (1.26E-3 ± 1.62E-3) at all levels of network error. These were associated with gradient ratios of 3.59 ± 1.36. Thus, a single small step size, based on least variance of gradient ratios, could yield gradients that were consistently proportional to those obtained mathematically at any level of error. Multiplying these computationally derived gradients by a constant could then provide gradients that reproduced the mathematically derived ones. A step size of 1.26E-3 with a learning rate of 0.16 yielded the same network as that using mathematical gradient descent (visual inspection of the network could not reveal any difference from the reference case; see Figure 1). After 45,000 learning cycles, both mathematical gradient descent (α = 0.04 and β = 0.5) and computational gradient descent (α = 0.16, β = 0.5, and step size = 1.26E-3) reduced the error below 1.5E-4. In theory, the slope of the activation function between wi j and wi j + wi j should approach the tangent of the activation function at wi j as limwi j →0 . In practice, however, this was not the case. This was likely related to two factors specific to our computational approach. First, the resolution of 32-bit

Learning with Neurons Exhibiting Frequency Adaptation

2069

floating-point processors limits the smallest step size to 1E-15 in comparison to the infinitesimal virtual step of the mathematical approach. Second, the effects of a finite step (more than 1E-15) may be so small that it falls below the floating-point resolution and consequently fails to propagate through the network. The smallest perceptible step capable of propagating network activity was found to have a sigmoidal relationship with network error. The smallest step was 2.2E-7 ± 4.2E-7 when error was below an apparent transition of about 0.5 and it was 2.3E-03 ± 3.3E-03 when above this transition. Smaller step sizes were incapable of faithfully propagating activity through the network and consequently incapable of providing an accurate estimate of the gradient.

3.2 Computational Gradient Descent with Adapting Rate-Model Neurons. The previous results were from networks with standard ratemodel neurons (without frequency adaptation) in order to reproduce mathematical gradient descent using our computational approach. We now show the behavior of rate-model neurons with frequency adaptation and then how a network of such adapting neurons can undergo supervised learning with our computational approach to gradient descent. As explained in section 2, frequency adaptation of rate-model neurons was implemented according to Cartling (1996). Neuronal discharge causes calcium entry leading to stimulation of KCa channels and the expression of frequency adaptation. The adapting rate-model neuron equation (see equation 2.3) allows setting a calcium sensitivity that reproduces the effects of KCa channel density on frequency adaptation. The discharge properties of a neuron with different levels of calcium sensitivity and current inputs are shown in Figure 3. In Figure 3A, the neuron received a fixed step input of 1.0 ηA, but its calcium sensitivity was varied linearly between 0 and 1.0. At a calcium sensitivity of 0, the discharge rate was maximal (224 Hz) and followed the step profile of the input current. This yielded the highest level of intracellular calcium because it is directly related to neuronal firing rate. As the calcium sensitivity of the neuron was increased by steps of 0.2, there was a gradual drop of the initial firing rate and the peak calcium concentration. This yielded a phasic-tonic discharge profile that was most pronounced for the neuron with maximal sensitivity to calcium. In Figure 3B, the neuron had a fixed maximal calcium sensitivity of 1.0, but its input consisted of a first step that varied between 0.2 and 1.0 ηA and a second step always to 0.8 ηA. As the first step was increased from 0.2 to 1.0 ηA, there was an increase in both the initial firing rate and intracellular calcium concentration, which subsequently caused more pronounced attenuation of the firing rates and clearer phasic-tonic discharges. Although the second step was always to the same amplitude, the firing rates were inversely related to the prior activities. These results indicate that the neuron activity depends on the prior activation history and yields a range of firing profiles, from purely

2070

P. Fortier, E. Guigon, and Y. Burnod A

Hz

300 1 2 3 4 6

200 100 0

0

100

200

300

400

500

calcium

1 1 2 3 4 6

0.5

0

0

100

200

300

400

sensitivity

1

6 5 4

0.5

2+

3 2

Ca

Hz

B

500

0

1

0

100

200 300 Time (ms)

400

500

300

400

500

300

400

500

200 300 Time (ms)

400

500

300 200

1 2 5 3 4 4 3 5 2 1

100 0

0

100

200

calcium

1

0.5

0

5 4 3 2 1

0

100

input (ηA)

1

200 5 4 3

0.5

2 1

0

0

100

Figure 3: Responses of rate-model neurons (Cartling, 1996) with intrinsic properties producing frequency adaptation. (A) Frequency response and intracellular calcium levels in response to a fixed 1 ηA input current step to a neuron at different levels of calcium sensitivity. (B) Frequency response and intracellular calcium levels of a neuron with fixed calcium sensitivity of 1.0 in response to a first step to different step sizes of input current and a second step always to 0.8 ηA. Waveforms resulting from the same conditions are labeled at the right with the same number.

Learning with Neurons Exhibiting Frequency Adaptation

2071

tonic to strongly phasic-tonic, depending on the size of the input current and the sensitivity to calcium. Our computational gradient descent approach was used to produce the same transformation as in Figure 2 but with adapting rate-model neurons whose calcium sensitivity was held at zero for comparison. It remained to be determined whether the optimal weight change parameters of α = 0.16, β = 0.5, and step size of 0.001 used with nonadapting neurons would apply to adapting neurons. Optimal weight learning with adapting neurons required lower α values and a narrower range of step sizes, but it included the step size of 0.001 used earlier for nonadapting neurons. The network obtained by using α = 0.001, β = 0.5, and step size of 0.001 for weight changes, while calcium sensitivity was held at zero, is shown in Figure 4. As described in section 2, the activities displayed between 0 and 1 represent a linear rescaling of firing rates between 0 and the maximal firing rate of 224 Hz. Figure 4 shows that the network quickly converged (error less than 1.26E-3) using computational gradient descent. It was not surprising to see that the resulting weight matrix was very different from that in Figure 2 since neurons with different properties were used for the networks in these two figures. The next step was to use the computational gradient descent approach to modify not only connection weights but also the calcium sensitivity of neurons. The learning was identical: weights (wi j (t)) were changed in procedure k (t) proportion to k e k (t) y , while calcium sensitivities (vi (t)) were changed wi j k (t) . Since the calcium sensitivities are limited in proportion to k e k (t) y vi to values within 0 to 1, it was expected that smaller step sizes would be optimal. This was observed, but there were minimal benefits from using smaller step sizes. The values α = 0.001, β = 0.5, and step size of 0.001 were selected for changing both the connection weights and calcium sensitivities in order to produce the network results shown in Figure 5. Fig. 5A shows that the network converged (error less than 2.79E-4) to a solution that largely involved a single excitatory connection from the input to the output neuron with a calcium sensitivity (vi ) of 0.64. Neuron h1 was inactive (its connection to o1 had no impact), neuron h2 had negligible activity, and neuron h3 made a small excitatory connection to the output neuron. Figure 5B shows that eliminating the connections from h2 and h3 to the output neuron did not ruin the match between actual and target output activities (error less than 2.76E-2). This indicates that the transformation of the tonic input into a phasic-tonic output was largely achieved by the frequency adaptation of the output neuron. These results show that our computational gradient descent approach can correctly modify both weights and calcium sensitivity in adapting rate-model neurons in such a way as to achieve supervised learning and produce connectivity consistent with real biological networks where a tonic input is converted into a phasic-tonic ouput by the intrinsic properties of a single cell (Schwindt et al., 1992).

2072

P. Fortier, E. Guigon, and Y. Burnod

b

i1

h1

h2

h3

Actual Target

o1

[Ca2+]

v

b i1 h1 h2 h3 1 0.5 0

o1

0

Time

50

1 0.5 0

1 0.5 0

0

Time

50

Error

0.4

0.2

0

0

5

10

15

20 25 30 Learning Cycles (x 1000)

35

40

45

Figure 4: Learning of the transformation of a tonic input into a phasic-tonic output pattern using computational gradient descent (α = 0.001, β = 0.5, and step size of 0.001) in a network formed of adapting rate-model neurons with intrinsic properties that are capable of producing frequency adaptation. The bottom panel describes the exponential decline in error to less than 1.26E-3 with successive learning cycles. The area of the blocks reflects strength of connection (black is excitatory and white is inhibitory) between the presynaptic neuron (column) and the postsynaptic neuron (row) after learning: b = bias, i1 = input unit, h1–3 = hidden units, o1 = output unit. The actual firing pattern of each neuron after learning is displayed. The dotted line for the target output activity is completely overlapped by the actual activity of the neuron. The right-most column describes the calcium concentration of the neurons. The column labeled v contains histogram bars of size 0 for the calcium sensitivity. The calcium sensitivity was set to zero for comparison with the network in Figure 2 containing rate-model neurons without frequency adaptation.

4 Discussion The results showed that mathematical gradient descent could be reproduced using a simple computational approach that empirically determines the change in network error for a given step change in connection weight. It was shown that selection of an appropriate step size was key to reproduction of mathematical gradient descent. Theoretically, the results of mathematical gradient descent should be approached by computational gradient descent as smaller weight steps are used. However, our results showed that numerical resolution formed a lower limit to usable step sizes such that smaller

Learning with Neurons Exhibiting Frequency Adaptation

A

b

i1

h1

h2

h3

Actual Target

o1

2073

[Ca2+]

v

b i1 h1 h2 h3 1 0.5 0

o1

0

Time

50

1 0.5 0

1 0.5 0

0

Time

50

Error

0.2

0.1

0

B

0

5

b

i1

10

h1

h2

15

h3

20 25 30 Learning Cycles (x 1000) Actual Target

o1

35

40

45

2+

v

[Ca ]

b i1 h1 h2 h3 o1

1 0.5 0

0

Time

50

1 0.5 0

1 0.5 0

0

Time

50

Figure 5: Learning of connection weights and calcium sensitivities in a network with rate-model neurons exhibiting frequency adaptation. (A) The layout is the same as in Figure 4 and the neural network is the same except that learning of calcium sensitivity was enabled. The values α = 0.001, β = 0.5, and step size of 0.001 were used for changing both the connection weights and calcium sensitivities (error less than 2.79E-4). (B) Copy of the network in A except for the removal of the connections from h2 and h3 to o1 in order to show that the transformation of input activity was largely due to its connection with the output neuron exhibiting frequency adaptation (error < 2.76E-2).

steps were ineffective in changing network activity and, consequently, ineffective in providing information about the gradient. This is unlike the mathematical approach, which always provides information about the gradient.

2074

P. Fortier, E. Guigon, and Y. Burnod

Large step sizes always propagate some activity through the network, but the local details of the error function can be detected only by small step sizes. The smallest step size that could be used to reveal the gradient was directly related to network error such that individual neurons became more sensitive to weight changes as learning occurred. However, the smallest usable step size was not always optimal. It appears that by not being able to take an infinitesimal step size, the smallest optimal step size (defined as that which reproduced the results of the mathematical approach) was then arbitrarily determined by the step on the error function that produced a gradient that most closely reproduced the mathematically derived gradient. The amplitude of the gradient trajectory was smaller when estimated by the computational approach. This reflects underestimation of a tangent by measurement of the slope on an exponentially increasing or decreasing function. This underestimation of the gradient was offset by using higher learning rates (α). These higher learning rates were no longer appropriate when we switched from nonadapting to adapting rate-model neurons. It was not because the actual gradient was better estimated but rather because the adapting neurons were more sensitive to step changes in weights and consequently yielded larger gradients. Smaller learning rates had to be used; otherwise, learning was very erratic. On the other hand, the optimal step size and momentum identified for nonadapting neurons were appropriate for the adapting neurons. The study showed that our computational gradient descent approach could be used not only to change connection weights but also the sensitivity of frequency adaptation to calcium. The parameters used for weight changes were also applied to calcium sensitivity changes. Network learning converged onto the appropriate weight connections and neuron intrinsic properties that could transform a tonic input pattern into a phasic-tonic output pattern. This transformation was largely achieved by a monosynaptic connection from the tonic input neuron to the output neuron that exhibited strong frequency adaptation, as is the case for real biological neurons. For example, a single monosynaptic connection to a regular-spiking pyramidal cell is all that is required to convert a tonic input current into a phasictonic discharge (Schwindt et al., 1992). This is a simple yet fundamental transformation achieved by intrinsic neuron properties rather than connectivity of the network. Thus, it becomes easier to recognize known biological circuitry in neural network simulations when the model neurons express more features of real biological neurons. Consequently, it is more likely that network properties suggested from simulations could be confirmed from neuron properties and connectivity observed in real biological networks. Our computational approach to supervised learning is both simple to implement and generalizable to any conceivable model neuron with modifiable intrinsic properties. The key learning parameters are α and step size, which must be determined empirically for each problem in order to achieve optimal performance. Moreover, future simulations could not only adjust

Learning with Neurons Exhibiting Frequency Adaptation

2075

connection weights and modifiable intrinsic properties but also examine the effects of using different learning rates for the changes in connection weights and the changes in modifiable intrinsic properties. We showed the architecture of the network when learning of weight and calcium sensitivity was identical, but other results could be obtained when learning rates differ. Biological neurons certainly undergo different rates of changes in synaptic potentiation (Dittman, Kreitzer, & Regeh, 2000; Salin, Malenka, & Nicoll, 1996).

References Arbib, M. A., & Erdi, P. (2000). Precis of neural organization: Structure, function, and dynamics. Behav. Brain Sci., 23, 513–571. Barish, M. E. (1988). Ion channels as a source of behavioral diversity: Doing more with less in simpler organisms. Trends. Neurosci., 11, 558–561. Cartling, B. (1995). A generalized neuronal activation function derived from ionchannel characteristics. Network, 6, 389–401. Cartling, B. (1996). Response characteristics of a low-dimensional model neuron. Neural Comput., 8, 1643–1652. deCharms, R. C., & Zador, A. (2000). Neural representation and the cortical code. Annu. Rev. Neurosci., 23, 613–647. Dittman, J. S., Kreitzer, A. C., & Regehr, W. G. (2000). Interplay between facilitation, depression, and residual calcium at three presynaptic terminals. J. Neurosci., 20, 1374–1385. Fetz, E. E., Cheney, P. D., Mewes, K., & Palmer, S. (1989). Control of forelimb muscle activity 21 by populations of corticomotoneuronal and rubromotoneuronal cells. Prog. Brain Res., 80, 437–449. Finlay, B. L., & Darlington, R. B. (1995). Linked regularities in the development and evolution of mammalian brains. Science, 268, 1578–1584. Fortier, P. A., Smith, A. M., & Kalaska, J. F. (1993). Comparison of cerebellar and motor cortex activity during reaching: Directional tuning and response variability. J. Neurophysiol., 69, 1136–1149. Harris-Warrick, R. M. (2000). Ion channels and receptors: Molecular targets for behavioral evolution. J. Comp. Physiol., 186, 605–616. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Jones, E. G. (2000). Microcolumns in the cerebral cortex. Proc. Natl. Acad. Sci., 97, 5019–5021. Katz, P. S., & Harris-Warrick, R. M. (1999). The evolution of neuronal circuits underlying species-specific behavior. Curr. Opin. Neurobiol., 9, 628–633. Koch, C., & Segev, I. (1989). Methods in neuronal modeling: From synapses to networks. Cambridge, MA: MIT Press. Nguyen, D., & Widrow, B. (1990). Improving the learning speed of 2-layer neural networks 22 by choosing initial values of the adaptive weights. International Joint Conference of Neural Networks, 3, 21–26.

2076

P. Fortier, E. Guigon, and Y. Burnod

Ranganathan, R. (1994). Evolutionary origins of ion channels. Proc. Natl. Acad. Sci., 91, 3484–3486. Salin, P. A., Malenka, R. C., & Nicoll, R. A. (1996). Cyclic AMP mediates a presynaptic form of LTP at cerebellar parallel fiber synapses. Neuron, 16, 797–803. Schwindt, P. C., Spain, W. J., & Crill, W. E. (1992). Calcium-dependent potassium currents in neurons from cat sensorimotor cortex. J. Neurophysiol., 67, 216–226. Schwindt, P. C., Spain, W. J., Foehring, R. C., Stafstrom, C. E., Chubb, M. C., & Crill, W. E. (1988). Multiple potassium conductances and their functions in neurons from cat sensorimotor cortex in vitro. J. Neurophysiol., 59, 424–449. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural. Comput., 1, 270–280.

Received July 19, 2004; accepted February 24, 2005.

LETTER

Communicated by Shahar Mendelson

Learning Bounds for Kernel Regression Using Effective Data Dimensionality Tong Zhang [email protected] IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A.

Kernel methods can embed finite-dimensional data into infinitedimensional feature spaces. In spite of the large underlying feature dimensionality, kernel methods can achieve good generalization ability. This observation is often wrongly interpreted, and it has been used to argue that kernel learning can magically avoid the “curse-ofdimensionality” phenomenon encountered in statistical estimation problems. This letter shows that although using kernel representation, one can embed data into an infinite-dimensional feature space; the effective dimensionality of this embedding, which determines the learning complexity of the underlying kernel machine, is usually small. In particular, we introduce an algebraic definition of a scale-sensitive effective dimension associated with a kernel representation. Based on this quantity, we derive upper bounds on the generalization performance of some kernel regression methods. Moreover, we show that the resulting convergent rates are optimal under various circumstances. 1 Introduction Kernel methods have attracted significant attention recently since they can embed data into much larger nonlinear features spaces. Although these methods can use infinite-dimensional features in the corresponding reproducing kernel Hilbert spaces (RKHS), the kernel representation makes the computation feasible. An important property of kernel methods is their good generalization abilities in spite of the large underlying feature spaces. This is achieved through regularization, which restricts the representation power of kernel methods to a much smaller subset of the infinite-dimensional feature space. The purpose of this article is to show that the complexity of regularized kernel learning methods can be characterized by a scale-sensitive effective dimension of the data represented in the feature space. The effective dimension of the data can be measured through eigenvalues in the kernel principal component analysis (PCA) in the feature space. For kernel methods, although the feature space can be very large, the eigenvalues of the principal components decrease rapidly. Therefore, given an Neural Computation 17, 2077–2098 (2005)

© 2005 Massachusetts Institute of Technology

2078

T. Zhang

approximation scale, we can truncate the dimensions corresponding to the small eigenvalues as long as they produce a combined contribution that is smaller than the scale. The remaining eigenvectors, corresponding to the large eigenvalues, are finite. The number of these eigenvectors gives the effective data dimension at the selected approximation scale. This suggests that at any desirable approximation scale, we can obtain generalization bounds using the corresponding effective dimension. A precise algebraic definition of effective dimensionality, motivated through direct computation for the signal reconstruction problem, is introduced in section 3. We shall note that learning bounds for kernel methods can also be obtained through covering numbers. For example, for the least-squares regression problem, bounds were obtained in Lee, Bartlett, & Williamson (1998) and Chucker and Smale (2002). These results were improved in Mendelson (2002), where the chaining technique was used to obtain tighter bounds. The resulting bounds in Mendelson (2002) are similar to those of van de Geer (2000). However, bounds obtained in Mendelson (2002) are presented in a form that is more suitable for learning problems typically formulated in the machine learning literature. Covering number estimates have also been obtained (e.g., Guo, Bartlett, Shawe-Taylor, & Williamson, 2002; Williamson, Smola, & Scholkopf, ¨ 2001; Zhang, 2002). By combining the covering number results with learning bounds in Mendelson (2002), we can obtain generalization bounds for least-squares regression. However, it is not clear that existing bounds for covering numbers are tight. In addition, generalization bounds that are based on covering numbers can be improved in many cases. In order to obtain simplified and good convergence rates, one often imposes specific assumptions on the form of covering numbers. For example, for finite-dimensional problems, we will obtain the correct O(1/n) rate of convergence (n is the sample size), while results in Mendelson (2002) will give a suboptimal rate that is log n factor worse. In addition, bounds obtained using empirical process theory (such as Mendelson, 2002, and van de Geer, 2000) often contain large, unspecified constants. A more recent development along this line, which has remedied some problems in the covering number approach, is through the use of concentration inequalities and localized Rademacher complexity measures by Bartlett, Bousquet, and Mendelson (2002). This approach simplifies more traditional analysis, but the resulting bounds still contain large constants. Also, to obtain an estimate on the localized Rademacher complexity, one often still has to rely on bounds for covering numbers. Only for very specific problems, such as kernel least squares, is it possible to estimate localized Rademacher complexity more directly. In particular, the algebraic method used in Mendelson (2003) is related to the approach employed here. As shown in this article, for kernel regression, using a similar algebraic approach, we can directly obtain generalization bounds without going through the extra step of localized Rademacher complexity analysis. It is actually

Learning Bounds for Kernel Regression

2079

easier to do so than to estimate the localized Rademacher complexity itself (as in Mendelson, 2003). Our approach presents a novel alternative to the complicated empirical process machineries, such as those employed in the traditional covering number or localized Rademacher complexity analysis. In fact, the analysis given here is completely elementary and self-contained (besides a Bernstein inequality for random vectors, which can also be avoided if we do not state our main result as an exponential probability inequality). In the meantime, bounds obtained in this article are slightly tighter than those from earlier analysis. We note that the standard empirical process machinery is more generally applicable; however, it is also an indirect analysis (through covering numbers), which can often be improved. The technique used here is specific to kernel regression, and hence it gives a more direct analysis and leads to a clean and tight generalization bound. Moreover, the effective dimension is defined algebraically, and thus can be explicitly calculated given the data. Therefore, we can avoid the complicated problem of estimating covering numbers. Since the concept of effective dimension employed in this article is derived through exact computation in the signal recovering problem, we can naturally expect the resulting bounds for the standard learning formulation to be quite tight as well. In fact, we will show that by applying our analysis to some well-studied problems for which the optimal minimax rates are known, we are able to obtain the optimal rates. Our analysis also sheds useful insights into the complexity of kernel learning methods. The effective dimension depends on the decay of eigenvalues in the PCA analysis of the kernel data representation in the feature space. However, this effective dimension, which is scale sensitive, is controlled through regularization (which determines the scale). In this sense, the role of regularization in kernel learning can be regarded as an implicit method of dimensionality reduction (or feature selection in machine learning), which selects the first few principal component directions. However, the regularization approach may have certain computational advantages since it is arguably easier to solve a regularized kernel regression problem than a kernel eigenvalue problem, which is required in a PCA-based dimensionality-reduction scheme. The analysis in this letter suggests that the two approaches control learning complexity similarly. This letter improves some preliminary results in Zhang (2003a). We organize it as follows. In section 2, we introduce the kernel learning formulation studied here and establish necessary notations. In section 3, the concept of effective dimension is motivated from the signal reconstruction problem. Learning bounds for kernel regression problems based on effective dimensionality are obtained in section 4. Section 5 applies our analysis to some specific kernel learning formulations. We show that the optimal convergence rates can be obtained for various problems. Some final remarks are given in section 6.

2080

T. Zhang

2 Kernel Regression Consider the problem of predicting a real-valued output y based on its corresponding input vector x. In machine learning, our goal is to estimate a functional relationship y ≈ p(x) from a set of training examples. Usually the quality of a predictor p(x) can be measured by a loss function φ( p(x), y). In the standard machine learning formulation, we assume that the data (x, y) are drawn from an unknown underlying distribution D. Our goal is to find p(x) so that the expected true loss of p given below is as small as possible, L( p(·)) = E x,y φ( p(x), y), where we use E x,y to denote the expectation with respect to the true (but unknown) underlying distribution D. Typically, one needs to restrict the hypothesis function family size so that a stable estimate within the function family can be obtained from a finite number of samples. Let the training samples be (x1 , y1 ), . . . , (xn , yn ). We assume that the hypothesis function family that predicts y based on x can be specified with the following kernel method, p(α, x) =

n

αi K (xi , x),

(2.1)

i=1

where α = [αi ]i=1,...,n is a parameter vector that needs to be estimated from the data. K is a symmetric positive kernel. That is, K (a , b) = K (b, a ), and the n × n Gram matrix G = [K (xi , x j )]i, j=1,...,n is always positive semidefinite. Definition 1 Let H0 = { i= 1 αi K (xi , x) : ∈ N, αi ∈ R}. H0 is an inner product space with norm defined as 1/2 = α K (x , ·) α α K (x , x ) . i i i j i j i

i, j

Let H be the closure of H0 under the norm · , which forms a Hilbert space, called the reproducing kernel Hilbert space of K . We denote the corresponding norm as · H . It is well known and not difficult to check that the norm · H in definition 1 is well defined, and it defines an inner product. (Further information on reproducing Hilbert spaces can be found in Wahba, 1990.) For notational

Learning Bounds for Kernel Regression

2081

purpose, we shall denote K (xi , ·) ∈ H by ψx ∈ H. Definition 1 implies that ∀ p ∈ H: p(x) = p · ψx .

(2.2)

Since the reproducing kernel Hilbert space H can be large, in order to avoid overfitting, it is often necessary to consider models in a bounded convex subset C of H. We would like to find the best model in C defined as pC (·) = arg inf L( p) = arg inf E x,y φ( p(x), y). p∈C

p∈C

(2.3)

In supervised learning, we construct an estimator pˆ of pC (·) from a set of n training examples S = {(x1 , y1 ), . . . , (xn , yn )}. Throughout the letter, we use the symbolˆto denote empirical quantities based on the n observed training data S. Specifically, we use Eˆ x,y to denote the empirical expectation with respect to the training samples and ˆ p) = Eˆ x,y φ( p(x), y) = L(

n 1 φ( p(xi ), yi ). n i=1

In the standard machine learning analysis, one often assumes that the predictor pˆ is taken from a hypothesis function class C that models the relationship of the input x and the output y. A frequently studied learning algorithm is the empirical risk minimization (ERM) method, where we find a predictor in C that minimizes the empirical risk: ˆ p) = arg inf pˆ = arg inf L( p∈C

p∈C

n 1 φ( p(xi ), yi ). n i=1

(2.4)

This formulation is related to a different form of penalized kernel learning formulation used in practical computation. Assume that C ∈ H is defined by the constraint C = { p ∈ H : r ( p) ≤ c 0 }, where r : H → R is a functional called the regularization operator. By introducing a Lagrangian multiplier λn (≥0), we can rewrite equation 2.4 as pˆ = arg inf

p∈H

n 1 φ( p(xi ), yi ) + λnr ( p) . n i=1

(2.5)

For kernel methods with the corresponding RKHS H, one often considers r ( p) = 12 p(·)2 . For this choice of r (·), by differentiating equation 2.5 and

2082

T. Zhang

using equation we obtain the first-order condition at the optimal solution 2.2, n pˆ : λn pˆ = − n1 i=1 φ ( p(xi ), yi )ψxi , that is, λn pˆ (x) = −

n 1 φ ( p(xi ), yi )K (xi , x), n i=1

which is of the form of equation 2.1. We see that although equation 2.5 is formulated as an optimization in a possibly infinite-dimensional Hilbert space H, the solution lies in a finite-dimensional space, which makes the computation feasible. Although computationally, equations 2.4 and 2.5 are equivalent when we let r ( p) = 12 p(·)2 with an appropriately selected λn , in learning theory, they are analyzed differently since one typically assumes that λn is sample independent when analyzing equation 2.5 and that C is sample independent when analyzing equation 2.4. It is trickier to obtain good generalization bounds for equation 2.5, and for technical reasons, which we shall not elaborate, the leave-one-out approach in Zhang (2003b) is often more suitable for this purpose. The goal of this article is to obtain kernel-dependent generalization bounds for the constrained formulation, equation 2.4, using a concept of effective dimensionality of the underlying kernel representation, which we introduce in the next section. This kernel-dependent effective dimension, denoted as Dλ , is scale sensitive. The scale parameter λ is closely related to the regularization parameter λn in equation 2.4. We shall prove (in theorem 1) that with large probability, the generalization error of any training data–dependent estimator pˆ ∈ C has a form of ˆ pˆ ) − L( ˆ pC )) + inf [O(λ) + O(Dλ /n)], L( pˆ ) ≤ L( pC ) + c( L( λ>0

where c is a positive constant that depends on only the loss function. This type of bounds is often referred to as oracle inequalities in the literature since they compare the risk of an estimator pˆ to the best possible (but unknown) predictor pC . This is also the type of bounds studied in previous works on least-squares regression mentioned in section 1. For the empirical risk minimization method defined in equation 2.4, the second term on the right-hand side is always nonpositive, and hence can be ignored. We are thus mainly interested in the third term, which characterizes the learning complexity. The parameter λ is an arbitrary positive number that can be interpreted as the approximation scale (bias), and Dλ /n can be interpreted as the variance associated with this approximation scale. The complexity in the third term is minimized when we choose the right balance of bias-variance trade-off at λ ≈ Dλ /n. If H is a finite-dimensional space, then the third term is O(d/n), where d = dim(H) is the dimension of H. If H is an infinite-dimensional space (or when d is large compared

Learning Bounds for Kernel Regression

2083

to n), one can adjust λ appropriately based on the sample size n to get a bound O(dn /n). In this case, the effective dimension dn at the optimal scale λ becomes sample-size dependent. However, √ the dimension (at the optimal scale) will never grow faster than dn = O( n), and hence even in the √ worst case, the third term converges to zero at a rate no worse than O(1/ n). Note that in some practical kernel learning formulations, a bias term may be included in equation 2.1, where the corresponding function space H has n the form p(α, x) + b = i=1 αi K (xi , x) + b. It is well known that H can be considered as the RKHS with kernel K (x1 , x2 ) = K (x1 , x2 ) + 1. Therefore, for simplicity, we do not consider the formulation with bias in this letter. We shall mention that by treating H as the RKHS of a different kernel, the resulting learning formulations may be slightly different from those in the literature where the bias b is typically not included in the penalization term. 3 Effective Dimensionality in the Signal Reconstruction Problem In this section, we shall motivate the concept of effective dimension using the problem of reconstructing signals from observations that are corrupted with noise. We consider a set of observations xi and the associated signals f (xi ). We observe corrupted response yi = f (xi ) + ni , where {ni } are independently and identically distributed noise, drawn from a zero-mean probability distribution with variance σ . Given the set of observed response yi , the goal is to obtain an estimate yˆ i ≈ f (xi ) such that the mean squared error yˆ − f 2 =

n 1 ( yˆ i − f (xi ))2 n i=1

is as small as possible. Note that we have used yˆ and f to denote the vectors [ yˆ i ] and [ f (xi )], respectively. Unlike the standard regression formulation in learning problems, in the signal reconstruction problem, the input data points xi are fixed, and we are interested in the behavior of { f (xi )} only with respect to the random observation {yi }. In statistics, this formulation is often referred to as fixed design (where each xi is a design point). The standard learning formulation, where xi is assumed to be taken from a random distribution, is often referred to as random design. To see the effect of dimensionality with respect to the reconstruction accuracy, we first consider the case that the signal vector f = [ f (xi )] belongs to a known low-dimensional space of dimension d, which can be expressed as f = Pu, where P is an n × d projection operator such that P T P = I . We use I to denote the identity matrix. Consider the natural estimator fˆ = P P T y, which projects the observation onto the signal subspace, and it is easy to see that the expected mean squared error is E fˆ − f )2 = dn σ 2 . This

2084

T. Zhang

means that the generalization ability of the estimator is proportional to the dimensionality of the target. In general, f may not belong to a fixed subspace. However, we can extend the above analysis to any linear estimator fˆ = Sy and obtain E fˆ − f 2 = Sf − f 2 +

tr(ST S) 2 σ . n

(3.1)

This gives a bias-variance decomposition. The second term denotes the variance, which is small when tr(ST S) is small. This term determines the learning complexity and is what we are interested in. The first term is the bias term, which can be large when S is not close to the identity matrix I . However, if we have good prior knowledge, then we may guess an operator S such that Sf ≈ f and tr(ST S) is small. Based on the previous paragraph, it is natural to regard tr(ST S) as a measure of the effective dimension of the linear estimator S. The above analysis can be easily applied to kernel methods. A natural way to use kernel representation in signal reconstruction is by solving the following regularized least-squares problem; fˆ = arg min f

1 ( f (xi ) − yi )2 + λ f H , n

where f H is the norm of f in the kernel-induced reproducing Hilbert space. Let G = [K (xi , x j )] be the kernel Gram matrix; then f 2H = f T G −1 f . It follows that the solution can be expressed as fˆ = (G + λI )−1 Gy, where I is the identity operator. Therefore, the effective dimension of kernel learning is measured by the quantity tr((G + λI )−2 G 2 ) ≤ tr((G + λI )−1 G). The right-hand side is simpler and is of the same scale as the left-hand side when we choose a small λ to balance the bias-variance trade-off. For notation simplicity in this article, we shall specify our bounds using the quantity tr[(G + λI )−1 G] on the right-hand side and regard it as the effective dimension for kernel methods. It can be shown that ˜ ˜ + λI )−1 G), tr((G + λI )−1 G) = tr((G

Learning Bounds for Kernel Regression

2085

where ˜ = G

n 1 ψx ψ T , n i=1 i xi

(3.2)

and ψx is the representation of x in H, as in equation 2.2. We have also used the matrix notation ψx ψxT to denote the self-adjoint operator H → H defined as (ψx ψxT )h = ψx (ψx · h) = h(x)ψx . The proof of this claim is not difficult. However, since it is not important in our analysis except for the purpose of motivating the right effective-dimension formula to use, we shall skip the detailed proof. The goal of this letter is to extend the above analysis for fixed design kernel least-squares regression to standard learning problems in the random design setting. Moreover, we will derive exponential probability bounds that are often more useful than the expected error bound in equation 3.1. In the setting of learning theory, we consider samples x1 , . . . , xn that are ˜ in equation 3.2 taken from a probability distribution. It is natural to replace G ˜ = E x ψx ψxT . Therefore, we by the expectation over the true distribution: G define the effective dimension associated with kernel representation with a scale (regularization) parameter λ as

Dλ = tr (E x ψx ψxT + λI )−1 E x ψx ψxT .

(3.3)

We show that this quantity characterizes the learning complexity of kernel regression. Clearly Dλ is a decreasing function of λ that approaches zero as λ → ∞. This means that the effective dimension becomes small when the approximation scale λ becomes large. Some useful properties of Dλ are listed in the appendix. In particular, proposition A.1 gives the relationship of Dλ and eigenvalues in the corresponding kernel PCA. Properties of Dλ in the appendix imply that for a finite-dimensional space H, Dλ is upper-bounded by the dimensionality dim(H) of the space. Moreover, as λ → 0, Dλ converges to the rank of E x ψx ψxT , which can be regarded as the physical dimension of the data points ψx . As λ increases, the effective dimension Dλ decreases. We always have λDλ ≤ E x ψxT ψx = E x K (x, x). Moreover, the equality holds when λ → ∞. Intuitively, we may interpret the effective dimension Dλ as the number of most significant principal components of ψx (through kernel PCA) needed to approximate a function in the unit ball of H at the approximation scale λ. 4 Complexity of Kernel Learning In this section, we obtain learning bounds that use the effective data dimension Dλ and establish the main result of the letter in Theorem 1.

2086

T. Zhang

4.1 Decomposition of Loss Function. For the second-order differentiable convex loss functions that we are interested in, we shall introduce a decomposition necessary to our analysis. The decomposition approximates convex loss functions using the least-squares loss. Since the effective dimension introduced in the previous section was derived from the least-squares regression problem, this approximation is quite natural. Consider a convex subset C ⊂ H, and let pC ∈ C be the optimal predictor defined in equation 2.3. By differentiating this equation at the optimal solution and using the convexity of C with respect to p, we obtain the following first-order condition: E x,y φ1 ( pC (x), y)( p(x) − pC (x)) ≥ 0

(∀ p ∈ C),

(4.1)

where φ1 ( p, y) is the derivative of f ( p, y) with respect to p. The following notation becomes convenient later in our analysis. Definition 2. The Bregman distance of φ (with respect to its first variable) is defined as dφ ( p, q ; y) = φ(q , y) − φ( p, y) − φ1 ( p, y)(q − p). It is well known (and easy to check) that for a convex function, its Bregman divergence is always nonnegative. We further assume in our analysis that there exist positive constants cl and c u such that 0 < cl ≤ φ1 ( p, y)/2 ≤ c u , where φ1 is the second-order derivative of f with respect to the first variable. Using Taylor expansion of φ, it is easy to see that we have the following inequality for dφ : cl ( p − q )2 ≤ dφ ( p, q ; y) ≤ c u ( p − q )2 .

(4.2)

Now, ∀ p ∈ C, we consider the following decomposition: φ( p(x), y) − φ( pC (x), y) = dφ ( pC (x), p(x); y) + φ1 ( pC (x), y)( p(x) − pC (x)). We obtain from equation 4.2 the following inequalities, which are needed in our analysis: cl ( p(x) − pC (x))2 + φ1 ( pC (x), y)( p(x) − pC (x)) ≤ φ( p(x), y) − φ( pC (x), y) ≤ c u ( p(x) − pC (x)) + 2

φ1 ( pC (x),

(4.3) y)( p(x) − pC (x)).

Note that cl and c u depend on only the loss function φ. For the leastsquares problem, we can take cl = c u = 1.

Learning Bounds for Kernel Regression

2087

4.2 Empirical Ratio Inequalities. In this section, we derive some ratio uniform convergence inequalities for kernel methods using the effective dimensionality Dλ . Such inequalities can then be used to obtain generalization bounds for kernel regression methods. Note that similar ratio uniform convergence results can also be obtained using covering numbers (if we have an estimate of such covering numbers) and are also needed in the localized Rademacher complexity analysis. The approach presented here is much more direct and elementary. Consequently, the resulting bounds are cleaner. Given a positive definite self-adjoint operator Q : H → H, we define an inner product structure on H as u, v Q = u · Qv = uT Qv. 1/2

The corresponding norm is u Q = u, u Q . Given a positive number λ, and letting I be the identity operator, we define the following self-adjoint operator on H:

−1 Qλ = E x ψx ψxT + λI . Using this operator, we consider the inner product space Tλ on the set of self-adjoint operators on H, with the inner product defined as A, B Tλ = tr(AQλ B), where tr(S) is the trace of a linear operator S (sum of eigenvalues). The corresponding norm is denoted as · Tλ . We start our analysis with the following simple lemma: Lemma 1. For all self-adjoint operators A, B, and Qλ , we have [tr (AB)]2 ≤ tr (AQλ A)tr (B Qλ −1 B). Proof. Note that ∀ρ > 0:

1 tr B Qλ −1 B 2 ρ T 1 1 1/2 −1/2 1/2 −1/2 = tr ρ AQλ − B Qλ ρ AQλ − B Qλ ρ ρ ρ 2 tr(AQλ A) +

+ tr(AB T ) + tr(B AT ) ≥ tr(AB T ) + tr(B AT ) = 2tr(AB). Now the lemma can be established by choosing ρ to minimize the left-hand side.

2088

T. Zhang

The following bounds form the foundation of our analysis. Lemma 2. For any function a (x, y), the following bounds are valid: sup | Eˆ x,y a (x, y) p(x) − E x,y a (x, y) p(x)| Exp(x)2 + λ p2H

p∈H

≤ Eˆ x,y a (x, y)ψx − E x,y a (x, y)ψx Qλ , sup p∈H

| Eˆ x p(x)2 − Ex p(x)2 | ≤ Eˆ x ψx ψxT − E x ψx ψxT Tλ . 2 p H E x p(x)2 + λ p H

Proof. Note that E x p(x)2 + λ p2H = p T Qλ −1 p. Therefore, letting v = Eˆ x,y a (x, y)ψx − E x,y a (x, y)ψx , we obtain from Cauchy-Schwartz inequality, | Eˆ x,y a (x, y) p(x) − E x,y a (x, y) p(x)| = | p · v| ≤ ( p T Qλ −1 p)1/2 (v T Qλ v)1/2 . This proves the first inequality. To show the second inequality, we apply lemma 1: | Eˆ xp(x)2 − E x p(x)2 |

= tr Eˆ x ψx ψxT − E x ψx ψxT pp T

T 1/2 ≤ Eˆ x ψx ψxT − E x ψx ψxT Tλ tr pp T Q−1 λ pp = Eˆ x ψx ψxT − E x ψx ψxT Tλ p H E x p(x)2 + λ p2H . This proves the second inequality. The importance of lemma 2 is that it bounds the behavior of an arbitrary estimator p ∈ H (which can be sample dependent) in terms of the norm of the empirical mean of n zero-mean Hilbert-space-valued random vectors. The convergence rate of the latter can be easily estimated from the variance of the random vectors, and therefore we have significantly simplified the problem. In order to estimate the variance of the random vectors on the right-hand sides of lemma 2, we shall use the notion of effective data dimensionality (at the scale λ) defined in equation 3.3, which can now be expressed as Dλ = E x ψxT Qλ ψx = E x ψx 2Qλ .

Learning Bounds for Kernel Regression

2089

We also define the following quantities to measure the boundedness of the input data: MH ≥ sup ψx H , x

Mλ = sup ψx Qλ .

(4.4)

x

√ It is easy to see that Mλ ≤ MH / λ. Lemma 3. Let c = sup x,y a (x, y). Then we have E x,y a (x, y)ψx − E x ,y a (x , y )ψx 2Qλ ≤ c 2 Dλ , 2 2 D . E x ψx ψxT − E x ψx ψxT Tλ ≤ MH λ

Proof. Let φ = E x ,y a (x , y )ψx . Then we have E x,y a (x, y)ψx − φ2Qλ = E x,y a (x, y)ψx 2Qλ − φ2Qλ ≤ c 2 Dλ , which gives the first inequality. Note that ∀φ ∈ H: φφ T Tλ = φ Qλ φ H . Therefore, 2 2 2 Dλ , E x ψx ψxT − E x ψx ψxT Tλ = E x ψx 2Qλ ψx 2H − E x ψx ψxT Tλ ≤ MH leading to the second inequality. 4.3 Generalization Bounds for Kernel Regression. In order to obtain generalization bounds, we need the following version of Bernstein inequality in Hilbert spaces. Lemma 4. (Yurinsky, 1995). Let ξi be zero-mean independent random vectors in a Hilbert space. If there exist B, M > 0 such that for all natural numbers n B2 l l−2 l ≥ 2: n1 i= , then for all δ > 0 , P( n1 i ξi H ≥ δ) ≤ 1 Eξi H ≤ 2 l!M 2 exp(− n2 δ 2 /(B 2 + δ M)). In this article, we shall use the following variant of the above bound for convenience: 1 2Mt 2t ξi ≥ + B ≤ 2 exp(−t). P n i n n H

(4.5)

2090

T. Zhang

Lemma 5. Under the assumptions of lemma 3, let λ (t) = with probability of at least 1 – 2 exp(–t),

2t Dλ n

+

2t Mλ . n

Then

sup | Eˆ x,y a (x, y) p(x) − E x,y a (x, y) p(x)| ≤ λ (t)c. E x p(x)2 + λ p2H

p∈H

Similarly, with probability of at least 1 – 2 exp(–t), we have: sup p∈H

| Eˆ x p(x)2 − E x p(x)2 | ≤ λ (t)MH . p H E x p(x)2 + λ p2H

Proof. For the first inequality, we let random vector ξi be a (xi , yi )ψxi − E x a (x , y )ψx under the Qλ -norm. Clearly, ξi Qλ ≤ 2c Mλ , and Eξi 2Qλ ≤ c 2 Dλ by lemma 3. It thus follows that we can let B = c Dλ 1/2 and M = c Mλ in lemma 4. The first inequality thus follows from lemma 2 and equation 4.5. Similarly, let ξi = ψxi ψxTi − E x ψx ψxT under the Tλ -norm. We have ξi Tλ ≤ 2 Dλ . It thus follows that we can let 2MH Mλ , and by lemma 3 Eξi 2Tλ ≤ MH 1/2 B = MH Dλ and M = MH Mλ in lemma 4. The second inequality follows from lemma 2 and equation 4.5. We are now ready to derive the main result of the article: Theorem 1. Assume that supx,y |φ1 ( pC (x), y)| ≤ b C cl , where cl and c u satisfy equation 4.2. Consider any sample dependent estimator pˆ such that pˆ ∈ C (that

is, pˆ ∈ C is a function of the training sample S). Let λ (t) = 2tnDλ + 2t nMλ . Then ∀λ > 0, with probability of at least 1 – 4 exp(−t), and the generalization error is bounded as: L( pˆ ) ≤ L( pC ) + +

2c u

cl

ˆ pˆ ) − L( ˆ pC )] + λcl pˆ − pC 2H [ L(

c u2 λ (t)2 [b C + MH pˆ − pC H ]2 . cl

Proof. We introduce the following notations for convenience: ˆ p) = Eˆ x,y φ1 ( pC (x), y)( p(x) − pC (x)), A( A( p) = E x,y φ1 ( pC (x), y)( p(x) − pC (x)), ˆ p) = Eˆ x ( p(x) − pC (x))2 , B(

Learning Bounds for Kernel Regression

2091

B( p) = E x ( p(x) − pC (x))2 , E( p) = B( p) + λ p − pC 2H . We obtain from Lemma 5 that with probability of at least 1 − 4 exp(−t), ˆ pˆ ) − A( pˆ )| ≤ λ (t)b C cl E( pˆ )1/2 , | A( ˆ pˆ ) − B( pˆ )| ≤ λ (t)MH pˆ − pC H E( pˆ )1/2 . | B( Combining the above two inequalities, we obtain: ˆ pˆ ) − B( pˆ )| ≤ λ (t)E( pˆ )1/2 cl [b C + MH pˆ − pC H ]. ˆ pˆ ) − A( pˆ )| + cl | B( | A( Using equation 4.3 and recalling equation 4.1, we obtain cl ˆ pˆ ) − L( ˆ pC )] [L( pˆ ) − L( pC )] ≤ [ L( cu + λ (t)E( pˆ )1/2 cl [b C + MH pˆ − pC H ].

(4.6)

Let K 1 ( p) = [L( p) − L( pC )] + λcl p − pC 2H , ˆ p) − L( ˆ pC )] + λ Kˆ 2 ( p) = [ L(

cl2 p − pC 2H . cu

Then equations 4.1 and 4.3 imply that cl E( p) ≤ K 1 ( p). We can derive from equation 4.6 cl K 1 ( pˆ ) ˆ K 1 ( pˆ ) ≤ K 2 ( pˆ ) + λ (t) cl [b C + MH pˆ − pC H ] cu cl ≤ Kˆ 2 ( pˆ ) +

cl cu K 1 ( pˆ ) + λ (t)2 [b C + MH pˆ − pC H ]2 . 2c u 2

Rearranging the above inequality, we obtain the theorem. Note that the above result holds for any data-dependent estimator. Specifically, for the empirical risk minimization estimator pˆ that is defined through ˆ pˆ ) − L( ˆ pC ) ≤ 0. Therefore, the second term on the equation 2.4, we have L( right-hand side of the bound is nonpositive, which can be ignored. That is, we have L( pˆ ) ≤ L( pC ) + λcl pˆ − pC 2H +

c u2 λ (t)2 [b C + MH pˆ − pC H ]2 . cl

2092

T. Zhang

Since the above bound holds for all λ > 0, we may choose a more specific λ to simplify the bound. The following result gives such a simplified version of theorem 1. Corollary 1. Under the assumptions of theorem 1, ∀t ≥ 0 and ∀λ such that 2 λn ≥ max(Dλ , 1)MH , the following bound holds with probability of at least 1 – 4 exp(−t), L( pˆ ) ≤ L( pC ) + +λ

2c u

cl

ˆ pˆ ) − L( ˆ pC )] [ L(

2c u2 (2t + 1)2 [b C + MH pˆ − pC H ]2 . 2 cl MH

√ Proof. We shall apply theorem 1. Using the bound Mλ ≤ MH / λ, we obtain

2 2 /λ n−2 ≤ λ(4t + 8t 2 )/MH .

λ (t)2 ≤ 4t Dλ n + 8t 2 MH We also have λcl pˆ − pC 2H ≤

λc u2 [b C + MH pˆ − pC H ]2 . 2 cl MH

Substituting into theorem 1, we obtain the desired bound. Since when λ → ∞, Dλ → 0, the assumption of corollary 1 can always be satisfied with a parameter λ that is sufficiently large. The quality pˆ − pC H is never larger than the diameter of C: d H (C) = sup{ p1 − p2 H : p1 , p2 ∈ C}. It can thus be upper-bounded by a constant when d H (C) is finite. The parameters cl , c u , and b C depend on the loss function φ. For the least-squares loss, φ( p, y) = ( p − y)2 , we have cl = c u = 1. Now assume that sup{ p H : p ∈ C} ≤ A, and y ∈ [−b, b]. Then we have d H (C) ≤ 2Aand b C ≤ 2(AMH + b). For the empirical risk minimization estimator defined in equation 2.4, we obtain the following bound from corollary 1 with probability of at least 1 − 4 exp(−t): E x,y ( pˆ (x) − y)2 ≤ inf E x,y ( p(x) − y)2 + λ p∈C

2 when λn ≥ Dλ MH .

8(2t + 1)2 [2MH A + b]2 , 2 MH

Learning Bounds for Kernel Regression

2093

The optimal bound can be obtained when we pick λ as the solution of 2 the fixed-point equation λn = Dλ MH . The solution exists since when λ = 0, the left-hand side is smaller than the right-hand side, and when λ → ∞, the right-hand side is smaller than the left-hand side. As we shall see in section 5, the rate obtained using the optimally chosen λ agrees with the minimax rate for various problems. It is also similar to the rate from the localized Rademacher analysis with the complexity estimate given in Mendelson (2003). 5 Examples As we can see from section 3, the concept of effective dimension is motivated through direct computation of the bias-variance decomposition in the signal reconstruction problem. One can thus expect that the generalization bound we obtained in theorem 1 is quite tight. Indeed, in this section, we show that the rates we obtain achieve the best possible minimax rates for some kernel regression problems for which the optimal rates are known. ˆ p) in C. In We will consider only empirical estimator pˆ that minimizes L( ˆ pˆ ) − L( ˆ pC )] ≤ 0 in corollary 1. Therefore, we obtain the followthis case, [ L( ing bound: with probability of at least 1 − 4 exp(−t), 2 , L( pˆ ) ≤ L( pC ) + a C (t) inf λ : λn ≥ Dλ MH where a C (t) =

2c u2 (2t + 1)2 [b C + MH d H (C)]2 2 cl MH

can be regarded as a constant. 5.1 Worst-Case Effective Dimensionality and Generalization. √ In the 2 2 worst case, we have Dλ ≤ MH /λ. Therefore, we can let λ = MH / n in corollary 1, which leads to the following bound with probability at least 1 − 4 exp(−t):

L( pˆ ) ≤ L( pC ) +

2c u2 (2t + 1)2 [b C + MH pˆ − pC H ]2 . √ cl n

√ This implies that the convergence rate is O(1/ n) in the worst case. This is the best possible kernel-independent rate (compare with section 5.3).

2094

T. Zhang

5.2 Finite-Dimensional Problems. We can use the bound Dλ ≤ dim(H). 2 Now let λ = dim(H)MH /n in corollary 1. We obtain with probability at least 1 − 4 exp(−t): L( pˆ ) ≤ L( pC ) +

2c u2 (2t + 1)2 dim(H) [b C + MH pˆ − pC H ]2 . cl n

It is well known that the convergence rate of the order O(dim(H)/n) is optimal in this case. 5.3 Smoothing Splines. We consider only one-dimensional problems. For smoothing splines, the corresponding Hilbert space consists of functions p satisfying the smoothness condition that [ p (s) (x)]2 d x is bounded ( p (s) is the sth derivative of p and s > 1/2). We may consider periodic functions (or their restrictions in an interval), and the condition corresponds to a decaying Fourier coefficients condition. Specifically, the space can be regarded as the RKHS with kernel K (x1 , x2 ) =

(k + 1)−2s (sin(kx1 ) sin(kx2 ) + cos(kx1 ) cos(kx2 )). k≥0

Now, using proposition A.4, we have Dλ ≤ infk≥1 2k +

2/λ (2s−1)k 2s−1

. There-

Note that we may take = 2s/(2s − 1). Therefore, we fore, Dk −2s ≤ in corollary 1 where k is the largest integer such that can let λ = k 2 n . This gives the following bound (with probability at least k 2s+1 ≤ (2s−1) 2 8s 1 − 4 exp(−t)): 4sk . 2s−1 −2s

2 MH

L( pˆ ) ≤ L( pC ) + ×

8s 2 (2s − 1)2 n

2s/(2s+1)

c u2 (2s − 1)(2t + 1)2 [b C + MH d H (C)]2 . cl s

This rate of O(n−2s/(2s+1) ) matches the best possible convergence rate for any data-dependent estimator. Note that the lower bound is well known in the nonparametric statistical literature (e.g., see Stone, 1982). 5.4 Exponential Kernel. Again for simplicity, we consider onedimensional problems where x ∈ [−1, 1]. The kernel function is given by K (x1 , x2 ) = exp(x1 x2 ) =

n 1 i i x x. i! 1 2 i=0

Learning Bounds for Kernel Regression

2095

1 Now, using proposition A.4, we have Dλ ≤ infk≥0 [k + 2 + λk! ]. Therefore, 2 Dk −k ≤ 2k + 3. Note that MH ≤ e; therefore, for sufficiently large n, we can let λ = k∗−k∗ in corollary 1 with 10k∗k∗ +1 = n, where the optimal solution ∗ k∗ ≤ ln2 lnlnnn . Therefore, we can take λ = 10k ≤ n20lnlnlnnn . This means that at the n optimal scale, the effective dimension is at most O(ln n/ ln ln n). Now corollary 1 implies a generalization bound of the form L( pˆ ) ≤ L( pC ) + O( n lnlnlnn n ).

6 Conclusion In this letter, we introduced a concept of scale-sensitive effective data dimension and used it to derive generalization bounds for some kernel regression problems. The resulting convergence rates are optimal for various learning formulations. The effective dimension at the appropriately chosen optimal scale can be √ sample-size dependent and behaves like n in the worst case. This shows that despite the claim that a kernel method learns a predictor from an infinite-dimensional Hilbert space, the effective dimension is subpolynomial in the sample size. Therefore, kernel methods are not more powerful than learning in an appropriately chosen finite-dimensional space. The formulation of effective dimension in the appendix suggests that we may use the largest few eigenfunctions in the kernel PCA to approximate the effective dimensions. This has interesting computation implications. In certain applications, one may be able to obtain the eigenfunctions relatively easily; another possibility is to directly model the eigenfunctions without going through the kernels. For such problems, it is more efficient computationally to solve a regression problem in the subspace spanned by the eigenfunctions corresponding to the largest eigenvalues instead of using the standard kernel method. This is because with n-samples, kernel methods require n parameters (one for each data point) in the computation. However, as we have shown, the √ effective number of parameters (effective dimension) is not more than O( n). Therefore, it could be possible to significantly reduce the computational cost of kernel methods by explicitly parameterizing the effective dimensions using the principal eigenfunctions (if they can be obtained relatively easily).

Appendix: Properties of the Effective Dimension We give properties of the scale-sensitive data dimension Dλ . The following characterization of Dλ is very useful for estimating the quantity. It relates Dλ to the eigendecomposition of the data in the feature space (kernel PCA).

2096

T. Zhang

Proposition A.1. Consider a complete set of orthonormal eigenpairs {(λi , ui ) : i ≥ 1} of the operator E x ψx ψxT , where ui · u j = it 0 if i = j and ui · ui = 1. We i have the identity: Dλ = i λiλ+λ . Proof. Clearly {ui } forms a complete set of eigenvectors for the operator (E x ψx ψxT + λI )−1 E x ψx ψxT . The corresponding eigenvalues are λi /(λi + λ). Since the trace of an operator is the sum of its eigenvalues, we obtain the equality. The following result implies that the quantity Dλ becomes the data decision when λ is small. Proposition A.2. If H is a finite-dimensional space, then Dλ ≤ dim(H). In addition, limλ→0 Dλ = rank(E x ψx ψxT ), where we use rank to denote the rank of a matrix. Proof. The number of nonzero eigenvalues r of E x ψx ψxT ) is its rank, where r ≤ dim(H). Let the corresponding eigenvalues be λi , . . . , λr . Therefore, Dλ = ri=1 λi /(λi + λ) ≤ r and Dλ → r as λ → 0. The following results give an upper bound of effective dimension Dλ at an approximation scale λ that is independent of the feature space dimensionality: Proposition A.3. For all Hilbert spaces H, we have the following bound Dλ ≤ 2 MH /λ, where MH is defined in equation 4.4. 2 Proof. Dλ ≤ tr(E x ψx ψxT )/λ = E x ψx 2H /λ ≤ MH /λ.

In many cases, we can find a so-called feature representation of the kernel function K (x1 , x2 ) = ψx1 · ψx2 as K (x1 , x2 ) = ∞ j=1 ψ j (x1 ) · ψ j (x2 ), where each ψ j (x) is a real-valued function that gives a feature component of data x. Under this representation, we can obtain the following bound on the effective dimension: PropositionA.4. Consider the following feature space decomposition of kernel: ψx1 · ψx2 = i ψ j (x1 )ψ j (x2 ), where each valued function. If λ1 ≥ ψ j is a real λ2 · · ·, then we have the following bound: j≥k λ j ≤ E x j≥k ψ j (x)2 . This implies Dλ ≤

inf

k≥0

k + Ex

ψ j (x) /λ . 2

j>k

Proof. We can represent each ψx using the feature representation as ψx (·) = ψ (x)ψ (·). This induces a norm-preserving embedding α ψ j j i xi ∈ H → j i

Learning Bounds for Kernel Regression

2097

{ i αi ψ j (xi )} ∈ L 2 . Therefore, technically, we may consider H belonging to ˜ which are spanned by {ψ j (x)}, with the quotient norm induced from H, the above embedding. For simplicity, we still denote H˜ by H. Now for any k, let Pk : H → H be the projection operator onto the subspace spanned by {ψ j (·)}kj=1 . Let Pk ψx = φxk and (I − Pk )ψx = ψx − φxk = φ¯ xk . It is clear that φ¯ xk 2H ≤ j>k ψ j (x)2 . Therefore,

−1

−1

Dλ = tr E x ψx ψxT + λI E x φxk φxk T + tr E x ψx ψxT + λI E x φ¯ xk φ¯ xk T

−1

T E x φxk φxk T + tr (λI )−1 E x φ¯ xk φ¯ xk T ≤ tr E x φxk φxk + λI 2 ≤ k + E x φ¯ xk H λ ψ j (x)2 /λ. ≤ k + Ex j>k

Note that in the above derivation, the second inequality follows from proposition A.2 and the fact that φxk belongs to a subspace of H, which has dimension at most k. References Bartlett, P., Bousquet, O., & Mendelson, S. (2002). Localized Radeacher complexity. In Proceedings of the Annual Conference on Computational Learning Theory. Berlin: Springer. Chucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American Mathematical Society, n.s. 39(1), 1–49. Guo, Y., Bartlett, P. L., Shawe-Taylor, J., & Williamson, R. C. (2002). Covering numbers for support vector machines. IEEE Transactions on Information Theory, 48(1), 239– 250. Lee, W., Bartlett, P., & Williamson, R. (1998). The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory, 44(5), 1974–1980. Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Trans. Inform. Theory, 48(7), 1977–1991. Mendelson, S. (2003). On the performance of kernel classes. JMLR, 4, 759–771. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10, 1040–1053. van de Geer, S. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Williamson, R. C., Smola, A., & Scholkopf, ¨ B. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6), 2516–2532. Yurinsky, V. (1995). Sums and gaussian vectors. Berlin: Springer-Verlag. Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2, 527–550.

2098

T. Zhang

Zhang, T. (2003a). Effective dimension and generalization of kernel learning. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Neural information processing systems, 15. Cambridge, MA: MIT Press. Zhang, T. (2003b). Leave-one-out bounds for kernel methods. Neural Computation, 15, 1397–1437.

Received August 6, 2003; accepted March 2, 2005.

NOTE

Communicated by John Platt

Gradient-Based Adaptation of General Gaussian Kernels Tobias Glasmachers [email protected]

Christian Igel [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, Institut fur D-44780 Bochum, Germany

Gradient-based optimizing of gaussian kernel functions is considered. The gradient for the adaptation of scaling and rotation of the input space is computed to achieve invariance against linear transformations. This is done by using the exponential map as a parameterization of the kernel parameter manifold. By restricting the optimization to a constant trace subspace, the kernel size can be controlled. This is, for example, useful to prevent overfitting when minimizing radius-margin generalization performance measures. The concepts are demonstrated by training hard margin support vector machines on toy data. 1 Introduction We consider hyperparameter selection for kernel methods using general gaussian kernels,1 K B (x, z) = e − 2 (Bx−Bz) 1

T

(Bx−Bz)

,

(1.1)

where x, z ∈ Rn , and B is a positive definite symmetric n × n matrix. The most elaborated methods for adjusting these kernels are gradient-based approaches (Chapelle, Vapnik, Bousquet, & Mukherjee, 2002; Keerthi, 2002; Chung, Kao, Sun, & Lin, 2003; Gold & Sollich, 2003) that restrict B to diagonal matrices. However, only by dropping this restriction can one achieve invariance against linear transformations of the input space. A related geometrically inspired idea is carried out by Galleske and Castellanos (2002) for probabilistic neural networks. Indeed, it has been shown empirically 1

T

1 The more common notation K (x, z) = e − 2 (x−z) Q (x−z) is recovered by using the map B → B T B = Q, which is a diffeomorphism on the manifold of symmetric positive definite matrices. Many implementations of kernel-based methods support gaussian kernel only with B T B = γ I , where I is the unit matrix. Transforming the input space according to B and using the standard gaussian kernel for training is a simple way to overcome this restriction in practice.

Neural Computation 17, 2099–2105 (2005)

© 2005 Massachusetts Institute of Technology

2100

T. Glasmachers and C. Igel

by direct search that adapting the rotation of gaussian kernels improves the performance on benchmark problems (Friedrichs & Igel, 2005). Therefore, we compute the gradient for optimizing B in the manifold of positive definite symmetric matrices. We show how to decouple adaptation of shape and orientation from the size of the kernel, which becomes necessary, for example, to overcome inherent problems when minimizing radiusmargin generalization performance measures for support vector machines (SVMs).

2 Kernel Parameterization and Gradient In order to ensure that gradient-based optimization does not lead to invalid matrices, we use a parameterization that maps symmetric to symmetric and positive definite matrices. Let m := {A ∈ Rn×n | A = AT } be the vector space of symmetric n × n matrices. The manifold M := {B ∈ Rn×n | ∀x = 0 : x T Bx > 0 ∧ B = B T } of positive definite symmetric n × n matrices can be parameterized using a single map.2 The natural choice is exp : m → M,

A →

∞ Ai . i! i=0

(2.1)

∂A It holds exp(0) = I and ∂a∂i j A=0 exp(A) = ∂a for each of the n(n + 1)/2 hyij perparameters a i j = a ji . The idea is to use at each point B, a parameterization that maps the origin 0 ∈ m to B. We define the map M → M,

H → H B H,

(2.2)

which is a diffeomorphism3 mapping the unit matrix I to B. To compute the gradient of equation 1.1, we express B by exp(A)B exp(A) with A = 0. In A = 0, the partial derivatives of exp(A) can be computed easily. We consider ∂ K for each hyperparameter a i j = a ji : ∂a i j A=0 exp(A)B exp(A) ∂ K exp(A)B exp(A) (x, z) ∂a i j A=0 ∂ 1 T T T T e − 2 · (x−z) exp(A) B exp(A) exp(A)B exp(A)(x−z) = ∂a i j A=0

ξi j :=

2 The manifold M is a subset of the Lie group GL (R) and the vector space m a corren sponding subspace of the Lie algebra gln (R) (cf. Baker, 2002). 3 The map is invertible, and the map as well as its inverse map are differentiable.

Gradient-Based Adaptation of General Gaussian Kernels

2101

1 = − K B (x, z) 2 ∂ · (x − z)T exp(A)B(exp(A))2 B exp(A) (x − z) ∂a i j A=0 ∂A 1 ∂A ∂A 2 (x − z). B + 2B B + B2 = − K B (x, z) · (x − z)T 2 ∂a i j ∂a i j ∂a i j

Setting S :=

∂A ∂a i j

∂A B + B ∂a and using S = ST it follows: ij

1 = − K B (x, z) · (x − z)T (SB + B S) (x − z) 2 1 = − K B (x, z) · (x − z)T S(B(x − z)) + (B(x − z))T S(x − z) 2 1 = − K B (x, z) · (x − z)T (S + ST )(B(x − z)) 2 = −K B (x, z) · (x − z)T SB(x − z).

(2.3)

For example, a simple steepest-descent step with learning rate η > 0 would lead to the new matrix exp(−ηξ )B exp(−ηξ ). Adapting the parameters changes three properties of the kernel: the shape, which is determined by the eigenvalues of B; the orientation, when nondiagonal matrices B are allowed; and the size, which we define as the smallest volume where a certain amount, say 95%, of the kernel is concentrated. The size is controlled by the determinant of B. As we will see, it is sometimes reasonable to restrict the adaptation to kernels with a fixed size. This is achieved by considering the one codimensional linear subspace,

n := {A ∈ m | tr(A) = 0} ⊂ m, of matrices A fulfilling det(exp(A)) = 1. The gradient descent can be restricted to this subspace4 by orthogonally projecting the gradient matrices ξ to n subtracting tr(ξ )/n from the diagonal entries of ξ . 3 Application to Radius-Margin Quotients of Hard Margin SVMs As an example, we consider model selection for hard margin SVMs for binary classification (e.g., Vapnik, 1998). The most frequently used

4

n = m ∩ sln (R) is the subspace of the Lie-algebra sln (R) corresponding to m.

2102

T. Glasmachers and C. Igel

differentiable performance measure, derived from a bound on the generalization error, is the radius-margin quotient 2 R , γ

(3.1)

where R denotes the radius of the smallest ball in feature space containing all training examples and γ the margin of the SVM classifier (Scholkopf, ¨ Burges, & Vapnik, 1995; Vapnik, 1998; Chapelle et al., 2002; Keerthi, 2002; Chung et al., 2003; Gold & Sollich, 2003). Of course, our approach also works in combination with other performance criteria, such as those discussed by Chapelle et al. (2002). The advantage of the quantity 3.1 is that its gradient can be computed very efficiently (Chapelle et al., 2002; Keerthi, 2002). As a proof of concept, we consider an artificial chessboard test problem. Each of the training patterns (x, y) is generated by drawing a vector x˜ = (x1 , . . . , xd )T from a uniform distribution on ]−2, 2[d ⊂ Rd labeled using the rule y=

+1 if −1 if

d

j=1 x j

is even

j=1 x j

is odd

d

.

Then x is determined from x˜ by multiplication with a fixed positive definite symmetric matrix x := B · x˜ . In the experiments, we set d = 2, = 500, and B = DT · diag(3, 13 ) · D, where in each trial, an orthogonal matrix D is drawn randomly from the uniform distribution on the (compact Lie group On (R) of) orthogonal n × n matrices. Five different kernel parameterizations are optimized (Table 1). Gradient descent (using adaptive individual step sizes) is performed using the parameterization 2.1 and 2.2 with derivative 2.3 combined with the results from Chapelle et al. (2002). The optimization starts from an isotropic kernel whose standard deviation is initialized to the median of the distances

Table 1: Kernel Parameterizations in the Experiments. Constraint

Number of # Variables

(A) B = λI 1 (B) B diagonal, det(B) const. n − 1 (C) B diagonal n (D) det(B) const. n(n + 1)/2 − 1 (E) none n(n + 1)/2

Impact on Kernel Size Shape Size and Shape Shape and Orientation Size, Shape, and Orientation

Gradient-Based Adaptation of General Gaussian Kernels

2103

from each positive training point to the nearest negative training point, a heuristics suggested by Jaakkola, Diekhaus, and Haussler (1999). In parameterizations B and D the optimization is restricted to n. In all five scenarios, the radius-margin generalization performance measure decreases (see Figure 1). However, in all cases where the size of the kernels is not kept constant (parameterizations A, C, and E), the number of support vectors drastically increases due to an inherent disadvantage of the

A

1 0.8

squared radius margin quotient fraction of support vectors

0.6

evaluation error

0.2

0.4

0 0

B

1

0.8

0.6

0.6

0.4

0.4

0.2

0.2 10

20

30

40

D

1

0 0

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 0

10

20

10

30

40

0 0

30

40

20

30

40

30

40

E

1

0.8

20

C

1

0.8

0 0

10

10

20

Figure 1: The diagrams show the radius-margin generalization performance measure (R/γ )2 /( · 20), the fraction of support vectors, and the test error on a large, separate data set over the number of gradient descent steps. All quantities are averaged over 20 trials. From left to right: (A) multiples of the unit matrix, (B) diagonal with fixed trace, (C) diagonal, (D) fixed trace, (E) no constraints.

2104

T. Glasmachers and C. Igel

optimization criterion: having√a training data set consisting of elements, the radius is bounded by R ≤ 1 − 1/ ≈ 1 − 1/(2). In many applications, this bound is almost reached, such that the derivative of R is comparatively small and the gradient of quantity 3.1 with regard to the kernel parameters is governed by the gradient of γ −2 . Then the margin can easily be enlarged by increasing det(exp(A)), that is, by concentrating the kernel mass to smaller areas in input space. This leads to solutions with smaller radius-margin quotient but an increasing number of support vectors. These complex solutions using nearly all points as support vectors are highly adapted to the training data set and are not desirable, because they tend to overfit (leading to worse test error in parameterizations A and C; see Figure 1). This effect can be avoided by early stopping, changing the optimization criterion (e.g., to the smoothed error of an external validation data set; see Chapelle et al., 2002), or controlling the kernel size (e.g., by fixing the trace). Comparing parameterizations C with E and in particular B with D demonstrates, in accordance with Friedrichs and Igel (2005), that better results can be achieved when the kernel adaptation is not restricted to diagonal matrices. The final test error in parameterization D is significantly (20 trials, Wilcoxon rank-sum test, p < 0.01) lower than in all other cases.

References Baker, A. (2002). Matrix groups: An introduction to Lie group theory. Berlin: SpringerVerlag. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1), 131–159. Chung, K.-M., Kao, W.-C., Sun, C.-L., & Lin, C.-J. (2003). Radius margin bounds for support vector machines with RBF kernel. Neural Computation, 15(11), 2643–2681. Friedrichs, F., & Igel, C. (2005). Evolutionary tuning of multiple SVM parameters. Neurocomputing, 64(C), 107–117. Galleske, I., & Castellanos, J. (2002). Optimization of the kernel functions in a probabilistic neural network analyzing the local pattern distribution. Neural Computation, 14(5), 1183–1194. Gold, C., & Sollich, P. (2003). Model selection for support vector machine classification. Neurocomputing, 55(1–2), 221–249. Jaakkola, T., Diekhaus, M., & Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (pp. 149–158). Menlo Park, CA: AAAI Press. Keerthi, S. S. (2002). Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Transactions on Neural Networks, 13(5), 1225– 1229.

Gradient-Based Adaptation of General Gaussian Kernels

2105

Scholkopf, ¨ B., Burges, C. J. C., & Vapnik, V. (1995). Extracting support data for a given task. In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings of the First International Conference on Knowledge Discovery and Data Mining (pp. 252–257). Menlo Park, CA: AAAI Press. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.

Received December 8, 2004; accepted March 22, 2005.

LETTER

Communicated by Misha Tsodyks

Learning Only When Necessary: Better Memories of Correlated Patterns in Networks with Bounded Synapses Walter Senn [email protected]

Stefano Fusi [email protected] Department of Physiology, University of Bern, CH-30 Bern, Switzerland

Learning in a neuronal network is often thought of as a linear superposition of synaptic modifications induced by individual stimuli. However, since biological synapses are naturally bounded, a linear superposition would cause fast forgetting of previously acquired memories. Here we show that this forgetting can be avoided by introducing additional constraints on the synaptic and neural dynamics. We consider Hebbian plasticity of excitatory synapses. A synapse is modified only if the postsynaptic response does not match the desired output. With this learning rule, the original memory performances with unbounded weights are regained, provided that (1) there is some global inhibition, (2) the learning rate is small, and (3) the neurons can discriminate small differences in the total synaptic input (e.g., by making the neuronal threshold small compared to the total postsynaptic input). We prove in the form of a generalized perceptron convergence theorem that under these constraints, a neuron learns to classify any linearly separable set of patterns, including a wide class of highly correlated random patterns. During the learning process, excitation becomes roughly balanced by inhibition, and the neuron classifies the patterns on the basis of small differences around this balance. The fact that synapses saturate has the additional benefit that nonlinearly separable patterns, such as similar patterns with contradicting outputs, eventually generate a subthreshold response, and therefore silence neurons that cannot provide any information.

1 Introduction Realistic synaptic efficacies vary within a limited range of values. Synaptic saturation induced by new stimuli to be learned can provoke a rapid Neural Computation 17, 2106–2138 (2005)

© 2005 Massachusetts Institute of Technology

Learning Only When Necessary

2107

deterioration of the memories acquired in the past. In general, neural networks with bounded synapses are forgetful (Parisi, 1986), and the memory traces of past experiences are destroyed at a rate that is dramatically high: if one assumes that the long-term changes cannot be arbitrarily small, the memory trace decays exponentially with the number of stored patterns. The neural network remembers only the most recent stimuli, and the memory span cannot surpass a number of patterns that is proportional to the logarithm of the number of neurons (Amit & Fusi, 1994; Fusi, 2002). Slowing the learning process by changing a small fraction of synapses solves the forgetting problem, and it allows, in principle, storing an extensive number of random uncorrelated patterns, as in the case of unbounded synaptic strengths (Tsodyks, 1990; Amit & Fusi 1992, 1994; Brunel, Carusi, & Fusi, 1998). However, these studies were restricted to patterns with uniform statistics and fixed coding level (i.e., with the same average number of active neurons per pattern). Moreover, they focused on the maintenance of the memory trace, not on the dynamic mechanisms to store and retrieve information. More recent papers show that it is possible to store and retrieve real-world patterns in networks of excitatory and inhibitory neurons(see, e.g., Amit & Mascaro, 2001). However, in all these works, the internal state of each synapse has an unreasonably large number of stable states, and the synaptic dynamics are almost unaffected by the boundaries. Here we study the dynamics of a biologically realistic network with distinct excitatory and inhibitory neurons, which is able to learn linearly separable patterns. A previous work (Amit, Wong, & Campbell, 1989) addressed the problem of separation between excitation and inhibition, but it did not consider the problems of realistic synapses whose weights are limited from above and from below. In this work, we assume that the synapses are bounded, and they do not allow arbitrarily small changes. The qualitative behavior of networks with realistic synapses does not strongly depend on the number of synaptic states that can be preserved on long timescales (Amit & Fusi, 1994; Fusi, 2002). Hence we consider the extreme case of binary synapses. As can be formally proven (Tsodyks, 1990; Amit & Fusi, 1992, 1994; Senn & Fusi, 2005), the weight assignment problem for binary synapses can be solved by a stochastic learning rule, provided that the number of input neurons is large compared to the number of patterns to be stored. To study the mean field dynamics of a stochastic model with binary synapses, we focus on the case of continuous synaptic states with multiplicative saturation. Simulations with discrete synaptic states show that this mean field description is accurate. We consider a learning scenario in which each stimulus imposes a pattern of activities to the neurons of the network. Given a specific activity

2108

W. Senn and S. Fusi

pattern as an input to the neuron, the desired output is known and provided by a teacher (supervised learning). The teacher signal indicating the right response might come from a different cortical area (e.g., a topdown signal that encodes the class to which the current sensory stimulus belongs), or it can be provided by the sensory stimulus (e.g., when a pattern of activity is imposed to all neurons of a recurrent network, in which each neuron can be regarded as both an input neuron and an output neuron). The learning rule is designed to embed the imposed activity patterns into the synaptic matrix. After learning, each pattern seen during training can be retrieved without mistakes. In the case of a feedforward network, this means that each input pattern produces the correct response indicated by the teacher during the training. For a recurrent network, each pattern imposed by the sensory stimuli becomes a fixed point of the network dynamics. Under additional stability conditions, these fixed points can also be attractors of the network dynamics. In what follows, we restrict our analysis to only two distinct responses of the output neurons and to feedforward networks. We show that a Hebbian learning rule with an additional stop-learning condition will find the appropriate synaptic weights to produce the desired response of a single output unit, provided that the two classes of input patterns are linearly separable. In case of unbounded synapses with the stop-learning condition, a successful learning is ensured by the classical perceptron convergence theorem (Rosenblatt, 1962; Block, 1962; Minsky & Papert, 1969; Diederich & Opper, 1987; Arbib, 1987; Hertz, Krogh, & Palmer, 1991). The perceptron learning rule embeds the patterns into the weight vector by adding or subtracting the input vector, provided that the postsynaptic neuron does not yet give the required response. This is shown to give good classification performance on real-world patterns (Amit & Mascaro, 2001). In all these cases, the weight vector becomes longer and longer as more patterns are learned. It is not clear a priori how a local algorithm could find an appropriate weight vector if the individual components are restricted within rigid boundaries. The simplicity of the convergence proof for the classical perceptron rule hides several problems that would naturally emerge in any realistic neural network. In particular, the unboundedness of the synapses allows arbitrarily large weights. Many of the parameters that control the convergence of the classical perceptron rule should be scaled to bring back the synaptic weights into a limited range (see section 4 for more details). Additional requirements are therefore necessary to guarantee the convergence when the synaptic weights are bounded. This is particularly the case when excitatory synapses have only two stable states corresponding to two different excitatory efficacies. Only in the presence of global inhibition, with

Learning Only When Necessary

2109

a learning rate that is small enough, and with a neuronal threshold that is small compared to the total amount of excitation, will a successful learning become possible. These constraints ensure that any set of linearly separable patterns can be learned by a Hebbian rule with a stop-learning condition and bounded synapses. Learning with bounded synapses and the stopping condition has other interesting consequences. It is well known that in the spontaneous activity state, the total postsynaptic current produced by only the excitatory synapses is relatively high compared to the neuronal threshold. In fact, 10,000 afferents with a somatic amplitude of 0.2 mV and a spontaneous firing rate of 1 Hz, say, would give a depolarization of 2 mV per millisecond. With a voltage threshold of 20 mV, this would lead to a spontaneous firing rate of roughly 100 Hz instead of 1 Hz. Only a strong balancing of excitation by inhibition can resolve this puzzle and prevent the neurons from constantly being active at a high rate, as already pointed out in several works (see, e.g., van Vreeswijk & Sompolinsky, 1996; Amit & Brunel, 1997). Balanced excitation and inhibition emerge as a by-product of successful learning with bounded synapses and the stop-learning condition. Such successful learning requires a small neuronal threshold to prevent the individual synapses from running into saturation. As a consequence, the total excitation will be roughly cancelled by inhibition. Moreover, overlaps in the patterns to be separated urge the synaptic weights to be roughly equal (although complete equality would fully destroy the memory). In the case of binary synapses, these overlaps would cause equal probabilities of being potentiated. Surprisingly, the constraint of bounded synaptic strengths turns out to be advantageous when dealing with nonseparable sets of patterns. Due to synaptic saturation, learning similar patterns with contradicting outputs tends to erase any synaptic structure, and eventually the postsynaptic response is suppressed by the global inhibition. Such a suppression mechanism tends to shut down neurons that are trained with inconsistent teaching signals, as it arises during training with nonseparable patterns. This suppression mechanism prevents an erroneous activation of an output neuron.

2 The Model 2.1 Neuron Model. We consider a single postsynaptic neuron that receives excitatory inputs from N presynaptic neurons, and an inhibitory input that is proportional to the total activity of the N excitatory neurons (see

2110

W. Senn and S. Fusi

Figure 1A). The postsynaptic neuron is either active or inactive, depending on whether the total postsynaptic current h is above or below the neuronal threshold θ◦ . The total postsynaptic current is calculated by the weighted sum of the excitatory synaptic input ξ j , minus a global inhibition. Global inhibition is represented by an inhibitory neuron that sums all the excitatory inputs with the same weight. The activity of this inhibitory neuron is assumed to be proportional to the total (excitatory) input (linear transfer function). In a less abstract network, the inhibitory neuron would be represented by a population of inhibitory cells, with random connections from the excitatory inputs and random connections to the outputs (see, e.g., Amit & Brunel, 1997). Formally, the total postsynaptic current of the output neuN ron is h = N1 j=1 (G j − g I )ξ j , where ξ j can be any value from 0 to R. Notice that the net effect of the inhibitory population can be regarded as a synaptic shift, which also allows negative weights. The components ξ j of an input pattern can be interpreted, for instance, as the firing rate of the presynaptic neurons. The excitatory weights G j and the global inhibitory weight g I take on real values in the interval [0, 1]. In the simulations with binary synapses, the excitatory weights take on values J j = 0 or 1. 2.2 Training Protocol. During training, the input neurons are repeatedly presented with all the p patterns ξ of two classes C + and C − . With each presentation, the activities ξ j are imposed to the N presynaptic neurons, and the postsynaptic neuron is clamped to the desired response (by setting ξ post = 0 or 1, depending on whether ξ belongs to class C + or C − , respectively). The synaptic learning rule is designed such that, after successful training, the total synaptic current h generated by a pattern ξ should fall either above or below the threshold θ◦ , depending on whether ξ is in class C + or C − . 2.3 Synaptic Dynamics. Upon presentation of a pattern ξ , the excitatory weights are modified in a Hebbian way, depending on the pre- and postsynaptic activities and the total (postsynaptic) current h. When the postand presynaptic cells are both active (clamped to ξ post > 0, ξ j = 1) and the total synaptic current is not too large (h ≤ θ◦ + δ◦ , with a learning margin δ◦ ≥ 0), the weight G j is increased by q + ξ j (1 − G j ). The weight increase is proportional to the learning rate q + , the presynaptic activity ξ j , and the saturation factor (1 − G j ). When the postsynaptic cell is inactive (ξ post = 0), the presynaptic neuron is active (ξ j > 0), and the total synaptic input not too low (h ≥ θ◦ − δ◦ ), then the weight G j is decreased by q − ξ j G j . The weight decrease is proportional to the learning rate q − , the presynaptic activity ξ j , and the saturation factor G j . Summarized, the weight change at time t

Learning Only When Necessary

A

C

2111

B

D

Figure 1: Neuronal architecture and sketch of the convergence proof. (A) We consider a postsynaptic neuron receiving direct excitatory input from N presynaptic neurons (ξ j ), and indirect input through a inhibitory neuron with linear input-output relationship. The excitatory weights (G j ) are subject to Hebbian plasticity with weight saturation and a stop-learning condition. The globally inhibitory weight (g I ) is fixed. The postsynaptic response (ξ post ) is the thresholded total synaptic current h, but any other nonlinear input-output relationship that dichotomizes the input is also possible. (B) The sets C + (crosses) and C − (circles) of patterns ξ are assumed to be linearly separable, with a separation vector S and a threshold θ . Since S may contain negative components and components larger than 1, it cannot in general be approximated by the excitatory weight vector G. Only if the solution vector S (and with it the threshold θ) is scaled down by , and if some global inhibition g I is present, is it possible to approximate the solution vector, S ≈ G I = G − g I 1, with a G that is far from saturation at the boundaries 0 and 1 of the hypercube. (C) The synaptic change G triggered by pattern ξ is decomposed into a linear and forgetting (saturation) part, G = L + F . Without global inhibition (g I = 0 and G I = G), synaptic saturation (F ) may prevent the weight vector G from being updated in the “correct” direction L, in the sense that (S − G I )G > 0. In the example, we have (S − G I )G < 0; the update moves G I away from the solution vector S. This is because an update of G I in the desired direction L is distorted by the nearby boundaries and, instead, G I moves in the direction of G = L + F toward the upper right corner. Such a distortion is not possible if G is close to the main diagonal and far from 0 and 1 (achieved by a small , and a g I in between 0 and 1; see A). (D) A positive scalar product (S − G I )G > 0 ensures that the G I moves toward S, provided that the learning rate q is small (distance indicated by the upper brace is smaller than that indicated by the lower brace).

2112

W. Senn and S. Fusi

writes G t+1 j

=

t G tj + q + ξ tj (1 − G tj ) , if ξ post = 1 and h t ≤ θ◦ + δ◦ ,

G tj − q − ξ tj G tj ,

t if ξ post = 0 and h t ≥ θ◦ − δ◦ .

(2.1)

The condition on the total synaptic current h t represents a stop-learning condition: learning stops as soon as the total synaptic current would be able to reproduce the desired postsynaptic activity (with some margin δ◦ t for overlearning). If the condition on h t in equation 2.1 with ξ post = 0 or 1 is met, we speak of a synaptic update. Notice that the synaptic dynamics is entirely determined by four parameters: q + , q − , θ◦ , δ◦ . The neuronal dynamics requires an additional parameter g I setting the global inhibition. The motivation to study learning rule 2.1 comes from a probabilistic synaptic model with binary states. In this model, the synapse stochastically flips its state on presentation of a pattern ξ , depending on the conditions on the pre- and postsynaptic activities and the total current h. Downregulated synapses (J j = 0) are potentiated with probability q + ξ j if ξ post = 1, ξ j > 0, and h ≤ θ◦ + δ◦ . Potentiated synapses (J j = 1) downregulate with probability q − ξ j if ξ j > 0, ξ post = 0, and h ≥ θ◦ − δ◦ . The dynamics of the expected synaptic strengths, G tj = J jt , can be well approximated by the dynamics in equation 2.1. Note that the stochastic update can formally be described by J jt+1 = J jt + ζ j+ (1 − J jt ) and J jt+1 = J jt − ζ j− J jt , respectively, where ζ ± are random variables that are 1 with probability q ± ξ tj and 0 otherwise. Since the fluctuations of the total postsynaptic current h t for different realizations of the stochastic process ζ typically shrink to zero with growing N, the expected total current h t (which is again denoted by h t in equation 2.1) well approximates the actual total current h t . A formal treatment of the stochastic model with a convergence proof for linearly separable patterns is found in Senn & Fusi (2005). 3 Results 3.1 Linearly Separable Patterns Can Always Be Learned. Given any two sets C ± of linearly separable patterns, a neuron endowed with global inhibition and bounded synapses obeying the mean field dynamics of equation 2.1 will always learn to correctly classify the patterns in a finite number of presentations. The tighter the separation between the two classes C ± , the smaller the neuronal threshold θ◦ , the learning margin δ◦ , and the learning rate q must be (for simplicity, we assume q + = q − = q ). More precisely, we assume that there is a separation vector S of length S = N (not necessarily binary and positive), and a separation threshold θ,

Learning Only When Necessary

2113

such that the classes are separated by S and θ with a positive margin (see Figure 1B). Writing this separation margin as δ + , the linear separability states that ξ S > (θ + δ + )N for ξ ∈ C + , and ξ S < (θ − δ − )N for ξ ∈ C − . Notice that θ and δ characterize the statistics of the patterns to be classified. They should not be confused with δ◦ and θ◦ , which are parameters of the synaptic dynamics. Classification is then also possible for all separation vectors S that are scaled by a factor , provided that the threshold and the margins are also scaled by the same factor. These different solutions correspond to output neurons that would separate the patterns around different thresholds at the end of the training session (i.e., h > θ + δ for ξ ∈ C + and h < θ − δ for ξ ∈ C − ). However, as we show, the synaptic dynamics will converge (to a scaled separation vector) only if the scaling factor is small enough, ≤ g I /(2R), where is the partial separation margin of the sets C ± , g I = min{g I , 1 − g I } is the distance of the inhibitory weight g I from the boundaries 0 and 1, and R is the maximal activity of an input ξ j (see Figure 1B). The final weight vector to which the synaptic dynamics converges depends on the parameters θ◦ and δ◦ of the synaptic dynamics. To guarantee the convergence of the learning process, we need these two parameters to be small enough, θ◦ ≤ θ , δ◦ ≤ δ. Note that the threshold scaling factor is not a dynamic variable of the learning process. The parameters θ◦ and δ◦ are always chosen at the beginning of the learning process and are never changed. However, the learning process will actually converge only if θ◦ and δ◦ are chosen properly, with a size that depends on the separation margin () of the patterns. The theorem guarantees that there is always a range of scaled thresholds for which the learning process converges: if the scaling factor and the learning rate q = q ± are small enough, then for any global inhibition g I between 0 and 1 (i.e., between the minimal and maximal excitatory weights), the synaptic dynamics 2.1 converges (i.e., all the patterns will be correctly classified) in at most n◦ = 6/(q g I ) updates of the synaptic weight vector. When the smallness conditions on q and are also considered (see the appendix), this amounts to an upper bound for the number of updates in the order of 1/ 4 . This bound is valid for any presentation order of the patterns to be learned and for any initial conditions for the synaptic states. The rigorous formulation and proof of the theorem is found in the appendix. 3.2 Sketch of the Proof. The idea behind the threshold scaling and the global inhibition is to keep the synaptic strength G t far away from the lower and upper boundaries. This prevents the weight vector G t from being distorted by synaptic saturation. Let us write the synaptic update in the form G t+1 = G t + q G t , where we assume equal learning rates for long-term

2114

W. Senn and S. Fusi

potentiation (LTP) and long-term depression (LTD), q + = q − = q . The normalized change G can be decomposed into a linear and a forgetting (saturation) part L and F , respectively. If the updating conditions are met, we can write equation 2.1 in the form G = L + F =

ξ ∗ (1 − G) = (1 − g I )ξ − ξ ∗ G I , if ξ ∈ C + , if ξ ∈ C − , −ξ ∗ G = −g I ξ − ξ ∗ G I , (3.1)

where G I = G − g I 1 and ∗ is the component-wise product of vectors and F = −ξ ∗ G I . The linear term L = (1 − g I )ξ in case of ξ ∈ C + and L = −g I ξ in case of ξ ∈ C − , respectively, is the learning component, which is parallel to the pattern to be learned (see Figure 1C). This linear term is also present in the case of the classical perceptron learning with analog unbounded synapses and would always bring G t toward a solution vector. Selecting a pattern ξ ∈ C + , for instance, we have ξ S > (θ + δ + )N by assumption that the solution vector S (and therefore S) separates the classes. In the case that this pattern is not yet correctly implemented by the neuron, that is, if h N = ξ G I < (θ + δ)N, the synaptic weight vector is updated by q G according to equation 2.1 (note that the last inequality is equivalent to the update condition h t ≤ θ◦ + δ◦ in equation 2.1). By subtracting this inequality from the previous one, we get (S − G I )ξ ≥ N. Multiplying with the factor (1 − g I ) and using the definition of L and g I = min{g I , 1 − g I } given above, we obtain, (S − G I )L ≥ g I N.

(3.2)

The same estimate, equation 3.2, is obtained when ξ ∈ C − and L has the form −g I ξ . Were the forgetting part negligible, we would have G ≈ L, and equation 3.2 would ensure that total weight vector G tI moves toward the solution vector S, provided that the learning rate q is small. In fact, if the angle between (ρ S − G I ) and G is smaller than 90 degrees, the weight vector at the next time step, G I + q G, is always closer to the target vector ρ S than G I was, ensuring that q is small enough (see Figure 1D). In general, the forgetting part F = −ξ ∗ G I is not negligible. Note that this term is the same for both up- and downregulations (see equation 3.1). It arises from the synaptic saturation and tends to bring G I = G − g I toward 0, where G j = g I for all j. In this asymptotic limit, no structure would be present in the synaptic weight vector, showing that synaptic saturation might neutralize previous learning steps (see Figure 1C). However, synaptic

Learning Only When Necessary

2115

saturation is strongly reduced and can become negligible if all the weights are far from the boundary. This is the case if the weight vector is close to the main diagonal where all the synaptic strengths are roughly equal. If the uniform component is subtracted by the global inhibition and if the neuronal threshold is small, the remaining structure in the weight vector is enough to separate the patterns. Given the separation threshold θ and the separation parameter of the two classes, the neuronal threshold leading to a correct separation must be in the range of θ . More precisely, the convergence of the weight vector is guaranteed with a threshold θ◦ = θ , provided that ≤ g I /(2R). In fact, it is possible to show that (S − G I )F ≥ −2 RN, and that for small , the distortion by the synaptic saturation therefore vanishes. Together with equation 3.2, we obtain (S − G I )(L + F ) > 0, asserting that the effective synaptic change G = L + F , including the forgetting term, points toward the target vector S. Hence, provided that is small, convergence of the learning procedure is guaranteed as outlined above. 3.3 Convergence for Bounded Versus Unbounded Synapses. In the classical perceptron, the smallness of the neuronal threshold and the synaptic parameters (θ, δ, q ) is not required because the synaptic weights can grow unboundedly. In fact, when increasing the number of new patterns p to be √ learned, the maximum synaptic weight G max constantly increases (with p in case of random patterns; see Figure 8 and section 4). This is because each synaptic update pushes the effective weight vector G I in the direction of the separation vector. The smaller the separation margin between the two classes to be separated, the larger the maximum weight becomes, G max ∼ 1/. A renormalization of the synaptic weights after learning would similarly lead to a small threshold, as it is necessary in the current framework with bounded synapses. This renormalization is also changing the estimate of the convergence time. For unbounded synapses, the number of synaptic updates required for convergence is inversely proportional to the learning rate and the separation margin, n◦ ∼ 1/(q ). It is inversely proportional to because the component of the synaptic update vector in the direction of the separation vector cannot be larger than the difference of the overlaps between the separation vector and the two classes, |S(ξ + − ξ − )| < for ξ ± ∈ C ± . It is inversely proportional to q because, by definition, the length of the synaptic update vector is proportional to the learning rate. To prevent overshooting, however, q must be smaller than (see Figure 1D), yielding an upper bound of n◦ ∼ 1/ 2 in the case of unbounded synapses. This is in agreement with the estimate of the convergence time for the classical perceptron (see, e.g., Hertz et al., 1991). Since for learning with unbounded synapses the

2116

W. Senn and S. Fusi

maximum weight grows like 1/ (see Figure 8), we may obtain an a posteriori solution for the dynamics with bounded synapses by scaling all the synaptic parameters (θ, δ, q ) by = . As a consequence, the effective synaptic weight vector (G I = G − g I 1) approaches the scaled solution vector S, and the learning progress per synaptic update is limited by (because |S(ξ + − ξ − )| < for ξ ± ∈ C ± ). Hence, q and are scaled by , and the upper bound for the number of required updates in the case of bounded synapses becomes n◦ ∼ 1/(q ) = 1/(2 q ) = 1/ 4 . This upper bound takes into consideration that the weight vector may need to travel in small steps from the boundary into the narrow neighborhood of the hypercube center where synaptic saturation can be neglected. If the postsynaptic neuron was already involved in a learning task, its weight vector is likely to be in this neighborhood, and learning may be as fast as without imposing these bounds. 3.4 Global Inhibition and a Small Threshold Are Necessary. To test the statement of the theorem and show the necessity of the different requirements, we consider a simple numerical example. We randomly chose a set of p = 10 patterns ξ with activities ξ j ( j = 1, . . . , N = 20), uniformly distributed between 0 and R = 40 (to allude to realistic firing rates in terms of spikes per second). The excitatory synaptic weights G j of the 20 synapses were randomly initialized between 0 and 1. The two classes C + and C − were constructed by projecting the patterns onto a random separation vector S of length N. Each pattern was tagged according to whether the projection was above or below a separation threshold θ, which in turn was chosen to divide the patterns in two groups, each containing five patterns. In our example, we had θ = 17.3, and the resulting separation margins were δ = = 0.27. In general, the model might not converge for a random choice of the parameters of the synaptic dynamics (e.g., when we use θ◦ = θ and δ◦ = δ). However, our theorem guarantees that there is scaled version of the parameters (θ◦ = θ and δ◦ = δ), which would allow the convergence of the learning process. In our specific example, any ≤ 0.3 allows learning all the patterns without mistakes. As predicted, the separation of the postsynaptic current, h t > θ◦ + δ◦ and h t < θ◦ − δ◦ , for patterns ξ in C + and C − , respectively, is reached after a few updates N oft the synaptic strengths G tj according to equation 2.1, with h t = N1 j=1 (G j − g I )ξ j (cf. Figure 2A). The simulation confirms that learning makes always some progress due to its linear part, in the sense that in case of a synaptic update, we have (S − G I )L > 0, equation 3.2, while the forgetting (saturation) part may work against this progress as (S − G I )F can become negative (see Figure 2B).

Learning Only When Necessary

B

h−(θ°+δ°) and (θ°−δ°)−h

5

0

−5

20 40 60 Time (number of presentations)

Linear (−) and forgetting part (−.)

A

2117

50 40 30 20 10 0 −10

20

40 Time

60

Figure 2: Any linearly separable set of patterns is learnable with limited synaptic strengths. (A) Evolution of the signed distance between the total postsynaptic current and the learning threshold, h t (ξ ) − (θ◦ + δ◦ ), for patterns ξ of class C + , and (θ◦ − δ◦ ) − h t (ξ ), for patterns of class C − . According to the update condition, equation 2.1, learning stops as soon as these quantities become all positive, here after a total of 69 pattern presentations (out of which 27 satisfied the condition on h t and led to synaptic updates). Note that the monotonic convergence of the total weight vector G tI toward the scaled solution vector S does not imply that for all patterns, the total input h t (ξ ) monotonically converges. Model parameters: θ◦ = 5.2, δ◦ = 0.08, q = q ± = 2 · 10−3 , g I = 0.5. The same set of patterns is used in Figures 3 to 6. (B) Evolution of the learning progress represented by the linear part, (S − G tI )L t (solid line) and the forgetting part, (S − G tI )F t (dasheddotted line). The quantities represent the learning progress due to the nonsaturating and saturating part: they indicate by how much the two learning components F t and G t move the effective weight vector G tI = G t − g I toward the target vector S. The flat parts correspond to presentations that did not trigger synaptic updates because the patterns were already correctly implemented, and the condition on h t in the update rule 2.1 therefore was not satisfied. As shown in the proof, the linear part always supports learning, (S − G tI )L t > 0, while the forgetting part may counteract learning when G tI comes close to S, as happens at the 48th, 58th, and 68th presentation, where (S − G tI )F t < 0. Such forgetting could become dominant if the threshold (the scaling factor ) were not small enough.

The value of global inhibition plays an important role, although it does not need to be finely tuned to guarantee the convergence of the learning process. As predicted by the theorem, many more learning steps are necessary if g I is too close to the boundary 0 or 1 (see Figure 3A). In fact, the theorem predicts that the number of synaptic updates required to learn the 1 . The chance of finding a configuration patterns is roughly n◦ ∝ g1 ≈ g I (1−g I) I of excitatory synapses that balance inhibition shrinks when g I tends to a

2118

W. Senn and S. Fusi

B

15

Number of iterations (/1000)

Number of iterations (/1000)

A

10

5

0

1.5

1

0.5

0

0

0.5 Global inhibition gI

1

0

0.2 0.4 Threshold scaling factor ρ

0.6

Figure 3: Learning requires global inhibition and a small scaling factor. (A) The number of iterations (in thousands) required to learn the random set of patterns is minimal if the global inhibitory strength g I is roughly 0.5, as predicted by the theory. An inhibitory weight close to 0 or 1 urges the excitatory weights to “catch up” to the inhibitory weight, and the emerging synaptic saturation (“forgetting”) strongly impairs the learning (cf. Figure 1C). The neuronal threshold, the learning margin, and the learning rate were scaled by a factor of 1/100 (yielding θ◦ = 5.2 · 10−3 , δ◦ = 8 · 10−5 , q = 2 · 10−5 ) such that it is still possible to separate the patterns with values of the global inhibition near 0 and 1. (B) Number of synaptic updates (in thousands) required for convergence as a function of the scaling factor , with the same learning rate q as in A. As predicted by the theory, learning is impaired if the neuronal threshold, compared to the total (excitatory) synaptic strength, is not small ( > 0.5; cf. Figure 1C).

boundary value. Similarly, only when the neuronal threshold is small, expressed by a small threshold scaling factor, will it be possible to converge to a solution (see Figure 3B). The simulation result is expressed by the requirement ≤ g I /(2R) appearing in the theorem (see also Figure 1B). If global inhibition is kept away from 0 and 1, the drawback of synaptic saturation is fully compensated, provided that the learning rate and the threshold are sufficiently small. Global inhibition is necessary for a simple reason. For instance, the separation of the patterns into two classes may require an output ξ post = 0 to a N pattern with a high activity level f = N1 j=1 ξ j (many presynaptic neurons strongly active). This is typically not possible with excitatory synapses alone because a pattern with a high total activity would lead to a suprathreshold response. However, if the activity level f is subtracted through global inhibi G j ξ j − g I f = N1 (G j − g I )ξ j , the assignment of the output tion, h = N1 0 becomes possible, even if the activity level of the pattern is high (choose G j < g I for components j with strong input ξ j ). To simultaneously assign

Learning Only When Necessary

B Linear (−) and forgetting part (−.)

A

2119

Without stop–learning condition 60 40 20 0 −20 −40 −60

50

100 Time

150

Figure 4: Individual synaptic modifications should be small and triggered only if the required response is not matched. (A) To prevent overshooting, the learning rates q ± must be a fraction of the separation parameter (width of the bracelet: 2( + δ), corresponding to the separation margin between the two classes, as indicated by the parallel dotted lines). Without the stop-learning condition and without shrinking of the learning rates q ± toward 0, the weight vector G t would be repeatedly attracted by the clusters (as appearing on the right), while patterns not in these clusters start to get misclassified (as the leftmost cross). The dashed line shows the separation hyperplane after learning the cluster of crosses. A subsequent learning of the cluster of circles would move the hyperplane up again (arrows). (B) Same plot as in Figure 2B, but without stoplearning condition on the total postsynaptic current h t in equation 2.1. The linear part oscillates because the weight vector G periodically “overlearns” the patterns, that is, is repeatedly attracted toward one cluster of patterns and thereby starts to misclassify other patterns. In contrast, the forgetting part slowly converges, showing that the final weight vector oscillates close to the main diagonal where synaptic saturation is minimal and the weights are roughly equalized.

an output ξ post = 1 to a pattern with low activity level, the threshold must be small. This is needed because tightly separated classes ( small) require that small differences in the inputs ξ j , independent of the size of ξ j , may turn a subthreshold response into a suprathreshold response. After subtracting the activity level f , this becomes possible with a small threshold. 3.5 A Small Learning Rate and the Stop-Learning Condition Are Necessary. To prevent overshooting of the target vector S, the learning rates q ± (= q ) must be small enough. A monotonic convergence toward the target vector is expected if the learning rate is small compared to the neuronal threshold. Since the threshold itself scales with the separation parameter , the learning rate must scale, for instance, with 2 . In fact, the convergence is guaranteed if q ≤ g I /(2R2 ) (cf. Figure 4A). The requirement

2120

W. Senn and S. Fusi

of a small threshold is also confirmed by the simulations (see Senn & Fusi, 2004, in press). Learning is also severely impaired if the stop-learning condition on the total postsynaptic current h t in equation 2.1 is not imposed. Only if the learning process stops when the desired output is reached is it possible to learn any set of separable patterns. Otherwise, the dynamics may learn a dominant cluster of patterns while other patterns far from such a cluster may fall off from the correct classification (see Figure 4A). In fact, dropping the stop-learning condition leads to sustained oscillations in the total postsynaptic currents, and no further learning progress is achieved (see Figure 4B). Although decreasing the learning rate will reduce the amplitude of the oscillations to 0, the final position of the separation plane may still not separate the two classes of input patterns. This is because without the stopping condition, it is just the center of gravity of the patterns within each class C ± that determines the final position of the separation plane, and this does not account for the outliers. Any learning rule that is able to learn tightly separated classes must incorporate some form of stopping condition. 3.6 Learning Equalizes Synaptic Strengths and Balances Inputs. In the absence of the stopping condition, the statistics of different synapses tend to reflect the statistics of LTP and LTD events. If different synapses share the same statistics, then the distribution of efficacies will also tend to be equalized (i.e., be the same across different inputs). For example, in the case of slow learning of random uncorrelated patterns with binary synapses, the asymptotic potentiation probability (G ∗ ) for all the synapses is given by the ratio between the rate of potentiating events (q˜ + ) divided by the total rate of events inducing potentiations or depressions (q˜ + + q˜ − ) (see Brunel et al., 1998, and below). This is also confirmed for the case of analog and bounded synapses. By the ongoing up- and downregulation of a synapse in the presence of the multiplicative saturation, the synaptic weights are driven toward asymptotic states where the synaptic saturation and the Hebbian learning are balanced. In the current example, this is expressed by the convergence (S − G I )F → 0 (decaying curve in Figure 4B). In the presence of the stopping condition, the final excitatory weights reached after successful learning will always be close to the global inhibitory weight, at least when the difficulty of the task (small ) requires a small threshold θ◦ and a small learning margin δ◦ . The tighter the two classes C + and C − are separated, the less distortion by synaptic saturation can be afforded, and the more uniform the synaptic distribution becomes. A relatively uniform distribution of the excitatory synaptic weights G j around the value of the global inhibition g I is enforced by a priori choosing a small

Learning Only When Necessary

B

Before learning

After learning

1

1

0.8

0.8 Syn. weight

Syn. weight

A

2121

0.6 0.4 0.2 0

0.6 0.4 0.2

5

10 15 Synapse index

20

0

5

10 15 Synapse index

20

Figure 5: Balancing and equalization of the synaptic weights through learning. (A) The initial synaptic strengths G j (solid line) span the whole possible interval between 0 and 1, scaled up by N. The two narrowly separated black lines represent the learning thresholds θ◦ ± δ◦ , divided by the average presynaptic activity of all patterns, R/2, to be comparable with the individual synaptic weights. The dashed line at g I = 0.5 represents the global inhibitory weight (dotted line: G j − g I ). (B) After faithful learning of the set of 10 patterns in 27 synaptic updates (69 presentations; see Figure 2) the excitatory synaptic strengths G j became roughly equal (solid line). Subtracting global inhibition (dotted line) makes the effective synaptic weights fluctuating around the threshold. If the stop-learning condition is not imposed, the weights equalize much less (dashed-dotted line, shown after 200 synaptic updates).

threshold (small scaling factor ), depending on the separation margin of the classes to be learned (see Figures 1B and 1C). Hence, the balancing of excitation and inhibition, and the equalization of the synaptic weights, appears as a by-product of learning. This is also confirmed by our simulations. Due to their random initial values, the weights span the whole possible range of values before learning (see Figure 5A). After a few synaptic updates evoked by the initially incorrectly classified patterns, the weights all adopted roughly the same value (see Figure 5b, solid line). If the stop-learning condition is discarded, the weights are less equalized because for the small set of random patterns ( p = 10), the asymptotic strengths varies considerably (see Figure 5b, dashed-dotted line). Weight equalization (but not necessary the balancing by inhibition) would also emerge without the stopping condition, but the number of patterns and the number of synaptic updates must be large. 3.7 Conflicting Patterns Shut Down Neuronal Activity. An interesting property of (multiplicative) synaptic saturation is that it tends to stabilize

2122

W. Senn and S. Fusi

the synaptic weights (van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001). This property can be advantageous when dealing with similar patterns requiring different outputs, since in these cases, it leads to a uniform distribution of the synaptic weights, which depends on the ratio between the probability of inducing LTP and the probability of inducing LTD. If this distribution is below the threshold of activation of the output cell, the neuron will no longer respond to these patterns, and therefore will not try to make an impossible classification of stimuli that would produce contradictory responses. To be more concrete, we stimulate our neuron with a set of input patterns such that each synapse gets repeatedly potentiated and depressed. According to the update rule, equation 2.1, the equilibrium weight of synapse j is then determined by the equation G j = q˜ + (1 − G j ) − q˜ − G j = 0,

(3.3)

where q˜ ± represent the effective rates of up- and downregulations. These rates are the product of the learning rates q ± , the expected presynaptic activity ξ j , and the relative frequency of requiring a postsynaptic response 1 or 0, respectively. Solving equation 3.3 for G j gives the unique equilibrium weight G ∗ = q˜ + /(q˜ + + q˜ − ). For slow learning (small q ± ) this expression does not depend on the specific order of presentation of patterns (Brunel et al., 1998). The equilibrium weight G ∗ is an attracting fixed point of equation 3.3, as shown by the negative derivative of G j with respect to G j at the dG fixed point, dG j j = −q˜ + − q˜ − . Whatever the initial synaptic weight is, the saturation factors (1 − G j ) and G j in equation 3.3 always drive the synapse to the unique steady state G ∗ . If the equilibrium weight is smaller than the global inhibition, G ∗ < g I , the total postsynaptic current would become negative in response to an arbitrary stimulus ξ , h = N1 (G j − g I )ξ j < 0. Taking the stop-learning condition into account, however, the weights G j are depressed only until the lower learning threshold θ◦ − δ◦ is reached, h = N1 (G j − g I )ξ j ≈ θ◦ − δ◦ . In general, any attempt to train the output neuron(s) to respond with different outputs to too similar input patterns will eventually lead to a subthreshold activation. The case of a single pattern for which contradicting outputs are required is formally treated in the appendix (see theorem 2). The neuronal suppressing mechanism is confirmed by the simulations. As an example, we show the evolution of the total postsynaptic currents h t for the case of 5 pairs of identical patterns ξ ± (i.e., p = 10 and identical classes C + = C − ). As predicted, the total postsynaptic currents eventually become, or remain, subthreshold for all patterns (see Figure 6A). The downward

Learning Only When Necessary

B

With saturation

Without saturation

6

6

5.5

5.5

5

5

h

h

A

2123

4.5 4

4.5

500

C

1000 Time

1500

4

2000

0

500

D

1000 Time

1500

2000

1 1 Syn. weight

Syn. weight

0.8 0.6 0.4

0

0.2 0

0.5

5

10 15 Synapse index

20

5

10 15 Synapse index

20

Figure 6: Synaptic saturation suppresses neuronal activity in response to conflicting patterns. (A) Evolution of the total postsynaptic current h t in response to the five patterns trained with conflicting outputs, that is, requiring once the output ξ post = 0 and once ξ post = 1 for the same input patterns. After a transient response (around update 200), the total postsynaptic currents of the five patterns becomes subthreshold (horizontal line represents the neuronal threshold θ◦ ). (B) Without synaptic saturation (modeled by setting F = 0 in equation 3.1), the postsynaptic currents do not become subthreshold. (C, D) The final distribution of the synaptic weights G j (solid lines) corresponding to the simulations in A and B with and without saturation, respectively (same initial weights as in Figure 5A). Dashed line: global inhibition, g I ; double solid line: neuronal threshold scaled by the presynaptic mean activity, 2θ◦ /R; dotted line: G j − g I . Learning the contradicting outputs homogenizes the weights in the presence of synaptic saturation and leads to the uniform dominance of inhibition, and therefore the suppression of any neuronal activity (C). The final weight distribution when the upper synaptic bound was relieved does not show the equalization, and therefore does not lead to the activity suppression (D).

drift of the total postsynaptic current h t is caused by the synaptic saturation, which strongly homogenizes the synaptic weights until excitation is dominated by the global inhibition (see Figure 6C). In fact, without synaptic saturation (mimicked by cancelling the forgetting part F = −ξ ∗ G I

2124

W. Senn and S. Fusi

in the update rule, equation 3.1), the suppression effect vanishes and the total postsynaptic currents incoherently become either sub- or suprathreshold (see Figure 6B). This is also reflected in the uncontrolled growth of the synaptic weights beyond the upper boundary (see Figure 6D). Hence, teaching the neuron to respond with different outputs to the same patterns will uniformly depress the synaptic weights and silence the neuron. 3.8 Convergence for Binary Synapses with Stochastic Modifications. We finally provide a partial account of the results that learning with discrete synapses converges in a finite number of steps, provided that (1) the number of neurons is large enough and (2) the small learning rate is replaced by small transition probabilities between stable discrete states. If these conditions are satisfied, the expected values of the binary synapses are well described by the analog synaptic variables introduced in the model. As a consequence, the convergence of the stochastic learning process with binary synapses is well predicted by the deterministic one for analog synapses. More precisely, one proves that the stochastic algorithm is likely to converge within some finite number of updates, n◦ (), which is bounded above by some power of 1/. The probability of not converging within these updates shrinks as 1/N when the number of neurons N increases while the separation margin is kept fixed (for a rigorous proof and simulations with highly correlated patterns, see Senn & Fusi, 2005). Simulations with binary synapses projecting to a single output cell confirm that the stochastic learning rule is successful (see Figure 7). The parameters of the learning dynamics are the same as in the simulations of the deterministic example (see Figure 2), and the activities ξ j of the 10 patterns are either 0 or 40, with probability of 0.5. As expected, the convergence in the stochastic case is noisier, and it takes a larger number of presentations than in the case with continuous synapses (see Figure 7A). With an increasing number of neurons, however, the prediction of the synaptic dynamics by the mean field equation, 2.1, becomes more reliable. The redundancy in the synaptic encoding speeds up learning until it approaches the convergence speed of continuous-valued (bounded) synapses. In fact, the number of presentations per pattern, required to correctly classify the stimuli, shrinks with the increasing number of presynaptic neurons toward an asymptotic value (see Figure 7B). 4 Discussion We showed that despite the synaptic boundedness and despite restricting plasticity to the excitatory synapses, any set of linearly separable patterns

Learning Only When Necessary

B Presentations per pattern

3 2

°

°

h−(θ +δ ) and (θ − δ )−h

A

0

°

°

1

−1 −2

2125

3

10

2

10

1

10

0

0

50 100 150 Time (number of presentations)

10 2 10

3

10 Number of neurons N

Figure 7: Convergence of stochastic learning in the case of binary synapses. (A) Total synaptic input current h t as a function of time, evaluated for all 10 random, linearly separable patterns. Same parameters as in Figure 2, except for the number of neurons, which is N = 100 instead of N = 20. The learning process converges in about 150 presentations (15 presentations per stimulus). (B) Number of presentations per pattern required for convergence, as a function of the number of neurons N, for p = 10, 20, 40 random binary 0/1 patterns with coding level f = 1/4. Other parameters: q ± = .05, g I = 0.5. The classes are constructed to be linearly separable. The neuronal threshold θ◦ and the learning margin δ◦ are chosen to yield a maximal separation of the classes after projecting the patterns to a solution vector S.

can be learned with a Hebbian rule incorporating a stop-learning condition. These biologically plausible restrictions, however, require (1) some global inhibition, (2) a small learning rate, and (3) a threshold that is small compared to the overall excitatory synaptic strengths. The restrictions are shown to be necessary to prevent fast forgetting, which may arise during the learning process by driving the synaptic strengths into saturation. As a byproduct of learning, the synaptic strengths roughly (but not fully) equalize, and a rough balancing between the total excitation and inhibition emerges. Synaptic saturation further causes a neuron to suppress its activity if it is learned with similar patterns but opposing outputs. 4.1 Possible Implementations of the Stop-Learning Mechanism. The stop-learning condition is necessary to protect past memories when the same or similar patterns are insistently presented. There are many ways of implementing such a stopping mechanism. It could be inherent to the individual synapse, governed by the postsynaptic activity, or it can depend on an external feedback. For instance, the synapse may not undergo potentiation if the pre- and postsynaptic activities and the postsynaptic calcium

2126

W. Senn and S. Fusi

concentration are above some critical level or below some minimal concentration (see Fusi, 2003, for a spike-driven synaptic dynamics implementing this mechanism). High calcium concentration might indicate that a neuron, which is supposed to be active, is already responding as imposed by the sensory stimulus or by the teacher, and that learning should stop. A similar mechanism can be obtained by reducing the synaptic change when the postsynaptic neuron is spending a large fraction of its time in the refractory period (Amit & Mongillo, 2003). Unfortunately, experimental data leave the question of such an intrinsic nonmonotonicity open (see, e.g., Cho, Aggleton, Brown, & Bashir, 2001). Another possibility would be that the stop-learning signal is carried by an external signal, for instance, related to the reduction of dopamine release, as observed after successful reinforcement learning (see, e.g., Fiorillo, Tobler, & Schultz, 2003). A similar stop-learning phenomenon is observed in V4 of a monkey performing a delayed match-to-sample task, where no learning effect is seen if the visual stimuli are not degraded by noise and therefore easy to classify (Rainer, Lee, & Logothetis, 2004). 4.2 Global Inhibition Sets the Equilibrium Distribution of the Excitatory Weights. Global inhibition is a general property often assumed in neural networks to normalize the total synaptic input. In fact, recent experimental findings show that inhibitory neurons in the neocortex, but also in the hippocampus, may form a large network, tightly coupled through gap junctions (see, e.g., Amitai et al., 2002). In our framework, such a global inhibition defines a range, far from saturation, into which the excitatory weights will tend during the learning process. Since we restrict synaptic plasticity to excitatory synapses, inhibition must be global to assert that any set of linearly separable patterns with any correlations (i.e., clustering of the patterns) can be learned. Nonglobal inhibition may lead to a strong and unequal forgetting across the synapses due to unequal synaptic saturation, unless inhibition is also plastic. The supervised learning scenario with the stop-learning condition urges the total postsynaptic currents to be clustered around the neuronal threshold. Since the latter must be small relative to the synaptic bounds, the excitatory current will be balanced by inhibition after learning. Since inhibition is global, the excitatory synaptic strengths, moreover, become equalized. Weight equalization and balancing by inhibition will always emerge from the stopping condition, even when the synapses are not bounded, as long as the threshold is small and the inhibitory weights are equal. If the stopping condition is discarded and the synapses are bounded, weight equalization arises (for large N and p) because the weights tend to an asymptotic state,

Learning Only When Necessary

2127

which, for uniform random patterns, depends on only the ratio between the effective rate for up- and downregulation. This asymptotic state, however, is not necessarily related to the global inhibition. In fact, assuming that it is dominated by the inhibitory weight has distinct computational advantages in terms of neuronal silencing when learning does not converge (see below). Weight equalization was also shown to emerge in an unsupervised learning scenario where the postsynaptic activity is not imposed by a teacher signal, and hence no stopping condition can be defined explicitly (Rumsey & Abbott, 2003). Instead of stopping any synaptic modification when the desired output activity is reached, a slow anti-Hebbian term is shown to smooth out large fluctuations in the synaptic weight structure caused by Hebbian learning. Bounded synapses may also contribute to some synaptic equalization in an unsupervised learning scenario, but because of the missing activity-dependent feedback, synaptic bounds are themselves not sufficient. 4.3 Slow Learning Prevents Fast Forgetting. Slow learning becomes important if the set of patterns to be learned is large. This is because slow learning prevents the synaptic weights from overshooting, but also from heading off into the saturation regime. In the continuous-valued synaptic model, slow learning is implemented by a small learning rate (q ). However, biological synapses do not admit arbitrarily small changes. Synapses must be able to operate with a limited number of discrete states. In a discretevalued synaptic model, slow learning is achieved by making a selection of a small number of synapses to be modified. Stochastic selection is an unbiased way to choose the synapses to be changed (Tsodyks, 1990; Amit & Fusi, 1992, 1994) and it can be naturally implemented by exploiting the variability in the neural activity (Fusi, Annunziato, Badoni, Salamon, & Amit, 2000; Fusi, 2002). There is an optimal learning rate that allows learning uncorrelated random patterns in a single shot and forgetting slowly. Below this learning rate, every pattern should be presented more than once (Fusi, 1995; Brunel et al., 1998). The advantage is that the synaptic resources are equally distributed among all the patterns to be stored. Our binary perceptron exploits slow learning for the same reason, and the stopping condition introduces an extra selection mechanism. Interestingly, slow learning is observed in some cortical areas: for instance, in inferotemporal and perirhinal cortex (Miyashita, 1993; Yakovlev, Fusi, Berman, & Zohary, 1998; Erikson & Desimone, 1999) where the internal representations of sensory stimuli form in tens or hundreds of repetitions of the same pattern. The introduction of other internal synaptic states would allow the same memory span and a much reduced number of presentations (Fusi, Drew, & Abbott, 2005).

2128

W. Senn and S. Fusi

4.4 Small Neuronal Thresholds Allow Separating Similar Patterns. The assumption of a small neuronal threshold relative to the total excitatory synaptic strength seems to be satisfied in biology by virtue of the huge number of excitatory synapses projecting onto a single neuron (Braitenberg & Schutz, ¨ 1991). As we showed, the ratio between the neuronal threshold and the total excitatory synaptic strength must decrease with the difficulty of the learning task, that is, with decreasing separation margin between the two classes to be learned (). Interestingly, this is needed also for the classical perceptron and for many other classical learning rules like the Hopfield prescription. However, the requirement is veiled by the unboundedness of the synapses and by the fact that usually the neuronal threshold is set to 0. Indeed, as the number of patterns ( p) increases, the separation margin typically becomes smaller, and more iterations are needed to converge. As a consequence, for the classical perceptron learning (i.e., with unbounded synapses), the maximum synaptic weight G max = maxi=1,...,N G i increases when more patterns ( p) have to be learned or, more generally, when the separation margin goes to zero (see Figure 8). To enforce that the maximum synaptic weights remains finite, say 1, all the synaptic weights, the threshold (θ), and the learning margin (δ) should be scaled by the same factor, namely, the maximum weight G max . For random uncorrelated patterns, this maximum weight G max roughly increases as the square root of the number √ of random patterns, p (see Figure 8, vertical and bottom horizontal axes), but also as the inverse of the separation margin, 1/ (see Figure 8, vertical √ and top horizontal axes). The growth of G max with p and 1/ implicitly confirms the theoretical result that in the limit of large N, the separation √ ¨ Diederich, Kinzel, & Opper, 1990, equamargin shrinks as 1/ p (Kohler, tion 7). Notice that in models without the stopping condition, for example, in the Hopfield model where the synaptic weights are explicitly set as the sum over p patterns (Hertz et al., 1991), the maximum weight grows even linearly with p. Clipping the synapses to a finite range at the end of the learning process as in Sompolinsky (1987) and Amit and Mascaro (2001) does not help, because this would require a buffer for temporarily storing the unbounded synaptic weights during the learning process, which itself would require growing synapses. Our need of a proper scaling of the threshold is not a unique feature of our model, but will be shared by any model with bounded synaptic weights. We conclude that depending on the difficulty of the learning task, learning with bounded synapses requires some fine discrimination around the balance between excitation and inhibition. The correct tuning of the threshold-to-synaptic strength ratio could be performed by additional homeostatic processes (see, e.g., Desai, Cudmore, Nelson, & Turrigiano, 2002). Homeostatic plasticity may also tune the global

Learning Only When Necessary

2129 Separation margin ε

−1

−2

−3

10

10

i

Maximum syn. weight max (G )

10

2

10

1

10

0

10

0

1

2

3

10 10 10 10 Number of uncorrelated random patterns p

Figure 8: Maximum synaptic weight G max as a function of the number of random uncorrelated binary patterns p (bottom scale) and their separation margin (top scale) for the classical perceptron with unbounded synapses. All the scales are logarithmic. According to the graph, the maximum weight after learning, G max = maxi=1,...,N {G i }, increases with the square root of the number of patterns, √ G max ∼ p, and with the inverse of the separation margin, G max ∼ 1/. The number of neurons (fixed to N = 2000) was chosen such that for all the training sets, the number of patterns was well below the maximal capacity, p < 2N. The maximum weight increases because the number of synaptic updates required to find an appropriate weight vector also grows with increasing complexity of the separation task (growing p and decreasing ). Recall that with each of these synaptic updates, a component of the solution vector is added to the weight vector, causing the latter to steadily grow in the direction of the solution vector. If the synaptic weights are bounded, then the neuronal threshold should be scaled with a quantity growing as G max , and hence growing with the difficulty of the classification task. This shows that the threshold scaling in our perceptron model with bounded synapses is the counterpart of the unlimited weight growth in the classical perceptron, and therefore cannot be avoided. The initial values of the synaptic weights were randomly chosen between −1 and 1.

inhibition (g I ) to dominate over the excitatory equilibrium weight (G ∗ ), such that neurons silence themselves in response to unstructured input. 4.5 Silencing Uncertain Neurons Allows Dealing with Nonseparable Patterns. Suppressing the activity of a neuron trained to respond with different outputs to the same input patterns is an important property when dealing with nonseparable patterns. This problem emerges when the maximal storage capacity is surpassed or the input patterns are inherently nonseparable. For example, as the number of random input patterns increases

2130

W. Senn and S. Fusi

( p > 2N), the chance that they are nonseparable, and therefore not classifiable by a neuron, also increases (Cover, 1965). Nonseparable patterns would uniformize the synaptic weights by means of the synaptic saturation, and their response would be suppressed by the global inhibition. The same suppression mechanism can also be exploited to improve the classification of more complex data sets like Latex deformed characters (Senn & Fusi, in press). Because patterns that are incorrectly classified typically evoke a subthreshold response, the classification performance can be improved by considering the response of several output neurons in parallel. If these neurons behave in a different way (e.g., because of the stochastic selection in the learning rule for binary synapses), then some output units would respond correctly, while those that respond incorrectly are actually silent. A similar mechanism has already been applied in Amit and Mascaro (2001), where the authors consider several output units, each randomly connected to a subset of input units. They also correctly classify a large number of Latex deformed characters. In general, an additional second layer would be required to judge whether the number of activated neurons is significant for a correct classification of the input pattern. Appendix A.1 Perceptron Convergence Theorem for Bounded Synapses. The theorem asserts that with the classical Hebbian rule incorporating a stoplearning condition, any set of linearly separable patterns can be learned with bounded synaptic strengths, provided that the learning rate is small, there is some global inhibition, and the neuronal threshold is small compared to the overall sum of the presynaptic excitatory weights. For notational convenience, we consider equal learning rates for LTP and LTD, q − = q + = q . Theorem 1. Let C ± be any sets of linearly (δ + )-separable activity patterns ξ ∈ [0, R] N with separability threshold θ ∈ R and separability parameters δ ≥ 0, > 0. Let us choose any globally inhibitory weight g I ∈ (0, 1), any scaling factor ≤ g I /(2R), and any learning rate q ≤ g I /(2R2 ), where g I = min{g I , 1 − g I }. Set the threshold of the postsynaptic neuron to θ◦ = θ , and the learning margin to δ◦ = δ. Then, for any repeated presentation of the patterns ξ ∈ C ± and any initial condition G 0 ∈ [0, 1] N , the synaptic dynamics 2.1 converges in at most n◦ = 6/(q g I ) synaptic updates. Note that the maximal number of stochastic updates, n◦ , which is required to learn the patterns, is independent of the number of patterns p to be learned. This apparent paradox arises because n◦ counts only the number

Learning Only When Necessary

2131

of presentations that trigger synaptic updates, that is, those for which the update conditions in equation 2.1 are satisfied. Since the patterns satisfying these conditions are not known a priori, however, an online algorithm needs to cycle repeatedly through all the p patterns. Hence, for a periodic cycling, an upper bound for the number of presentations, t, until learning stops is t◦ = pn◦ = 6 p/(q g I ). Proof. The condition on the linear separability of the sets C ± states that there is an S ∈ R N with S2 = N and a separation threshold θ ∈ R such that ξ S > (θ + δ + )N for ξ ∈ C + (i.e., ξ post = 1), and ξ S < (θ − δ − )N for ξ ∈ C − (i.e., ξ post = 0). Writing the learning rule 2.1 in the form G t+1 = G t + q G t and assuming that the conditions for a synaptic update are satisfied, we can decompose equation 2.1 into the linear and forgetting part according to equation 3.1. Recall that the condition for a synaptic update is satisfied if either h = N1 G I ξ ≤ (θ + δ) or h = N1 G I ξ ≥ (θ − δ) for ξ ∈ C + and ξ ∈ C − , respectively. Learning with the linear part. According to the update and separability condition for the case ξ ∈ C + , we have ξ G I < (θ + δ)N and ξ S > (θ + δ + )N, respectively. Subtracting the first from the second inequality, we get (S − G I )ξ ≥ N. Similarly, for the case ξ ∈ C − , we have the two conditions ξ G I > (θ − δ)N and ξ S < (θ − δ − )N, respectively, and by subtraction, we get −(S − G I )ξ ≥ N. Defining the linear part in the learning rule 3.1 by L = ξ (1 − g I ) in case of ξ ∈ C + and L = −g I ξ in case of ξ ∈ C − , respectively, we get the basic inequality, equation 3.2, presented previously in the main text, (S − G I )L ≥ g I N. Controlling the forgetting part. We next estimate the impact of the forgetting (saturation) term F = −ξ ∗ G I . We show that updating G with q F either supports learning (in the sense of equation 3.2) or at least does not move G I too far away from S. Inserting the definition of F , writing √ √ ξ = ξ ∗ ξ , and applying the Cauchy-Schwartz inequality twice in the form x y ≤ x y, with equality if x = y, we get sequentially (S − G I )F = G I (ξ ∗ G I ) − S(ξ ∗ G I ) = ( ξ ∗ G I )2 − ( ξ ∗ S)( ξ ∗ G I ) ≥ ξ ∗ G I ( ξ ∗ G I − ξ ∗ S).

(A.1)

2132

W. Senn and S. Fusi

√ √ When forgetting supports learning. In the case of ξ ∗ G I ≥ ξ ∗ S, the parentheses on the right-hand side of equation A.1 is nonnegative, and one immediately concludes from it that (S − G I )F ≥ 0. Note that the condition on the norm of G I roughly states that G I lies “behind” S when √ looking from the origin in the direction of ξ . In this case, the forgetting term F speeds up, or at least does not counteract, the convergence of G I toward S. In fact, since G = L + F , we obtain from (S − G I )F ≥ 0, together with equation 3.2, that for any , (S − G I )G ≥ g I N, provided ξ ∗ G I ≥ ξ ∗ S.

(A.2)

√ When forgetting counteracts learning. We next consider the case that ξ ∗ √ G I ≤ ξ ∗ S. Inserting this into equation A.1 while neglecting the term √ ξ ∗ G I in the parentheses on the right-hand side, we get the estimate (S − G I )F ≥ − ξ ∗ G I ξ ∗ S ≥ −2 ξ ∗ S2 ≥ −2 RN. (A.3) For the last inequality, we used the definition of the norm square, the fact that ξ j ≤ R, and the assumption on the separation vector that S2 = N N √ ξi Si2 ≤ R i Si2 = RN. Since the above estimate to obtain ξ ∗ S2 = i=1 cannot exclude that (S − G I )F becomes negative, we cannot preclude that forgetting counteracts learning. However, since the scaling factor enters as the square, forgetting becomes disproportionally weak if gets small. Let us choose ≤ g I /(2R). Using again G = L + F , we then get from estimate A.3, together with equation 3.2, that (S − G I )G ≥ N(g I − R) ≥ g I N/2, provided ξ ∗ G I ≤ ξ ∗ S.

(A.4)

Learning in the General Case Stops. We next show that with each synaptic update, the distance from G I to S decreases at least by some fixed quantity. We conclude that the learning process must terminate, since otherwise the distance from G I to S would become negative. Let tµ denote the time(s) when pattern ξ µ is presented and the synapses are updated. At a subsequent t +1 t time step tµ + 1, there is G Iµ = G Iµ + q G tµ . Combining equations A.2 and tµ A.4, we estimate (S − G I )G tµ ≥ g I N/2, independently of the value of √ t +1 ξ ∗ G I . Substituting G Iµ in the following line, multiplying the norm t squares out, inserting (S − G Iµ )G tµ ≥ g I N/2, and choosing a learning

Learning Only When Necessary

2133

rate q ≤ g I /(2R2 ) yields t t − S − G Iµ 2 = −2q S − G Iµ G tµ + q 2 G tµ 2

t +1 2

S − G Iµ

≤ . . . . . . ≤ q N(q R2 − g I ) ≤ −q g I N/2.

(A.5)

Note that by definition of G (see equation 3.1), we have G tµ 2 ≤ R2 N. t This is because the synaptic weights G jµ are between 0 and 1, and the stimuli µ ξ j are between 0 and R. Summing up the contributions of all the updates up to time t evoked by the different patterns, G tI = G 0I + q tµ
+ − . . . ≤ −nt q g I N/2,

(A.6)

where nt is the number of synaptic updates up to the ith presentation of a pattern. From equation A.6, we immediately obtain 0 ≤ S − G tI 2 ≤ S − G 0I 2 − nt q g I N/2.

(A.7)

Since S − G 0I 2 ≤ (2 + g 2I + 1)N ≤ 3N, we conclude from equation A.7 that S − G tI 2 ≤ 0 after nt = 6/(q g I ) updates. Hence, the number of synaptic updates until learning stops must be smaller, n◦ = 6/(q g I ). If we set = g I /(2R) and q = g I /(2R2 ), consistent with the smallness requirements above, we obtain n◦ = 48(R/(g I ))4 . Note that this estimate is independent of the initial state of the synaptic weight vector G 0 ∈ [0, 1] N . A.2 Neuronal Silencing with Conflicting Patterns. To illustrate how the perceptron with bounded synapses deals with strongly nonseparable patterns, we consider a learning scenario with a single pattern ξ ∈ C ± , for which both outputs 0 and 1 are required. In the course of “learning”, the response to this pattern ξ will eventually be suppressed, provided that the global inhibition is strong enough. This shunting property is due to the multiplicative synaptic saturation, and it is not present in the classical perceptron with unbounded synapses (recall that a mean field description of binary synapses naturally leads to a multiplicative saturation). For simplicity, we assume that the input patterns are also binary.

2134

W. Senn and S. Fusi

Theorem 2. Let us repeatedly present a pattern ξ ∈ {0, 1} N to the perceptron, and alternately require the output ξ post = 0 and 1. Define the asymptotic synaptic + strength by G ∞ = q −q+ q + , where q + and q − are the learning rates for the up- and downregulation, respectively. Assume that the asymptotic strength is dominated by the global inhibition g I , that is, (G ∞ − g I )ξ ≤ θ◦ − κ, where θ◦ is the neuronal N − threshold, κ > 0, and ξ = N1 i= 1 ξi is the mean activity of ξ . Assume that q ≤ 0 N κ. Then for any initial condition of the synaptic weight vector G ∈ [0, 1] , the N (G it − g I )ξi < θ◦ , after response to pattern ξ becomes subthreshold, h t = N1 i=1 t≈

−2 log κ q−

presentations.

The theorem can easily be generalized with the same proof to the case of C + = C − , where C ± contains more than only one pattern ξ . Interestingly, the estimate of the convergence time holds independent of the size of the LTP rate q + . The basic idea of the proof is to show that a sequence up- and downregulations of a synaptic weight G tj (with ξ j > 0) will always bring this toward its asymptotic state G ∞ . The same convergence property is also shown to hold for the case of stimulations with random patterns in the absence of the stopping condition in the learning rule (Brunel et al., 1998). Proof. We show that whenever the total current at time t − 1 is suprathreshold, h t−1 ≥ θ◦ , it will decrease within the subsequent two updates by at least q − κ. Let us first assume that at time t − 1, the total postsynaptic current is above the LTP threshold, say, h t−1 = θ◦ + h t−1 + > θ◦ + δ◦ , with some t−1 h + > δ◦ . When considering pattern ξ as a member of class C − at time t − 1, no potentiation is triggered due to the stop-learning condition, and we have h t = h t−1 . When considering pattern ξ in the next time step t as belonging to class C + , a downregulation is triggered. Since by assumption we have (G ∞ − g I )ξ ≤ θ◦ − κ, we can subtract this inequality from N (G it − g I )ξi = θ◦ + h t−1 h t = N1 i=1 + and obtain 1 t G i − G ∞ ξi ≥ κ + h t−1 + . N

(A.8)

According to equation 2.1, the expected change of synapse j at an LTD step is G it = −q − G it ξi . With this and equation A.8, total change in the postsynaptic current at time t can be estimated by h t+1 − h t =

1 q− t . G it ξi ≤ − G i ξi − G ∞ ξi ≤ −q − κ + h t−1 + N N (A.9)

Learning Only When Necessary

2135

The first inequality holds because G ∞ ≥ 0, and the second inequality holds because of equation A.8 and because for ξ j ∈ {0, 1} we have ξi2 = ξi . Since h t = h t−1 , we conclude that h t+1 − h t−1 ≤ −q − (κ + h t−1 + ). We next show that the same inequality still holds when, for instance, a downregulation is immediately followed by an upregulation, and when the total current is not yet subthreshold. Such a situation can arise when θ◦ ≤ h t−1 ≤ θ◦ + δ◦ . Let us set again h t−1 = θ◦ + h t−1 + . According to the learning rule 2.1, the synaptic weight after a downregulation at time t − 1 is decreased by q − G it−1 ξi , and it becomes G it = G it−1 (1 − q − ξi ). After a subsequent upregulation at time t, this weight is increased by q + (1 − G it )ξi = q + (1 − G it−1 (1 − q − ξi ))ξi . By summing up the contributions of the individual components and using that ξi2 = ξi , we can compute the difference in the total current after first a down- and then an upregulation as 1 − t−1 1 + q G i ξi + q 1 − G it−1 (1 − q − ξi ) ξi N N 1 t−1 1 t−1 = −(q + + q − ) G i ξi + q + ξ i + q + q − G i ξi N N 1 t−1 ≤ −(q + + q − ) G ∞ ξ + κ + h t−1 G i ξi + q +ξ + q +q − + N (A.10) + q +q − (A.11) ≤ −(q + + q − ) κ + h t−1 + . (A.12) ≤ −q − κ + h t−1 +

h t+1 − h t−1 = −

To get the first term in equation A.10, we were plugging in the inequal t−1 G i ξi ≥ G ∞ ξ + κ + h t−1 ity N1 + . This latter inequality is derived in the same way as equation A.8. To get the first term in equation A.11, we were substituting G ∞ = q + /(q + + q − ), and could cancel the term q + ξ i . To get the second term in equation A.11, we used that both G it−1 and ξi are between 0 and 1. Inequality A.12 holds if q − ≤ κ. The case where first an up- and then a downregulation occurs leads to the same inequality, A.12. Taken together, equations A.9 and A.12 state that whenever the total synaptic current at time t − 1 is above threshold, h t−1 = θ◦ + h t−1 + ≥ θ◦ , it decreases within the next two presentations at least by the amount q − (κ + t−1 t−1 is due to the reduction of h t−1 h t−1 + ). Since the reduction in h + , also κ + h + t+1 is reduced within the two next time steps by the same amount, κ + h + ≤

2136

W. Senn and S. Fusi

− 2t (κ + h t−1 + )(1 − q ). We conclude that the sequence κ + h + (t = 0, 1, 2, . . .), as long as h 2t + ≥ 0, is bounded above by a geometric series,

0 − t κ + h 2t + ≤ a t = κ + h + (1 − q ) .

This geometric series {a t }t =0,1,... decays below κ (and therefore h 2t = κ − θ◦ + h 2t + decays below θ◦ ) after t◦ steps, where t◦ = log( κ+h 0 )/ log(1 − q ) ≤ +

log κ/ log(1 − q − ) ≈ − log κ/q − . The last inequality holds because h 0 ≤ 1 and therefore κ + h 0+ ≤ 1, and the approximation holds for small q − . Hence, for any initial conditions with h 0 ≥ θ◦ , the total current h 2t becomes subthreshold after t = 2t ≥ 2t◦ ≈ −2 log κ/q − presentations. Acknowledgments This work was supported by the SNF grant 3152-065234.01, the Silva Casa foundation, and the EU grant IST-2001-38099 ALAVLSI. We thank Nicolas Brunel for helpful comments and for checking the theorem, Massimo Mascaro for many discussions about the paper (Amit & Mascaro, 2001) that inspired this work, and Joe Brader for interesting discussions and proofreading. References Amit, D. J., & Brunel, N. (1997). A model of spontaneous activity and local delay activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Fusi, S. (1992). Constraints on learning in dynamic synapses. Network: Computation in Neural Systems, 3, 443–464. Amit, D. J., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Amit, D. J., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Amit, D. J., Wong, K., & Campbell, C. (1989). Perceptron learning with signconstrained weights. J. Phys., A22, 2039–2045. Amit, Y., & Mascaro, M. (2001). Attractor networks for shape recognition. Neural Computation, 3, 1415–1442. Amitai, Y., Gibson, J., Beierlein, M., Patrick, S., Ho, A., Connors, B., & Golomb, D. (2002). The spatial dimensions of electrically coupled networks of interneurons in the neocortex. Journal of Neuroscience, 22, 4142–4152. Arbib, M. (1987). Brains, machines, and mathematics. Berlin: Springer Verlag. Block, H. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34, 123–135.

Learning Only When Necessary

2137

Braitenberg, V., & Schutz, ¨ A. (1991). Anatomy of the cortex. Berlin: Springer Verlag. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic learning in attractor neural networks. Network: Computation in Neural Systems, 9, 123–152. Cho, K., Aggleton, J., Brown, M., & Bashir, Z. (2001). An experimental test of the role of postsynaptic calcium levels in determining synaptic strength using perirhinal cortex of rat. J. Physiology, 532(2), 459–466. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with application in pattern recognition. IEEE Transaction on Electronic Computers, 14(3), 326–334. Desai, N., Cudmore, R., Nelson, S., & Turrigiano, G. (2002). Critical periods for experience-dependent synaptic scaling in visual cortex. Nature Neuroscience, 5(8), 783–789. Diederich, S., & Opper, M. (1987). Learning of correlated patterns in spin-glass networks by local learning rules. Phys. Rev. Lett., 58, 929–952. Erikson, C., & Desimone, R. (1999). Responses of macaque perirhinal neurons during and after visual stimulus association learning. J. Neurosci, 19, 10404–10416. Fiorillo, C., Tobler, P., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Fusi, S. (1995). Prototype extraction in material attractor neural networks with stochastic dynamic learning. In S. K. Rogers & D. W. Ruck (Eds.), Proceedings of SPIE 95, Applications and Science of Artificial Neural Networks (Vol. 2, pp. 1027– 1038). Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybernetics, 87, 459–470. Fusi, S. (2003). Spike-driven synaptic plasticity for learning correlated patterns of mean firing rates. Reviews in the Neurosciences, 14, 73–84. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2227–2258. Fusi, S., Drew, P., & Abbott, L. (2005). Cascade models of synaptically stored memories. Neuron, 45, 599–611. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Kohler, ¨ H., Diederich, S., Kinzel, W., & Opper, M. (1990). Learning algorithm for a neural network with binary synapses. Z. Phys. B—Condensed Matter, 78, 333–342. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Miyashita, Y. (1993). Inferior temporal cortex: Where visual perception meets memory. Ann. Rev. Neurosci., 16, 245–263. Parisi, G. (1986). A memory which forgets. Journal of Physics A—Mathematical & General, 19(10), L617–620. Rainer, G., Lee, H., & Logothetis, N. (2004). The effect of learning on the function of monkey extrastriate visual cortex. PLoS Biol, 2(2), E44. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan Books.

2138

W. Senn and S. Fusi

Rubin, J., Lee, D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Rumsey, C., & Abbott, L. (2003). Equalization of synaptic efficacy by activity- and timing-dependent synaptic plasticity. Journal of Neurophysiology, 91(5), 2273–2280. Senn, W., & Fusi, S. (2004). Slow stochastic learning with global inhibition: A biological solution to the binary perceptron problem. Neurocomputing, 58–60, 321–326. Senn, W., & Fusi, S. (2005). Convergence of stochastic learning in perceptrons with binary synapses. Phys. Rev. E, 7(5). Sompolinsky, H. (1987). The theory of neural networks: The Hebb rule and beyond. In L. Van Hemmen & J. Morgestern (Eds.), Heidelberg Colloquium on Glassy Dynamics. Berlin: Springer. Tsodyks, M. (1990). Associative memory in neural networks with binary synapses. Modern Physics Letters, B4, 713. van Rossum, M., Bi, G., & Turrigiano, G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience, 20, 8812–8821. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Yakovlev, V., Fusi, S., Berman, E., & Zohary, E. (1998). Inter-trial neuronal activity in infero-temporal cortex: A putative vehicle to generate long term associations. Nature Neuroscience, 1, 310–317.

Received January 7, 2004; accepted March 22, 2005.

LETTER

Communicated by Ad Aertsen

Coding of Temporally Varying Signals in Networks of Spiking Neurons with Global Delayed Feedback Naoki Masuda [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Japan, and ERATO Aihara Complexity Modelling Project, Japan Science and Technology Agency, Tokyo, Japan

Brent Doiron [email protected] Physics Department, University of Ottawa, Ottawa, Canada, and Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, Canada

Andr´e Longtin [email protected] Physics Department, University of Ottawa, Ottawa, Canada

Kazuyuki Aihara [email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan, and ERATO Aihara Complexity Modelling Project, Japan Science and Technology Agency, Tokyo, Japan

Oscillatory and synchronized neural activities are commonly found in the brain, and evidence suggests that many of them are caused by global feedback. Their mechanisms and roles in information processing have been discussed often using purely feedforward networks or recurrent networks with constant inputs. On the other hand, real recurrent neural networks are abundant and continually receive information-rich inputs from the outside environment or other parts of the brain. We examine how feedforward networks of spiking neurons with delayed global feedback process information about temporally changing inputs. We show that the network behavior is more synchronous as well as more correlated with and phase-locked to the stimulus when the stimulus frequency is resonant with the inherent frequency of the neuron or that of the network oscillation generated by the feedback architecture. The two eigenmodes have distinct dynamical characteristics, which are supported by numerical simulations and by analytical arguments based on frequency response and bifurcation theory. This distinction is similar to the class I versus class II classification of single neurons according to the bifurcation from Neural Computation 17, 2139–2175 (2005)

© 2005 Massachusetts Institute of Technology

2140

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

quiescence to periodic firing, and the two modes depend differently on system parameters. These two mechanisms may be associated with different types of information processing.

1 Introduction 1.1 Oscillations and Synchrony in Feedforward Networks. Oscillatory and synchronous activities of field potentials and neuronal firing are widely found in various parts of the brain. Accordingly, their functional roles beyond the classical notion of rate coding have been widely debated. For example, oscillatory synchrony in the γ band around 40 Hz of distant neurons with nonoverlapping receptive fields may bind features of the presented stimuli, as supported by the recordings in the visual cortex (Gray, Konig, ¨ Engel, & Singer, 1989; Llin´as, Grace, & Yarom, 1991; Sillito, Jones, Gerstein, & West, 1994; Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Ritz & Sejnowski, 1997; Murphy, Duckett, & Sillito, 1999) and the olfactory system (Stopfer, Bhagavan, Smith, & Laurent, 1997). Such oscillations are also used by the electric fish for communication between conspecific (Heiligenberg, 1991). On the other hand, θ oscillations (8–12 Hz) in the hippocampus are suspected of giving information on contexts or on behavioral states to upstream neurons, supplying them with precise temporal structure (Klausberger et al., 2003; Buzs´aki & Draguhn, 2004). Model studies with feedforward networks also support the role for oscillatory synchrony in promoting phase locking of spikes onto external stimuli (Hopfield, 1995; Gerstner, Kempter, van Hemmen, & Wagner, 1996). Precisely timed spiking is necessary, for example, in auditory sensation, which requires coincidence detection with high temporal resolution and enhanced reliability of spiking (Gerstner et al., 1996; Hunter, Milton, Thomas, & Cowan, 1998). The meaning of synchronous firing (oscillatory and nonoscillatory) has been widely discussed, for example, with the synfire chain models in which the volleys of synchronous firing are propagated in networks to transmit and process the inputs (Abeles, 1991). How neural networks encode stimulus information in relation to synchrony and/or asynchrony has particularly been examined with feedforward networks, which are more amenable to analyses. In feedforward networks, synchronous and asynchronous states alternate when the network parameters are modulated (Gerstner & Kistler, 2002; Masuda & Aihara, 2002b, 2003a; van Rossum, Turrigiano, & Nelson, 2002; Nirenberg & Latham, 2003). For a small amount of noise or relatively homogeneous networks, the likely behavior is synchronous firing, which may be related to binding or the synfire chain (Abeles, 1991; Diesmann, Gewaltig, & Aertsen, 1999; Cˆateau & Fukai, 2001; van Rossum et al., 2002; Aviel, Mehring, Abeles, & Horn, 2003; Litvak, Sompolinsky, Segev, & Abeles, 2003; Mehring, Hehl, Kubo, Diesmann, & Aertsen, 2003; Reyes, 2003).

Coding of Temporally Varying Signals

2141

Conversely, large noise or heterogeneity tends to result in asynchronous firing that is advantageous for the population firing rate coding of stimuli with high temporal resolution (Shadlen & Newsome, 1998; Masuda & Aihara, 2002b; van Rossum et al., 2002) and for rapidly responding to stimulus changes (Gerstner, 2000; Gerstner & Kistler, 2002).

1.2 Feedback Links Oscillations and Synchrony. Although the model analyses mentioned so far and many others use feedforward architecture, purely feedforward circuits are rarely found in biological neural networks. Little is known about how recurrent networks behave in terms of synchrony and coding in the context of time-dependent inputs. This is the question that our article addresses. Nervous systems are full of feedback loops ranging from the level of a small number of neurons (Berman & Maler, 1999; Wilson, 1999) to the level of the brain areas (Damasio, 1989; Llin´as et al., 1991; Ritz & Sejnowski, 1997). Remarkably, in the visual cortex, the inputs relayed from the global feedback pathway outnumber by far the feedforward ones such as sensory inputs (Douglas et al., 1995; Billock, 1997). Accordingly, information processing that is unexpected of feedforward networks could exist in networks in which the feedforward and feedback signals are mixed. An apparent role for the global feedback is to enhance synchrony of the upstream neurons, which are more related to peripheral nervous system than are the downstream neurons located closer to the central nervous system. This enhanced synchrony is found in the thalamocortical pathway (Sillito et al., 1994; Bal, Debay, & Destexhe, 2000), the basal ganglia (Plenz & Kitai, 1999), and the olfactory system (Stopfer et al., 1997). In recurrent networks, synchrony is enhanced because the feedback signals serve as common inputs to the upstream neurons, which generally make the neurons more synchronous (Knight, 1972; Shadlen & Newsome, 1998; Gerstner, 2000; Masuda & Aihara, 2002b, 2003b; Aviel et al., 2003; Mehring et al., 2003). Synchrony probably encodes the stimulus information, for example, on the orientation of visual stimuli (Sillito et al., 1994; Douglas et al., 1995; Murphy et al., 1999). Synchrony accompanied by the fine temporal structure of spike trains, which is induced by the correlated feedback, may also be useful in more general spatiotemporal spike codings. Examples include synfire chains embedded in recurrent neural networks (Aviel et al., 2003; Mehring et al., 2003), binding of the fragmental stimulus information that reverberates in the feedback loops (Damasio, 1989; Billock, 1997; Buzs´aki & Draguhn, 2004), temporal filling in of visual stimuli (Billock, 1997), selective attention (Billock, 1997), and signal restoration (Douglas et al., 1995). Another important role for global feedback is to induce oscillations with period specified by the loopback time (Wilson, 1999; Doiron, Chacron, Maler, Longtin, & Bastian, 2003). In summary, oscillations and synchronization are in close relation with each other when both are mediated by global delayed feedback.

2142

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

1.3 Recurrent Nets with Dynamic Inputs. The mechanisms of synchronization and oscillations in recurrent networks have been extensively studied with models. However, most of them assume constant or simple noisy external inputs whose information content is limited to a few static quantities such as the levels of biases and noise (Abeles, 1991; Diesmann et al., 1999; Brunel & Hakim, 1999; Brunel, 2000; Doiron et al., 2003; Doiron, Lindner, Longtin, Maler, & Bastian, 2004). Accordingly, how oscillatory synchrony interacts with external inputs and how the input information is processed have not been discussed, with few exceptions (Knight, 2000; Gerstner & Kistler, 2002). Actually, external inputs often have their own timescales and spatiotemporal structure (Gerstner et al., 1996), and they may interact with the dynamic properties of neural networks to affect the way in which information is processed (Doiron et al., 2003; Chacron, Doiron, Maler, Longtin, & Bastian, 2003). In addition, most studies on synchrony and oscillations in neural networks have ignored the effects of single neurons, which have their own timescales of firing according to bias strength and noise level (Knight, 1972; Aihara & Matsumoto, 1982; Llin´as et al., 1991; Longtin, 1995, 2000; Izhikevich, Desai, Walcott, & Hoppensteadt, 2003; Buzs´aki & Draguhn, 2004). When the eigenfrequency of a single neuron is resonant with the input frequency, signal amplification and better performance of the coding happen in both the subthreshold and suprathreshold regimes (Knight, 1972, 2000; Hunter et al., 1998; Longtin & St-Hilaire, 2000; Longtin, 2000; Gerstner & Kistler, 2002; Lindner & Schimansky-Geier, 2002; Masuda & Aihara, 2003c). Then it is possible that three timescales, each originating from the global feedback, the single-neuron property, or the inputs, coexist and interact in one neural system. The distinction between resonant and nonresonant behaviors is particularly important in this respect.

1.4 Outline of Our Letter. Our focus in this letter is on analyzing information processing of temporally changing inputs by noisy spiking neural networks with global feedback. To this end, we use inputs with temporal structure, mostly sinusoidal inputs. The reason for this choice is twofold. One is that sinusoidal waves are established as standard tools for dynamical systems, as represented by transfer functions and resonance analysis. The second is that animals often receive natural sinusoidal-like inputs. For example, pure tone auditory inputs are sinusoidal waves, and they are processed frequency-wise in auditory pathways (Gerstner et al., 1996). Electrosensory systems of weakly electric fish also receive periodic inputs for communication (Heiligenberg, 1991). Regular respiratory rhythm also affects dynamics and functions of olfactory systems (Fontanini, Spano, & Bower, 2003). We ignore spatial features of signals and networks, which is another interesting topic (van Rossum et al., 2002; Doiron et al., 2003, 2004); thus our networks have homogeneous coupling structure.

Coding of Temporally Varying Signals

2143

In section 2, we introduce the network model with two layers of neurons with spiking dynamics. In section 3, we examine how synchrony, oscillations, and asynchrony are determined by the strength of the global feedback. Section 4 is devoted to the numerical analysis of how the frequency information in inputs is processed by the network in relation to resonances between different time constants. We analyze the resonance phenomena with the use of the dynamical model for population activity (Gerstner, 2000; Gerstner & Kistler, 2002) in section 5. Accordingly, two complementary calculations are presented: linear analysis based on gain functions in the frequency domain in section 6 and nonlinear analysis based on the bifurcation theory in section 7. In section 8, we discuss differences in the functional consequences of the single-neuron dynamics and those of the global feedback loops, with possible relevance to experiments.

2 Model 2.1 Dynamical Equations. The feedforward neural network with global feedback that we use in this letter is schematically shown in Figure 1. The network contains two layers, Aand B, of noisy leaky integrate-and-fire (LIF) neurons (Knight, 1972; Gerstner & Kistler, 2002) with the membrane time constant τm = 20.0 ms. It is supposed to model two mutually connected regions of the brain, such as two different layers in a columnar structure, a thalamocortical loop, or interacting excitatory and inhibitory thalamic nuclei. Layer A consists of n1 = 250 neurons that receive common external inputs. The downstream layer B has n2 = 30 neurons, each of which receives feedforward spike inputs from n1 = 75 randomly chosen neurons in layer A. Every neuron in layer B transmits spike trains back to all the neurons in layer A, modeling a global feedback loop. It provides correlated inputs

layer A

layer B

vA,1

vB,1

vA,2

vB,2

I(t)

vA,n1

JAB

JAA

JBA

vB,n2 JBB

Figure 1: Architecture of the feedforward neural network with global feedback.

2144

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

that tend to induce synchrony, which will be investigated in section 3. Each layer is also supplied with some intralayer excitatory coupling. However, we focus on global feedback instead of analyzing effects of intralayer feedback. Accordingly, the corresponding coupling constants (JAA in equation 2.1 and J B B in equation 2.2) are set rather arbitrarily at small values. We use spatially homogeneous time-dependent external inputs I (t). Although we exclusively use suprathreshold inputs, qualitatively similar results are expected for noisy subthreshold inputs based on single-cell modeling studies (Longtin & St-Hilaire, 2000; Longtin, 2000). Let us denote the membrane potential of the ith neuron in layer A (resp. B) by v A,i (resp. v B,i ) and assume that each neuron has a threshold equal to 1. Upon firing, the membrane potentials are instantaneously reset to the resting potential 0, and the neurons restart integrating inputs. Then the dynamics of the whole network follows the stochastic delay differential equations written by

τm

n1 dv A,i J AA δ(t − TA,i , j − τ AA) = dt i =1 j

+

n2 i =1

τm

dv B,i dt

J BA δ(t − TB,i , j − τBA )

j

− v A,i (t) + I (t) + σˆ ξ A,i (t), (1 ≤ i ≤ n1 ) J AB δ(t − TA,i , j − τ AB ) = i ∈Si

+

(2.1)

j

n2 i =1

J B B δ(t − TB,i , j − τ B B )

j

− v B,i (t) + σˆ ξ B,i (t), (1 ≤ i ≤ n2 ),

(2.2)

where δ is the delta function that approximates the interaction via action potentials, t = TA,i , j (resp. TB,i , j ) is the jth firing time of the i th neuron in layer A (resp. B), and Si is the set of neurons in layer A that project synapses onto the ith neuron in layer B. The coupling within each layer has the spike amplitude J AA = J B B = 0.005 and the propagation delay τ AA = τ B B = 2.0 ms. The feedforward coupling is assumed to have strength J AB = 0.5 and to be instantaneous (τ AB = 0). The global feedback has a delay τBA and strength J BA . Finally, the terms with ξ represent dynamical white gaussian noise with amplitude σˆ and are independent for all the neurons. We use the Euler-Maruyama integration scheme to numerically explore the system (Risken, 1984). We add independent white gaussian variables √ with variance σ 2 to the membrane potentials every t = 0.02 ms (σˆ = σ t). The results

Coding of Temporally Varying Signals

2145

in the following sections do not depend qualitatively on the parameter values. 2.2 Rationale for Model Formulation. Let us comment on the rationality and the limits of the assumptions. The instantaneous feedforward coupling is not realistic. However, this assumption is not essential because only lengths of feedback loops, namely, τ AB + τBA , τ AA, and τ B B , matter for the resonance phenomena that turn out to be functionally important. This point is verified in section 6 with equations 6.8 and B.3. Regarding the individuality of synapses, synaptic time delays from a specific subpopulation to another should be homogeneous enough for input coding supported by synchrony. This requirement is common to a body of work on the possibility of stable emergence of synchrony even with realistic conductance-based LIF neurons (Diesmann et al., 1999; Brunel, 2000; Doiron et al., 2003; Mehring et al., 2003). Therefore, we confine ourselves to the case in which the propagation delay is entirely homogeneous. Another remark is that we set the intralayer propagation delays, or τ AA and τ B B , small compared to interlayer ones, or τBA , which are of the order of 10 ms in later sections. Biologically, delays in corticocortical horizontal connections are often long (Gilbert & Wiesel, 1983; Bringuier, Chavane, Glaeser, & Fr´egnac, 1999). Our focus is not on this type of networks, but on explaining other systems where effects of organized global feedback with relatively long delays are prominent. Examples include thalamic excitatory-inhibitory loops and thalamocortical loops (Destexhe & Sejnowski, 2003) and electrosensory systems (Heiligenberg, 1991). We could apply our results to, for example, corticocortical networks with slow intralayer interaction since values of τ AA or τ B B turn out to set oscillation frequencies of our interest, as will be later suggested by equation 6.8. However, this analogy is valid only when delays are more or less homogeneous and connections are statistically random, as mentioned above. These conditions are often violated particularly in view that horizontal coupling is not random but spatially organized (Gilbert & Wiesel, 1983; Bringuier et al., 1999; Mehring et al., 2003). Last, with regard to the characteristics of noise, our independent white gaussian noise sources set the part of spontaneous activity independent for all the neurons, such as fluctuations at individual synapses due to stochastic synaptic releases. In reality, some parts of the fluctuations are spatially correlated because of shared synaptic inputs (Abeles, 1991; Shadlen & Newsome, 1998). To take into account the degree of input correlation is well beyond the scope of this work. However, based on the related results (Doiron et al., 2004), we anticipate that input correlation enhances the degree of synchrony, which is a key to efficiently encoding sinusoidal inputs. 2.3 Measures of Synchrony and Correlation. We introduce measures to evaluate the following coding properties: (1) the performance of the population rate coding, (2) the degree of synchrony, and (3) the frequency profile

2146

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

of population firing rates. The last one is closely related to phase locking and temporal spike coding (Longtin, 1995; Gerstner et al., 1996) and also useful in evaluating the bandpass property in section 4. We measure property (1) by corr, which is defined to be the cross-correlation function between the temporally discretized versions of the inputs {s1 (i)} and the population firing rates {s2 (i)} (Masuda & Aihara, 2002b, 2003a). These time series are defined by s1 (i) =

1 w

iw

I (t)dt,

(2.3)

(i−1)w

and s2 (i) = the number of spikes from layer A observed for t ∈ [(i − 1)w, iw), (2.4) where w is the width of the observation bin. We remark that in the limit of w → 0, n1 should tend to ∞, and with the proper normalization, s2 (i) is the instantaneous firing probability of a neuron in layer A. The measure for property (2) is given by Burkitt and Clark (2001) and Masuda and Aihara (2002b): n1 2πiv A, j (t) r = n1 . e j=1

(2.5)

Stronger synchrony gives a larger value of r , and r is normalized in [0, 1]. Finally, the standard power spectrum of {s2 (i)} is used to evaluate property (3). These measures are averaged over five trials, and each is calculated from the spike trains for the duration of 40 spikes from a single neuron. 3 Effects of Global Feedback on Synchrony As a preliminary step for understanding how the network deals with temporally structured inputs, we first apply correlated noisy inputs and explore the effects of global feedback on population rate coding and synchrony. We use inputs generated by the Ornstein-Uhlenbeck process η(t) whose dynamics are represented by

τ OU

I (t) = I0 + Cη(t),

(3.1)

dη(t) = −η(t) + ξ (t), dt

(3.2)

Coding of Temporally Varying Signals

2147

A

B 200 firing rates

1 0.75 0.5 0.25 0 -0.3

-0.15

0 JBA

0.15

0 -0.3 -0.15

layer A 500 time (ms)

600

0 JBA

0.15

0.3

{ 400

500 time (ms)

600

F 200 150 100 50 0 400

A(t) (Hz)

E A(t) (Hz)

50

D layer B {

{ 400

100

0.3

C layer B {

layer A

150

500 time (ms)

600

200 150 100 50 0 400

500 time (ms)

600

Figure 2: (A) Degree of synchrony (r , solid lines) and performance of the population rate coding (corr, dotted lines) in layers A (thick lines) and B (thin lines) when the strength of global feedback is varied. (B) Mean firing rates of single neurons in layer A (thick line) and ones in layer B (thin line). Examples of raster plots of excitatory neurons (lower traces) and inhibitory neurons (upper traces) are also shown for (C) J BA = 0 and (D) J BA = −0.25. (E, F) The normalized population firing rates A(t) for the population of excitatory neurons (thick lines) and that of inhibitory neurons (thin lines), calculated with bin width 4 ms. E and F correspond to C and D, respectively.

where ξ (t) is a white gaussian process with standard deviation 1, I0 = 1.4 is the input bias, C = 5 is the strength of input modulation, and τ OU is assumed to be 10 ms. We also set w = 2.5 ms in equation 2.3, τBA = 30.0 ms, and σ = 0.08. The degree of synchrony and the performance of the population rate coding are shown in Figure 2A as functions of the feedback strength. In the purely feedforward situation with J BA = 0, less synchrony and larger values of corr are observed compared with when J BA < 0. As J BA tends more negative, the synchronous mode is first seen. Then as it increases toward

2148

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

zero, asynchrony with high resolution of the population rate code gradually appears (Masuda & Aihara, 2002b, 2003a; van Rossum et al., 2002; Litvak et al., 2003). A more negative feedback gain provides the neurons in layer A with more correlated inputs in the form of common spike trains from layer B. Then layer A is forced to synchronize more strongly (Damasio, 1989; Sillito et al., 1994; Murphy et al., 1999; Doiron et al., 2003; Masuda & Aihara, 2003a), and the enhanced synchrony in turn supplies more correlated feedforward inputs to layer B to induce synchrony there as well (Abeles, 1991; Diesmann et al., 1999; Cˆateau & Fukai, 2001; van Rossum et al., 2002; Masuda & Aihara, 2003a; Litvak et al., 2003; Reyes, 2003). The results are consistent with the general tendency that reduced firing rates reinforce synchrony in recurrent networks (Brunel & Hakim, 1999; Burkitt & Clark, 2001), since the negative global feedback in our network obviously decreases firing rates. Figures 2C (for J BA = 0) and 2D (for J BA = −0.25) show example raster plots of n1 = 250 neurons in layer A (lower traces) and n2 = 30 neurons in layer B. These figures, together with the time-dependent firing rates shown in Figures 2E and 2F corresponding to Figures 2C and 2D, respectively, support that there is a transition between synchronous and asynchronous regimes as explained above. At the same time, oscillatory activities, which are our interest, are evident, particularly in the presence of delayed feedback (see Figure 2D). What mechanisms underlie such oscillations and how oscillations are related to information coding are examined in sections 4, 6, and 7. As a remark, the use of J BA < 0 can obscure the biological sense of the neurons in layer B, because they send inhibitory spikes to layer A and excitatory spikes to layer B. However, the results are qualitatively the same even if the coupling within layer B is inhibitory (data not shown). The observables from layer B roughly approximate those from layer A for all J BA since the dynamics in layer B is correlated to that in layer A (Brunel, 2000; Laing & Longtin, 2003; see Figure 2). This is consistent with experimental evidence in, for example, the visual system (Billock, 1997) and the hippocampus (Klausberger et al., 2003). More precisely, the amount of information on I (t) in layer B is limited by that in layer A. Consequently, in layer B, larger r and smaller corr than in layer A are always observed. For this reason, we do not show corr and r of layer B in the following figures. Figure 2A indicates that for J BA > 0, significant synchrony is not established. This is so even though synchrony is expected at higher J BA , which produces more correlated input to layer A. Noise is amplified through the positive feedback loop, and Figure 2B shows that firing rates are drastically increased as a function of J BA . The combined effects give rise to a tendency toward asynchrony (Brunel & Hakim, 1999; Burkitt & Clark, 2001), which seems to override the synchronizing effect of the correlated inputs. In contrast to the switching between the synchronous mode and the population rate code mode obtained for negative feedback, corr decreases and r does not robustly increase as J BA goes positive large. As mentioned, the positive

Coding of Temporally Varying Signals

2149

feedback elicits autonomous autocatalytic reverberation even after external inputs are removed. Then firing rates just saturate at large values so that the network behavior is almost ignorant of the input information, deteriorating the population rate coding (Douglas et al., 1995). This consequence is also likely for more elaborated conductance-based neuron models and real neurons since saturating effects with monotonic frequency-input ( f -I ) curves are observed in these cases as well (Hoˆ & Destexhe, 2000; Chance, Abbott, & Reyes, 2002). Although nonmonotonic f -I relations because of conductance dynamics (Kuhn, Aertsen, & Rotter, 2004) possibly change our conclusions, we concentrate on more conventional cases. To put all the cases of J BA together, the purely feedforward architecture with J BA = 0 naturally maximizes corr since corr measures the performance of the simplest task of mimicking inputs. Although signal amplification and maintenance of high firing rates are achieved by positive feedback loops (Douglas et al., 1995), positive feedback is often harmful (e.g., seizure), and it is also inefficient in terms of the consumed power (Knight, 2000). Positive feedback must be used in more local or specific manners. For example, it may activate certain parts of a network for particular functions to be turned on, or for a specific aspect of stimuli to be magnified, as suggested by the experiments in the electric fish circuitry (Berman & Maler, 1999). 4 Bandpass Filtering of Dynamic Inputs In this section, we investigate how the interplay of various factors with different timescales influences information processing. To this end, let the external inputs be the sinusoidal waveforms represented by I (t) = I0 + C sin(2π f ext t),

(4.1)

where I0 and C are the bias and the amplitude of modulation, respectively. The relevance of sinusoidal inputs to biological neural systems has been explained in section 1. We vary w in equations 2.3 and 2.4 according to w = 100/ f ext ms so that the resolution of discretizing I (t) into {s1 (i)} is independent of the timescale. As to the modulation strength, there is always competition between f ext and inherent stochastic dynamics of single neurons (Longtin, 1995) or that of networks, as we will see. Obviously, corr increases with C at first, meaning that stronger modulation more easily facilitates better population rate coding with some, but not perfect, phase locking (Burkitt & Clark, 2001). If C is even stronger, corr degrades to some extent because of almost perfect phase locking, which disregards the detailed temporal structure of I (t). In these situations, the contribution of the neural network dynamics is negligible, and abundant results for single neurons or neural networks without

2150

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

A

B

0.75

0.5

0.5

0.25

0.25

0 0

30

60 f ext (Hz)

90

C power 3000 2000 1000

0

30 60 90 output freq (Hz)

0

30

60 f ext (Hz)

90

D power 3000 2000 1000

90 60 30 f ext (Hz) 0

0

30 60 90 output freq (Hz)

90 60 30 f ext (Hz) 0

Figure 3: (A, B) The degree of synchrony (r , solid lines) and the performance of the population rate coding (corr, dotted lines) of layer A. (C, D) Power spectra of the population firing rates of layer A. For clarity, only the values of power more than 100 are plotted. We set I0 = 1.4 and τBA = 30.0 ms. The dependence on f ext is examined for J BA = 0 (A, C) and J BA = −0.25 (B, D).

global feedback support that firing times are regulated by the peaks of external inputs without regard to network interactions (Gerstner et al., 1996; Longtin & St-Hilaire, 2000; Masuda & Aihara, 2003b). However, here we examine the regime in which the dynamics are not so strongly entrained by inputs. We set C = 0.10, which is small compared with values of I0 used in sections 4.1 and 4.2. 4.1 Effects of Global Feedback 4.1.1 Feedforward Only First, a purely feedforward architecture is contrasted with one with global feedback, with which we are concerned. We set τBA = 30 ms, I0 = 1.4, and σ = 0.07. The coding profiles of layer A dependent on f ext are shown in Figure 3A for J BA = 0. Compared with the case in which J BA = −0.25 (see Figure 3B), the level of synchrony is lower and corr is larger for most of the range of the input frequency. This agrees with the results in section 3. However, peaks of corr and r appear simultaneously

Coding of Temporally Varying Signals

2151

layer A

A

I(t)

1/fext

I 0 τm 1/f neuron

layer B τAB

τBB

τAA layer A

B

I(t)

I 0 τm 1/f neuron

I 0 τm 1/f neuron

layer B τAB

I 0 τm 1/f neuron

1/fnet 1/fext

τAA

τBA

τBB

Figure 4: Schematic figure showing how f ext , f neuron , and f net are set when the feedback loop is (A) absent and (B) present.

at f ext ∼ = 42 Hz and its harmonics. Generally, spiking neurons have inherent output frequencies for a constant input (Knight, 1972, 2000; Rinzel & Ermentrout, 1998; Wilson, 1999; Lindner & Schimansky-Geier, 2002; Izhikevich et al., 2003). With the values of τm and I0 given above, each LIF neuron in open loop has the characteristic resonant frequency f neuron (see Figure 4A), which is about 42 Hz. Better phase locking and better population rate coding of inputs are realized when f ext is approximately equal to f neuron or its harmonics (Knight, 1972, 2000; Hunter et al., 1998; Longtin, 2000; Klausberger et al., 2003; Buzs´aki & Draguhn, 2004). Figure 3 can also be regarded as an inverted tuning curve showing the input amplitudes necessary to achieve a certain level of corr. The resonant regimes require just small input amplitudes for locking, population rate coding with high resolution, and presumably synchrony. The simultaneous increase in corr and r in the resonant scheme seems contradictory to the trade-off between synchrony and efficiency in population rate coding realized by asynchrony. However, what happens is the locking of spikes onto the peaks of I (t). Even if the neurons fire only around the peaks of I (t), firing rates approximate the sinusoidal inputs well enough to result in large corr. At the same time, the phase locking leads to synchronization, which yields r even larger than the case of sinusoidally modulated firing rates. The power spectra of the population firing rates of layer A are shown in Figure 3C for changing f ext . We verify that oscillations are enhanced around the resonant frequencies. This is consistent with the fact

2152

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

that an oscillation of population activities always accompanies some degree of synchrony (Brunel, 2000). On the other hand, Figure 3C indicates that the effect of f ext and that of f neuron on the output frequency compete with each other in nonresonant situations, resulting in relatively irregular and asynchronous firing. However, asynchrony in this case does not mean efficient population rate coding (see Figure 3A). Firing times are not entirely locked to the stimulus peaks but determined as a compromise between the input and the intrinsic neuronalnetwork dynamics. Accordingly, the purely feedforward networks filter away the temporal information on nonresonant inputs. This mechanism of bandpass filtering is commonly found in suprathreshold and subthreshold regimes (Knight, 1972, 2000; Longtin & St-Hilaire, 2000; Longtin, 2000; Lindner & Schimansky-Geier, 2002).

4.1.2 Negative Feedback. Let us discuss dynamics of networks with negative global feedback. Analytical results based on analog neurons (Marcus & Westervelt, 1989; Longtin, 1991; Giannakopoulos & Zapp, 1999; Wilson, 1999; Laing & Longtin, 2003) and neural populations with exact threshold dynamics (Brunel & Hakim, 1999; Brunel, 2000) show that another inherent frequency appears via the Hopf bifurcation as −J BA or τBA increases. This new frequency f net is introduced by the delayed global feedback loop, as schematically shown in Figure 4B. The coding profiles for J BA = −0.25 are shown in Figures 3B and 3D. With this J BA , a limit cycle corresponding to the oscillatory population firing rates exists, and the network has two characteristic frequencies, f neuron ∼ = 40 Hz and f net ∼ = 15 Hz. We note that f neuron is smaller than the purely feedforward case because of the net negative feedback inputs to layer A (Douglas et al., 1995). We observe in Figure 3B larger values of corr and r near f ext = f net , as well as near f ext = f neuron . Also, corr is raised near the harmonics of f neuron . Figure 3D supports the notion that the enhanced synchrony and the better population rate coding are accompanied by reinforced regular oscillations and phase locking around resonant frequencies. For nonresonant inputs, the peaks in corr, r , and the power spectra disappear as a result of competition between f ext , f neuron , and f net . The coexistence of two characteristic frequencies f net and f neuron can be understood with an analysis of the population firing rates depending on the parameter J BA . Here we explain it qualitatively; more detailed analyses on the Hopf bifurcation are provided in section 7. Let us suppose for simplicity that I (t) is a constant bias plus some noise and that J BA decreases from 0. At low |J BA |, constant firing rates corresponding to totally asynchronous population activities are the stable fixed points, if intralayer coupling is ignored. However, the neurons fire more or less synchronously even at this stage because of nonzero J BA . Therefore, the constant firing rates can be interpreted as somewhat synchronous firing with output frequency f neuron (see Figure 4A). Past the Hopf bifurcation point, f neuron changes continuously

Coding of Temporally Varying Signals

2153

A

B

-0.3

-0.3

JBA

JBA

0

0

0.2

20

40 fext

60 (Hz)

80

0.2

20

40 fext

60

80

(Hz)

Figure 5: Dependence of corr on J BA and f ext with (A) τBA = 20.0 ms and (B) τBA = 30.0 ms. We set I0 = 1.4. Brighter points correspond to larger values of corr.

and begins to oscillate in time at the eigenmode corresponding to f net (see Figure 4B). The inverse of the resonant frequency f net is approximated by twice the loopback time: 2τBA = 60 ∼ = 1/0.017 ms since the feedforward synaptic delay and the response time of the neural populations (Knight, 1972; Gerstner & Kistler, 2002) are much smaller than τBA in our framework and many others. This agrees with the analytical calculations in section 6 based on linear response theory and is intuitively understood as follows (Brunel & Hakim, 1999). In response to increasing firing rates in layer A, strong negative feedback reverberates into layer A approximately after τBA . Then the firing rates in layer A decrease to make layer B less activated. The strong inhibition in layer A imposed by the negative feedback is removed after another τBA , which completes one round. This is numerically confirmed (Doiron et al., 2003; Maex & De Schutter, 2003), and theoretical results also guarantee that 1/ f net falls somewhere in [2τBA , 4τBA ] depending on situations (Brunel & Hakim, 1999; Giannakopoulos & Zapp, 1999; Laing & Longtin, 2003; Doiron et al., 2004). Next, we examine the influence of J BA on input filtering more systematically. The values of corr for various values of J BA and f ext are shown in Figures 5A and 5B for τBA = 20 ms and τBA = 30 ms, respectively. We have set I0 = 1.4 and σ = 0.02. Since the dynamical states are kept more or less synchronous and r does not give us as much information as corr does, we show only the values of corr in Figure 5 and later in Figure 6. Black and white regions correspond to corr ∼ = 0 and corr ∼ = 1, respectively. Near J BA = 0, only feedforward filtering is in operation, and it is broadly tuned. Although it has a lowpass property whose cutoff frequency is specified by the membrane constant, the bandpass cutoff is not seen in the numerical results because the cutoff frequency is very high. Near J BA = −0.05, oscillatory network dynamics begin to be observed, presumably corresponding to the Hopf bifurcation assessed with σ = 0. Then, only the inputs whose

2154

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

A

B

1.0

1.0

I0

I0

1.5

1.5

2.0

20

40 fext

60 (Hz)

80

2.0

20

40 fext

60

80

(Hz)

Figure 6: Dependence of corr on I0 and f ext with (A) τBA = 20.0 ms and (B) τBA = 30.0 ms. We set J BA = −0.25. Brighter points correspond to larger values of corr.

f ext ’s are resonant with f net or f neuron are readily transmitted, and this feedback bandpass filtering is sharply tuned. As expected, f net changes from approximately 25 Hz to 15 Hz as we vary τBA from 20 ms to 30 ms. For positive J BA , oscillatory synchrony is actually difficult to build up, as we mentioned above. Analytical results also suggest that population firing rates do not oscillate but are constant for small but positive J BA (Knight, 2000; Laing & Longtin, 2003). On the other hand, the peaks associated with f neuron and its harmonics persist up to a larger positive J BA ∼ = 0.15, where f neuron ∼ = 55 (resp. 65) Hz in Figure 5A (resp. Figure 5B). Both f net and f neuron increase in J BA . However, f neuron is more sensitive to J BA , since J BA directly affects the level of effective bias (Maex & De Schutter, 2003). Another point is that corr generally decreases as |J BA | increases, consistent with Figure 2A. 4.2 Effects of Input Bias. In addition to the principal oscillatory component of inputs, an input bias is perpetually subject to changes. With J BA = −0.25 and σ = 0.03 fixed, values of corr for varying I0 and f ext are numerically calculated. Figures 6A and 6B correspond to τBA = 20 ms and τBA = 30 ms, respectively. The neuronal frequency f neuron is identified by the brighter regions with larger corr, which mark the resonance between f ext and f neuron . In Figures 6A and 6B, they are relatively bright oblique regions touching f ext ∼ = 50 Hz at I0 = 2.0. Similarly, resonant peaks corresponding to f net are those touching f ext ∼ = 25 (resp. 15) Hz at I0 = 2.0 in Figure 6A (resp. Figure 6B). Also some harmonics of f neuron and f net are observed as oblique bands. At first sight, f neuron increases with I0 in a smooth manner. This property, also suggested in section 4.1, is reminiscent of the property of class I neurons (also called type I membranes) or of LIF neurons (Knight, 1972; Rinzel & Ermentrout, 1998; Masuda & Aihara, 2002a). In contrast, the peak associated with f net does not move that much, because the frequency of the network oscillation caused by the Hopf bifurcation is rather insensitive to the bias, even well beyond the bifurcation point (Rinzel & Ermentrout,

Coding of Temporally Varying Signals

2155

1998; Masuda & Aihara, 2002a; Izhikevich et al., 2003). These results agree with the analytical results in section 6 and are discussed in more detail in section 8. In summary, there are two oscillation periods. The intrinsic neuron timescale f neuron is susceptible to I0 and J BA , whereas the intrinsic network timescale f net depends much on τBA and to some extent on J BA . The latter oscillation is observed only when negative feedback is large enough. When the input timescale f ext matches either f neuron or f net , both input-output correlation and network synchrony increase, and the network bandpasses sinusoidal inputs. 5 Spike Response Model To take a closer look at the mechanism of bandpass filtering, we introduce the description on population activities based on a simplified spike response model (SRM) (Gerstner & Kistler, 2002). The SRM is equipped with realistic features of neurons such as spiking mechanisms, stochasticity, and refractory periods. At the same time, it allows exact treatments of delay-induced synchronous network oscillations and oscillations of single neurons. Based on the SRM, we derive linear frequency response functions in section 6 and bifurcation diagrams in section 7. In this section, we briefly explain the formulations necessary for sections 6 and 7. However, we move the mathematical details to appendix A since the calculations are basically the same as those in Gerstner and Kistler (2002). Our novel point is the consideration of delayed feedback. For complete derivation, also refer to Doiron (2004). Let us assume a single-layered recurrent network with feedback delay τd . This reduction from the network with two layers is justified for a moment by the mathematical fact that the dynamics of layer B is enslaved by that of layer A in the sense that their firing rates are locked with a phase lag (Brunel, 2000; Laing & Longtin, 2003; Doiron et al., 2003, 2004). After some calculations as shown in appendix A, we obtain a delay differential system involving the effective input potential h(t) and the population firing rate A(t):

∞

h(t) =

e−s/τm × J

0

∞

−∞

Gσ1 (r )A(t − s − τd − r )dr + I (t − s) ds, (5.1)

and A(t) =

f [h(t)] ≡ g[h(t)]. 1 + τr f [h(t)]

(5.2)

Here J is the uniform feedback gain, τr is the absolutely refractory period, Gσ1 (r ) is the noise kernel to account for the stochasticity in the dynamics, and f given in equation A.10 in appendix A implements a stochastic spiking

2156

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

mechanism with threshold. We remark that we now use the letter Afor population activity, not to be confused with a layer label. Equation 5.2 dictates that a firing rate is determined by monotonic nonlinear function of h(t). In turn, h(t) comprises the filtered versions of the neural activity reverberating with delays and the external input. The lowpass filtering stems from the membrane time constant of the single neuron, namely, τm . 6 Resonance Conditions Based on Linear Response Theory In this section, we derive the linear response function in the frequency domain to examine the resonance conditions. We again follow the previous work (Knight, 1972, 2000; Gerstner, 2000; Gerstner & Kistler, 2002) except that we take into account the delayed feedback. This analysis is qualified because in section 4, we worked out the numerical simulations with the weak input modulation. For the inputs given in equation 4.1 with small C, let the population firing rates be A(t) = A + A(t),

(6.1)

where A is the mean firing rate. For simplicity, we set τr = 0 in this section. We also apply the deterministic firing rule without long memory, by replacing h(t) with h(t|tˆ) in equation A.9 and assuming β → ∞ in equation A.10. Instead, we introduce the reset noise (Gerstner, 2000; Gerstner & Kistler, 2002), which is more mathematically tractable for the purpose of deriving frequency response functions. The reset noise causes the jitter in the reset time in the form of the gaussian distribution with variance σ22 . Then, in terms of h, the dynamical equation for the population firing rates linearized around the stationary rate A becomes A(t) =

∞ −∞

+

Gσ2 (r )A(t − T0 − r )dr

∞ A −T0 /τm (t) − e G (r )h (t − T − r )dr , h σ2 0 v (θ ) −∞

(6.2)

where T0 is the backward interspike interval that is almost equal to an interspike interval 1/ f neuron , and v (θ ) > 0 is the rate of increase in the membrane potential v at the threshold v = θ , which we assume to be constant (Gerstner & Kistler, 2002). By applying the Fourier transform to equations 5.1 and 6.2, we obtain the following equations for the radial frequency ω = 0: ˆ h(ω) =

τm ˆ + Iˆ (ω) , J Gˆσ1 (ω)e−iωτd A(ω) 1 + iωτm

(6.3)

Coding of Temporally Varying Signals

−iωT0 ˆ = Gˆσ2 (ω) A(ω)e ˆ A(ω) +

A v (θ )

hˆ (ω) − e−T0 /τm −iωT0 Gˆσ2 (ω)hˆ (ω) ,

2157

(6.4)

where the Fourier transform of a function is denoted by putting ˆ. The Fourier transform of the gaussian distribution is given by σ 2 ω2 ˆ Gσ (ω) = exp − . 2

(6.5)

Combining equations 6.3 to 6.5, we derive the frequency response function in the following form:

σ 2 ω2 1 − exp − 2 − iωT0 2 −σ12 ω2 AFˆ (ω) , J exp − iωτd − v (θ ) 2

ˆ A(ω) AFˆ (ω) = v (θ ) Iˆ (ω)

(6.6)

where Fˆ (ω) =

iωτm 1 + iωτm

σ 2 ω2 T0 1 − exp − 2 . − iωT0 − 2 τm

(6.7)

We then ignore the term A/v (θ ) in the numerator, which is independent of ω, and define the effective coupling strength by J = J A/v (θ ). Figure 7 shows the gain function calculated as the absolute value of equation 6.6 for some parameter sets. In the resonant situations, e−iωT0 or J e−iωτd are more or less maximized to make the denominator of equation 6.6 small (Knight, 2000; Gerstner & Kistler, 2002). The change in T0 mainly affects the second term of the denominator, whereas τBA and J BA influence the third term. For (T0 , J , τd ) = (25 ms, −0.5, 30 ms), the response gain (solid lines in Figure 7) actually has three peaks: at f net ∼ = (2τd )−1 = 18 Hz, −1 f neuron ∼ = 40 Hz, and 2 f (also see Figure 4B). Figure 7A shows T = 0 neuron that if we remove the global feedback, the peak at f net vanishes with little influence on the peaks at f neuron and 2 f neuron (also see Figure 4A). With J < 0 present, only the former peak shifts as we change τd , as shown in Figure 7B. By contrast, Figure 7C shows that the latter peaks mainly move as we vary T0 , which is sensitive to the input bias. As a remark, the nonlinear bifurcation theory in section 7 and others (Marcus & Westervelt, 1989; Longtin, 1991; Giannakopoulos & Zapp, 1999; Wilson, 1999) claims that the feedback delay must be large enough to produce a Hopf bifurcation with the network oscillation. This is not covered by the linear theory developed here. On the other hand, the resonant peak at f net may be identified even in the subthreshold regime due to noise.

2158

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

A

B

80

80 gain

120

gain

120

40 0

40

0

20

40 60 output freq (Hz)

C

0

80

0

20

40 60 output freq (Hz)

80

120

gain

80 40 0

0

20

40 60 output freq (Hz)

80

Figure 7: The gain functions for T0 = 25 ms, J 0 = 0.5, τd = 30 ms (solid lines in (A), (B), and (C)) compared with T0 = 25 ms, J 0 = 0, τd = 30 ms (dotted line in A), T0 = 25 ms, J 0 = 0.5, τd = 40 ms (dotted line in B), and T0 = 20 ms, J 0 = 0.5, τd = 30 ms (dotted line in C). We set σ1 = 8.0 and σ2 = 4.0.

We are in a position to explicitly take two layers into account. For simplicity, the intralayer coupling is ignored. After similar calculations that are detailed in appendix B, we end up with the frequency response function equation B.3. Essential points suggested by the denominator of equation B.3 are that f neuron does not change with the consideration of two layers and that f net is modified to 2(τ AB + τBA )−1 , which is again twice the loopback time. More generally, we deal with the frequency response of N assemblies in appendix B. With these assemblies denoted by 1, 2, . . . , N, the resonance frequency for a negative feedback loop of chain length k is given by f net =

1 , 2 τρ1 ρ2 + τρ2 ρ3 + . . . + τρk ρ1

(6.8)

where {ρ1 , . . . , ρk } is an ordered subset of {1, 2, . . . , N}. On the other hand, f net associated with a positive loop is not likely to be observed because the Hopf bifurcation points, which mark the emergence of oscillations (see section 7), are not crossed and positive feedback usually introduces instability (Knight, 2000). In reality, a pair of layers is not

Coding of Temporally Varying Signals

2159

necessarily interconnected, and the parameters differ for each loop. Among them, the contribution of the feedback loops with small τρ1 ρ2 + τρ2 ρ3 + . . . + τρk ρ1 and negative large effective gains ( J ρ1 ρ2 J ρ2 ρ3 . . . J ρk ρ1 in appendix B) relative to other loops dominates the summation in equation B.8. Only these loops effectively affect the bandpass properties. If the contribution of the feedback loops to the denominator is small relative to one of the singleneuron dynamics ((1 − Gˆσ2 e−iωT0 ) N in equation B.8), f neuron = 1/T0 is the only effective resonant frequency, and the network property is of feedforward or single-neuron nature. To summarize, a feedback delay potentially sets an oscillation period distinct from one associated with a single-neuron oscillation. Actual appearance of the oscillation depends on whether the network has passed a Hopf bifurcation point, which is the topic of section 7. 7 Hopf Bifurcations in Network Activity In this section, we show how a network oscillation occurs via a supercritical Hopf bifurcation in population activity. In biological terms, our results qualitatively clarify which network configuration accommodates oscillations that underlie enhanced coding of periodic inputs, as discovered in section 4. Theoretically, we characterize how oscillations emerge in the SRM supplied with a delayed feedback. This bifurcation is set by both the singlecell excitability and the network feedback strength. The nonlinear analysis developed here complements the linear analysis in section 6. We set J < 0 to discuss the emergence of stable synchronous and oscillatory firing presented in section 4. With the limit σ1 → 0 and σ2 → 0, we can transform equation 5.1 into the following scalar delay differential equation (Gerstner & Kistler, 2002): τm

dh = −h(t) + J A(t − τd ) + I (t). dt

(7.1)

The advantage of these approximations and transformations is that the system given by equations 5.2 and 7.1 is more tractable than the original system of equations A.3 and A.7 in appendix A. We give a dynamical system analysis of equations 5.2 and 7.1. To this end, we consider only the autonomous network and set I (t) = I0 . Inserting equation 5.2 in equation 7.1 and performing the transformation h → (h − I0 )/θ and t → t/τm gives the nondimensionalized system: dh(t) = −h(t) + J˜ g˜ [(h(t − τ˜d )], dt

(7.2)

where the new dimensionless parameters are defined by J˜ = J /θ τm and τ˜d = τd /τm . The population activity A(t) is now determined by g˜ [h(t)] with g˜ given in equation 5.2, but τr and f (t) are replaced by τ˜r = τr /τm and

2160

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

Figure 8: Hopf bifurcation in population dynamics. We integrate equation 7.2 with an Euler approximation scheme with a time step of 10−3 . The parameters are set at τr = 0.3, β = 1, and τ0 = 1. (A) The bifurcation set defines a curve of supercritical Hopf bifurcations in the J /I0 parameter space. Equations C.1 and C.13 are solved using a bisection root-finding algorithm with a tolerance of 10−5 . The curve computed for τd = 1 partitions the parameter space into stable and unstable regimes, as labeled. (B) The Hopf bifurcation curve in the J /τd parameter space with I0 = 6. (C) Bifurcation diagram for A(t) with I0 as a bifurcation parameter and J = −3. Solid lines and dashed lines represent stable and unstable fixed points, respectively. Open circles are the minimum and maximum of the oscillatory solution for A(t) born out of the Hopf bifurcations.

˜ f˜ (t) = τ˜0−1 exp[β(h(t) + I˜0 − 1)], respectively. We have set β˜ = βθ, I˜0 = I0 /θ , and τ˜0 = τ0 /τm . In appendix C, we analyze equation 7.2 by numerically solving the transcendental characteristic equation caused by the delayed feedback. As a result, J /I0 parameter space is partitioned into schemes. Figure 8A shows that the parameter space is divided into the stable regime where stationary firing rates are stable and the unstable regime where firing rates oscillate. The Hopf bifurcation occurs on the boundary curve denoted by . As J varies, there is apparently only one Hopf bifurcation for a fixed I0 (Marcus & Westervelt,

Coding of Temporally Varying Signals

2161

1989; Longtin, 1991; Brunel & Hakim, 1999; Brunel, 2000; Giannakopoulos & Zapp, 1999; Gerstner, 2000; Gerstner & Kistler, 2002; Laing & Longtin, 2003). Then an oscillation occurs for the negative feedback strength larger than a threshold. Actually, the situations of Figures 2C, 2E, 3A, 3C, and 4A correspond to the stable region in Figure 8A, whereas those of Figures 2D, 2F, 3B, 3D, and 4B correspond to the unstable region. However, as I0 grows, a stable network oscillation develops (Brunel & Hakim, 1999) and then is lost through a reverse Hopf bifurcation, which is also shown in the bifurcation diagram of A(t) in Figure 8C. This is intuitively understood from the saturation of inhibitory feedback for large I0 caused by the nature of g(h). When inhibitory strength is saturated, large inputs can overtake any delay-induced oscillatory behavior. Figure 8B shows delayinduced instability in our population dynamics, consistent with other delayed dynamical systems (Marcus & Westervelt, 1989; Longtin, 1991; Brunel & Hakim, 1999; Giannakopoulos & Zapp, 1999; Wilson, 1999; Gerstner & Kistler, 2002). When J is larger negative, an oscillation begins with a smaller value of τd as τd increases. In summary, larger −J and τd with moderate I0 are more in favor of network oscillations. Although oscillations can occur in a lot of ways in dynamical systems theory, the one here is through a Hopf bifurcation. Limit cycles that emerge from a Hopf bifurcation have rather stiff f net compared with ones via, for example, a saddle-node bifurcation (Rinzel & Ermentrout, 1998). Such a rigid oscillation allows a strong resonance with a sinusoidal input to improve coding, as prominent in section 4. Effects of the type of bifurcation on coding performance are discussed in section 8.2.

8 Discussion 8.1 Relation to Experiments. Oscillatory synchrony is widely found on a diversity of relative timescales in various neural systems. In this section, we relate our results to experimental and relevant numerical results, particularly on sensory systems. Let us discuss neural systems separately according to the relative timescale f ext / f neuron . Respiration provides external inputs with f ext ∼ = 1 Hz f neuron . Oscillatory rhythm in olfactory systems can be locked to periodic respiratory inputs when they are present (Fontanini et al., 2003). Since oscillations are found even in the absence of inputs (Fontanini et al., 2003), olfactory systems are equipped with mechanisms to produce rhythms presumably owing to recurrent loops. Slow oscillations with frequency near f ext are enhanced by respiration. Then built-in f net may be close to actual respiration frequencies to enhance coding of general periodic inputs whose f ext is close to respiration frequency. However, we note that much faster γ oscillations are ubiquitous in olfactory systems as well (Stopfer et al., 1997). In this case, inputs with larger f ext may be relevant.

2162

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

Similarly, hippocampal θ oscillations fall in this category if we consider that hippocampal neurons sense spike inputs or local field potentials from the θ rhythm as effective slow inputs. Such oscillatory field potentials may be caused by global feedback (Buzs´aki & Draguhn, 2004). There, neurons typically fire more than once in a stimulus cycle around a specific phase of the periodic stimulus (Klausberger et al., 2003), reminiscent of the relation f ext f neuron . The time constants associated with the intrinsic single neuronal dynamics are not important. Spike trains on top of θ oscillations are advantageous in accurate population rate coding. In addition, a subpopulation of neurons with these phase-locked spike trains may constitute a cell assembly for specific information processing (Buzs´aki & Draguhn, 2004). The aforementioned results on olfactory systems are suggestive of the relation between f ext and f net , whereas those on the hippocampus particularly relate f ext and f neuron . They may be complementary to each other for understanding the interaction of the three timescales. The other extreme f ext f neuron is found in auditory systems that are perpetually subject to sound inputs with high frequency (Gerstner et al., 1996). Electrosensory systems of weakly electric fish receive communication signals with relatively high frequency as well (Heiligenberg, 1991). These systems actually receive periodic forcing from the environments, hence directly fitting our framework. When f ext f neuron , each neuron fires only sporadically with respect to the stimulus timescale. In the context of rate coding, a single neuron cannot encode these fast inputs. A population of neurons is called for to realize accurate rate coding (Shadlen & Newsome, 1998; Kistler & De Zeeuw, 2002; Masuda & Aihara, 2002b, 2003a; van Rossum et al., 2002). The fast input signals perhaps stem from some feedback loops with short delays. In this case, f neuron < f net holds, as actually found in inferior olive neurons (Kistler & De Zeeuw, 2002) and in electrosensory networks (Doiron et al., 2003). Although we have mostly assumed f neuron > f net in our analyses, the results can be extended to the case of f neuron < f net without difficulty (Brunel & Hakim, 1999; Brunel, 2000). Let us mention that superposition of different inputs can cause much smaller effective f ext . For example, the electric fish interact by emitting amplitude-modulated electric fields into the water. Then oscillatory fields from many fish with various frequencies (several hundred cycles per second) are superposed (Heiligenberg, 1991). As a result, effective inputs to the fish are beating oscillations whose effective f ext is, for example, the difference between two original f ext . The fish may select the slow oscillatory components resonant with its f net and/or f neuron , possibly serving to identify specific individuals. In visual systems, oscillatory field potentials with γ frequency (20–80 Hz) are ubiquitous and considered to be functionally relevant (Gray et al., 1989; Sillito et al., 1994; Ritz & Sejnowski, 1997; Murphy et al., 1999). Such oscillations do not originate from external visual inputs with γ frequency; effective f ext or f ext / f neuron in this context seems unclear. Instead, they are

Coding of Temporally Varying Signals

2163

likely to be introduced by global feedback loops in the visual pathways such as thalamocortical networks (Sillito et al., 1994; Murphy et al., 1999; Destexhe & Sejnowski, 2003; Buzs´aki & Draguhn, 2004). Then, in the sense of increased corr and r , the network is ready to select external inputs with frequency f ext ∼ = f net , where f net is in the γ band. As a general note, regular activities such as θ and γ oscillations are widely found in the brain. Consequently, chains of resonant signals may be propagated in neural networks, without allowing interference or participation of nonresonant signals (Izhikevich et al., 2003). If interspike intervals are sufficiently regular within each spike packet corresponding to one stimulus peak, as is the case for regular bursting, the coding performance of the network is also responsive to inputs fulfilling f ext ∼ = f neuron .

8.2 Difference in Two Types of Resonance. We have looked at the interplay between three characteristic frequencies— f neuron , f net , and f ext —which are dependent, respectively, on the structure of intrinsic neural dynamics, network architecture, and the stimulus. When f ext is close to f neuron or its harmonics, oscillatory population firing rates read out f neuron even for small signal amplitude. In this situation, the population rate coding with high temporal resolution, synchrony, regular oscillations, and strong but imperfect phase locking are realized at the same time. The magnitudes of J BA , I0 , and also of τm mainly influence f neuron . Stated in another way, the bandpass filtering associated with f neuron is more adaptive in response to changes in I0 . This type of resonance is the same as that for single neurons (Knight, 1972, 2000; Hunter et al., 1998; Longtin, 2000; Klausberger et al., 2003). When f ext is close to f net , oscillatory firing rates read out f net . As shown in section 8, the Hopf bifurcation occurs with bifurcation parameter J BA becoming more negative or τBA becoming larger. The resonant frequency f net depends on τBA and J BA (Marcus & Westervelt, 1989; Longtin, 1991; Brunel & Hakim, 1999; Giannakopoulos & Zapp, 1999; Wilson, 1999; Brunel, 2000; Gerstner, 2000; Laing & Longtin, 2003; Mehring et al., 2003; Doiron et al., 2004), but it is more robust against bias changes than are single-neuron oscillations. To illustrate the inherent difference between single-neuron oscillations and network oscillations, let us first deal with a classification scheme of single-neuron dynamics. For single neurons, Hopf bifurcations can underlie the emergence of intrinsic oscillatory firing by class II neurons (also called type II membranes) such as the Hodgkin-Huxley neurons. Oscillatory class II neurons have f neuron relatively robust against changes in parameters such as I0 . Consequently, they have bandpass properties, as our feedback networks do. Furthermore, they have other related functions, such as the rate coding of amplitude-modulated signals superposed on sinusoidal carriers (Longtin & St-Hilaire, 2000; Masuda & Aihara, 2003c).

2164

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

On the other hand, class I neurons are essentially integrators or coincidence detectors and have distinct dynamical characteristics from those of class II neurons (Rinzel & Ermentrout, 1998; Wilson, 1999). The LIF neurons, which we have used, behave similarly to class I neurons, and their oscillations with frequency f neuron are not robust. Actually, resonant peaks are extinguished for an increased level of noise (σ = 0.10) more easily than the resonant peaks at f net and its harmonics (results not shown). Accordingly, such stable oscillations of class II neurons could be naturally associated with oscillatory activities abundant in experiments, as reviewed in section 1. However, there is an essential problem in these arguments in favor of active roles for class II neurons; most neurons in the mammalian brain, including pyramidal neurons and inhibitory interneurons, seem to be class I (Wilson, 1999) except, for example, some intrinsic oscillators in the frontal cortex of the guinea pig (Llin´as et al., 1991) and some neurons similar to the squid giant neurons modeled by Hodgkin and Huxley (Aihara & Matsumoto, 1982). Nevertheless, we find that class II properties persist at a network level. Networks with global feedback, or even those with local feedback whose delay is long enough to avoid inhibitory afterpotentials and refractory periods, have filtering as well as other properties of a single class II neuron. Furthermore, networks with global feedback have adaptive parameters such as J BA related to learning. For example, the global feedback strength of the electric fish can be modulated even in a short time (Berman & Maler, 1999), possibly shifting f net as well as f neuron . More generally, the feedback current gain J BA used here should eventually be replaced by a feedback conductance that is voltage dependent and plastic (Douglas et al., 1995). This will likely endow neural networks with richer properties. Chains of these coherent oscillatory signals can also stably transmit from assemblies to assemblies (Knight, 2000; Masuda & Aihara, 2003c; Izhikevich et al., 2003). These functions are likely to be associated with the class II properties of network oscillations as well as those of single-neuron oscillations. To summarize, synchrony, which is presumably relevant to information processing such as binding, normally yields decreased performance in the sense of rate coding (Masuda & Aihara, 2002b, 2003a; van Rossum et al., 2002; Litvak et al., 2003). However, in addition to the case in which f ext ∼ = f neuron , this law is violated when the delayed feedback creates f net and f ext is close to f net . In these resonant situations, inputs are maximally transmitted with sharp tuning, synchrony, oscillations, and reliable rate codes. A final remark is that we have started with the SRM with realistic synaptic time courses. However, we have then restricted the models to current-input regimes for simplicity. Conductance-based synapses add another timescale that at least quantitatively changes dependence of f neuron and f net on other parameters (Hoˆ & Destexhe, 2000; Chance et al., 2002). They can even yield a nonmonotonic relation between f neuron and the input bias (Kuhn et al.,

Coding of Temporally Varying Signals

2165

2004), which may modify our results even qualitatively. A whole new study with different ratios of synaptic timescales is warranted for future work. Appendix A: SRM for Neural Populations with Delayed Feedback In this appendix, we explain the SRM model for neural populations with delayed feedback based on Gerstner and Kistler (2002) and Doiron (2004). Let the population consist of n SRM neurons where the membrane potential of the ith neuron before next firing is given by vi (t) = ηi (t − tˆ) + h i (t|tˆ),

(A.1)

where tˆ is the last firing time. In equation A.1, ηi (t) is a suitable refractory function, and h i (t|tˆ) is the component of the membrane potential with tˆ given, which is due to external inputs to the neuron. Equation A.1 is supplemented with a spiking rule: the jth spike time of neuron i, or Ti, j , is given by the jth time t such that vi (t) = θ , where θ represents the threshold value of membrane potential where generally nonlinear ionic currents produce action potentials. Typically, ηi (t − tˆ) is sufficiently negative for small t − tˆ so as to effectively reset the membrane potential after spike discharge. With Ti, j , the population activity A(t) is written by A(t) =

n 1 δ(t − Ti, j ). n i=1 j

(A.2)

The actual population dynamics is a stochastic process because of various noise sources, such as the dynamical noise in equations 2.1 and 2.2. We first introduce noise caused by the jitter in the global feedback delay (Knight, 2000). It has the form of the gaussian distribution with variance σ12 . We denote the gaussian distribution with mean 0 and standard deviation σ1 by Gσ1 , and then the effective input potential defined by h i (t) in the presence of global feedback with delay τd is given by the following integral equation: h i (t) = J i

∞

0 ∞

+

i (s)

∞

−∞

Gσ1 (r )A(t − s − τd − r )dr ds

κi (s)Ii (t − s)ds,

(A.3)

0

where J i denotes the feedback gain for the ith postsynaptic neuron. Equation A.3 separates h i (t) into two parts. An external input Ii (t) is convolved with a linear response kernel κi (s). Network interactions are also expressed whereby the population activity A(t) is convolved with a response kernel i (s), which models both a synaptic and membrane response.

2166

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

Since we consider homogeneous neural networks, we set J i = J, ηi (t) = η(t), i (t) = (t), κi (t) = κ(t), and Ii (t) = I (t). All these conditions set our network to be a globally coupled network of identical neurons driven by a common stimulus. In the case of LIF neurons with its time course of the synaptic current given by the delta function, the response kernels are given by (t) = κ(t) = τm−1 e−t/τm (t ≥ 0), and the input potentials are connected by the relation t − tˆ h(t|tˆ) = h(t) − h(tˆ) exp − . τm

(A.4)

Also using equation A.1, we obtain

∞

h(t) =

e

−s/τm

J

∞

−∞

0

Gσ1 (r )A(t − s − τd − r )dr + I (t − s) ds,

(A.5) which is equation 5.1, and v(t) = η(t − tˆ) + × J

t−tˆ

e−s/τm

0

∞ −∞

Gσ1 (r )A(t − s − τd − r )dr + I (t − s) ds.

(A.6)

For sufficiently simple model neurons without adaptive or bursting properties, such as the LIF neurons, the conservation law, A(t) =

t

−∞

Ph (t|tˆ)A(tˆ)d tˆ,

(A.7)

holds in the limit n → ∞. In equation A.7, Ph (t|tˆ) is the probability that a neuron with an input potential h(t) fires at time t > tˆ. We supplement equation A.7 with the normalization condition:

t −∞

S(t|tˆ)A(tˆ)d tˆ = 1,

(A.8)

where S(t|tˆ) is the survivor function defined as the probability that a neuron does not emit a spike during the time interval (tˆ, t). Equation A.8 simply states that all neurons have fired at least once over their history (−∞, t). Approximating the neuronal dynamics as a renewal process (Cox & Lewis, 1966), we can relate the interval distribution Ph (t|tˆ) to the hazard function

Coding of Temporally Varying Signals

2167

f [h(t)] by Ph (t|tˆ) = f [h(t)]S(t|tˆ) = f [h(t)] exp −

tˆ

t

f [h(t )]dt .

(A.9)

The hazard function f [h(t)] may be conceived as a time-dependent firing probability; that is, a single cell emits a spike during the interval (t, t + t) with probability f [h(t)]t. Equations 5.1 and A.7, in conjunction with an appropriate choice of Ph (t|tˆ), η(t − tˆ), and f [h(t)], give a system of integral equations that determines network activities. To derive a manageable delay differential system, we model a generally stochastic firing mechanism by adopting a noisy spike threshold. It is implemented by combining a standard spike escape rate and the hazard function in the form f [h(t)] =

1 exp (β(h(t) − θ)) . τ0

(A.10)

Here β characterizes the threshold fluctuations. When β → ∞, f is the Heaviside function centered at θ, providing a deterministic firing rule. In contrast, β → 0 corresponds to a uniform and completely random firing probability of 1/τ0 for all h. Next we choose our refractory function to be η(t − tˆ) =

∞, 0,

(tˆ < t < τr + tˆ) , (t ≥ τr + tˆ)

(A.11)

meaning that the absolute refractory period is τr . The relative refractory period is not explicitly considered. Noting Ph (t|tˆ) = 0 for tˆ < t < tˆ + τr , we derive from equations A.7, A.8, A.9, and A.11

A(t) = f [h(t)] 1 −

t

A(t )dt = g[h(t)].

(A.12)

t−τr

This is the Wilson-Cowan equation (Wilson, 1999) for a population of neurons with only absolute refractoriness. If we further assume that both I (t) and A(t) vary on timescales much slower than τr , we can use a coarse graining of time to approximate equation A.12 as A(t) =

f [h(t)] = g[h(t)], 1 + τr f [h(t)]

which is equation 5.2.

(A.13)

2168

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

Appendix B: Frequency Response Functions of Layered Networks with Feedback Let us assume that two layers A and B are connected with delays τ AB and τBA and that the same value of T0 is shared by layers A and B. The firing rate of layer A and that of B represented by AA(t) = AA + AA(t) and AB (t) = AB + AB (t), respectively, satisfy AA Fˆ Iˆ , Aˆ A = Gˆσ2 Aˆ Ae−iωT0 + Fˆ J BA Gˆσ1 e−iωτ BA AˆB + v (θ)

(B.1)

AˆB = Gˆσ2 AˆB e−iωT0 + Fˆ J

(B.2)

AB Gσ1 e

ˆ

−iωτ AB

Aˆ A,

where J AB ≡ J AB AB /v (θ ) and J BA ≡ J BA AA/v (θ ) are the effective coupling strengths, and ω is omitted when possible. With equations B.1 and B.2, the frequency response function of layer A is written by AA Fˆ (1 − Gˆσ2 e−iωT0 ) Aˆ A v (θ ) = . 2 Iˆ 1 − Gˆσ2 e−iωT0 − Fˆ 2 J AB J BA Gˆσ21 e−iω(τ AB +τ BA)

(B.3)

Let us next consider N assemblies labeled 1, 2, . . . , N. The transpose, the unit matrix of size N, and the firing rate of the ith assembly are denoted by T, E, and Ai (t) = Ai + Ai (t), respectively. We have ˆ ˆ + Fˆ Gˆσ1 XA ˆ + A1 F I, ˆ ˆ = Gˆσ2 e−iωT0 A A v (θ )

(B.4)

where ˆ = ( Aˆ1 , Aˆ2 , . . . , AˆN )T , A Iˆ = ( Iˆ , 0, . . . , 0)T ,

(B.5) (B.6)

and X is an N × N matrix whose diagonal elements are zero and nondiagonal elements Xi j (1 ≤ i, j ≤ N, i = j) are given by Xi j = J ji e−iωτ ji .

(B.7)

Using equation B.4, the denominator of the frequency response function, which characterizes the resonant frequencies, can be expressed as follows:

Coding of Temporally Varying Signals

2169

det 1 − Gˆσ2 e−iωT0 E − Fˆ Gˆσ1 X N N N−k 1 − Gˆσ2 e−iωT0 = 1 − Gˆσ2 e−iωT0 +

·

k=2

sign((ρ1 ρ2 . . . ρk ))

(ρ1 ρ2 ...ρk )

× (−1)k Fˆ k Gˆσk1 J ρ1 ρ2 J ρ2 ρ3 . . . J ρk ρ1 e−iω(τρ1 ρ2 +τρ2 ρ3 +...+τρk ρ1 ) N N N−k 1 − Gˆσ2 e−iωT0 = 1 − Gˆσ2 e−iωT0 −

·

(ρ1 ρ2 ...ρk )

k=2

Fˆ k Gˆσk1 J ρ1 ρ2 J ρ2 ρ3 . . . J ρk ρ1 e−iω(τρ1 ρ2 +τρ2 ρ3 +...+τρk ρ1 ) ,

(B.8)

where the summations are over all the possible k cycles (ρ1 ρ2 . . . ρk ) out of {1, 2, . . . , N}, and we have used the fact that the sign of a k cycle as a permutation is equal to (−1)k−1 . If the coupling is weak enough to prohibit equation B.8 from being negative, minimization of each term of the last summation in equation B.8 leads to equation 6.8, namely, f net =

1 2 τρ1 ρ2 + τρ2 ρ3 + . . . + τρk ρ1

(B.9)

for a negative k cycle: J ρ1 ρ2 J ρ2 ρ3 . . . J ρk ρ1 < 0.

(B.10)

Appendix C: Bifurcation Analysis of Network Activity In this appendix, we present the bifurcation analysis of equation 7.2. Hereafter we drop ˜ to simplify notation. Fixed points of equation 7.2, which are denoted by h ∗ , are given by the roots of the following transcendental equation: h ∗ = J g(h ∗ ).

(C.1)

Because of the form of g, equation C.1 admits just one real root for all values of β, τ0 , τr > 0 and J < 0. Linearizing about the fixed point yields the following local dynamics: dh(t) = −h(t) + Dh(t − τd ), dt

(C.2)

2170

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

where D≡ J

J β exp(β(h ∗ + I0 − 1))τ0 dg = . dh h=h ∗ (τ0 + τr exp(β(h ∗ + I0 − 1)))2

(C.3)

Substituting the ansatz h(t) = h 0 e λt , where h 0 is a constant, into equation C.2 yields the characteristic equation: λ + 1 − De −λτd = 0.

(C.4)

Substitution of λ = a + ib (a , b ∈ R) into equation C.4 gives a = De −τd a cos(τd b) − 1, b = −De

−τd a

sin(τd b).

(C.5) (C.6)

By considering the case when a = 0, we have from equations C.5 and C.6 1 = D cos τd b, b = D2 − 1.

(C.7) (C.8)

Since b ∈ R, we have |D| ≥ 1. The set of equations C.7 and C.8 admits a countably infinite number of solutions given by

1 1 arccos + 2π k , τd Dk

Dk = b k2 + 1, bk =

(C.9) (C.10)

with k = 0, 1, 2, . . . . For each k, we have a pair of conjugate eigenvalues λk = ±ib k that lie on the imaginary axis. We show that for each k, the dynamical system represented by equation 7.2 is at a Hopf bifurcation. The Hopf bifurcation theorem for delayed systems is similar to the case for systems without delay (Hale & Lunel, 1993). In order for a system to be at a Hopf bifurcation, we require (1) a = 0, (2) b k = 0, and (3) da | = 0. Condition 1 is satisfied by the assumption, and condition 2 fold Dk a =0 lows by substituting |Dk | > 1 into equation C.9. To examine condition 3, we differentiate equations C.5 and C.6, rearrange terms, and set a = 0, to obtain cos τd b k + τ Dk da . = d Dk a =0 2Dk τd cos τd b k + τd2 Dk2 + 1

(C.11)

Coding of Temporally Varying Signals

Let us suppose the converse of condition 3, that is, equation C.11 results in cos τd b k + τd Dk = 0.

2171 da | d Dk a =0

= 0. Then

(C.12)

Combining equations C.9 and C.12 yields τd = −1/Dk2 < 0, which is a contradiction. Therefore, ddaDk |a =0 = 0. This argument holds true for all k = 0, 1, 2, . . .. Consequently, equation 7.2 admits a countably infinite number of Hopf bifurcations that are characterized by a sequence of D: {D0 , D1 , D2 , . . .}. However, we are interested only in the case where h ∗ changes stability. Starting with D for which every λ satisfying equation C.4 has negative real parts, loss of stability occurs when two conjugate eigenvalues cross the imaginary axis as D goes through Dk for some k. We focus on D = D0 where h ∗ actually changes stability (Giannakopoulos & Zapp, 1999). By substituting b 0 taken from equation C.9 into equation C.10, we obtain the transcendental equation to determine D0 : arccos

1 = τd 1 − D02 . D0

(C.13)

Given τd , β, τ0 , and τr , equations C.1 and C.13 determine the set of values

= {J , I0 } for which the system represented by equation 7.2 is at a Hopf bifurcation. The J /I0 parameter space is partitioned into the stable regime and the unstable or oscillatory regime by . Using standard root-finding methods, we solve equations C.1 and C.13 and plot the curve in Figure 8A. Center manifold reduction and normal form calculation for retarded functional differential systems are possible (Faria & Magalh˜aes, 1995). However, such an analysis is beyond the scope of this article. Nevertheless, it has been performed for a general class of retarded functional difference equations of which equation 7.2 is an example (Giannakopoulos & Zapp, 1999). Then the coefficient of the third-order term in the normal form of the Hopf bifurcation is generally derived. This coefficient determines the criticality of the bifurcation, namely, whether the bifurcation is subcritical or supercritical. Application of this expression to equation 7.2 shows that the Hopf bifurcation is supercritical for all values of (I0 , J , τd ) that constitute , disallowing any bistability (results not shown). Acknowledgments We thank John Lewis, Jason Middleton, Maurice Chacron, Jan Benda, Krisztina Salisznyo, and Benjamin Lindner for helpful comments. N.M. acknowledges financial support from the Special Postdoctoral Researchers Program of RIKEN. N.M. and K.A. acknowledge financial support from

2172

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

the Advanced and Innovational Research Program in Life Sciences from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. B.D. and A.L. acknowledge financial support from NSERC, Canada. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aihara, K., & Matsumoto, G. (1982). Temporally coherent organization and instabilities in squid giant axons. J. Theor. Biol., 95, 697–720. Aviel, Y., Mehring, C., Abeles, M., & Horn, D. (2003). On embedding synfire chains in a balanced network. Neural Comput., 15, 1321–1340. Bal, T., Debay, D., & Destexhe, A. (2000). Cortical feedback controls the frequency and synchrony of oscillations in the visual thalamus. J. Neurosci., 20(19), 7478–7488. Berman, N. J., & Maler, L. (1999). Neural architecture of the electrosensory lateral line lobe: Adaptations for coincidence detection, a sensory searchlight and frequencydependent adaptive filtering. J. Experimental Biology, 202, 1243–1253. Billock, V. A. (1997). Very short term visual memory via reverberation: A role for the cortico-thalamic excitatory circuit in temporal filling-in during blinks and saccades? Vision Research, 37(7), 949–953. Bringuier, V., Chavane, F., Glaeser, L., & Fr´egnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration field of area 17 neurons. Science, 283, 695–699. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Burkitt, A. N., & Clark, G. M. (2001). Synchronization of the neural response to noisy periodic synaptic input. Neural Computation, 13, 2639–2672. Buzs´aki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304, 1926–1929. Cˆateau, H., & Fukai, T. (2001). Fokker-Planck approach to the pulse packet propagation in synfire chain. Neural Networks, 14, 675–685. Chacron, M. J., Doiron, B., Maler, L., Longtin, A., & Bastian, J. (2003). Nonclassical receptive field mediates switch in a sensory neuron’s frequency tuning. Nature, 423, 77–81. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Cox, T. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Damasio, A. R. (1989). The brain binds entities and events by multiregional activation from convergence zones. Neural Computation, 1, 123–132. Destexhe, A., & Sejnowski, T. J. (2003). Interactions between membrane conductances underlying thalamocortical slow-wave oscillations. Physiol. Rev., 83, 1401–1453. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533.

Coding of Temporally Varying Signals

2173

Doiron, B. (2004). Electrosensory dynamics: Dendrites and delays. Unpublished doctoral dissertation, University of Ottawa. Doiron, B., Chacron, M. J., Maler, L., Longtin, A., & Bastian, J. (2003). Inhibitory feedback required for network oscillatory responses to communication but not prey stimuli. Nature, 421, 539–543. Doiron, B., Lindner, B., Longtin, A., Maler, L., & Bastian, J. (2004). Oscillatory activity in electrosensory neurons increases with the spatial correlation of the stochastic input stimulus. Phys. Rev. Lett., 93(4), 048101. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Faria, T., & Magalh˜aes, L. T. (1995). Normal form for retarded functional differential equations with parameters and applications to Hopf bifurcation. J. Differential Equations, 122, 181–200. Fontanini, A., Spano, P., & Bower, J. M. (2003). Ketamine-Xylazine-induced slow (<1.5 Hz) oscillations in the rat piriform (olfactory) cortex are functionally correlated with respiration. J. Neurosci., 23(22), 7993–8001. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Giannakopoulos, F., & Zapp, A. (1999). Local and global Hopf bifurcation in a scalar delay differential equation. J. Math. Anal. Appl., 237, 425–450. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3(5), 1116–1133. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Hale, J. K., & Lunel, S. V. (1993). Introduction to functional differential equations. New York: Springer Verlag. Heiligenberg, W. (1991). Neural nets in electric fish. Cambridge, MA: MIT Press. Ho, ˆ N., & Destexhe, A. (2000). Synaptic background activity enhances the responsiveness of neocortical pyramidal neurons. J. Neurophysiol., 84, 1488–1496. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hunter, J. D., Milton, J. G., Thomas, P. J., & Cowan, J. D. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol., 80, 1427–1438. Izhikevich, E. M., Desai, N. S., Walcott, E. C., & Hoppensteadt, F. C. (2003). Bursts as a unit of neural information: Selective communication via resonance. Trends in Neurosciences, 26(3), 161–167. Kistler, W. M., & De Zeeuw, C. I. (2002). Dynamical working memory and timed responses: The role of reverberating loops in the olivo-cerebellar system. Neural Computation, 14, 2597–2626. Klausberger, T., Magill, P. J., M´arton, L. F., Roberts, J. D. B., Cobden, P. M., Buzs´aki, G., & Somogyi, P. (2003). Brain-state- and cell-type-specific firing of hippocampal interneurons in vivo. Nature, 421, 844–848.

2174

N. Masuda, B. Doiron, A. Longtin, and K. Aihara

Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Knight, B. W. (2000). Dynamics of encoding in neuron populations: Some general mathematical features. Neural Computation, 12, 473–518. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24(10), 2345–2356. Laing, C. R., & Longtin, A. (2003). Dynamics of deterministic and stochastic paired excitatory-inhibitory delayed feedback. Neural Computation, 15, 2779–2822. Lindner, B., & Schimansky-Geier, L. (2002). Maximizing spike train coherence or incoherence in the leaky integrate-and-fire model. Phys. Rev. E, 66, 031916. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feedforward networks with excitatory-inhibitory balance. J. Neurosci., 23(7), 3006–3015. Llin´as, R. R., Grace, A. A., & Yarom, Y. (1991). In vitro neurons in mammalian cortical layer 4 exhibit intrinsic oscillatory activity in 10- to 50-Hz frequency range. Proc. Natl. Acad. Sci. USA, 88, 897–901. Longtin, A. (1991). Noise-induced transitions at a Hopf bifurcation in a first order delay-differential equation. Phys. Rev. A., 44(8), 4801–4813. Longtin, A. (1995). Mechanisms of stochastic phase locking. Chaos, 5(1), 209–215. Longtin, A. (2000). Effect of noise on the tuning properties of excitable systems. Chaos, Solitons and Fractals, 11, 1835–1848. Longtin, A., & St-Hilaire, M. (2000). Encoding carrier amplitude modulations via stochastic phase synchronization. Int. J. Bifu. Chaos, 10(10), 2447–2463. Maex, R., & De Schutter, E. (2003). Resonant synchronization in heterogeneous networks of inhibitory neurons. J. Neurosci., 23(33), 10503–10514. Marcus, C. M., & Westervelt, R. M. (1989). Stability of analog neural networks with delay. Phys. Rev. A, 39(1), 347–359. Masuda, N., & Aihara, K. (2002a). Spatio-temporal spike encoding of a continuous external signal. Neural Computation, 14, 1599–1628. Masuda, N., & Aihara, K. (2002b). Bridging rate coding and temporal spike coding by effect of noise. Phys. Rev. Lett., 88(24), 248101. Masuda, N., & Aihara, K. (2003a). Duality of rate coding and temporal spike coding in multilayered feedforward networks. Neural Computation, 15, 103–125. Masuda, N., & Aihara, K. (2003b). Ergodicity of spike trains: When does trial averaging make sense? Neural Computation, 15, 1341–1372. Masuda, N., & Aihara, K. (2003c). Filtered interspike interval encoding by class II neurons. Physics Letters A, 311(6), 485–490. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cybern., 88, 395–408. Murphy, P. C., Duckett, S. G., & Sillito, A. M. (1999). Feedback connections to the lateral geniculate nucleus and cortical response properties. Science, 286, 1552–1554. Nirenberg, S., & Latham, P. E. (2003). Decoding neuronal spike trains: How important are correlations? Proc. Nat. Acad. Sci. USA, 100(12), 7348–7353. Plenz, D., & Kitai, S. T. (1999). A basal ganglia pacemaker formed by the subthalamic nucleus and external globus pallidus. Nature, 400, 677–682.

Coding of Temporally Varying Signals

2175

Reyes, A. D. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nature Neurosci., 6(6), 593–599. Rinzel, J., & Ermentrout, B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (2nd ed., pp. 251–291). Cambridge, MA: MIT Press. Risken, H. (1984). The Fokker-Planck equation. New York: Springer-Verlag. Ritz, R., & Sejnowski, T. J. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Curr. Opinion in Neurobiol., 7, 536–546. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (1994). Feature-linked synchronization of thalamic relay cell firing induced by feedback from the visual cortex. Nature, 369, 479–482. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. van Rossum, M. C., W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of firing rates through layered networks of noisy neurons. J. Neurosci., 22(5), 1956–1966. Wilson, H. R. (1999). Spike decisions and actions. New York: Oxford University Press.

Received April 7, 2004; accepted March 4, 2005.

LETTER

Communicated by Nathaniel Daw

Attention-Gated Reinforcement Learning of Internal Representations for Classification Pieter R. Roelfsema [email protected] Netherlands Ophthalmic Research Institute, 1105 BA Amsterdam, Netherlands, and Center for Neurogenomics and Cognitive Research, Department of Experimental Neurophysiology, Vrije Universiteit, 1081 HV Amsterdam, Netherlands

Arjen van Ooyen [email protected] Netherlands Institute for Brain Research, 1105 AZ Amsterdam, Netherlands, and Center for Neurogenomics and Cognitive Research, Department of Experimental Neurophysiology, Vrije Universiteit, 1081 HV Amsterdam, Netherlands

Animal learning is associated with changes in the efficacy of connections between neurons. The rules that govern this plasticity can be tested in neural networks. Rules that train neural networks to map stimuli onto outputs are given by supervised learning and reinforcement learning theories. Supervised learning is efficient but biologically implausible. In contrast, reinforcement learning is biologically plausible but comparatively inefficient. It lacks a mechanism that can identify units at early processing levels that play a decisive role in the stimulus-response mapping. Here we show that this so-called credit assignment problem can be solved by a new role for attention in learning. There are two factors in our new learning scheme that determine synaptic plasticity: (1) a reinforcement signal that is homogeneous across the network and depends on the amount of reward obtained after a trial, and (2) an attentional feedback signal from the output layer that limits plasticity to those units at earlier processing levels that are crucial for the stimulus-response mapping. The new scheme is called attention-gated reinforcement learning (AGREL). We show that it is as efficient as supervised learning in classification tasks. AGREL is biologically realistic and integrates the role of feedback connections, attention effects, synaptic plasticity, and reinforcement learning signals into a coherent framework. 1 Introduction A fundamental question in cortical neurophysiology is how neurons in sensory areas of the cortex modify their tuning to improve the animal’s performance. During development, but also in adulthood, neurons in sensory Neural Computation 17, 2176–2214 (2005)

© 2005 Massachusetts Institute of Technology

Attention-Gated Reinforcement Learning

2177

areas become tuned to features that are relevant to behavior and lose their sensitivity to features that do not carry useful information. It is still unclear how behavioral relevance influences sensory representations, and the mechanisms that guide this plasticity are only partially understood. The question of how to change the tuning of sensory neurons has also been addressed in artificial neural networks, where it corresponds to modifying the tuning of hidden units, which are intermediate between the network’s input and output layers. Learning theories have proposed various rules to change the tuning of hidden units. These theories broadly fall into three classes: unsupervised, supervised, and reinforcement learning schemes. Unsupervised learning schemes modify synapses on the basis of statistical regularities in the input (e.g., Rumelhart & Zipser, 1986; Becker & Hinton, 1992; Hinton, Dayan, Frey, & Neal, 1995; Olshausen & Field, 1996), but they do not consider the consequences of the representation for behavior. Recent neurophysiological studies have demonstrated, however, that neurons in the frontal cortex (Freedman, Riesenhuber, Poggio, & Miller, 2001), the inferotemporal cortex (Sigala & Logothetis, 2002; Baker, Behrmann, & Olson, 2002), and even the primary visual cortex (Schoups, Vogels, Qian, & Orban, 2001) become selectively tuned to variations in the input that are most relevant for behavior. This implies that unsupervised learning methods are incomplete. In contrast, supervised learning schemes such as error backpropagation (Rumelhart, Hinton, & Williams, 1986; Bishop, 1995) change the tuning of hidden units in order to improve the network’s output. These learning schemes are efficient because they form internal representations that support nonlinear input-output mappings, and they are widely applied to train artificial neural networks. However, these schemes are implausible from a neurobiological point of view, for at least two reasons (Barto, 1985; Crick, 1989). First, supervised learning schemes have to propagate specific error signals from the output layer back to the input layer (Zipser & Rumelhart, 1990). These error signals are not observed in neurophysiology. Second, a “teacher” has to specify the correct pattern of activity across the network’s output layer during learning. It is unclear how a teacher could specify the target activity of all the neurons in, for example, the motor cortex. Reinforcement learning models are much more popular in neurobiology (Barto, 1985; Montague, Dayan, Person, & Sejnowski, 1995; Schultz, Dayan, & Montague, 1997; Sutton & Barto, 1998). In neurobiologically inspired models, the output is chosen stochastically, so that the network can try various outputs for each of the input patterns. This is reminiscent of animal learning, where the animal tries out various responses until it finds the correct one. Moreover, the teacher is replaced by a global reinforcer, such as the presence or absence of reward. Biological reinforcement learning schemes modify behavior on the basis of whether the amount of reward on a particular trial is better or worse than expected. The popularity of reinforcement learning models has greatly increased in recent years, as signals predicted by

2178

P. Roelfsema and A. van Ooyen

reinforcement learning theories have been found in the brain (Ljungberg, Apicella, & Schultz, 1993; Schultz et al., 1997; Schultz & Dickinson, 2000; Waelti, Dickinson, & Schultz, 2001; Schultz, 2002). Reinforcement learning has been used to train biologically inspired neural networks to perform relatively simple stimulus-response mappings (Barto, 1985; Barto & Anandan, 1985; Williams, 1992; Montague et al., 1995). However, previous reinforcement learning schemes that use biologically plausible learning rules are not as efficient as supervised learning schemes in optimizing the tuning of hidden units. The reason is that these models lack an efficient mechanism to assign credit to those hidden units that play a crucial role in the stimulusresponse mapping (Barto, 1985; Williams, 1992; Bishop, 1995). In this study, we demonstrate how this credit assignment problem can be solved in a neurophysiologically plausible way by the inclusion of an attentional signal that feeds back from the network’s response selection stage to earlier processing levels. In the literature, the word attention is used for many different concepts. To avoid confusion, it is important to stress that this feedback signal would correspond to what psychologists call goal-driven (or top-down) selective attention. Here we focus on tasks that are restricted in two respects. First, the tasks under study require the assignment of a unique response to each member of a set of stimuli (1-of-n deterministic categorization task). Second, the reward is delivered immediately after the network’s response. Thus, we address only the “spatial” credit assignment problem of identifying units that were involved in selecting the response. We do not address the so-called temporal credit assignment problem that arises when rewards are delivered after a delay or when the animal progresses through a sequence of stages before reward delivery. This temporal credit assignment problem has been the focus of a number of previous studies (Sutton & Barto, 1998; Montague et al., 1995). To gain insight into the spatial credit assignment problem, we start from the observation that areas of the cerebral cortex interact with each other through a dense network of feedforward and feedback connections (Felleman & Van Essen, 1991). Feedforward connections map sensory stimuli onto motor responses. They propagate neuronal activity from sensory cortex to association cortex and from there to the motor cortex. However, there are also feedback connections, which propagate activity from the motor cortex back to the sensory cortex. Feedback connections mediate goaldriven attentional effects (Desimone & Duncan, 1995; Moore, 1999; Treue & Mart´ınez Trujillo, 1999), and attention is necessary for neuronal plasticity at early processing levels (Ahissar & Hochstein, 1993; Schoups et al., 2001). The precise role of attention in learning, however, is not well understood. In this study, we therefore investigate the consequences of attentional feedback for learning in a neural network. We use two factors to modulate synaptic plasticity. The first factor, δ, encodes whether the amount of reward obtained after a trial is better or worse than expected (e.g., Montague

Attention-Gated Reinforcement Learning

2179

et al., 1995). This δ has been used in many previous studies on reinforcement learning. It is a global signal that is delivered to all units regardless of whether they were involved in the network’s choice. The second factor is the attentional feedback signal from units in the output layer. This factor gates the plasticity of units at earlier processing levels responsible for the network’s output. This credit assignment signal does not depend on reward and is the distinguishing feature of the new learning scheme. We call the new scheme AGREL, which stands for attention-gated reinforcement learning. AGREL provides a new theoretical link between supervised learning and biologically inspired reinforcement learning theories. We will demonstrate that AGREL is as powerful as previous supervised learning schemes in deterministic categorization tasks and yet plausible from a neurophysiological point of view.

2 Task and Network Design We will use a neural network to simulate the selection of behavioral responses by an animal that is learning to classify stimuli into a number of categories. In a real experiment, the animal would be required to associate a unique action with every stimulus category (see, e.g., Schoups et al., 2001; Freedman et al., 2001; Sigala & Logothetis, 2002; Baker et al., 2002). We will use P stimuli that have to be categorized into C mutually exclusive classes. In the neural network, a stimulus is presented onto the input layer on every trial (see Figure 1). Activity is then propagated to a hidden layer (a higher sensory area), and from there to the output layer (representing the motor cortex) that has an output unit for every category. If presented with stimulus p, the task of the network is to respond with a target pattern t p in which all output units k have activity 0, except unit k = c p , which encodes the target category and has activity 1. If the network chooses the correct class, it receives a reward. If not, it receives nothing. Thus, in case of an error, the network is not informed about the class that should have been chosen if there are more than two categories. On these trials, the feedback from the environment is less informative than that given by supervised learning schemes. After the trial, synaptic connections are updated. In a neurobiological model, connections should be changed on the basis of information that is available locally, at the synapse. We will show that AGREL can form useful internal representations, even if this neurobiological constraint is taken into account. We emphasize that AGREL is not an improved method for the training of artifical neural networks. The new model is closely related to error backpropagation (BP), an efficient supervised learning method for training neural networks in nonlinear classification tasks (Rumelhart et al., 1986; Bishop, 1995). In the sequel, we will compare AGREL to BP, and we will therefore first outline how error BP is used to train neural networks for classification. This will allow us to

2180

P. Roelfsema and A. van Ooyen

Input layer

X1

XN

vij Hidden layer

Y1

YM

wjk Output layer

w’sj ZC

Z1 winning unit, s

Figure 1: Three-layer neural network that is trained to perform a classification task. The task of the network is to activate a single output unit c p that encodes the class of the stimulus pattern p. There are N units in the input layer, M in the second (hidden) layer, and C in the output layer. Connections vi j propagate activity from the input layer to the hidden layer, and connections w jk in turn propagate the activity from the hidden to the output layer. The winning output unit, s, feeds its activity back to the hidden layer through connections ws j (dashed lines).

indicate the neurobiologically implausible features of BP (see also Barto, 1985; Crick, 1989). 3 Error Backpropagation We first explain how BP is used to train the three-layer network of Figure 1 in a classification task. On each trial, an input pattern p is presented to p the network’s input layer with N units (with activities Xi ) and activity is propagated to the hidden layer with M hidden units through connections p vi j . The activity of units in this layer, Yj , depends on their input according to a nonlinear activation function. Here we use the logistic function (but we note that our results do not depend critically on this choice), p

Yj =

1 p 1 + exp − h j

with

p

hj =

N

p

vi j Xi .

(3.1)

i=0

Hidden units j also have a bias weight, v0 j . Connections w jk in turn propagate the activity from the hidden to the output layer. If the task is classification, then it is advantageous to choose the softmax activation function for the output units (Bishop, 1995), p Zk

p exp a k = C p k =1 exp a k

p

with a k =

M j=0

p

w jk Yj .

(3.2)

Attention-Gated Reinforcement Learning

2181

Output units k also have a bias weight, w0k . After each trial, error BP determines the error at each output unit, which is related to the difference between the actual and target output of the unit. A suitable definition for the total error is given by the cross-entropy (van Ooyen & Nienhuis, 1992; Bishop, 1995), Qp = −

C

p

p

tk ln Zk .

(3.3)

k=1 p

p

Here, tc p = 1, and tk = 0 for k = c p (the components of target pattern t p ). To reduce the error in the output, error BP changes connection weights along the error gradient. Weights between the hidden and output layer are changed according to w jk = −β

∂ Qp p p p = βYj tk − Zk , ∂w jk

(3.4)

and weights between the input and hidden layer according to vi j = −β

C p ∂ Qp p p p p = β Xi Yj 1 − Yj tk − Zk w jk , ∂vi j k=1

(3.5)

where β is a parameter that determines the learning rate. These equations are biologically implausible, because they imply that a “teacher” has to p specify tk after each trial for each of the output units. It is unlikely that a teacher in the brain could specify the target activity of all output neurons in, for example, the frontal or motor cortex. Furthermore, the change in synaptic weight vi j in equation 3.5 depends on a sum over error signals p p (tk − Zk ) at the output layer multiplied by the connection strengths w jk . Thus, each hidden unit must know its own individualized error signal (see also Barto, 1985). A system that calculates these error signals does not appear to exist in the brain. Error-related signals that are found in neurophysiology are global and not fine-tuned to individual neurons (Montague et al., 1995; Schultz et al., 1997; Schultz & Dickinson, 2000). 4 AGREL: Attention-Gated Reinforcement Learning Reinforcement learning schemes do not require a teacher that reveals the target output because the only information required for learning is the delivery or omission of a reward (Barto, 1985; Barto & Anandan, 1985; Sutton & Barto, 1998). Based on this information, the system computes a global error signal, δ, which reflects changes in reward expectancy. Recent neurophysiological studies demonstrated that such a signal is indeed computed,

2182

P. Roelfsema and A. van Ooyen

by dopamine neurons of the midbrain (Montague et al., 1995; Schultz et al., 1997; Schultz & Dickinson, 2000). The dopamine neurons carry information about rewards expected on the current trial, even if reward delivery is somewhat delayed. Here, we will study AGREL only in tasks where rewards are delivered immediately after correct responses. In that case, the reward-evoked dopamine activity (as well as δ) reflects the difference between the amount of reward that was expected before the response and the amount received afterward (Fiorillo, Tobler & Schultz, 2003; Morris, Arkadir, Nevet, Vaadia & Bergman, 2004). A disadvantage of previous biological reinforcement learning schemes is that they are not as efficient as BP in optimizing the tuning of units in the hidden layer. There is no mechanism that identifies the hidden units that are responsible for the outcome of a trial. AGREL solves this credit assignment problem by including cortical feedback connections that mediate attentional effects. Thus, AGREL combines two signals that jointly determine plasticity: (1) the global error signal δ and (2) an attentional signal that feeds back from the output layer to previous layers. When a new input pattern is presented, activity is propagated to the output layer, just as in BP. A crucial feature of AGREL is that units in the output layer engage in a competition. On every trial, one output unit wins and gets activity 1, while the other output units get activity 0. AGREL uses the stochastic softmax rule to determine the probability of choosing unit k as the winning unit, Pr

p Zk

p exp a k = 1 = C p k =1 exp a k

p

with a k =

M

p

w jk Yj .

(4.1)

j=0

Equation 4.1 is similar to equation 3.2, but it now describes the probabilities p in a competitive selection process. An increase in the synaptic input a k to output unit k enhances the probability of selecting this unit and decreases the probability of selecting others. If the network chooses the correct output unit for a particular stimulus, it receives a reward r . We will assume that r equals 1. No reward is given in case of misclassification. After each trial, the synaptic weights are updated according to a simple and physiologically plausible Hebbian rule, which states that the change in synaptic weight depends on the product of preand postsynaptic activity (Malinow & Miller, 1986; Gustafsson & Wigstrom, 1988). For the weights between the hidden units and output units, the factor δ modulates the Hebbian plasticity, p

p

w jk = βYj Zk f (δ).

(4.2)

Note that this equation implies that only connections onto the winning p output unit (with activity Zk = 1) are changed, because the other output

Attention-Gated Reinforcement Learning

2183

units have activity 0 after the competition. On rewarded trials, δ equals the difference between the amount of reward that was obtained and the amount that was expected for a particular stimulus, δ = r − E p (r ).

(4.3)

E p (r ) is the average amount of reward that is expected with stimulus p, and p this equals Pr(Zc p = 1), the probability that the correct output unit is chop sen. On rewarded trials, δ therefore equals 1 − Pr(Zc p = 1). The probability p Pr(Zc p = 1) can be determined by evaluating activity in the output layer at the start of the competition (an alternative method to compute δ is described in section 7.2). On unrewarded trials, however, plasticity in AGREL does not depend on E p (r ), and δ is set to −1. Unexpected rewards are especially valuable in learning. In AGREL, δ therefore influences synaptic plasticity through an expansive function f (δ) that can be implemented at the synapse. f (δ) takes large values if δ is close to 1, that is, when actions are rewarded unexpectedly: f (δ) =

δ/(1 − δ);

δ≥0

δ;

δ = −1

.

(4.4)

The weights vi j between the input layer and the hidden layer are also modified according to a Hebbian rule that depends on f (δ). However, here a secp ond factor, f b Yj , which equals the feedback arriving from the output layer at unit Yj , also influences plasticity: p

p

p

vi j = β Xi Yj f (δ) f b Yj

with

C p p p f b Yj = 1 − Yj Zk wk j .

(4.5)

k=1

After the competition, only the winning output unit has activity 1, and the other output units have activity 0. Equation 4.5 therefore reduces to p p p vi j = β Xi Yj f (δ) ws j 1 − Yj ,

(4.6)

where ws j stands for feedback of the winning unit s (dashed connections in Figure 1). This feedback signal gates the plasticity of connections vi j from p the input layer to hidden unit j. The factor (1 − Yj ) reduces the effect of feedback on the plasticity of highly active units. Note that synapses can implement equations 4.2, 4.4, and 4.6, because the required information is available locally. Cortical anatomy and neurophysiology suggests that feedforward and feedback connections are reciprocal (Felleman & Van Essen, 1991; Salin & Bullier, 1995). In AGREL, the plasticity of feedback connections wk j is

2184

P. Roelfsema and A. van Ooyen

therefore also governed by equation 4.2, so that the strength of feedforward and feedback connections becomes proportional during training (deviations from exact reciprocity are investigated in section 5.4). The consequence of this reciprocity is that hidden units that provide most excitation to the winning output unit also receive strongest feedback. Feedback thereby assigns credit to hidden units that are responsible for the choice of action. 4.1 Average Weight Changes in AGREL. We now compute the average weight changes in AGREL. First, we compute the average change in synaptic weights w jk between the hidden and output layer. Note that as a result of equation 4.2, only connections to the winning output unit are updated, since p for the other units, Zk = 0. The correct output unit k = c p is selected with p probability Pr(Zc p = 1), causing an average change in weight w jc p across trials of p p E w jc p = Pr Zcpp = 1 βYj δ/(1 − δ) = βYj 1 − Pr Zcpp = 1 .

(4.7) p

An erroneous output unit k = c p is selected with probability Pr(Zk = 1), and the average change in weights w jk equals p p p p E(w jk ) = Pr Zk = 1 βYj f (δ) = −βYj Pr Zk = 1 ;

k = c p .

(4.8)

Combining equations 4.7 and 4.8 yields p p p E(w jk ) = βYj tk − Pr Zk = 1 .

(4.9)

We can compute the average change in weights vi j across trials from equation 4.6: E(vi j ) =

C

p p p Pr Zsp = 1 β Xi Yj f (δ) 1 − Yj ws j

s=1

p p p δ = Pr Zcpp = 1 β Xi Yj 1 − Yj w 1 − δ cp j p p p p − Pr Zk = 1 β Xi Yj 1 − Yj wk j

(4.10)

k=c p

C p p p p p tk − Pr Zk = 1 wk j . = β Xi Yj 1 − Yj k=1

A comparison of equations 4.9 and 4.10 to equations 3.4 and 3.5 points to the central result of this study: weight changes in AGREL are, on average, the same as those in BP. This implies that AGREL can solve all 1-of-n classification tasks that can be solved by BP. This is a remarkable result since

Attention-Gated Reinforcement Learning

2185

there is no teacher, and correct classification can be learned only by trial and error. 4.2 Analysis of the Variance of Weight Changes in AGREL. The changes in synaptic weights in AGREL are, on average, the same as in error BP. However, in AGREL, they depend on the stochastic competition between units in the output layer. It is therefore of interest to compute the variance of the changes in synaptic weights across trials. The variance in the weight change w jc p of connections to the correct output unit equals 2 2 − E w jc p Var w jc p = E w jc p 3 p 2 1 − Pr Zcpp = 1 , = βYj p Pr Zc p = 1

(4.11)

and the variance in the weight change w jk of connections to the other output units (k = c p ) equals p 2 p p Var(w jk ) = βYj Pr Zk = 1 1 − Pr Zk = 1 .

(4.12)

A similar computation yields the variance in the change of weights between the input and hidden layer:

p 2 (1 − Pr Zc p = 1 Var(vi j ) = wc p j − p Pr Zc p = 1

2 C p p 2 p tk − Pr Zk = 1 wk j + Pr Zk = 1 wk j − .

p p β Xi Yj 1

k=c p

p 2 Yj

k=1

(4.13) p

We note that Pr(Zc p = 1) appears in a denominator in equations 4.11 and 4.13, and also that the variance increases with the square of the learning rate, β 2 . Thus, the variance caused by the stochastic choices is high for large p values of β 2 / Pr(Zc p = 1), that is, if the learning rate is high and the probability of correct classification is small. This variance is not homogenous across p weight space, but is highest in regions with a small Pr(Zc p = 1). It can be seen, however, that AGREL tends to escape from these high-variance regions. The repeated choice of incorrect actions k = c p reduces the probability that these actions are chosen again, since AGREL weakens the connections to the corresponding output units (see equation 4.8). The decrease in the probability p of choosing incorrect outputs causes a continuous increase in Pr(Zc p = 1),

2186

P. Roelfsema and A. van Ooyen

since actions are chosen in a competitive manner. Therefore, AGREL tends to leave the regions of weight space that are associated with a high variance. 5 Benchmark Problems To compare the efficiency of AGREL to that of BP, we carried out a number of neural network simulations. The general layout of the simulated neural networks was as in Figure 1. f (δ) of equation 4.4 increases rapidly for values of δ very close to 1. In the simulations, f (δ) was therefore clipped at 50/β if it reached a value that was larger. We first compare AGREL to the error BP algorithm on two nonlinear classification tasks that can be solved by small networks. Then we investigate whether AGREL can be used to train a larger network on a more difficult problem. 5.1 Exclusive-or. The exclusive-or (XOR) problem is a classical nonlinear classification task. There are two input units and four input patterns: 00 (both input units off), 01, 10, and 11 (both on). There are two output units, one of which should be active for input patterns 00 and 11, and the other for 01 and 10. We compared AGREL to BP by measuring the median number of iterations required to reach a performance criterion. The criterion used for AGREL was that the probability of correct classification was at least 75% for each of the input patterns. The criterion for BP was that the difference between the activity of units in the output layer and their target values was less than 0.25 for all output units and for each of the input patterns. Table 1 shows for both models the median number of iterations (presentations of the entire stimulus set; one iteration equals four trials) required to reach criterion. Initial connection weights were drawn from a uniform distribution in the interval [−0.25, 0.25], and the optimal value was determined for the learning rate β. We tested one model with two and one with three hidden Table 1: Comparison of the Speed of Convergence of AGREL and Standard Error Backpropagation. BP

XOR (2 hidden units) XOR (3 hidden units) Counting 2 inputs Counting 3 inputs Counting 4 inputs Mine detection

AGREL

Iterations

β

Iterations

β

366 218 33 71 126 120

0.6 0.9 2.0 1.5 1.0 0.45

535 474 157 494 1316 492

0.35 0.45 0.4 0.25 0.1 0.05

Notes: Indicated is the median number of iterations until criterion performance was reached. βs are optimal learning rates that yielded fastest convergence.

Attention-Gated Reinforcement Learning

2187

units. The model with two hidden units did not converge within 25.000 iterations in some of the runs (2 out of 10 for AGREL and 2 out of 10 for BP). In the other runs, the median number of iterations required by BP and AGREL were 366 and 535, respectively. The model with three hidden units converged in all runs with sufficiently small β and required fewer iterations: 218 for BP and 474 for AGREL. 5.2 Counting. In the counting task, there are N input units (here we used N = 2, 3, or 4) and the network has to determine the number of input units that are “on”. There are N + 1 output units (the classes are 0, 1, . . . , N), and we used N + 1 hidden units. The first output unit should be active if all the input units are off, the second if one of the input units is on, the third if two are on, and so on. In a network with two input units, AGREL required a median of 157 iterations, about five times the number required by BP (see Table 1). In the case of three input units, AGREL reached criterion after 494 iterations, which is 7 times as many as BP, and with four input units AGREL required 1316 iterations, 10 times as many BP. Thus, AGREL converges more slowly than BP for this problem, especially if N is large. This can be explained: AGREL has to try various output units for each of the input patterns before it can determine the correct one, which slows the convergence, especially if there are many classes. This effect does not occur in BP, since the “teacher” indicates the correct class on every trial. Therefore, the number of iterations required increases faster with the number of classes for AGREL than for BP. 5.3 Mine Detection. To test AGREL on a more complex benchmark problem, we used the mine detection task of Gorman and Sejnowski (1988). The database for this task can be downloaded from various websites. The task is to classify sonar returns from undersea targets as rocks or mines. There are 208 input patterns, 111 mines, and 97 rocks. The sonar data are presented as a vector across the 60 input units of the network. When we used a network with 12 hidden units, AGREL reached criterion after a median of 492 iterations (see Table 1). BP converged after a median of 120 iterations, which is in the same range as the results of Gorman and Sejnowski (1988) and about four times as fast as AGREL. 5.4 Reciprocity of Feedforward and Feedback Connections. So far, we have assumed that the strength of the feedback connections wk j is the same as the strength of the corresponding feedforward connections w jk . It is not clear whether such a precise equivalence of feedforward and feedback connections in the cortex is enforced by development and whether it would hold at the start of training. We therefore ran additional simulations where the initial strengths of feedforward connections and feedback connections were chosen independently, from uniform distributions in the interval [−0.25, 0.25]. If w jk and wk j are changed in the same way during training, their

2188

P. Roelfsema and A. van Ooyen

strengths tend to become similar. In biology, the weight changes w jk and wk j may not be exactly the same, however, and we therefore also included a noise term in the updating of feedback connections: p

p

w jk = βYj Zk f (δ) wk j

= β(1 +

p p η)Yj Zk

(5.1) f (δ).

(5.2)

Here η is a gaussian noise term with a mean of 0 and a standard deviation of 0.2. With this modified scheme, AGREL required a median of 359 iterations to solve the XOR problem with three hidden units. This is similar to the number of iterations required by AGREL if feedforward and feedback connections were identical (see Table 1). We conclude that feedforward and feedback connections need not be identical at the start of training and also that some noise in the updating of connections does not deteriorate learning. 5.5 Summary of Benchmark Results. AGREL converges somewhat more slowly than BP. In our task set, the ratio between the number of iterations required by AGREL and BP varied between 1.5 and 10 (see Table 1). If there are many categories, the random sampling of output units increases stochasticity, which decreases the convergence rate (and therefore the optimal value for β). We note, however, that this is inevitable for any reinforcement learning algorithm, as it has to find the correct category by trial and error. AGREL tolerates small differences in the strength of corresponding feedforward and feedback connections and does not require them to be identical at the start of training. We conclude that AGREL fares well on both small and large benchmark problems. 6 Changes in the Tuning of Sensory Neurons Due to Training in Categorization Tasks The benchmark tests indicate that AGREL can be used to train artificial neural networks in a wide range of classification tasks. It should be emphasized, however, that AGREL is actually designed as a model for the neurophysiology of learning. We therefore now investigate whether AGREL can account for changes in the tuning of neurons in sensory areas that are induced by categorization training. 6.1 Face Categorization Task. Three recent studies investigated the effect of categorization training on the tuning of neurons in the inferotemporal cortex (Baker et al., 2002; Sigala & Logothetis, 2002; Freedman, Riesenhuber, Poggio, & Miller, 2003), which is a region of visual cortex that is involved in object recognition (Tanaka, 1995). In the study by Sigala and Logothetis (2002), monkeys were trained to classify face stimuli. The animals had to categorize 10 line drawings of faces into two classes. Each face consisted of

Attention-Gated Reinforcement Learning

2189

an outline and four features that varied between stimuli: eye separation, eye height, mouth height, and nose length (see Figure 2A). Each of these features could take three values. Two of the features, eye separation and eye height, were called diagnostic, as they allowed separation between classes along a linear category boundary (see Figure 2B). The stimuli were not linearly separable by using the other two, nondiagnostic features, mouth height and nose length. On each trial, the monkeys saw one stimulus and then pressed one of two levers to indicate the category. Thereafter, the animals received a reward, but only if they chose the correct category. The animals required more than 2000 trials to learn the categorization task (N. Sigala, personal communication, January 2003). After completion of training, single neurons were recorded in the inferotemporal cortex, and their tuning to diagnostic and nondiagnostic features was compared. Strength of tuning for a particular feature dimension was quantified with the selectivity index (SI), defined as SI = (Rmax − Rmin )/(Rmax + Rmin ),

(6.1)

where Rmax is the response strength evoked by the best feature value and Rmin the response to worst feature value on this dimension. Figure 2C shows the average selectivity index for diagnostic and nondiagnostic features for 96 inferotemporal neurons. Most neurons had stronger tuning to the diagnostic than to the nondiagnostic features. This result is remarkable, since it shows that neurons in this visual area become tuned to feature variations that are most relevant to behavior. To investigate whether AGREL can account for these changes in neuronal tuning, we used a model with four input units, four hidden units, and two output units (one for each category). Each input unit encoded one of the four features and had activity 0, 0.5, or 1 (see Figure 2B). We used a learning rate (β) of 0.1, and initial synaptic weights drawn from a uniform distribution in the interval [−1.25, 1.25]. With these parameters, it took an average of 630 trials (S.D. = 120, N = 24 simulations) before the probability of correctly classification for every pattern was larger than 75%. Thus, the model learned substantially faster than the monkeys. We also investigated how the selectivity index changed as a result of training. The initial selectivity index before the start of training was determined by the random pattern of synaptic weights, and it therefore did not differ between diagnostic and nondiagnostic features (see the square in Figure 2D). Figure 2D shows the selectivity for diagnostic and nondiagnostic features for 96 hidden units after training. Note that most points lie above the diagonal, which indicates that hidden units became more selective for diagnostic than for nondiagnostic features ( p < 10−10 , paired t-test), with an average selectivity index of 0.27 for diagnostic and 0.17 for nondiagnostic features. This indicates that AGREL can explain why categorization training induces a selective representation of features in sensory

2190

P. Roelfsema and A. van Ooyen

B

1

2

3

4

Eye separation

Category 1 ( )

5

1 7

3,4,5

2

Category 2 ( )

6,8,9,10

Mouth height

A

Eye height

6

7

C

8

9

C

Temporal cortex

3 6

2,4 1

8

Nose length

Model 1.0

n = 96

n = 96 0.8

Diagnostic features

Diagnostic features

5,9

10

1.0

0.6 0.4

0.8 0.6 0.4 0.2

0.2 0

7,10

0

0.8 0.2 0.4 0.6 Non-diagnostic features

1.0

0

0

0.2 0.4 0.6 0.8 Non-diagnostic features

1.0

Figure 2: Representation of diagnostic and nondiagnostic features after training in a face categorization task. (A) The task is to classify faces into two categories. (B) (Left) Eye separation and eye height are diagnostic features that allow correct classification along a linear category boundary (straight line). (Right) Mouth height and nose length are nondiagnostic features that do not allow correct classification along a linear category boundary. (C, D) Selectivity indices of neurons in the inferotemporal cortex (C) and of hidden units in the model (D) for diagnostic and nondiagnostic features, after training in the categorization task. If the neuronal response strength differentiates between feature values, then the selectivity index is high. (Abscissa) Average selectivity index for the two nondiagnostic features. (Ordinate) Average selectivity index for the two diagnostic features. Most inferotemporal neurons and hidden units are better tuned to diagnostic features than to nondiagnostic features. The square in D shows average selectivity index of hidden units before the start of training, caused by random initialization of synaptic weights between the input and hidden layer. The results are pooled across 24 simulations, in a network with four hidden units.

Attention-Gated Reinforcement Learning

2191

areas that support classification and are therefore most useful for the task at hand. 6.2 Orientation Discrimination Task. Categorization training can also influence the representation of a single feature. Neurons in the frontal cortex (Freedman et al., 2001) and the primary visual cortex (Schoups et al., 2001) become most sensitive to feature variations close to the boundary between two categories. Here we will focus on the results of Schoups et al. (2001), who investigated the effect of orientation discrimination training on the orientation tuning of neurons in the primary visual cortex of monkeys. The animals were trained to discriminate small differences in the orientation in one grating (see the lower grating in Figure 3A), while ignoring another grating (see the upper grating in Figure 3A). At the start of training, the monkeys were able to discriminate reliably only between orientations that differed by more than 15 degrees. The monkeys were trained for tens of thousands of trials in the orientation discrimination task, and their orientation discrimination thresholds gradually decreased to values between 0.5 and 2 degrees. The advantage of this design is that neurons with receptive fields at the upper and lower grating were activated equally often during training and with the same visual stimuli. Nevertheless, only neurons that were activated by the lower grating conveyed information relevant to behavior, whereas neurons activated by the upper grating location did not. After this training phase, recordings were made from neurons in the primary visual cortex with receptive fields at the lower or upper grating. For each neuron, the tuning for orientation was determined while the monkeys were only required to look at the fixation point; they were not engaged in the orientation discrimination task. Some of these orientation tuning curves are reproduced in Figure 3B. Cells 1 and 5 had preferred orientations differing substantially from the trained orientation, and changes in grating orientation around the trained orientation hardly influenced their response. Thus, these cells did not convey information that was relevant for solving the task. For cell 3, the preferred orientation was exactly at the trained orientation. Again, small changes in grating orientation around the trained orientation did not strongly influence the neuron’s firing rate. The situation is different for cells 2 and 4, which had a preferred orientation that differed from the trained orientation by about 15 degrees. The trained orientation is at the steepest part of their tuning curve, and changes in grating orientation around the trained orientation had a strong effect on the response strength. Thus, these neurons convey most of the information required to solve the task (see also Vogels & Orban, 1990). To quantify the effect of training across the population of neurons, the average slope of the tuning curve at the trained orientation was determined for groups of neurons with different preferred orientations (see the thick line segments in Figure 3B). The comparison of interest is between the slopes of tuning curve of neurons with receptive fields at the trained and nontrained

2192

P. Roelfsema and A. van Ooyen

Task

A

Model

D

Trained

Passive

Passive

Layer 4

Angle Layer 2-3,5-6

Trained

Fixation point

Output layer Left

Area V1

B

E

1

Response

cell 1 cell 2 cell 3 cell 4 cell 5 +16° +37° –17° – 42°

Right

4 2-3,5-6

0 –90

–60

–30

0

30

60

90

F 3 2 1

–47

–32

–16

0

16

32

Pref. ori.-trained ori (deg)

47

Slope (%change/deg)

C

Slope (%change/deg)

Pref. ori.-trained ori. (deg) 3 2 1

–48 –32 –16

0

16

32

48

Pref. ori.-trained ori. (deg)

Figure 3: Effects of training in an orientation discrimination task on orientation tuning in area V1. (A) Monkeys had to decide whether the orientation of the lower grating was leftward or rightward tilted from the right oblique orientation (continuous line, 45 degrees). The upper grating could be ignored. (B) Orientation tuning curves of neurons in area V1. Arrow, trained orientation. Thick line segments indicate the slope of the tuning curves at the trained orientation. (C) The slope of tuning curves at the trained orientation as a function of the neurons’ preferred orientation. The slope is highest for trained neurons (thick curve) with a preferred orientation that differs from the trained orientation (TO) by 12 to 20 degrees. The slope of the tuning curve of nontrained neurons (dashed curve) is shallower. (D) The neural network consisted of input units with gaussian orientation tuning (corresponding to cortical layer 4) and hidden units (cortical layers 2, 3, 5, and 6) at a trained and a passive retinotopic location. There were two output units; one had to be active in response to gratings at the trained location that were tilted to the left and the other to gratings tilted to the right. Orientations presented at the passive location contained no information for classification. (E) Example tuning curve of a unit in the input (dashed curve) and hidden layer (continuous curve) after training to categorize orientations that differed by 2 degrees. (F) The slope of tuning curves of hidden units as a function of preferred orientation. The slope is higher for units that respond to the trained location (continuous curve) than for units that are stimulated passively (dashed curve).

Attention-Gated Reinforcement Learning

2193

grating location. It can be seen that the tuning curves of neurons that responded to the trained grating had a steeper slope at the trained orientation than the tuning curves of neurons that responded to the irrelevant grating (see Figure 3C). Remarkably, this steepening of tuning curves is observed only for the more useful neurons with preferred orientations that differed from the trained orientation by 12 to 20 degrees. This sharpening of tuning curves was observed only in supra- and infragranular layers, but not in layer 4, which is the input layer of the primary visual cortex (Schoups et al., 2001). These results are in accordance with other results that categorization training induces a sharper tuning at the boundary between categories (Freedman et al., 2001). A spectacular aspect of this study is that these changes can be observed even in the primary visual cortex, the lowest area in the visual cortical processing hierarchy. To investigate whether AGREL can account for these changes in neuronal tuning, we used a model with an input layer that corresponds to layer 4 of area V1, with 20 input units at two retinotopic locations (see Figure 3D). The input units had gaussian tuning curves with a σ of 12 degrees, and a half-width at half-height of 15 degrees (see Figure 3E), which is well within the biologically plausible range (Vogels & Orban, 1990). The preferred orientations of the input units differed in steps of 9 degrees. There were also 64 hidden units at each retinotopic location. The learning rate β was set to 0.02. To start with a sufficient variety of tuning curves in the hidden layer, the model was first trained with an output layer with 12 units to categorize gratings with 12 different orientations (0, 15, . . . , 165 degrees) that could appear at either location. Training was stopped when performance was at least 75% correct for each orientation at both locations, which occurred after an average of 697 presentations of the stimulus set. This initial training phase corresponds to the visual experience of the monkeys prior to the orientation discrimination task. Then a model was used with an output layer consisting of two units, and newly initialized connections between the hidden layer and the output layer, while the connections between the input layer and the hidden layer remained the same. This model was trained to categorize two orientations that differed by 2 degrees presented at one location, while an irrelevant orientation was presented at the other, passive location. This second training phase took on average 3500 trials (S.D. = 1300 trials, N = 10 simulations). Again, the model improved more rapidly than the monkeys, which needed at least 10 times more trials during training. Training induced a steepening of the slope of tuning curves in the hidden layer at the trained orientation (see Figure 3F). This occurred only at the trained location and only for hidden units with a preferred orientation that differed by about 15 degrees from the trained orientation. The similarity between model and experiment (see Figures 3C and 3F) is remarkable, as this close match was achieved by varying only two parameters of the model: the amount of training in the initial phase and the sharpness of tuning of the input units. Without the initial orientation categorization training, the tuning curves in the hidden layer are determined

2194

P. Roelfsema and A. van Ooyen

by the random initialization of synaptic weights from the input layer. In this case, the slopes of the tuning curves of the hidden units are relatively small, which causes a general downward shift of the two curves of Figure 3F (data not shown). Variations in the width of the tuning curves in the input layer influence the degree to which hidden units that prefer orientations differing from the trained orientation are affected by the training. If tuning curves in the input layer are narrow, the increase in slope at the trained orientation is restricted to units that prefer orientations close to the trained orientation (but not precisely at this orientation). Training with broad tuning curves in the input layer also increases the slope of the tuning curve of units that have a preferred orientation that is further from the trained orientation. These simulation results, taken together, indicate that AGREL can explain how categorization training increases the sensitivity of sensory neurons to feature variations that are relevant for the task at hand. AGREL causes a selective representation of feature dimensions that support classification (see Figure 2) and sharpens neuronal tuning at the boundary between categories (see Figure 3).

7 Extensions of AGREL AGREL uses two factors to gate synaptic plasticity: the global factor δ and the attentional feedback signal. It is possible to define many different learning schemes on the basis of the same principle. In this section, we discuss two of these generalizations. 7.1 Generalization to Multiple Hidden Layers. So far, we have applied AGREL only to three-layer networks. We will now discuss how AGREL can be used if there are more than three layers. This is an important issue for any neurobiological learning scheme, because there are many levels between input and output in the cortex (Felleman & Van Essen, 1991). We change our three-layer network to a four-layer network by inserting a new input layer I before layer X (see Figure 4). Layer X thereby becomes an additional hidden layer. This modification does not change the synaptic update rules for connections w jk and vi j . The challenge is to define an update rule for the new layer of connections uhi between layers I and X. Feedback from layer Y should guide the plasticity of connections uhi . Thus, units in layer Y that provide feedforward input to layer Z should also provide feedback to layer X (see Figure 4A). However, the feedback signal is required to be different from the feedforward activation of the next layer. This can be implemented by using a separate feedback pathway (FB) where activity differs from the feedforward pathway (FF) (see Figure 4B). The separation between feedforward and feedback signals is consistent with neurobiology, because in the cortex, there are different neurons within the same cortical column that project to higher and lower levels (Felleman & Van Essen,

Attention-Gated Reinforcement Learning

2195 B

A Cortical column

Layer I

FF

FB

uhi

Cort. column

Layer X

p

Xi vij

v’ji

Layer Y

vij

v’ji p

p

wjk

w’sj

Layer Z

p

Y j(1-Y j)w' sj

Yj

w’sj s

winning unit, s

propagates activity gates plasticity

Figure 4: Generalization of AGREL to networks with more than three layers. (A) Feedforward connections u, v, and w propagate activity from the input layer I through two hidden layers to the output layer Z. The winning output unit, s, feeds back to units in layer Y through connections ws j . All units in Y that receive feedback from Z propagate it to layer X through feedback connections vji . (B) Units of AGREL are hypothesized to correspond to cortical columns that contain FF neurons (light gray circles) that propagate activity to the next higher layer as well as FB neurons (dark gray) that propagate activity to the previous layer. FB neurons gate plasticity in the FF pathway, but they do not directly influence the activity of FF neurons (connection with square).

1991). We therefore identify a unit in AGREL with such a cortical column and assume that the activity of FF neurons differs from the activity of FB neurons. In AGREL, feedback connections should influence only the activity of FB neurons and should not directly influence the activity of the FF neurons. All the activity in the feedback pathway originates from the winning unit s in the output layer. This unit feeds back to FB neurons in layer Y, which in turn feed back to FB neurons in layer X (see the dashed connections in Figure 4B). Once the competition in the output layer has settled, the winning unit s has activity 1 and the amount of feedback received by FB neurons in column jof layer Y equals ws j (see equation 4.6). In addition to this feedback, the FB neurons also receive an input from FF neurons of the same column, and p p their activity is set to Yj (1 − Yj )ws j (see Figure 4B). The FB neurons in turn p p propagate Yj (1 − Yj )ws j v ji to FB neurons in column i of layer X. All FB neurons in Y feed back to column i, and the total feedback arriving in this p column, f b Xi , equals

p

f b Xi =

M j=1

p p Yj 1 − Yj ws j vji .

(7.1)

2196

P. Roelfsema and A. van Ooyen p

Plasticity of connections uhi between layer I and X is gated by f b Xi : p p p p uhi = β Ih Xi f (δ) f b Xi 1 − Xi ,

(7.2)

the equivalent of equation 4.5. It is not difficult to show that the weight changes uhi in AGREL are again, on average, the same as in BP. Equations 7.1 and 7.2 can be applied recursively to compute the weight changes in networks with any number of hidden layers. Thus, in the case of more than three layers, a separate feedback network is required, but this network propagates activity, not error signals, and it is therefore relatively straightforward to generalize AGREL for the training of networks with multiple layers. 7.2 AGREL as an Actor-Critic Method. On correct trials, it is essential to have a good estimate of δ, the difference between the amount of reward that was obtained and that was expected. There is little doubt that deviations from the reward expectancy are computed in the brain. If the reward is delivered immediately after the animal’s response, then the reward-evoked response of dopamine neurons in the midbrain depends on reward like δ in AGREL (Fiorillo et al., 2003; Morris et al., 2004). There are various methods to compute δ on rewarded trials. In the above, we computed δ as 1 − p Pr(Zc p = 1), that is, on the basis of the a priori probability that unit c p would p win the competition, and suggested that Pr(Zc p = 1) can be determined by evaluating activity in the output layer at the start of the competition. Actor-critic models (Sutton & Barto, 1998) provide an alternative method to determine reward expectancy. They are composed of two structures. The first is the Actor, which implements the mapping of sensory states onto actions. The second is the Critic, which assigns a value to every sensory state. In general, the advantage of this design is that the Critic can assign positive values to sensory states that are not associated with immediate reward but predict that reward will be obtained in the future. The reward prediction error δ is positive for any action that causes a transition to a sensory state with a higher value, and negative if the succeeding state has a lower value. Thereby, such models can also learn to choose actions that are not rewarded immediately but will be rewarded in the near future. In other words, Actor-Critic methods permit a solution to the temporal credit assignment problem. Here we investigate if AGREL can be implemented as an Actor-Critic method in the case of immediate reward. We consider the extension of AGREL to tasks with delayed rewards to be a topic for future research. We trained an Actor-Critic model to perform the face classification task of Figure 2. AGREL is designed to map stimuli onto responses, which is the task of the Actor. We now also added a Critic network that evaluates the value of the stimulus represented by the input layer. This is illustrated

Attention-Gated Reinforcement Learning Input layer

X1

2197

XN vij

ai YM

Hidden layer Y1

bj

w’kj

wjk

Val

Output layer Z1

ZC

Figure 5: AGREL as an Actor-Critic method. The network includes a value estimation unit (Critic) that receives connections a i and b j from all units of the input layer and the hidden layer. This unit estimates the average amount of reward Val p that is obtained for the stimulus.

in Figure 5, where we introduced a single Critic unit that estimates the expected amount of reward by linear approximation (as in Montague et al., 1995; Suri & Schultz, 2001): val p =

N

p

a i Xi +

M

i=0

p

b j Yj .

(7.3)

j=1

Here, a 0 is the bias of the value-estimation unit. Now the prediction error δ is determined by a comparison between r , the amount of reward received after the trial, and the amount that was predicted for the present stimulus, val p : δ = r − val p .

(7.4)

δ may differ from −1 on unrewarded trials, and we therefore redefine f (δ) as follows: f (δ) =

δ/(1 − δ);

δ≥0

−1;

δ<0

.

(7.5)

The connections a i and b j to the Critic unit have to be updated continuously, since modifications of synaptic weights vi j and w jk change the network’s policy (i.e., the input-output mapping) and thereby the average amount of reward obtained for each of the patterns (Sutton & Barto, 1998). Plasticity of connections a i and b j is determined by p

a i = α Xi δ

(7.6)

p αYj δ.

(7.7)

b j =

2198

P. Roelfsema and A. van Ooyen

Again, all information required to update these synapses is available locally. The factor α determines the learning rate of the Critic and was set to 0.03. The learning rate β for the other synaptic weights that determine the network’s output (the Actor) equaled 0.1. With these parameters, the network required an average of 62 (S.D. = 25) presentations of the stimulus set (621 trials) to reach criterion in the face discrimination task (in 24 simulations). This is comparable to the results described above (see section 6.1). The Actor-Critic model caused an amplified representation of task-relevant features (just as in Figure 2D). The average selectivity index for the diagnostic features was 0.24, which was significantly ( p < 10−7 ) larger than the average selectivity index for the nondiagnostic features, which was 0.17. These results indicate that AGREL can indeed be implemented as an Actor-Critic model in the simplified case of immediate reward delivery. 8 Discussion AGREL is a new theory for learning in classification tasks. It is the first learning scheme that is at the same time biologically plausible and as powerful as widely used but biologically implausible strategies for training artificial neural networks (see, e.g., Crick, 1989). AGREL s computational power derives from two factors known to influence synaptic plasticity (see Figure 6). The first factor is a global reward-related signal, which reaches all synapses and is presumably implemented in the brain by the release of neuromodulators. The second factor is a site-specific effect due to the feedback of neuronal activity, which assigns credit to sensory neurons that play a critical role in the selection of an action. These two factors are combined at the individual

A

B Feedforward

Feedback δ

Input layer

Output layer Winning output unit

Figure 6: Two factors that modulate Hebbian plasticity. (A) An input pattern activates neurons in the various layers of the network. (B) One of the output units wins the competition. This unit feeds back to earlier processing levels. Synaptic plasticity is gated by two factors: (1) the feedback signal and (2) the global reinforcement signal δ.

Attention-Gated Reinforcement Learning

2199

synapse, where they modulate Hebbian plasticity. Here, we first discuss the neurophysiological evidence for these two factors and then how they interact at the level of the individual synapse. We then compare AGREL to other learning theories and conclude by considering current limitations of AGREL and possible directions for future research. 8.1 Global Reinforcement Signal δ. AGREL computes a signal δ that equals the difference between the amount of reward that is expected on a trial and the amount that is actually obtained. In this respect, the model follows other reinforcement learning theories (Montague et al., 1995; Suri & Schultz, 2001). There is substantial evidence that the brain computes a signal like δ. Schultz and coworkers demonstrated that the activity of dopamine neurons in the midbrain (substantia nigra and ventral tegmental area) is determined by the difference between the amount of reward expected and obtained (Ljungberg et al., 1993; Schultz et al., 1997; Schultz & Dickinson, 2000; Waelti et al., 2001). Dopamine neurons are also sensitive to increases in rewards that are predicted for the near future, and in this respect the activity of dopamine neurons fulfills the requirements of a signal that can be used in temporal difference learning (Dayan & Balleine, 2002; Montague & Berns, 2002; Schultz, 2002). Here we studied the case that rewards are delivered immediately after the response, and in that more restricted setting, the dopamine neurons have a particularly high response when the animal unexpectedly receives a reward (Fiorillo et al., 2003; Morris et al., 2004) as is required by AGREL. The exact method used by the brain to compute δ has not yet been resolved. Neurophysiological evidence indicates that there are dedicated neuronal circuits that continuously monitor the amount of reward that is expected in the near future (Ljungberg et al., 1993; Waelti et al., 2001). Here we outlined an alternative method to compute δ that is based on the prior p probability of choosing a particular category, Pr(Zk = 1), which can be computed at the start of the competition between response alternatives. If the p p network’s choice turns out to be correct, then Pr(Zk = 1) equals Pr(Zc p = 1), the probability of a correct response, and it also equals the expected amount of reward. On incorrect trials, however, plasticity in AGREL does not depend on reward expectancy (the Actor-Critic network of section 7.2 does require this information, however; see equations 7.6 and 7.7). This prediction of AGREL could be tested in future neurophysiological experiments. All plastic synapses have to be informed about the value of δ. In the cortex, this can be achieved by the release of neuromodulators (Pennartz, 1995; Schultz, 2002). Here we will consider two candidate neuromodulators that may carry out this job, dopamine and acetylcholine. Dopamine neurons in the substantia nigra and in the ventral tegmental area release dopamine in various nuclei of the striatum and also in regions of cortex (Garris et al., 1999). Thus, after unexpected rewards (i.e., if δ > 0), the dopamine concentration is increased in these structures (Phillips, Stuber, Heien, Wightman,

2200

P. Roelfsema and A. van Ooyen

& Carelli, 2003; Schultz, 2002). AGREL and other reinforcement learning models predict that this increased dopamine concentration should facilitate synaptic plasticity. Several studies support this prediction. Dopamine has been shown to be necessary for the potentiation of synapses in the prefrontal cortex, amygdala, and striatum (Gurden, Takita, & Jay, 2000; Rosenkranz & Grace, 2002; Reynolds, Hyland, & Wickens, 2001), as well as for the depression of synapses in the prefrontal cortex (Otani, Auclair, Desce, Roisin, & Crepel, 1999). Moreover, artificial stimulation of dopamine neurons in combination with an auditory stimulus expands the representation of that stimulus in auditory cortex (Bao, Chan, & Merzenich, 2001). This expansion does not occur if dopamine receptors are blocked. These results, taken together, indicate that dopamine may indeed modulate synaptic plasticity on the basis of a difference between the expected reward and the reward that was obtained. Acetylcholine is the second candidate neuromodulator that can globally influence synaptic plasticity. It is supplied to the cortex by a number of cholinergic cell groups in the basal forebrain and brainstem (reviewed by Pennartz, 1995). Acetylcholine innervation has a relatively homogeneous density across the cortex, and it can modulate synaptic plasticity in all cortical areas. Acetylcholine differs in this respect from dopamine, as dopamine innervation is most pronounced in prefrontal cortex and amygdala and much weaker in other regions of cortex. Many studies have demonstrated that acetylcholine plays a critical role in learning and plasticity. Lesions of cholinergic input to the cortex impair learning in rats and monkeys (Winkler, Suhr, Gage, Thal, & Fisher, 1995; Easton, Ridley, Baker, & Gaffan, 2002; Warburton et al., 2003). Moreover, cholinergic lesions block synaptic plasticity in the visual cortex, as well as in the somatosensory cortex (Bear & Singer, 1986; Juliano, Ma, & Eslin, 1991). Increased cholinergic activity, on the other hand, enhances synaptic plasticity. If an auditory stimulus is paired with artificial stimulation of cholinergic nuclei, then the representation of this stimulus is expanded in auditory cortex (Bakin & Weinberger, 1996; Kilgard & Merzenich, 1998). These results indicate that acetylcholine gates the plasticity of synapses in many, if not all, areas of the cortex. Thus, there are at least two neuromodulators, dopamine and acetylcholine, that may gate the plasticity of cortical synapses. At present, it seems too early to decide whether it is acetylcholine or dopamine, or a combination of these neuromodulators, that informs the cortical synapse about δ during learning. 8.2 Feedback from the Winning Output Unit. The global reinforcement signal by itself has little specificity to guide synaptic plasticity. AGREL therefore uses a second, site-specific factor to assign credit to those hidden units that are responsible for the selected action. This is achieved by a feedback signal from the winning output unit to the units that made it win. This feedback signal distinguishes AGREL from previous learning theories. Feedback

Attention-Gated Reinforcement Learning

2201

gates the plasticity of the lower layers, so that only the connections onto the appropriate hidden units are modified. We showed that feedback causes the hidden units to be tuned to features that are most useful for the task at hand. Indeed, our simulation results demonstrate that AGREL can account remarkably well for the changes in tuning in sensory areas that are induced by training in a categorization task. It causes a selective representation of task-relevant features (see Figure 2) and sharpens tuning at the boundary between categories (see Figure 3), just as is observed in the visual cortex of monkeys (Freedman et al., 2001; Schoups et al., 2001; Sigala & Logothetis, 2002; Baker et al., 2002). When stimuli enter into sensory areas of the cortex, feedforward connections rapidly propagate activity to association areas and then to areas involved in response selection and execution (see Figure 6A) (Lamme & Roelfsema, 2000). The active neurons in motor cortex typically represent many different, and even incompatible, motor programs (Goldberg & Segraves, 1987; Schall & Hanes, 1993). These conflicts are resolved by a competitive interaction among the motor programs, where eventually one of them wins and suppresses the others (Seidemann, Arieli, Grinvald, & Slovin, 2002; Schall & Hanes, 1993). This competition is a stochastic process, so that trials with the same sensory stimulus can nevertheless yield different behavioral outcomes (for a computational model of action selection see, e.g., Usher & McClelland, 2001; Gold & Shadlen, 2001). We used the softmax rule to select one of the actions. This rule can be implemented in neurobiologically realistic circuits (Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Nowlan & Sejnowski, 1995), and it is compatible with the computational models of action selection. In AGREL, the winning motor program feeds its activity back to lower hierarchical levels. Anatomical studies show that feedforward and feedback connections between areas are largely reciprocal (Felleman & Van Essen, 1991; Salin & Bullier, 1995). Thus, if neurons have strong feedforward connections to neurons in a higher area, they usually also receive strong feedback from these cells. The learning rules of AGREL enforce reciprocity, as they change the strengths of feedforward and feedback connections in the same way. The consequence of reciprocity is that neurons that provide the strongest excitation to a particular motor program also receive the strongest feedback if this motor program happens to win the competition (see Figure 6B). This explains why the feedback signal can be used to assign credit to neurons in lower areas that are responsible for the choice of action. A recent neurophysiological study directly tested the specificity of feedback (Moore & Armstrong, 2003). Electrical stimulation was used to enhance the activity of neurons in the frontal eye fields (area FEF), a region of cortex involved in the generation of eye movements. During electrical stimulation, neuronal activity was recorded in area V4, which is a visual area that projects to area FEF. Electrical stimulation of FEF enhanced the activity of neurons in V4, but only if the V4 receptive

2202

P. Roelfsema and A. van Ooyen

fields overlapped with the receptive fields of the stimulated FEF neurons. This is direct proof that feedback connections from neurons in a higher area project back to the neurons that are responsible for their feedforward activation. This role of feedback is supported by many other studies. If a visual stimulus becomes the target for an eye movement, for example, neurons in the visual cortex with a receptive field at the stimulus location increase their activity. This eye movement related response enhancement is observed in areas of the parietal cortex (Colby, Duhamel, & Goldberg, 1996; Gottlieb, Kusunoki, & Goldberg, 1998), the inferotemporal cortex (Chelazzi, Miller, Duncan, & Desimone, 1993), area V4 (Boch & Fischer, 1983; Moore, 1999), and even in the primary visual cortex, that is, at the lowest hierarchical level (Sup`er, van der Togt, Spekreijse, & Lamme, 2004). Psychophysical data indicate that this enhancement of neuronal responses is a correlate of visual attention (Desimone & Duncan, 1995). It is indeed well established that attention is directed to items that become the target of an eye movement. Coupling between attention and eye movements is so strong that observers are virtually unable to visually discriminate a target at one location just before they execute an eye movement to another location (Hoffman & Subramaniam 1995; Kowler, Anderson, Dosher, & Blaser, 1995; Deubel & Schneider, 1996). The strong coupling between the intention to move and the shift of attention to the features that instruct the movement is also known as the “premotor theory of attention” (Rizzolatti, Riggio, & Sheliga, 1994). We conjecture that this coupling is explained by the reciprocity of feedforward and feedback connections. The influence of response selection on the distribution of visual attention is what psychologists call goal-driven or top-down attention. Here we proposed a new role of this top-down attentional signal, which is to enable plasticity at earlier processing levels (see the thick connections in Figure 6B). We showed how this feedback signal amplifies the representation of diagnostic features in sensory areas. AGREL thereby increases the saliency of relevant features in perception, and this would correspond to what psychologists call an effect on stimulus-driven (bottom-up) attention. Thus, the theory predicts a new interaction between top-down and stimulus-driven attention: by its influence on plasticity, goaldriven attention eventually increases the saliency of diagnostic features in the course of training. Support for the effect of goal-driven attention on plasticity was obtained in a psychophysical study by Ahissar and Hochstein (1993). They presented the same visual stimuli to two groups of human observers. One group was asked to report about one attribute of the stimuli, and the other group had to report about a different attribute. The subjects’ ability to observe differences in these attributes improved during training. However, this improvement occurred only for the attribute that they were asked to report about; performance for the attribute that was not attentively practiced remained constant. Similar results were obtained in monkeys that were trained in the orientation discrimination task of Figure 3. Discrimination performance at

Attention-Gated Reinforcement Learning

2203

the trained location became much better than performance at the untrained location. After the training, V1 neurons with receptive fields at the trained location had a sharper orientation tuning, even though the bottom-up input at the two locations had been similar during training. These results, taken together, demonstrate that attentional feedback indeed gates the plasticity of sensory representations. In this study, we modeled only the effect of feedback on synaptic plasticity. Weight changes in AGREL will deviate from those of BP if the effect of feedback on neuronal activity is included in the model. Neurophysiological studies in visual cortical areas demonstrated that attended objects typically evoke responses that are 20% to 40% stronger than those evoked by nonattended objects (Moran & Desimone, 1985; Chelazzi et al., 1993; Motter, 1993; Schall & Hanes, 1993; Treue & Maunsell, 1996; Luck, Chelazzi, Hillyard, & Desimone, 1997; Roelfsema, Lamme, & Spekreijse, 1998; Reynolds, Pasternak, & Desimone, 2000). In an additional simulation of the face categorization task (see Figure 2), we investigated whether this influences the convergence of AGREL. The strength of the effect of feedback on activity was set such that excitatory feedback connections increased the responses of units in the hidden layer by an average of 19% (maximum 57%), and inhibitory feedback connections decreased the responses by an average of 17% (maximum 98%). The network learned the task after an average of 653 trials (24 simulations), which is comparable to the convergence in the absence of the effect of feedback on activity (613 trials; U-test, p > 0.2). Thus, in this simulation, the effect of feedback on activity did not deteriorate learning. Moreover, neurophysiological studies indicate that there is a substantial fraction (20–50%) of visual neurons that is not influenced by attention and that therefore always carries the unperturbed feedforward response (just like the FF pathway of Figure 4B) (e.g., Moran & Desimone, 1985; Motter, 1993; Treue & Maunsell, 1996; Luck et al., 1997; Roelfsema et al., 1988; Roelfsema, Lamme & Spekreijse, 2004; Reynolds et al., 2000). Thus, the pure sensory response is always available in the various areas of the visual cortex. Szabo, Almeida, Deco, and Stetter (in press) proposed that an effect of feedback on neuronal activity could explain the enhanced representation of diagnostic features over nondiagnostic features in area IT. Future neurophysiological studies should be able to distinguish between effects of training on feedback connections (as proposed by Szabo et al., in press) and the effects of training on feedforward connections to area IT (as in AGREL). On the one hand, if changes in the feedforward connections are responsible for the enhanced tuning to diagnostic features, then this effect should be visible at the start of the visual response of IT neurons. On the other hand, if the enhanced representation of diagnostic features is due to altered feedback, then it should not occur during the initial visual response but rather after an additional delay imposed by the loop through the response selection stage (Sugase, Yamane, Ueno, & Kawano, 1999; see also Lamme & Roelfsema, 2000).

2204

P. Roelfsema and A. van Ooyen

8.3 Interactions Between Feedforward Connections, Neuromodulators, and Feedback. AGREL proposes a specific set of interactions between pre- and postsynaptic activity, feedback effects, and neuromodulators at the level of the individual synapse. Many of the hypothesized interactions have not yet been addressed experimentally. However, there are a few exceptions. In AGREL, plasticity depends on a multiplicative interaction between feedforward input and feedback (see equations 4.6 and 7.2). A number of studies provide support for such a multiplicative interaction, although they measured activity, not synaptic plasticity. The first is the electrical stimulation experiment in area FEF discussed above (Moore & Armstrong, 2003). In this experiment, electrical stimulation in FEF only influenced V4 neurons with a visual stimulus in their receptive field, and not neurons with an empty receptive field. Thus, feedback by itself does not activate neurons; rather, it amplifies activity provided by the sensory input. This also holds true for attentional effects. If attention is drawn to a stimulus, it enhances the response of neurons that are activated by this stimulus. Attention has no effect on neurons that are not driven by the stimulus (McAdams & Maunsell, 1999; Treue & Mart´ınez Trujillo, 1999). These results demonstrate that neuronal activity depends on a multiplicative interaction between the feedforward and feedback input. Receptor pharmacology suggests a possible mechanism for the effect of feedback on neuronal activity (Dehaene, Sergent, & Changeux, 2003). A neuron’s initial feedforward response is mainly driven by its α-amino-3hydroxy-5-methyl-4-isoxazole propionate (AMPA) receptors, whereas cortical feedback strongly activates N-methyl-D-aspartate (NMDA) receptors (Salin & Bullier, 1995; Shima & Tanji, 1998). The opening of NMDA channels is voltage dependent so that current flows through these channels only if the cell is sufficiently depolarized by its AMPA receptors (Collingridge & Bliss, 1987). This explains why NMDA hardly influences spontaneous activity if applied to the visual cortex at low concentrations (Fox, Sato, & Daw, 1990). Apparently the neurons are insufficiently depolarized to open NMDA channels if there is no stimulus in their receptive field. If a visual stimulus drives the neurons, however, the same concentration of NMDA increases activity by an amount proportional to the strength of the visual response. In other words, NMDA increases the neuron’s sensory gain, and its blockers reduce the gain (Fox et al., 1990). Thus, NMDA-receptor activation has a multiplicative effect on the strength of the visual response. The pharmacology of glutamate receptors can, at the same time, also explain how feedback connections gate the plasticity of feedforward connections (see equations 4.5 and 7.2). The first step in many forms of synaptic plasticity is the entry of calcium in the postsynaptic neurons through NMDA channels. This calcium activates various biochemical cascades that upregulate the efficacy and number of AMPA receptors (Muller, Joly, & Lynch, 1988; Shi et al., 1999). This raises the exciting possibility that NMDA-ergic feedback connections might gate the plasticity of AMPA-ergic feedforward connections.

Attention-Gated Reinforcement Learning

2205

8.4 Comparison to Other Learning Theories. We first compare AGREL to BP and then to other biologically inspired learning schemes. The discovery of BP in the 1970s and early 1980s was a major breakthrough in the field of neural networks. BP was the first learning scheme that could efficiently train neural networks with hidden layers (Werbos, 1974; Rumelhart et al., 1986). Hidden layers are important, since they greatly expand the potential of neural networks. If the task is classification, for example, networks without a hidden layer can separate input patterns only along linear category boundaries (Bishop, 1995). Networks with one or more hidden layers, on the other hand, can find solutions for a much larger class of categorization problems, because they can form nonlinear category boundaries (the XOR problem is a well-known case where this is necessary). BP ensures that the output of hidden units becomes tuned to the appropriate nonlinear combinations of input unit activations, and it can thereby solve these additional categorization problems. However, BP is implausible from a neurobiological perspective (Crick, 1989). Its first implausible feature is that it requires a separate pathway for the backpropagation of error signals. The required error signals are different at each site of plasticity (Zipser & Rumelhart, 1990; Crick, 1989). These site-specific error signals are not observed in the cortex. Instead, feedback connections cause site-specific attentional effects that can be observed in many, if not all, cortical areas (reviewed by Lamme & Roelfsema, 2000). The second implausible feature of BP is that it uses a teacher. It is unlikely that a teacher exists in the brain that can supply the target activity of all neurons in the motor cortex after each trial. Moreover, animals can learn without a teacher by exploring the various response alternatives. If they make an error, they simply try out another response on the next trial. Reinforcement learning theories, such as AGREL, are designed to learn by trial and error. A drawback of learning without a teacher is that it takes more time, and the training of a neural network with AGREL is slower than with BP, especially if there are many response alternatives or if the probability of choosing the correct response is small. We conjecture, however, that this must be true for any learning scheme that has to find the correct response by trial and error. There are a number of previous theories that were inspired by the implausibility of BP and that suggested learning rules closer to biology. These previous theories can be broadly divided into two classes: (1) theories that compute a local error signal at each individual hidden unit and (2) other reinforcement learning theories, which use a global error signal that is broadcast to all sites of plasticity. In our discussion, we will not consider previous studies that combined reinforcement learning with error backpropagation (e.g., Tesauro, 1995) as they were not concerned with biological plausibility. A number of studies suggested methods to compute the local error signal at each hidden unit as is required by BP. A straightforward way is to use a dedicated error network with neurons that propagate the error signal from the output layer back to lower network levels (Zipser and Rumelhart,

2206

P. Roelfsema and A. van Ooyen

1990). Kording ¨ & Konig ¨ (2001) suggested an alternative, and somewhat speculative, possibility that a single neuron might propagate feedforward activation to higher levels as well as an error signal to lower levels by using different types of action potentials. In their proposal, the bottom-up input to a neuron drives normal action potentials, whereas the top-down error signal activates calcium spikes. However, there is no evidence that calcium spikes are tuned to the BP error. Another crucial limitation of the theories of Zipser and Rumelhart (1990) and Kording ¨ & Konig ¨ (2001) is that they did not get rid of the teacher of BP. Another method to compute the error gradient at each hidden unit was suggested by O’Reilly (1996) in his generalized recirculation algorithm (GeneRec). Like AGREL, GeneRec recirculates activity between layers, which are interconnected with feedforward and feedback connections. Moreover, feedforward connections and feedback connections in GeneRec have a similar strength, which aids in assigning credit to hidden units that are responsible for the stimulus-response mapping, just as in AGREL. There are, however, also important differences between the two learning schemes. In GeneRec, learning takes place by the alternation of two phases, a “minus” and a “plus” phase. In the minus phase, output units are activated by other units in the network. In the plus phase, the teacher determines the output units’ activity. GeneRec changes synaptic weights considering four factors: the pre- and postsynaptic activity in the minus phase and the pre- and postsynaptic activity in the plus phase. This implies that all units must remember their activity of the minus phase while the network is running in the plus phase. It is unclear how this could be implemented in the cortex. Another important drawback of GeneRec is that it requires a teacher to specify the target activity of all output units. Reinforcement learning represents the second class of theories that provides learning rules that might be implemented in the brain (Pennartz, 1995; Sutton & Barto, 1998; Suri & Schultz, 2001). The advantage of these theories is that learning can take place while the system explores the various response alternatives by trial and error. Many of the previous reinforcement learning theories provide solutions to the temporal credit assignment problem that arises when rewards are delivered after a delay, or when the animal first has to progress though a sequence of states before reward delivery (Montague et al., 1995; Sutton & Barto, 1998; Baxter & Bartlett, 2001). One way to solve the temporal credit assignment problem is to use a separate critic network that assigns a hedonic value to sensory states. The critic network can be trained with temporal difference learning to assign high hedonic values to states predicting that reward will be obtained in the near future (Sutton & Barto, 1998). Although we have not yet tested AGREL in tasks with delayed rewards, we showed that it is compatible with such a separate critic network. We stress, however, that AGREL was designed to improve the actor of an Actor-Critic model and that we leave the question of whether feedback connections can improve learning in the critic network for future research.

Attention-Gated Reinforcement Learning

2207

AGREL borrows important concepts from previous reinforcement learning theories; in particular, it adopts the reward prediction error δ. A number of earlier studies demonstrated that δ can be used to guide synaptic plasticity in the network that maps sensory inputs onto actions (actor-network). In some of these studies, δ was used to train networks with two layers (Barto & Anandan, 1985; Montague et al., 1995). If δ is the only signal that influences synaptic plasticity, the definition of a learning rule for networks with three or more layers is more problematic, because the global signal has insufficient specificity to resolve the spatial credit assignment problem. Nevertheless, it is possible to train multilayer feedforward networks just on the basis of δ and pre- and postsynaptic activity. One algorithm that has been used is the associative reward-penalty algorithm (AR−P ) (Barto, 1985; Barto & Anandan, 1985; Mazzoni, Andersen, & Jordan, 1991), and another class is formed by REINFORCE algorithms (Williams, 1992). Hidden units in these algorithms are stochastic and randomly choose between an active and inactive state. They attempt to estimate their contribution to the output by correlating their own behavior with the reward, without knowledge about their impact on the output layer (see also Seung, 2003). Thus, connections onto hidden units that are not involved in a decision may also change after a trial. AGREL differs in this respect, since it modifies connections onto only hidden units that were involved in the decision. We compared AGREL to published data obtained with AR−P in a small network with a single hidden unit (Barto, 1985) and found that convergence in AGREL is three times faster. We predict that this factor increases if there are many hidden units between the input and output layer so that it becomes more important to assign the credit to the correct ones. There is a further difference between AGREL and REINFORCE. In AGREL, the average change in synaptic strength is proportional to the gradient on the cross-entropy −∇ Q p (Q p is defined in equation 3.3). In contrast, REINFORCE makes weight changes that are proportional to ∇ E p (r ), where E p (r ) is the average amount of reward obtained with stimulus p (Williams, 1992). For the tasks considered here, there is a simple relationship between the two gradients, since

p ∂ E p (r ) ∂ P Zc p = 1 p = = Yj P Zcpp = 1 1 − P Zcpp = 1 ∂w jc p ∂w jc p ∂ Qp = −P Zcpp = 1 , and ∂w jc p p p ∂ E p (r ) ∂ P Zc p = 1 p = = −Yj P Zcpp = 1 P Zk = 1 ∂w jk ∂w jk ∂ Qp = −P Zcpp = 1 ; ∂w jk

k = c p ,

(8.1)

(8.2)

2208

P. Roelfsema and A. van Ooyen

and similarly p ∂ Qp ∂ P Zc p = 1 ∂ E p (r ) = = −P Zcpp = 1 . ∂vi j ∂vi j ∂vi j

(8.3)

Thus, in general, ∇ E p (r ) = −P Zcpp = 1 · ∇ Q p .

(8.4)

The two gradients therefore have the same direction for each pattern p, but REINFORCE makes on average small steps in weight space for patterns that p are usually classified erroneously (i.e., small P(Zc p = 1)) and larger steps for stimuli that are often classified correctly. The total gradient equals the sum across the gradients for the individual patterns and differs between AGREL and REINFORCE: −∇ Q =

−∇ Q p ,

(8.5)

p

whereas ∇ E(r ) =

p

∇ E p (r ) =

−P Zcpp = 1 · ∇ Q p .

(8.6)

p

This difference between REINFORCE and AGREL was readily apparent when we compared the two algorithms on benchmark problems. REINFORCE often failed to leave regions of weight space where one or more p input patterns had a small P(Zc p = 1), whereas AGREL usually succeeded in training the network. 8.5 Limitations and Future Extensions. So far, we have applied AGREL only to tasks where the animal learns to associate a unique action with every input pattern and where the reward is delivered immediately after a correct response. Moreover, we always rewarded one of a limited number of potential actions; all other actions were not rewarded. Future work will have to determine whether AGREL can be extended to more complex situations that have been addressed by other reinforcement learning theories. First, it will be important to investigate whether AGREL is compatible with tasks where rewards are delivered after a delay and sequential decision tasks where the animal progresses through a number of states before reward delivery (Sutton & Barto, 1998; Dayan & Balleine, 2002). We made a first step in this direction by implementing AGREL as an Actor-Critic method. However, we have yet to study the behavior of AGREL in tasks with delayed reward delivery. Also, we did not investigate tasks where rewards are

Attention-Gated Reinforcement Learning

2209

delivered probabilistically or where payoffs are variable. Second, it will be important to extend AGREL to regression tasks, where the output is a continuous variable rather than the choice of one of a limited set of categories. Animals are commonly confronted with tasks that require regression, for example, if they have to learn to reach to locations in space. 9 Conclusion We conclude that the inclusion of an attentional feedback signal in reinforcement learning permits new, biologically plausible learning rules that are as efficient as error BP in forming useful internal representations. Gating of plasticity by a combination of reinforcement signals and attentional feedback permits learning rules where the average changes in synaptic weights are precisely in the direction of the BP error gradient. AGREL thereby establishes a new link between supervised learning and biologically inspired reinforcement learning theories, theories of learning that were largely unconnected in the past. Acknowledgments We thank Carl van Vreeswijk and Cyriel Pennartz for helpful comments on an earlier version of the manuscript. P.R.R. was supported by a HFSP Young Investigators grant. References Ahissar, M., & Hochstein, S. (1993). Attentional control of early perceptual learning. Proc. Natl. Acad. Sci. USA, 90, 5718–5722. Baker, C. I., Behrmann, M., & Olson, C. R. (2002). Impact of learning on representation of parts and wholes in monkey inferotemporal cortex. Nat. Neurosci., 5, 1210–1216. Bakin, J. S., & Weinberger, N. M. (1996). Induction of a physiological memory in the cerebral cortex by stimulation of the nucleus basalis. Proc. Natl. Acad. Sci. USA, 93, 11219–11224. Bao, S., Chan, V. T., & Merzenich, M. M. (2001). Cortical remodelling induced by activity of ventral tegmental dopamine neurons. Nature, 412, 79–81. Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiol., 4, 229–256. Barto, A. G., & Anandan, P. (1985). Pattern-recognizing stochastic learning automata. IEEE Trans. Syst., Man. and Cybernet., SMC-15, 360–375. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350. Bear, M. F., & Singer, W. (1986). Modulation of visual cortical plasticity by acetylcholine and noradrenaline. Nature, 320, 172–176 Becker, S., & Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163.

2210

P. Roelfsema and A. van Ooyen

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Boch, R., & Fischer, B. (1983). Saccadic reaction times and activation of the prelunate cortex: Parallel observations in trained rhesus monkeys. Exp. Brain Res., 50, 201– 210. Chelazzi, L., Miller, E. K., Duncan, J., & Desimone, R. (1993). A neural basis for visual search in inferior temporal cortex. Nature, 363, 345–347. Colby, C. L., Duhamel, J. R., & Goldberg, M. E. (1996). Visual, presaccadic, and cognitive activation of single neurons in monkey lateral intraparietal area. J. Neurophysiol., 76, 2841–2852. Collingridge, G. L., & Bliss, P. (1987). NMDA receptors—their role in long-term potentiation. Trends Neurosci., 10, 288–293. Crick, F. (1989). The recent excitement about neural networks. Nature, 337, 129–132. Dayan, P., & Balleine, B. W. (2002). Reward, motivation, and reinforcement learning. Neuron, 36, 285–298. Dehaene, S., Sergent, C., & Changeux, J. P. (2003). A neuronal network model linking subjective reports and objective physiological data during conscious perception. Proc. Natl. Acad. Sci. USA, 100, 8520–8525. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annu. Rev. Neurosci., 18, 193–222. Deubel, H., & Schneider, W. X. (1996). Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Res., 36, 1827–1837. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Easton, A., Ridley, R. M., Baker, H. F., & Gaffan, D. (2002). Unilateral lesions of the cholinergic basal forebrain and fornix in one hemisphere and inferior temporal cortex in the opposite hemisphere produce severe learning impairements in rhesus monkeys. Cereb. Cortex, 12, 729–736. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex, 1, 1–47. Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Fox, K., Sato, H., & Daw, N. (1990). The effect of varying stimulus intensity on NMDAreceptor activity in cat visual cortex. J. Neurophysiol., 64, 1413–1428. Freedman, D. J., Riesenhuber, M., Poggio, T., & Miller, E. K. (2001). Categorical representation of visual stimuli in the primate prefrontal cortex. Science, 291, 312– 316. Freedman, D. J., Riesenhuber, M., Poggio, T., & Miller, E. K. (2003). A comparison of primate prefrontal cortex and inferior temporal cortices during visual categorization. J. Neurosci., 15, 5235–5246. Garris, P. A., Kilpatrick, M., Bunin, M. A., Michale, D., Walker, Q. D., & Wightman, R. M. (1999). Dissociation of dopamine release in the nucleus accumbens from intracranial self-stimulation. Nature, 398, 67–69. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends Cogn. Sci., 5, 10–16. Goldberg, M. E., & Segraves, M. A. (1987). Visuospatial and motor attention in the monkey. Neuropsychologia, 25, 107–118.

Attention-Gated Reinforcement Learning

2211

Gorman, R. P., & Sejnowski, T. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1, 75–89. Gottlieb, J. P., Kusunoki, M., & Goldberg, M. E. (1998). The representation of visual salience in monkey parietal cortex. Nature, 391, 481–484. Gurden, H., Takita, M., & Jay, T. M. (2000). Essential role of D1 but not D2 receptors in the NMDA receptor-dependent long-term potentiation at hippocampalprefrontal cortex synapses in vivo. J. Neurosci., 20, RC106 (1–5). Gustafsson, B., & Wigstrom, H. (1988). Physiological mechanisms underlying longterm potentiation. Trends Neurosci., 11, 156–162. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The “wake-sleep” algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hoffman, J. E., & Subramaniam, B. (1995). The role of visual attention in saccadic eye movements. Percept. Psychophys., 57, 787–795. Juliano, S. L., Ma, W., & Eslin, D. (1991). Cholinergic depletion prevents expansion of topographic maps in somatosensory cortex. Proc. Natl. Acad. Sci. USA, 88, 780–784. Kilgard, M. P., & Merzenich, M. M. (1998). Cortical map reorganization enabled by nucleus basalis activity. Science, 279, 1714–1718. Kording, ¨ K. P., & Konig, ¨ P. (2001). Supervised and unsupervised learning with two sites of synaptic integration. J. Comp. Neurosci., 11, 207–215. Kowler, E., Anderson, E., Dosher, B., & Blaser, E. (1995). The role of attention in the programming of saccades. Vision Res., 35, 1897–1916. Lamme, V. A. F., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends Neurosci., 23, 571–579. Ljungberg, T., Apicella, P., & Schultz, W. (1993). Responses of monkey dopamine neurons during learning of behavioral reactions. J. Neurophysiol., 67, 145–163. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77, 24–42. Malinow, R., & Miller, J. P. (1986). Postsynaptic hyperpolarisation during conditioning reversibly blocks induction of long-term potentiation. Nature, 320, 529–530. Mazzoni, P., Andersen, R. A., & Jordan, M. I. (1991). A more biologically plausible learning rule for neural networks. Proc. Natl. Acad. Sci. USA, 88, 4433–4437. McAdams, C. J., & Maunsell, J. H. R. (1999). Effects of attention on orientation-tuning functions of single neurons in macaque area V4. J. Neurosci., 19, 431–441. Montague, P. R., & Berns, G. S. (2002). Neural economics and the biological substrates of valuation. Neuron, 36, 265–284. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728. Moore, T. (1999). Shape representations and visual guidance of saccadic eye movements. Science, 285, 1914–1917. Moore, T., & Armstrong, K. M. (2003). Selective gating of visual signals by microstimulation of frontal cortex. Nature, 421, 370–373. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Morris, G., Arkadir, D., Nevet, A., Vaadia, E., & Bergman, H. (2004). Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133–143.

2212

P. Roelfsema and A. van Ooyen

Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. J. Neurophysiol., 70, 909–919. Muller, D., Joly, M., & Lynch, G. (1988). Contributions of quisqualate and NMDA receptors to the induction and expression of LTP. Science, 242, 1694–1697. Nowlan, S. J., & Sejnowski, T. J. (1995). A selection model for motion processing in area MT of primates. J. Neurosci., 15, 1195–1214. O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Comp., 8, 895– 938. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Otani, S., Auclair, N., Desce, J. M., Roisin, M. P., & Crepel, F. (1999). Dopamine receptors and groups I and II mGluRs cooperate for long-term depression induction in rat prefrontal cortex through converging postsynaptic activation of MAP kinases. J. Neurosci., 15, 9788–9802. Pennartz, C. A. M. (1995). The ascending neuromodulatory systems in learning by reinforcement: Comparing computational conjectures with experimental findings. Brain Res. Rev., 21, 219–245. Phillips, P. E. M., Stuber, G. D., Heien, M. L. H. V., Wightman, R. M., & Carelli, R. M. (2003). Subsecond dopamine release promotes cocaine seeking. Nature, 422, 614–618. Reynolds, J. H., Pasternak, T., & Desimone, R. (2000). Attention increases sensitivity of V4 neurons. Neuron, 26, 703–714. Reynolds, J. N. J., Hyland, B. I., & Wickens, J. R. (2001). A cellular mechanism of reward-related learning. Nature, 413, 67–70. Rizzolatti, G., Riggio, L., & Sheliga, B. M. (1994). Space and selective attention. In C. Umilt`a & M. Moscovitch (Eds.), Attention and performance XV. Conscious and nonconscious information processing (pp. 231–265). Cambridge, MA: MIT Press. Roelfsema, P. R., Lamme, V. A. F., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395, 376–381. Roelfsema, P. R., Lamme, V. A. F., & Spekreijse, H. (2004). Synchrony and covariation of firing rates in the primary visual cortex during contour grouping. Nature Neurosci., 7, 982–991. Rosenkranz, J. A., & Grace, A. A. (2002). Dopamine-mediated modulation of odourevoked amygdala potentials during Pavlovian conditioning. Nature, 417, 282–287. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group (Eds.), Parallel distributed processing (Vol. 1, pp. 318–364). Cambridge, MA: MIT Press. Rumelhart, D. E., & Zipser, D. (1986). Feature discovery by competitive learning. In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group (Eds.), Parallel distributed processing (Vol. 1, pp. 151–193). Cambridge, MA: MIT Press. Salin, P. A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiol. Rev., 75, 107–154. Schall, J. D., & Hanes, D. P. (1993). Neural basis of saccade target selection in frontal eye field during visual search. Nature, 366, 467–469.

Attention-Gated Reinforcement Learning

2213

Schoups, A., Vogels, R., Qian, N., & Orban, G. A. (2001). Practising orientation identification improves orientation coding in V1 neurons. Nature, 412, 549–553. Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Dickinson, A. (2000). Neuronal coding of prediction errors. Annu. Rev. Neurosci., 23, 473–500. Seidemann, E., Arieli, A., Grinvald, A., & Slovin, H. (2002). Dynamics of depolarization and hyperpolarization in the frontal cortex and saccade goal. Science, 295, 862–865. Seung, H. S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. Shi, S.-H., Hayashi, Y., Petralia, R. S., Zaman, S. H., Wenthold, R. J., Svoboda, K., & Malinow, R. (1999). Rapid spine delivery and redistribution of AMPA receptors after synaptic NMDA receptor activation. Science, 284, 1811–1816. Shima K., & Tanji J. (1998). Involvement of NMDA and non-NMDA receptors in the neuronal responses of the primary motor cortex to input from the supplementary motor area and somatosensory cortex: Studies of task-performing monkeys. Jpn. J. Physiol., 48, 275–290. Sigala, N., & Logothetis, N. K. (2002). Visual categorization shapes feature selectivity in the primate temporal cortex. Nature, 415, 318–320. Sugase, Y., Yamane, S., Ueno, S., & Kawano, K. (1999). Global and fine information coded by single neurons in the temporal visual cortex. Nature, 400, 869–873. Sup`er, H., van der Togt, C., Spekreijse, H., & Lamme, V. A. F. (2004). Correspondence of presaccadic activity in the monkey primary visual cortex with saccadic eye movements. Proc. Natl. Acad. Sci. USA, 101, 3230–3235. Suri, R. E., & Schultz, W. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13, 841–862. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Szabo M., Almeida R., Deco G., & Stetter, M. (in press). A neuronal model for the shaping of feature selectivity in IT by visual categorization. Neurocomputing. Tanaka, K. (1995). Neuronal mechanisms of object recognition. Science, 262, 685– 688. Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Comm. ACM, 38, 58–68. Treue, S., & Mart´ınez Trujillo, J. C. (1999). Feature-based attention influences motion processing gain in macaque visual cortex. Nature, 399, 575–579. Treue, S., & Maunsell, J. H. R. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. Usher, M., & McClelland, J. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychol. Rev., 108, 550–592. van Ooyen, A., & Nienhuis, B. (1992). Improving the convergence of the backpropagation algorithm. Neural Networks, 5, 465–471. Vogels, R., & Orban, G. A. (1990). How well do response changes in striate neurons signal differences in orientation? A study in the discriminating monkey. J. Neurosci., 10, 3543–3558.

2214

P. Roelfsema and A. van Ooyen

Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412, 43–48. Warburton, E. C., Koder, T., Cho, K., Massey, P. V., Duguid, G., Barker, G. R. I., Aggleton, J. P., Bashir, Z. I., & Brown, M. W. (2003). Cholinergic neurotransmission is essential for perirhinal cortical plasticity and recognition memory. Neuron, 38, 987–996. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Winkler, J., Suhr, S. T., Gage, F. H., Thal, L. J., & Fisher, L. J. (1995). Essential role of neocortical acetylcholine in spatial memory. Nature, 375, 484–487. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Zipser, D., & Rumelhart, D. E. (1990). The neurobiological significance of the new learning models. In E. L. Schwartz (Ed.), Computational neuroscience (pp. 192–200). Cambridge, MA: MIT Press.

Received April 22, 2004; accepted March 22, 2005.

LETTER

Communicated by Sebastian Seung

Computing with Continuous Attractors: Stability and Online Aspects Si Wu [email protected] Department of Informatics, University of Sussex, Brighton, U.K.

Shun-ichi Amari [email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Tokyo, Japan

Two issues concerning the application of continuous attractors in neural systems are investigated: the computational robustness of continuous attractors with respect to input noises and the implementation of Bayesian online decoding. In a perfect mathematical model for continuous attractors, decoding results for stimuli are highly sensitive to input noises, and this sensitivity is the inevitable consequence of the system’s neutral stability. To overcome this shortcoming, we modify the conventional network model by including extra dynamical interactions between neurons. These interactions vary according to the biologically plausible Hebbian learning rule and have the computational role of memorizing and propagating stimulus information accumulated with time. As a result, the new network model responds to the history of external inputs over a period of time, and hence becomes insensitive to short-term fluctuations. Also, since dynamical interactions provide a mechanism to convey the prior knowledge of stimulus, that is, the information of the stimulus presented previously, the network effectively implements online Bayesian inference. This study also reveals some interesting behavior in neural population coding, such as the trade-off between decoding stability and the speed of tracking time-varying stimuli, and the relationship between neural tuning width and the tracking speed. 1 Introduction Recent studies on neural population coding have revealed that continuous stimuli, such as orientation, moving direction, and the spatial location of objects, are likely to be encoded as continuous attractor in neural systems (Amari, 1977; Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Maunsell & Van Essen, 1983; Wilson & McNaughton, 1993; Rolls, Robertson, & GeorgesFran¸cois, 1995; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Zhang, 1996; Neural Computation 17, 2215–2239 (2005)

© 2005 Massachusetts Institute of Technology

2216

S. Wu and S. Amari

Seung, 1996; Hansel & Sompolinsky, 1998; Taube, 1998; Deneve, Latham, & Pouget, 1999; Wang, 2001; Wu, Amari, & Nakahara, 2002; Stringer, Trappenberg, Rolls, & Aranjo, 2002; Brody, Romo, & Kepecs, 2003; Trappenberg, 2003). The notion of continuous attractor emphasizes that the steady states of the system, which encodes stimulus values, form a continuous parameter space on which the system is neutrally stable. This representation-memory structure contrasts with that of discrete attractors, such as the Hopfield model (Hopfield, 1984), where stimulus values are supposed to be stored as discrete points in the state space. A simple illustration of the structural difference between continuous and a discrete attractors is shown in Figure 1. The key property of continuous attractors that distinguishes themselves from the discrete ones is the neutral stability of the system. Intuitively, neutral stability implies that there is no resistance along the attractor space, and as a result, the system can change state rather easily under external drives. This property is crucial for neural systems to carry out many important computation tasks, such as tracking a moving object or navigating in space, in which decoding a time-varying stimulus in real time is essential. For systems having discrete attractors, it will be extremely difficult to accomplish these tasks, as the systems have to overcome the associated energy barrier for each state updating. Another good property of continuous attractors, which may not be that obvious, is that they provide a framework for reading out stimuli by using a simple and efficient strategy called template matching (details are introduced in section 2.2.2) (Pouget, Zhang, Deneve, & Latham, 1998; Deneve et al., 1999; Wu et al., 2002). The advantages of continuous attractors, and their associated roles on brain functions, have been widely studied in the literature. An issue, however, that constantly puzzles computational neuroscientists is the instability of continuous attractors. Two aspects of instability are concerned. One is the potential susceptibility of the attractor structure with respect to the imprecision of network components. This takes into account the fact that neuronal synapses in reality may not be as perfect as those required mathematically for maintaining a continuous attractor (Tsodyks & Sejnowski, 1995; Zhang, 1996; Seung, Lee, Reis, & Tank, 2000; Wang, 2001). The other concerns the robustness of network computation with respect to input noises. This takes into account the fact that continuous attractors are intrinsically unstable along the attractor space and that a little fluctuation in input may dramatically modify the decoding result (Amari, 1977; Wang, 2001; Brody et al., 2003). Consider the ubiquity of noise in biological systems a computational model of such instability is essentially useless. Thus, understanding the stability of continuous attractors is of critical importance for us to understand its applications in brain functions. In this study, we mainly focus on how computational robustness of continuous attractors is achieved, and do not consider their structural stability.

Computing with Continuous Attractors

2217

A

the stable point

B

the stable valley

Figure 1: An illustration of structural difference between discrete and continuous attractors. (A) An example of a discrete point attractor. The system is stable only at the bottom of the bowl. (B) An example of line attractor, the onedimensional version of continuous attractor. The system is stable at any point on the one-dimensional valley and can move easily along the valley under external drive.

Apart from exploring computational robustness, this study also investigates the implementation of Bayesian online decoding in continuous attractors. A significant feature of biological computation is that its input is continuous in time and highly fluctuating (van Vreeswijk & Somplinsky, 1996; Natschlager, Maass, & Markram, 2002). In such a scenario, online computation, in terms of extracting stimulus values in real time and adjusting or modifying them constantly, is the natural solution for efficient information processing. It is far from clear how online decoding is realized in neural systems. Technically, to achieve online computation, the core is to

2218

S. Wu and S. Amari

have a mechanism that can store and propagate input information accumulated in time, so that this information can be used as prior knowledge to enhance further computation. Mathematically, this is expressed as Bayesian inference. In this work, we use continuous attractors as a prototype model for neural computation to explore the possible mechanisms for achieving Bayesian inference in neural systems. In summary, this study has two main goals: to explore the robustness and online aspects of computation with continuous attractors. For simplicity, we consider only a simple abstract network model for continuous attractors, as is conventionally done in the literature (Amari, 1977; Ben-Yishai et al., 1995; Zhang, 1996; Deneve et al., 1999; Wu et al., 2002; Wu, Chen, Niranjan, & Amari, 2003). This model captures the essential features of how continuous attractors are maintained in neural systems and is consistent with our goal of exploring their general properties. Specifically, we concentrate on analyzing linear attractors, the one-dimensional version of continuous attractors. The results are applicable in general cases. It turns out that the two goals can be achieved under the same neural mechanism. In the conventional recurrent network models for line attractors, neuronal interactions are usually considered to be static (i.e., unchanged with time). Here, we assume that apart from the original static components, there are also dynamical ones between neuronal interactions. These dynamical interactions vary according to the biologically plausible Hebbian learning rule and have the computational role of storing and propagating the temporal information of external inputs. The underlying picture is straightforward. Whenever the external input changes, due to either noise or real movement of the stimulus, it generates varied neural activities, which subsequently modify the size of dynamical interactions due to Hebbian learning. Thus, the history of external inputs is imprinted in the time course of dynamical interactions. In terms of implementing online Bayesian decoding, dynamical interactions have the role of conveying the prior knowledge of stimuli, that is, the history of external inputs. In terms of improving the computational robustness of continuous attractors, dynamical interactions enable the system to respond to the history of external inputs over a period of time, and hence suppress short-term fluctuations. Part of the idea that dynamical interactions contribute to implementing Bayesian inference in neural circuitry has been studied by Wu et al. (2003). While the decoding stability of line attractors is improved by dynamical interactions, a compromise is the delayed response to abrupt stimulus changes. This is also understandable. Essentially, stabilizing the computation of line attractors requires the system to respond to the time-averaged behavior of external inputs. Consider the situation that the stimulus experiences an abrupt change, which unfortunately is indistinguishable from noise in a short time window. Since the system responds to inputs over a period of time, its reaction to the sudden change inevitably will be delayed.

Computing with Continuous Attractors

2219

We analyze this trade-off between decoding robustness and tracking speed in detail in this article. As a by-product, this study also reveals some interesting behavior in neural population coding. For instance, we observe that the tracking speed of line attractors strongly depends on the size of the neural tuning width: the larger the tuning width, the quicker the tracking. This gives us a fresh justification of why the neural tuning width should be large, as seen in the data, an issue of wide concern in the study of population coding (Pouget, Deneve, Ducom, & Latham, 1999; Zhang & Sejnowski, 1999). The organization of this letter is as follows. In section 2, we introduce a simple recurrent network model for line attractors and review its computational advantages. In section 3, a modified network model that includes dynamical interactions is introduced, and its performance is interpreted from the view of Bayesian inference. In section 4, the performances of the new network model are investigated in detail, which particularly include how computation robustness of line attractors is improved, what the trade-off between decoding stability and tracking speed is, and how tracking speed is affected by the size of neural tuning width. Section 5 presents the overall conclusions and discussions of this work. 2 Computing with Line Attractors In this section, we introduce a recurrent network model for line attractors and review its two main computational advantages. 2.1 The Model. Let us consider a continuous stimulus x encoded by N neurons. For simplicity, we assume N −→ ∞ and consider a continuous neural field model (returning to finite N, however, is straightforward) (Amari, 1977). We use Uc to denote the internal state of neurons whose preferred stimulus is c and Oc the corresponding neural activity (i.e., the firing rate). The nonlinear relationship between Oc and Uc is defined by Oc =

Uc2 , 1 + µ Uc2 dc

(2.1)

where the parameter µ is a constant. The denominator on the right-hand side reflects the effect of global inhibition. In effect, this prevents neural activities from diverging to infinity. The neurons in the network are fully connected. The form of the recurrent interaction is the key that generates the structure of the linear attractor, which is set to 2

Wc,c = exp[−(c−c )

/2a 2 ]

,

(2.2)

2220

S. Wu and S. Amari

where the parameter a is the width of the gaussian function. The form of Wc,c exhibits two features. First, Wc,c is translational invariant, being a function of the difference between neuronal preferred stimuli, (c − c ). This symmetry leads to the neutral stability of the system. Second, Wc,c has the gaussian form, which ensures that the steady states of the network are bell-shaped, agreeing with experimental observations. The dynamics of the network is given by τ

dUc = −Uc + dt

+∞

−∞

Wc,c Oc dc + Ic ,

(2.3)

where the parameter τ is the time constant of Uc , and Ic is the external input. It is straightforward to check that with the above recurrent interactions and proper values of µ, when Ic = 0, the network accommodates a family of nontrivial stable states, which can be written as1 ˜ c = A exp[−(c−z)2 /2a 2 ] , O ˜ c = B exp[−(c−z) U

2

/4a ] 2

,

∀z,

(2.4)

∀z,

(2.5)

where the coefficients A and B are constants depending on µ, and the variable z is a one-dimensional free parameter. We see that the steady states of network indeed have a bell shape. Since equations 2.4 and 2.5 hold for any value of z, the system is said to have a line attractor. The value of z indicates the position of the steady state on the attractor space, which is also the network representation of stimulus. In case there is no external input, that is, Ic = 0, the system may stay at any point on the attractor, depending on the initial condition of the dynamics. When an external input is presented, the network will be driven to a particular position, with the corresponding value of z reporting the decoding result. It is worth noting that the parameter a has two functional meanings: it determines the range of neuronal interactions (see equation 2.2), and it defines the width of neural tuning function (see equation 2.4). 2.2 Computational Advantages of Linear Attractor. Based on this model, we now illustrate two main advantages of line attractors: to track stimuli smoothly and to implement template matching. To mimic real situations more precisely, such as when the stimulus is the orientation or moving direction of objects, we restrict the stimulus in the simulation to be a periodical variable. More exactly, we consider stimulus x (and so does the preferred stimulus of neurons) in the range 1

An easy way to check that equations 2.4 and 2.5 are the stable states of the network is to substitute them in equations 2.1 and 2.3 and see whether dUc /dt = 0 (and, hence, d Oc /dt = 0) holds when Ic = 0.

Computing with Continuous Attractors

2221

(−π, π], with x = θ and x = 2π + θ being the same. Under this condition, the steady states of the network will no longer have the exact gaussian form as in equation 2.4. However, provided that the value of a is not too large, say, a < π as considered here, the bell shape of steady states still holds, as confirmed by the simulation. With the periodic condition, the form of recurrent interactions is adjusted to be Wc,c =

2

exp[−(c−c ) /2a ] 2 2 exp[−(2π −|c−c |) /2a ] 2

if |c − c | ≤ π . if |c − c | > π

(2.6)

2.2.1 Tracking Time-Varying Stimuli Smoothly. One critical advantage of line attractors is their capability to track time-varying stimuli smoothly. For illustration, we consider a challenging situation (Georgopoulos, Taira, & Lukashin, 1993; Ben-Yishai et al., 1995), in which the stimulus experiences an abrupt change, say, jumping from x = 0 jumps to x = h, for h a small constant. This process can be modeled by setting external input 2 2 Ic = γ exp−c /2a initially, with γ a small constant, and then changing Ic 2 2 suddenly to Ic = γ exp−(c−h) /2a . Figure 2A records the activities of the network from the onset of the change to the stationary state. It shows that the network can indeed track the abrupt change of stimulus, in the sense that the network representation for stimuli is updated accordingly. In contrast, we also illustrate the typical behavior expected for discrete attractors in the same situation, as shown in Figure 2B. We see that for discrete attractors, the system state is trapped at the original position and is unable to follow the stimulus change. Figure 2A also exhibits another important characteristic of line attractors: during tracking, the bell shape of population activity is roughly maintained. This property can be illustrated in another way. Let us consider the case of the stimulus being an angle variable (e.g., the orientation or moving direction) and record network representations during the intermediate stages of tracking. The result is shown in Figure 3, which demonstrates that the network state rotates in a seamless way from the initial position to where the stimulus value has changed. This phenomenon is actually a unique feature of continuous attractors, not shared by discrete ones (Ben-Yishai et al., 1995). It tells us that the state change of continuous attractors must follow paths defined by the attracting space. For instance, if the stimulus changes from 0 to h, the state of line attractors must go through all the intermediate values between 0 and h. For discrete attractors, there is no such restriction, and the system in principle can update its state discontinuously from 0 to h, though this process may need extra effort. This property of seamless rotation has been identified as important evidence for the existence of line attractors in neural systems (Georgopoulos et al., 1993).

2222

S. Wu and S. Amari

A

0.15

Population Activity

Initial

Final

0.1

0.05

0 −3

−2

−1

0

1

2

3

Preferred Stimulus

Population Activity

B

0.2

Initial Final 0.1

0

−3

−2

−1

0

1

2

3

Preferred Stimulus Figure 2: An illustration of network response in the situation when the stimulus experiences an abrupt change. The population activity of the network is drawn every 20τ time units. Neurons are labeled by their preferred stimuli. In the simulation, the stimulus value is abruptly changed from 0 to 0.5π. The parameters used are N = 40, a = 1, τ = 1, µ = 0.5 and γ = 0.05. (A) The line attractor. (B) The discrete attractor. The discrete attractor is constructed by adding 2 2 2 a small amount of position-dependent component, 0.1e −(c +c ) , to the recurrent interactions, equation 2.2, so that the neutral stability of the system is broken.

2.2.2 Decoding Stimulus by Template Matching. Another advantage of linear attractors is their effectiveness in reading out stimuli from a noise environment. Let us consider an external input that is noisy and modeled 2 2 by Ic = γ exp[−(c−x) /2a ] + c , where x denotes the true stimulus and c a

Computing with Continuous Attractors

2223

Final direction

Initial direction

Figure 3: An alternative way to illustrate the tracking process shown in Figure 2A. The network representation is represented by a direction vector.

random variable. The goal of the neural estimator is to infer the value of x from the input Ic . Provided that Ic is a constant during decoding (though this is not plausible in practice), it will drive the network to the position on the attractor space that has the maximum overlap with Ic (Pouget et al., 1998; Deneve et al., 1999; Wu et al., 2002), that is, zˆ = argmaxz

π

−π

˜ c (z)dc, Ic O

(2.7)

where zˆ denotes the final position of the network state, that is, the estimation ˜ c (z) is given by equation 2.4. This decoding operation is of stimulus, and O ˜ c }, is used called template matching, in the sense that the tuning function, { O as the template to match Ic . This process is illustrated in Figure 4. To implement template matching, the neutral stability of the line attractor plays the key role, which guarantees that the template can be moved freely in order to fit Ic . For the decoding performance of template matching, it has been proved that template matching is equivalent to maximum likelihood inference (MLI) based on an unfaithful model that neglects the correlation between neuronal activities (Wu, Nakahara, & Amari, 2001). In general cases, template matching outperforms center of mass (COM) (or population vector equivalently). Compared with MLI that uses all information of the encoding process, template matching has the advantage of being much simpler.

3 Line Attractors with Dynamical Interactions A possible difficulty for applying line attractors in practice is its sensitivity to input noise, the inevitable consequence of the attractor’s neutral stability.

2224

S. Wu and S. Amari 1.5

Noise Input

Population Activitity

Initial Position 1

Final Position after matching

0.5

0

−3

−2

−1

0

1

2

3

Preferred Stimulus Figure 4: An illustration of the template-matching operation. The initial position of the template is at z = −0.5π. The true stimulus value is at x = 0. The diagram shows how the template is driven by the external input to a new position that has the maximum overlap with the noisy input.

Input noises generate a random drift, constantly fluctuating along the attractor space,2 which makes decoding unreliable. Thus, to improve the robustness of decoding, the neural system must have a mechanism to suppress this sensitivity. Consider that neural systems have no extra source of information on stimulus except from the one of external input; an efficient way to increase computational robustness is to let the system respond to the time-averaged behavior of external input, so that short-term fluctuations can be suppressed. To achieve this, the essence is to have a mechanism that can store and propagate input information accumulated with time. For the simple abstract model we consider, the form of recurrent interactions determines the type of operation performed by the network. We therefore expect that by properly adjusting recurrent interactions, the neural mechanism we want can be achieved.

2

Only in the extreme case when the input noise is constant over time, the driving force generated along the valley is vanishing after the system is relaxed. This is the case considered in Pouget et al. (1998), Deneve et al. (1999), and Wu et al. (2002).

Computing with Continuous Attractors

2225

3.1 The New Network Model. The new network model we consider has the following form τ τw

dUc = −Uc + dt

π

−π

(Wc,c + wc,c )Oc dc + Ic ,

dwc,c = −wc,c + ηOc Oc , dt

(3.1) (3.2)

where we assume that the recurrent interaction consists of two parts: the original static component, Wc,c , given by equation 2.1, and the new dynamical one, wc,c , which varies according to the Hebbian learning rule. The parameters τw and η are the time constant and the learning rate of wc,c , respectively. From now on, we consider the fluctuations of external input Ic changing over time. Without loss of generality, we consider the fluctuations that are updated every T time units, that is, the frequency of noise is 1/T. The dynamical behavior of the above network is governed by three timescales: (1) the frequency of input fluctuations, given by 1/T; (2) the converging speed of neural activity when Ic and wc,c are fixed, which is determined by 1/τ ; and (3) the response speed of dynamical interactions with respect to the change of neural activities, which is controlled by τw and η. We are particularly interested in the parameter region when T τ and τw ≥ T. The first requirement ensures that the updates of network states can catch up to the input change. Computationally, this is the only meaningful case, since any input information varying more quickly than 1/τ is undetectable by the network. The second requirement ensures that dynamical interactions will memorize the history of external input over a sufficiently long time. 3.2 Statistic Interpretation of the Network Performance. In general, it is difficult to analytically solve the dynamics of above network model, whose solution largely depends on the details of the external input. In order to get an intuitive understanding of the role of dynamical interactions, we first consider a special case in which the input fluctuations are sufficiently small. This condition allows us to approximate the continuous equation 3.2 as an equation discrete in time. More specifically, we approximate wc,c to be a constant during each period of T when Ic is fixed. This takes into account the fact that within each period, the contribution due to the variation of wc,c on the change of network states is a small quantity of a higher order when compared with that due to the change of input. As compensation of this approximation, wc,c is updated at the end of each period according to ˜ c (m − 1) O ˜ c (m − 1), wc,c (m) ≈ βwc,c (m − 1) + (1 − β)η O

(3.3)

2226

S. Wu and S. Amari 0.01

Network Representation

Eq.(3.3) Eq.(3.2)

0.005

0

−0.005

−0.01

0

10

20

30

40

50

Time Step Figure 5: Confirming the plausibility of the approximation 3.3. The network performances based on equations 3.2 and 3.3 are compared. The true stimulus x = 0. The unit of each time step is T. The network estimation of stimulus at each time step is shown. The parameters T = 20τ and τw = 100τ , which gives β = e −T/τw ≈ 0.82. Gaussian random noise is added in the input, which has zero mean and variance 2 × 10−3 . The other parameters are N = 40, a = 1, τ = 1, µ = 0.5, γ = 0.05, and η = 10.

where m labels the time step of duration T, wc,c (m) the approximated value ˜ c (m − 1) the stable state of of the dynamical interaction at the step m, and O neurons at step m − 1. The coefficient β is given by β = exp[−T/τw ] . Equation 3.3 can be seen as the solution to equation 3.2 under the approximation ˜ c (m − 1). Oc (t) = O The value of wc,c (m) consists of two parts. The first one is inherited from the previous time step, and the second is the newly learned result. The plausibility of this approximation is confirmed by the simulation in Figure 5. By iteration, we further write down the dependence of wc,c (m) on all previous neural activities, ˜ c (m − 1) O ˜ c (m − 1) + β O ˜ c (m − 2) O ˜ c (m − 2) wc,c (m) ≈ (1 − β)η[ O ˜ c (m − 3) O ˜ c (m − 3) + · · ·], + β2 O

(3.4)

which displays how the history of input information is imprinted on the time course of dynamical interactions.

Computing with Continuous Attractors

2227

With the above approximation, the network estimation at step m is calculated to be (see the appendix), zˆ (m) = argmaxz

π

−π

+ (1 − β)η

k=1

˜ c (z)Ic (m)dc O β

π

k−1

−π

2 ˜ c (z) O ˜ c (m − k)dc O

.

(3.5)

From the above equation, we see that the network estimation is determined by two parts. The first corresponds to the conventional template matching with respect to the instant external input (compare with equation 2.7), and the second with the summation of contributions from all previous neural activities, weighted according to their temporal proximity. Thus, equation 3.5 can be seen as implementing an online Bayesian inference, where the first term corresponds to MLI and the second to the contribution of prior knowledge.3 Now, the role of dynamical interactions becomes clear. It serves to store and propagate the input information being accumulated with time. 4 The Performances of the New Network Model Now we investigate the detailed performance of the new network model. 4.1 Reading-Out Stimulus Smoothly and Accurately. To confirm the analysis in section 3.2, we carry out the following simulation experiment. We consider a situation in which the true stimulus x is fixed, whereas the external input fluctuates over time, which is written as 2

Ic (m) = γ e [−(c−x)

/2a 2 ]

+ c (m),

(4.1)

where Ic (m) is the input value at the time step m and c (m) the corresponding noise.

3 To formally illustrate that equation 3.5 corresponds to Bayesian inference (more specifically, the implementation of maximum a posteriori), we may need to formulate the second term as the logarithm of a prior distribution of stimulus. It turns out, however, that in general situations, this distribution is complex and does not have a simple form. Only in the extreme case, as considered in Wu et al. (2003), which corresponds to β being sufficiently small here, this distribution can be formulated as a gaussian prior centered at the network estimation of the previous step (but in this case, the learning rate needs to be subtly adjusted in order to efficiently use the prior knowledge). For details, refer to Wu et al. (2003).

2228

S. Wu and S. Amari

Without loss of generality, we consider c (m) being independent gaussian white noise satisfying c (m) = 0,

(4.2)

c (m)c (m) = σ 2 δ(c − c ),

c (m)c (m ) = σ δ(m − m ), 2

(4.3) (4.4)

where σ denotes the noise strength and the symbol · the average over many trials. The condition (4.3) implies that noises are independent between the ensemble units and the condition (4.4) noises are uncorrelated over time. To check the susceptibility of network decoding with respect to input fluctuations, we measure its decoding error, defined by (xˆ − x)2 , and compare the obtained result with that of the original network model without dynamical interactions. The typical performances of the two networks are shown in Figure 6. We see that the new network becomes much more robust with respect to input noises. The extent of improvement depends on the choice of β (or τw ) and η, that is, how much history of the external input is memorized by the dynamical interactions. For the case in Figure 6, the decoding error of the new network is down to 6 × 10−3 , compared with 2 × 10−2 of the original model. 4.2 Retaining the Capacity of Smooth Tracking. A major reason for neural systems to adopt continuous attractors is that they provide the capacity of tracking time-varying stimuli smoothly. Therefore, it is important to check whether this property is retained in the new network model. Again, as in the case of Figure 2A, we consider a situation where the stimulus experiences an abrupt change. The typical behavior of the new network is illustrated in Figure 7, which confirms that the property of smooth tracking indeed holds. Intuitively, one may suspect that since wc,c has now broken the translational invariance of the recurrent interactions (consider a situation when a fixed input has been presented for a sufficiently long time; the value of wc,c becomes concentrated around that input), why it does not destroy the property of smooth tracking (see the example in Figure 2B). The explanation is that wc,c is dynamical here, which also evolves under the drive of external input. 4.3 The Trade-Off Between Decoding Stability and the Speed of Tracking. We do, however, observe a difference in performance between the new and old network models: the reaction speed to abrupt stimulus change is slower in the new model. The reaction time, whose inverse defines the response speed, is measured from the moment the change occurs to when the

Computing with Continuous Attractors

A

2229

0.06

Decoding Error

0.04 0.02 0 −0.02 −0.04 −0.06

0

50

100

150

200

150

200

Time Step

B

0.06

Decoding Error

0.04 0.02 0 −0.02 −0.04 −0.06

0

50

100

Time Step

Figure 6: A comparison of the decoding performances of the two network models with and without dynamical interactions. The noise strength σ 2 = 0.01. (A) New network model. The parameters are β = 0.8 and η = 10. (B) Original network model.

system catches up to the change. This phenomenon is not a surprise to us, which is due to the fact that the neural system responds to the time-averaged behavior of the external input. Thus, there is a trade-off between decoding robustness of line attractors and their speed of tracking. It is conceivable that the longer the network memorizes the input history, the more robust the network decoding will be and the more likely the tracking speed will be delayed. The parameters that determine how much history of the external input is to be memorized by dynamical interactions are β and η. From eqution 3.5, we see that the larger the value of η, the more the network estimation depends on previous neural activities. Thus, the tracking speed should decrease with η (or equivalently, the reaction time increases with η), as confirmed

2230

S. Wu and S. Amari 0.15

Final

Population Activity

Initial 0.1

0.05

0

−3

−2

−1

0

1

2

3

Preferred Stimulus Figure 7: Illustration of the tracking capacity of the new network model. The input condition is as the same as in Figure 2. The parameters for dynamical interactions are β = 0.8 and η = 10.

in Figure 8A.4 To check the trade-off on decoding robustness, we measure decoding accuracy versus η. The decoding accuracy is calculated when the true stimulus is fixed (only in this case can the accuracy be easily calculated by the mean square error). We observe that the decoding accuracy increases with η, as expected. For β = 0.8 and the noise strength σ 2 = 0.01, decoding errors are calculated to be 9 × 10−3 , 6 × 10−3 , and 4 × 10−3 for η = 5, 10, and 15, respectively. Similarly, the trade-off between robustness and tracking speed can also be displayed based on their dependence with β. The larger the value of β, the more information about previous neural activities is memorized by dynamical interactions, and, hence, the longer delay in tracking can be expected. This is confirmed in Figure 8B. As a trade-off, decoding robustness increases with β. For η = 10 and σ 2 = 0.01, decoding errors are calculated to be 8−3 , 6 × 10−3 , and 2.5 × 10−3 for β = 0.7, 0.8, and 0.9, respectively.

4 One may note that in both Figures 8 and 9, there is a sudden jump in the network representation immediately after the stimulus change. Mathematically, this is due to the sudden change of input value, as is evident by the fact that when we vary the size of γ in Ic , the magnitude of the jump will also change. It is not clear whether this implies certain biological behavior.

Computing with Continuous Attractors

Network Representation

A

2231

2

η=0

1.5

η=5

η=10

η=15

1

0.5

0

0

50

100

150

200

250

300

Time Step

B

2

Network Representation

β=0.7

0.5π

1.5

β=0.8

β =0.9

β=0

1

0.5

0

0

50

100

150

200

250

300

Time Step

Figure 8: Illustration of the dependence of tracking speed on the size of η and β. In the simulation, the stimulus value is abruptly changed from 0 to 0.5π . The time course of the network representation during the tracking is shown. (A) For different values of η, β = 0.8. η = 0 corresponds to the situation of no dynamical interactions. (B) For different values of β, η = 10. β = 0 corresponds to the situation of no dynamical interactions.

4.4 Reaction Time vs. the Magnitude of Abrupt Stimulus Change. It is also valuable to see the relationship between the amount of delay and the size of the abrupt stimulus change. Figure 9A shows that the amount of delay increases with the jump size. This finding qualitatively agrees with the experimental observation (Georgopoulos & Massey, 1987; Georgopoulos et al., 1993).

2232

S. Wu and S. Amari

A

2

Network Representation

0.6π

1.5

0.4π 1

0.2π 0.5

0 0

50

100

150

200

Time Step

Network Representation

B

2

0.5π

1.5

a=1

a=0.75

a=0.5

1

0.5

0 0

50

100

150

200

250

Time Step

Figure 9: (A) Illustration of the dependence of tracking time on the size of abrupt stimulus change. In the simulation, the stimulus value abruptly jumps from 0 to 0.2π, 0.4π, and 0.6π , respectively. (B) Illustration of the dependence of tracking on the width of tuning function. In the simulation, the stimulus value abruptly changes from 0 to 0.5π.

4.5 Tracking Speed vs. Neural Tuning Width. Another interesting behavior we observe is the dependence of the tracking speed on the width of tuning function. Figure 9B shows that the tracking speed increases with the tuning width a . This is understandable. Intuitively, tracking is to move the bump of population activity from the initial position to where the stimulus has changed (see Figure 7), and this operation is conducted through neuronal recurrent interactions. Therefore, it can be expected that the broader the range of neural interactions, the quicker the bump can move. From equation 2.1, we see that the tuning width a controls the range of

Computing with Continuous Attractors

2233

neuronal interactions. Thus, larger tuning width implies quicker tracking speed. Notably, this property also holds for the network without dynamical interactions. This finding gives us a new justification for why the neural tuning width is large, a long-standing debate in the field of population coding (Pouget et al., 1999; Zhang & Sejnowski, 1999). Through calculating the Fisher information (whose inverse, the Cram´er-Rao bound, defines the optimal accuracy for any unbiased estimator to achieve), Zhang and Sejnowski (1999). found that smaller tuning width actually leads to higher decoding accuracy. This seems to be in contrast to the experimental finding that the neural tuning width is often very large and suggests that apart from decoding accuracy, other factors also influence the construction of neural population code. One such factor is believed to be the speed of computation (Brunel & Nadal, 1998; Bethge, Rotermund, & Pawelzik, 2002). The idea is as follows. Large tuning width implies that more neurons can be active shortly after the onset of stimulus, and hence more neurons are involved in decoding in a short time window. This helps to average out fluctuations in the activities of individual neurons, and hence increases the decoding accuracy in a short time window.5 The finding of Figure 9B gives us another justification on large tuning width: it increases the tracking speed of neural systems, a critical property needed for carrying out many computational tasks. 5 Conclusions and Discussions This study has investigated two important issues that arise when applying continuous attractors in neural systems. One issue concerns the computational robustness of continuous attractors with respect to input noises, a consequence of the system’s neutral stability. The other concerns the implementation of Bayesian online coding, an advanced strategy to read out stimulus accurately from a noisy environment. It turns out that both aspects can be resolved under the same neural mechanism, that is, to include dynamical interactions of proper form between neurons. These dynamical interactions evolve according to the biologically plausible Hebbian learning rule and have the computational role of storing and propagating input information accumulated with time. In terms of stabilizing the computation of continuous attractors, dynamical interactions enable the system to respond to the history of external input over a period of time and, hence, suppress short-term fluctuations. In terms of realizing Bayesian online inference, dynamical interactions have the effect of conveying the prior knowledge of

5 In a short time window, MLI is not asymptotically efficient; instead, its decoding error satisfies the Cauchy distribution. The Cram´er-Rao bound is not achievable. This is the reason that the analysis based on the Fisher information fails in this case (Wu & Dayan, unpublished result).

2234

S. Wu and S. Amari

stimulus, which, in the coding paradigm of our concern, is the temporal information of external inputs. Understanding the stability of continuous attractors, on either its computational or structure robustness, has been an active topic in the research of neural information processing (e.g., see (Wang, 2001; Brody et al., 2003)). Apart from the one proposed here, another potential mechanism to improve the robustness of attractor computation is to consider the multiple stability of neural responses (Camperi & Wang, 1998; Koulakov, Raghavachari, Kepecs, & Lisman, 2002). Because of multiple stability, the system will not respond to small variations of input signals, and hence achieves a certain level of robustness. This mechanism is different from the one proposed in this work. In our consideration, it is the response of the system to the timeaveraged behavior of input that increases the robustness of computation. The beneficial point of our mechanism is that its realization is naturally associated with Bayesian online inference. In other words, neural systems do not simply ignore noise in order to maintain stability; instead, they actively use this information to enhance decoding (note that in a noisy environment, it is the time dependence of input fluctuations that gives us the information about the stationary value of the stimulus). Nevertheless, these two mechanisms are not necessarily contradictory. They may be applied in different conditions, for example, for noises of different sizes or different timescales. Further studies are needed to clarify this point. Exploring the neural mechanism for implementing Bayesian inference is the other main goal of this work. It is widely believed that Bayesian inference should be used in neural systems to cope with typical inputs in the natural environment that are both noisy and highly structured (see, e.g., Mumford, 1992; Dayan, Hinton, Neal, & Zemel, 1995; Zemel, Dayan, & Pouget, 1998; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Baddley, Hancock, & Foldiak, 2000; Rao, Olshausen, & Lewicki, 2002; Wu et al., 2003; Lee & Mumford, 2003; Rao, 2004). To implement Bayesian inference, the core is to have a mechanism that can store and propagate the prior knowledge of stimulus. In this work, we investigate this issue for the case of using line attractors to decode stimulus. For the coding paradigm of concern, the prior knowledge of stimulus is the temporal information of external inputs, which is critically needed for suppressing short-term fluctuations. Interestingly, it turns out that Hebbian learning, the fundamental principle that underlies the modification of neural interactions, serves to convey this prior knowledge. It converts the temporal information of external inputs into spatially distributed patterns of neuronal interactions, which subsequently influence the computation in network. Also, the amount of temporal information to be memorized by the network can be adaptively controlled through adjusting the decay and learning speed of dynamical interactions. This finding may have important implications for our understanding of neural coding. It suggests that Hebbian learning can also play an active role in information retrieval, not merely in learning new knowledge (Bi & Poo, 2001).

Computing with Continuous Attractors

2235

Dynamical recurrent interactions can also have other contributions in computation associated with continuous attractors. For instance, Zhang (1996) investigated a line attractor model for head direction representation, in which a key operation is to integrate the self-motion signal of object with the observer-centered one, so that the world-centered representation for head direction can be achieved. Zhang found that this computation can be carried out by including properly varied recurrent interactions. Stringer et al. (2002) further show that these rapidly varying interactions can be generated by Hebbian learning, sharing the same mechanism as proposed here for stabilizing the computation of line attractors and implementing Bayesian online decoding. It will be interesting to see how these studies can be combined to give a full picture of the roles of dynamical interactions in attractor computation. An obvious direction to extend the current work is to include more detailed structures of biological systems. Recent research has shown that continuous attractors can be readily achieved in spiking neural networks (Trappenberg, 1998; Seung et al., 2000; Gutkin, Laing, Colby, Chow, & Ermentrout, 2001; Degris, Brunel, Sigaud, & Arleo, in press; Osan, Curtu, Rubin, & Ermentrout, 2004). Considering the simplicity and plausibility of computational roles of dynamical interactions proposed here, we expect that they are also feasible for spiking neurons. Taking into account the nature of timescale considered here, we expect that the neural basis of dynamical interactions is associated with the short-term activity-dependent plasticity of synapses (Bi & Poo, 2001). These issues will be explored in our future study. Appendix: Bayesian Interpretation of the Network Performance Under the approximation that wc,c is a constant in each time interval when Ic is fixed, the dynamics of the new network model is simplified as τ

dUc = −Uc + dt

π −π

(Wc,c + wc,c (m))Oc dc + Ic (m),

(A.1)

where wc,c (m) is given by equation 3.3. Since we are interested only in the case when noise fluctuations are sufficiently small, the above equation can be further approximated as dUc τ ≈ −Uc + dt

π −π

Wc,c Oc dc + Ic (m),

(A.2)

˜ c (m)dc . wc,c (m) O

(A.3)

with Ic (m)

= Ic (m) +

π

−π

2236

S. Wu and S. Amari

π get equation A.2, the condition −π wc,c (m)Oc dc ≈ ˜ ˜ −π wc,c (m) Oc (m)dc , is used, where Oc (m) is the stable state of neurons in the step m. This approximation takes into account the fact that π ˜ c (m))dc is a small quantity of higher order when (m)(Oc − O w c,c −π compared with the input fluctuations. Comparing equation A.2 with equation 2.3, we see the only difference is that Ic is replaced by Ic (m). Thus, in effect, the contribution of dynamical interactions is equivalent to an extra “input” component to neurons. When Ic (m) is sufficiently small, the case we consider, the solution of equation A.2, that is, the estimation of the network, is given by Wu et al. (2003), To π

zˆ (m) = argmaxz

π

−π

˜ c (z)Ic (m)dc. O

(A.4)

Further, using equation 3.4, we get zˆ (m) = argmaxz + (1 − β)η

π

−π ∞ k=1

˜ c (z)Ic (m)dc O β

k−1

π −π

2 ˜ c (z) O ˜ c (m − k)dc O

.

(A.5)

The above equation consists of two terms. The first one corresponds to the contribution from the current input, and the second one to the contributions from previous neural activities. Acknowledgments We acknowledge valuable discussions with Thomas Trappenberg, Pulin Gong, Sophie Deneve, Kosuke Hamaguchi, and Peter Dayan. We also thank K. Y. Michael Wong for improving the writing of this letter.

References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87. Baddley, R., Hancock, P., & Foldiak, P. (2000). Information theory and the brain. Cambridge: Cambridge University Press. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848.

Computing with Continuous Attractors

2237

Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When Fisher information fails. Neural Computation, 14, 2317–2351. Bi, G., & Poo, M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Brody, C. D., Romo, R., & Kepecs, A. (2003). Basic mechanisms for graded persistent activity: Discrete attractors, continuous attractors, and dynamic representations. Current Opinion in Neurobiology, 13, 204–211. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Camperi, M., & Wang, X.-J. (1998). A model of visuospatial short-term memory in prefrontal cortex: Recurrent network and cellular bistability. J. Comput. Neurosci., 5, 383–405. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Degris, T., Brunel, N., Sigaud, O., & Arleo, A. (in press). Rapid response of head direction cells to reorienting visual cues: A computational model. Neural Computing. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2, 740–745. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2, 1527–1537. Georgopoulos, A. P., & Massey, J. T. (1987). Cognitive spatial-motor process. Experimental Brain Research, 65, 361–370. Georgopoulos, A. P., Taira, M., & Lukashin, A. (1993). Cognitive neurophysiology of the motor cortex. Science, 260, 47–52. Gutkin, B. S., Laing, C. R., Colby, C. L., Chow, C. C., & Ermentrout, B. (2001). Turning on and off with excitation: The role of spike-timing asynchrony and synchrony in sustained neural activity. J. Computational Neuroscience, 11, 121–134. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch and I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks. Cambridge, MA: MIT Press. Hopfield, J. J. (1984). Neurons with graded responses have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088– 3092. Koulakov, A. A., Raghavachari, S., Kepecs, A., & Lisman, J. E. (2002). Model for robust neural integrator. Natural Neuroscience, 5, 775–782. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A, 20, 1434–1448. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. J. Neurophysiology, 49, 1127–1147. Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol. Cybern., 66, 241–251. Natschlager, T., Maass, W., & Markram, H. (2002). The “liquid computer”: A novel strategy for real-time computing on time series [Special issue]. Foundations of Information Processing of TEIEMATIK, 8, 39–43.

2238

S. Wu and S. Amari

Osan, R., Curtu, R., Rubin, J., & Ermentrout, B. (2004). Multiple-spike waves in a one-dimensional integrate-and-fire neural network. J. Math. Biol., 48, 243–274. Pouget, A., Deneve, S., Ducom, J., & Latham, P. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 373–401. Rao, R. P. (2004). Bayesian computation in recurrent neural circuits. Neural Computation, 16, 1–38. Rao, R. P., Olshausen, B., & Lewicki, M. (2002). Probabilistic models of the brain, perception and neural function. Cambridge, MA: MIT Press. Rolls, E. T., Robertson, R. G., & Georges-Fran¸cois, P. (1995). The representation of space in the primate hippocampus. Soc. Neurosci. Abstr., 21, 1494. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Acad. Sci. USA, 93, 13339–13344. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Stringer, S. M., Trappenberg, T. P., Rolls, E., & Aranjo, I. (2002). Self-organzing continuous attractor networks and path integration: One-dimensional models of head direction cells. Network: Computation in Neural Systems, 13, 217–242. Taube, J. S. (1998). Head direction cells and the neurophysiological basis for a sense of direction. Prog. Neurobiol., 55, 225–256. Trappenberg, T. (1998). Dynamical cooperation and competition in a network of spiking neurons. In Proc. Fifth International Conference on Neural and Intelligent Processing (ICONIP98). Kitakyushu, Japan. Trappenberg, T. (2003). Continuous attractor neural networks. In L. N. de Castro & F. J. V. Zuben (Eds.), Recent developments in biologically inspired computing. Hershey, PA: Idee Group. Tsodyks, M., & Sejnowski, T. (1995). Associative memory and hippocampal place cells. Int. Journal of Neural Systems, 6 (supp. 1995), 81–86. van Vreeswijk C., & Somplinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wang, X. J. (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends in Neuroscience, 24, 455–463. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of hippocampal ensemble code for space. Science, 261, 1055–1058. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Computation, 14, 999–1026. Wu, S., Chen, D., Niranjan, M., & Amari, S. (2003). Sequential Bayesian docoding with a population of neurons. Neural Computation, 15, 993–1012. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797. Zemel, R., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10, 403–430. Zhang, K.-C. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neuroscience, 16, 2112– 2126.

Computing with Continuous Attractors

2239

Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neural population activity by reconstruction: Unified framework with application to hippocampal cells. J. Neurophysiol., 79, 1017–1044. Zhang, K., & Sejnowski, T. J. (1999). Neural tuning: To sharpen or broaden. Neural Computation, 11, 75–84.

Received August 23, 2004; accepted March 22, 2005.

LETTER

Communicated by Jean-Pierre Nadal

Optimal Signal Estimation in Neuronal Models Petr L´ansky´ [email protected] Institute of Physiology, Academy of Sciences of Czech Republic, 142 20 Prague 4, Czech Republic

Priscilla E. Greenwood [email protected] Department of Mathematics, Arizona State University, Tempe, AZ, U.S.A.

We study optimal estimation of a signal in parametric neuronal models on the basis of interspike interval data. Fisher information is the inverse asymptotic variance of the best estimator. Its dependence on the parameter value indicates accuracy of estimation. Our models assume that the input signal is estimated from neuronal output interspike interval data where the frequency transfer function is sigmoidal. If the coefficient of variation of the interspike interval is constant with respect to the signal, the Fisher information is unimodal, and its maximum for the most estimable signal can be found. We obtain a general result and compare the signal producing maximal Fisher information with the inflection point of the sigmoidal transfer function in several basic neuronal models.

1 Introduction In this letter, we address the question of how much information about the constant level of a driving signal, s, is contained in a train of neuronal spikes. We work under the classical assumption that the neuron codes information in the frequency of its action potentials (Adrian, 1928). If the signal is weak, there is little firing, and the firing rate increases with growing signal strength. As s increases further, the rate of firing reaches a maximum possible level and saturates. The curve describing firing rate as a function of s, the so-called frequency (input-output) transfer function or gain function, is sigmoidal. The region of signal strength, s, which can be most accurately discriminated (estimated) on the basis of firing rate, the so-called dynamic range, is the region where the rate is increasing most sharply. An aim of this article is to refine and quantify this statement. Nizami (2002) says, “Estimation of dynamic range poses a problem, because the literature reveals no consensus on how it is to be done.” As a measure of firing rate, we use the inverse of the mean interspike interval (ISI) (for other alternatives, see Lansky, Rodriguez, Neural Computation 17, 2240–2257 (2005)

© 2005 Massachusetts Institute of Technology

Optimal Signal Estimation Neuronal Models

2241

& Sacerdote, 2004). As a measure of information about the signal in the system output train of spikes, we use Fisher information (Stemmler, 1996; Brunel & Nadal, 1998; Zhang & Sejnowski, 1999; Bialek, Nemenman, & Tishby, 2001; Wilke & Eurich, 2002; Johnson & Ray, 2004). In a classical spiking neuron model (for a review, see, e.g., Tuckwell, 1988), the net input current as a function of time is represented as a stochastic process, for example, a Poisson sequence of pulses, and the neuron “integrates” this input. Then the state of the neuron is a stochastic process again, for example, a Wiener process with positive drift. The neuron fires at the random time when the process crosses a fixed threshold. The variety of models that can be studied in detail by this first-passage-time approach is limited by difficulties of computation. An alternate modeling device suggested by Plesser and Gerstner (2000) to deal with this difficulty is to postulate directly a hazard function, which represents the probability that the neuron fires in an interval (t, t + dt) given that it has not fired up to time t. A comparison between the diffusion approach and the hazard function approach to integrate-and-fire neuron models was given by Plesser and Gerstner (2000). Additional motivation for the hazard function model was developed by Gerstner and Kistler (2002). Yet another way to describe the statistical behavior of interspike intervals is to fit their histogram to a family of probability density functions. This can be done without any model in mind. Different modes of description of ISIs, the first-passage-time approach, hazard function approach or histograming approach, may well lead to the same ISI distribution. As indicated above, we view the spike train as depending directly on the level, s, of the driving signal and ask how much information is contained in a firing sequence about the signal. We extend all three modeling approaches to parametric families in which signal strength, s, is a parameter. We assume the transfer function shape and compute the associated Fisher information function, allowing direct quantitative comparison of these two functions of signal strength for a variety of examples. Such a quantitative assessment of the information about s contained in neuronal spiking has been unavailable in the past. There may be additional information about s in the correlation structure of the spike train, but we are not proposing here to use it.

2 Transfer Function and Entropy A statistic that is often calculated from spike train data is the average ISI, t, corresponding to each input signal level, s. The range of the signal depends on its physical meaning and which scale is used—for example, the intensity in linear or log scale. The frequency transfer function, f (s), is then estimated by f = 1/t for each s. It expresses the rate of firing as a function of signal level, s. If we regard the train of spikes of a firing neuron as a (locally)

2242

P. L´ansky´ and P. Greenwood

stationary point process depending on parameter s, the relation of the rate of firing to the random ISI, T, can be expressed formally as f (s) =

1 , E s (T)

(2.1)

where E s (T) is the expected value of T in the presence of signal at level s. Here we consider the transfer function as known and arising from some neuronal model. A typical form of empirically obtained transfer function is a sigmoid function, as explained in section 1. A further general discussion was presented in Bedenbaugh and Gerstein (1994). Specific examples obtained in different sensory systems were studied by McKeegan (2002), Nizami (2002), Rospars, Lansky, Duchamp-Viret, & Duchamp (2003), and others. A general representative form of the transfer function is the logistic function, f =

d , 1 + exp(−b(s − so ))

(2.2)

where the parameter d > 0 gives the asymptotic level, so determines the location, and b > 0 controls the steepness of the curve, or the width of the predominant increasing part. Although we will use the form 2.2 as a representative transfer function, it will be clear how to substitute other choices. Studies show (e.g., Rospars et al., 2003) that a variety of functions with similar shape, such as hyperbolic tangent, fit data equally well. For simplicity, we set d = b = 1 in equation 2.2, and we locate the inflection point, so , at zero, so that f (0) = f (0) = 1/2. Thus, the frequency saturates at one, and the slope of the transfer function is fixed. It is intuitively clear that the region around the inflection point where the curve is steepest and most nearly linear is the region of highest sensitivity to an increment of stimulation. This region is called the coding range or dynamic range. For transfer function 2.2, the coding range is the interval of s values where f (s) is between 10% and 90% of d, for example. There is a theoretic justification for the notion that the most detectable signal is that for which the first derivative, with respect to s, of a sigmoidal transfer function is maximal. The differential entropy of a random variable, S, with probability density function g(s) is defined by

∞

I (S) = −

g(s) ln g(s)ds.

(2.3)

0

The integrand, i(s) = −g(s) ln g(s), although sometimes negative, can be viewed as entropy density. This is parallel to investigating pi ln pi for a discrete random variable. Any saturating transfer function, if normalized to saturation at one, can be considered as the cumulative distribution function,

Optimal Signal Estimation Neuronal Models

2243

1 0.8 0.6 0.4 0.2 0 -4

-2

0 signal

2

4

Figure 1: Dependence of J 2 /E s (T) (solid line), the frequency transfer function (dashed line), and the entropy density (dotted line) on the signal. The frequency saturates at one. The maximum of J 2 /E s (T), located at −0.693, is shifted to the left from zero, which is the point of the maximum of the entropy density and location of the inflection point of the transfer function.

G, of a random variable. If we think of the signal as a random variable, S, with a sigmoidal cumulative distribution function, then the position of maximum of the entropy density of this random variable coincides with the inflection point of G. For our example, equation 2.2, with b = d = 1, s0 = 0, we take g(s) = d f /ds, and i(s) = −

exp(−s) exp(−s) ln (1 + exp(−s))2 (1 + exp(−s))2

(2.4)

is unimodal and has its maximum at zero coinciding with the inflection point of equation 2.2 (see Figure 1). This can be easily verified by taking the first derivative of equation 2.4 and comparing the location of its zero with location of zero for the second derivative of formula 2.2. However, this characteristic of the transfer function of the signal does not take into account how the signal is processed to produce a spike train and, finally, a frequency of spiking. In order to adequately measure information about s, one must have, in addition to the transfer function, a stochastic model of the spike train that produces it. Then, besides the mean of the ISI, one has the entire distribution function of the ISI as a function of the signal. An object of this article is to show how to use the ISI distribution, or at least its first two moments, to obtain a better idea of the optimum signal value than the transfer function alone can provide. An optimal signal, in terms of Fisher information, has the smallest possible variance of an asymptotically unbiased estimate. The necessity for further specification is clear from a

2244

P. L´ansky´ and P. Greenwood

hypothetical example: if the ISIs had a degenerate probability distribution at a value depending monotonically on s, then any signal could be determined with no error. The relation of our work to that of other authors who have studied associated problems is discussed in the last section. On the other hand, and it has to be stressed, we do not compare the Fisher information analysis to the approaches based on Shannon information concepts, because we do not consider the mutual information between the signal and the response. 3 Fisher Information and Its Approximation Next we introduce Fisher information as a measure of how well a signal, s, can be estimated from ISI data. Suppose that the random variable T has probability density function belonging to a parametric family g(t; s). The Fisher information with respect to the parameter s is J =

1 g

∂g ∂s

2 dt.

(3.1)

The use of Fisher information as a tool to locate the optimal signal for information transfer is theoretically motivated. It is the inverse of the asymptotic variance of the normalized error from an asymptotically efficient estimator (Rao, 2002). Thus, the higher the Fisher information, the better estimate of s can be achieved. Analogous to the definition of the coding range via the transfer function, in terms of Fisher information, the coding range can be defined as the interval of s values where J (s) is above some percentage, for example, 10% of its maximum. Applications of Fisher information in neural computation were studied by Seung and Sompolinsky (1993), Brunel and Nadal (1998), Abbott and Dayan (1999), Greenwood, Ward, and Wefelmeyer (1999), Greenwood, Ward, Russel, Neiman, and Moss (2000), Dayan and Abbott (2001), Wilke and Eurich (2002), Wu, Amari, & Nakahara (2004), Johnson and Ray (2004), and others. Stemmler (1996) drew attention to a useful lower bound for the Fisher information that can be obtained from the Cauchy-Schwartz inequality, J2 =

1 Vars (T)

∂ E s (T) ∂s

2 ,

(3.2)

where Var(T) is the variance of the random variable T. We have equality, J = J 2 , if for the family of probability density functions g, 1 ∂g(t) = c s (t − E s (T)), g(t) ∂s

(3.3)

Optimal Signal Estimation Neuronal Models

2245

where c s may depend on s, and we can regard J 2 as an approximation to J in the sense that equality would be obtained by ignoring moments higher than the first one in an expansion of g /g starting like equation 3.3. There are several examples of parametric families for which equation 3.3 is satisfied and hence J = J 2 . This happens, for example, in a family of normal distributions with mean as the parameter, and for the exponential distribution with parameter λ, which is the inverse of the mean. We will see another example later involving the inverse gaussian distribution. If the relation between the signal and the parameter of the ISI distribution is not linear, formula 3.2 changes according to ∗ J 2s

=

J 2s

∂m(s) ∂s

2 ,

(3.4)

where s ∗ = m(s) is the functional relation between the parameter s ∗ of the ISI distribution and the signal s. In our situation, m(s) is often the sigmoid function. The linear relation between the signal and the parameter of the ISI distribution is investigated in sections 4.1 and 4.2. However, these are theoretical examples, and in reality one can hardly expect that a parameter of ISI distribution depends linearly on the signal. If another form of the relations between the signal and the parameter of ISI is valid, then equation 3.4 can be applied. In situations where analytical formulas for ISI densities or their moments are unknown, only the numerical methods can be used. Since the random variable in equations 3.1 and 3.2 is the ISI, J or J 2 measures information about s gathered in time counted in spikes as opposed to natural time. We produce an approximate measure of information obtained in natural time by dividing J or J 2 by the expected value of the ISI, E s (T). For us, J 2 /E s (T) will be particularly useful as follows. Writing the transfer function 2.2 with s0 = 0, d = b = 1, in terms of E s (T) as in equation 2.1 gives us E s (T) = 1 + exp(−s).

(3.5)

J2 exp(−2bs) = . E s (T) Vars (T)(1 + exp(−bs))

(3.6)

Then

If Vars (T) were constant, independent of s, this would be a decreasing function of s, and we would have, “the weaker the signal, the better it can be estimated.” Of course, one expects that the variance of T changes with the signal. What dependence of the variance on s produces a unimodal

2246

P. L´ansky´ and P. Greenwood

shape in the right-hand side of equation 3.6? If we assume, for instance, that the standard deviation of T is proportional to its mean, Vars (T) ∝ E s (T),

(3.7)

that is, the coefficient of variation of T is a constant, and take the constant of proportionality equal to 1, then J2 exp(−2s) . = E s (T) (1 + exp(−s))3

(3.8)

We see in Figure 1 that J 2 /E s (T) is a unimodal function, which gives a pointwise measure of the information about s in a neural model with saturating transfer function 2.2, under the assumption that the standard deviation of the ISI is the same as its mean for each s. We see that the peak of the normalized J 2 is shifted to the left with respect to the location of the maximum of the derivative of the transfer function, which is traditionally taken as the point of highest discriminability. This does not necessarily imply that the peak of the Fisher information is shifted in the same way. The maximum of J 2 /E s (T) in equation 3.8 is at s = − ln 2. This holds independent of the value of the constant of proportionality in equation 3.7. Thus, one can find an optimum signal in the sense of maximal Fisher information. We can see from Figure 1 that the Fisher information is relatively flat but still variable over a range of values of s. Brunel and Nadal (1998) and Johnson and Ray (2004) showed how to achieve constant Fisher information over the coding range in Poisson model by choice of the transfer function. Instead of assuming a form of the transfer function by equation 2.2, we can ask an analogous question: What dependence of the variance on s produces constant Fisher information? The answer, from equation 3.6, is clearly Vars (T) ∝

exp(−2s) . 1 + exp(−s)

(3.9)

This relation requires that the decrease of the variance with increasing stimulation (its increase with increasing mean ISI) is slower than under assumption 3.7. In this letter, however, we focus attention on the result given by equation 3.8, which follows from assumptions 3.5 and 3.7. In models where a relation between mean and variance already exists, we use this relation instead of assumption 3.7 to compute J 2 /E s (T), and obtain an expression somewhat different from equation 3.8. How the position of the maximum of normalized J 2 depends on s in various models of spike generation is explored in the next section.

Optimal Signal Estimation Neuronal Models

2247

4 Examples The rest of the letter is devoted to a variety of simple examples where the normalized Fisher information, as a function of signal, s, usually takes the form 3.8 or one very similar. Each example is presented as a parametric family gλ of possible distribution functions of the ISI, T, together with a choice of transfer function f (s). Because the transfer function is defined in terms of E s (T), f (s) = 1/E s (T), and also E s (T) depends on λ, it means there are at least two ways of specifying f (s). The transfer function may be given directly in terms of s, or the parameter λ = λs may be given in terms of s. In the latter case, the transfer function becomes a function of s by substituting λ = λs into the expression 1/E s (T) in terms of λ. There are, in addition, various alternatives for the way the parametric family of ISI distributions is introduced. In first-passage-time (integrateand-fire) neuronal models, one postulates a stochastic function X(t), describing membrane depolarization, which generates the firing sequence by its first-passage times across a certain level. The hazard function approach (also called noisy threshold approach) to neural modeling avoids the specification of a stochastic potential function X(t). Instead, one directly specifies a function, h, which expresses the risk that the potential crosses a threshold. There is a one-to-one correspondence between hazard functions and dent sity functions, g(t) = h(t) exp(− 0 h(x)d x) (see, e.g., Cox & Lewis, 1966). The hazard function, then, depends on the signal s, and the dependence may be in terms of a natural parameter of the hazard function λ = λs . We will see a number of examples. 4.1 Poisson Process. For the first example, suppose that the spikes are generated in accordance with a Poisson process. The probability density function of T is g(t) = λs e −λs t ,

t ≥ 0,

(4.1)

depending on parameter λs > 0, which will be expressed in terms of s, where s is signal strength. The transfer function defined by equation 2.1 is f (s) = λs ,

(4.2)

and the family of hazard functions is h s (t) = λs , which is constant in t, so in this example, the firing rate and the hazard function are identical in s. Also, 1 E s (T) = (Vars (T)) 2 = λ−1 s , fulfilling assumption 3.7 made in section 3. If we choose λs = s, then the transfer function 4.2 is not saturating and gives no optimum value of s, as its slope is constant. Instead, let us choose the dependence of λs on s in the form 2.2, in accordance with our expectation that the transfer function, or firing rate, will

2248

P. L´ansky´ and P. Greenwood

saturate as s increases. This choice agrees with the notion that Poisson neurons are slowly firing. Thus, even if the signal is strong, the Poissonian character implies that the saturation frequency must be low (otherwise they would not be slowly firing). Then the normalized Fisher information bound, J 2 /E s (T), is given by equation 3.8. In this model, it coincides with normalized Fisher information J /E s (T). The optimum signal based on J 2 /E s (T) and that based on the inflection point of the transfer function compare as in Figure 1. 4.2 Poisson Process with Dead Time. Another way, more straightforward, to obtain saturation of the transfer function is to include a refractory time in each ISI. If a dead time, δ > 0, is added to each ISI, any transfer function is bounded by 1/δ. For example, Heck, Rotter, and Aertsen (1993) reported close similarity of their experimental data from visual cortex to this model. Intervals between events are exponentially distributed after an interval of length δ. Then the ISI density is, analogous to equation 4.1, f (t) = λs e −λs (t−δ) ,

t ≥ δ.

(4.3)

The transfer function is f (s) =

λs , λs δ + 1

(4.4)

and the variance is the same as for the Poisson process, 1/λ2s . Several forms of transfer function 4.4 can be considered by specifying the relation between λs and s. For example, if λs = s, then f (s) = s/(1 + δs), s ≥ 0 so that the transfer function saturates as s increases but is not sigmoidal. At s = 0, the derivative from the right is one and the normalized information lower bound, J 2 /E s (T), tends to infinity as s goes to zero. This means that small signals are most identifiable, as is also the case if ISIs are exponentially distributed with no dead time, and the transfer function is defined via λ = s. As a second example, we select λs = 1/(1 + exp(−s)). The transfer function is then similar to equation 2.2, f (s) =

1 , 1 + exp(−s) + δ

(4.5)

which has an inflection point at − ln(1 + δ). One can verify that equation 3.3 holds, so that J = J 2 . However, condition 3.7 fails, so that the normalized Fisher information differs from equation 3.8, J2 exp(−2s) = . E s (T) (1 + δ + exp(−s))(1 + exp(−s))2

(4.6)

Optimal Signal Estimation Neuronal Models

2249

A 0.8 0.6 0.4 0.2 0 -4

-2

0

2

4

2

4

signal 0.08 0.06 0.04 0.02 0

position of maximum

-4

-2

0 signal

B 0 -0.5 -1 -1.5 -2 0

2

4

6

8

10

signal

Figure 2: (A) Dependence of J 2 /E s (T) and of the transfer function on the signal for dead-time Poisson model for δ = 10 (solid lines) and δ = 0.1 (dashed lines). The transfer functions saturate at 1/(1 + δ). The peaks of the Fisher information curves do not coincide with inflection point of the transfer functions. (B) Position of the maximum of Fisher information for signal values (dashed line) and of the inflection point of the transfer function (solid line) in the dead-time Poisson model as a function of the dead time. √

The position of the maximum of equation 4.6 is − ln( 1+ 29+8δ ). Comparison of the positions of maxima of the derivative of the transfer function 4.5 and of 4.6 is presented in Figure 2. For small δ, the Fisher information maximum is to the left from the inflection point of the transfer function. For large δ, the relative positions are exchanged. They are the same for δ = 1.62. 4.3 Weibull Distribution of ISI. The Weibull distribution of ISI is not so commonly tested on experimental data as some other distributions such

2250

P. L´ansky´ and P. Greenwood

as gamma or inverse gaussian, but there exist some references that report its agreement with experimental data (e.g., McKeegan, 2002). We construct this example starting from the hazard functions. Suppose the parameterized family of hazard functions associated with our train of spikes has the simple form h(t) = tλs ,

(4.7)

where the dependence of λ on s has yet to be chosen. The corresponding cumulative distribution functions are Weibull, λs t 2 G(t) = 1 − exp − , 2

t ≥ 0.

(4.8)

From, for example, Johnson and Kotz (1985), we have

E s (T) =

12 3 2 , 2 λs

(4.9)

and the coefficient of variation is constant. Hence, the transfer function has the form, in terms of the Weibull parameter, λs , f (s) = c λs ,

(4.10)

where c is a constant. If we now choose λs to be the saturating function of s, λs = 1/(1 + exp(−s))2 ,

(4.10a)

then the transfer function is, up to a constant c, again the logistic 2.2. Since the assumptions leading to equation 3.8 are satisfied, the normalized Fisher information bound, J 2 /E s T, is given by equation 3.8 up to a multiplicative constant. Condition 3.3 is not satisfied in this case, and thus J 2 = J . If the parameterized family of hazard functions of our spike train is h(t) = t −κ λs ,

(4.11)

which generalizes example 4.7, then the associated cumulative distribution functions are Weibull,

λs 1−κ G(t) = 1 − exp − , t 1−κ

κ < 1.

(4.12)

Optimal Signal Estimation Neuronal Models

2251

1 0.8 0.6 0.4 0.2 0 -4

-2

0 signal

2

4

Figure 3: Dependence of J 2 /E s (T) and of the transfer function on the signal for the Weibull model with κ = −0.5 (dashed line) and κ = −2 (solid line)

Again let us choose λs = 1/(1 + exp(−s))2 . The coefficient of variation is a constant depending on κ. We find that J2 c exp(−2s) = . E s (T) (1 + exp(−s))(4−2κ)/(1−κ)

(4.13)

In Figure 3, we illustrate this function for selected values of κ together with the transfer function derived from the fact that E s (T) = 1 +

1 1−κ

1−κ λs

1/(1−κ) (4.14)

for the cumulative distribution function 4.12. The frequency transfer function is of the form f (s) = cλs1/(1−κ) .

(4.15)

After the substitution 4.10a for λs , we have f (s) = c

1 . (1 + exp(−s))2/(1−κ)

(4.16)

We can see that the optimum signal obtained from the transfer function does not coincide with the optimum obtained from the Fisher information. 4.4 Perfect Integrator. In this example, instead of postulating directly a family of distributions, or equivalently hazard functions, for T, we begin

2252

P. L´ansky´ and P. Greenwood

with a diffusion model for the integrate-and-fire neuron of the sort mentioned in section 1. This model of membrane depolarization, introduced by Gerstein and Mandelbrot (1964), is based on the assumption that the input to the neuron is accumulated over time until a threshold is crossed. At that moment, an action potential is generated, and the accumulation starts anew. The ISIs are identified with the first passage time, T, of a Wiener process X across a threshold, a , T = inf {t ≥ 0, X(t) > a | X(0) = 0 < a } ,

(4.17)

where X is governed by the equation d X(t) = µdt + σ dW(t), X(0) = 0,

(4.18)

and where W represents a gaussian process, the integral of δ correlated noise, with zero mean and amplitude one. The amplitude of the noise is given by σ , and µ, which represents the signal in this model, is the drift parameter of X(t). (For more details about the model, see, e.g., Lansky & Sato, 1999.) The probability density function of T defined by equations 4.17 and 4.18 is g(t) =

a (a − µt)2 exp − , √ 2σ 2 t σ 2π t 3

t ≥ 0,

(4.19)

which is known as the inverse gaussian distribution (Chhikara & Folks, 1989). The inverse gaussian distribution has often been reported as having an excellent fit to experimental ISI data, starting from Gerstein and Mandelbrot (1964). We restricted the attention to the case µ > 0 which ensures that the moments of T are finite, E(T) = a /µ, Var(T) = (a σ 2 )/µ3 . Taking into account the dependence of Vars (T) and E s (T) on the signal, we cannot assume that the coefficient of variation is constant unless σ is also a suitable function of the signal. The frequency transfer function for this model, as a function of the drift, µ, is µ/a . We assume that the drift µ = µs saturates as a function of the signal, s, as motivated in section 2, according to equation 2.2, f (s) =

1 µs = . a 1 + exp(−s)

(4.20)

We notice that the frequency transfer function 4.20 is the same as that for the Poisson example. However, because the family of firing time distributions is not the same in the two examples, the information about s will turn out to be different.

Optimal Signal Estimation Neuronal Models

2253

The upper bound of the Fisher information based on observation of T for the signal, s, is J2 =

a 2 exp(−2s) . σ 2 (1 + exp(−s))3

(4.21)

Since condition 3.3 holds true, we see that J coincides with J 2 . As before, we are interested in Fisher information per unit of natural time. We obtain J2 a 2 exp(−2s) . = 2 E s (T) σ (1 + exp(−s))4

(4.22)

In this example, the coefficient of variation is not constant, but depends on µ, so our information measure 4.22 differs somewhat from equation 3.8. The curve, equation 4.22, is unimodal with the mode located at s = 0, which is the same point as the point of inflection of the sigmoid curve g(s) in equation 4.20. We can see that for this model, the optimal signal is the same regardless of selected criteria.

5 Discussion and Conclusions On the basis of a simple framework, we derived an expression for evaluating the information about signal intensity in the observation of a firing sequence. The assumptions are stationarity of the observations, a sigmoidal transfer function, which we take as logistic, and constant coefficient of variation or another relation between the mean and variance. It turns out that this expression holds in a variety of models, including both simple and deadtime Poisson, Weibull models, and a diffusion model with inverse gaussian firing times. Our central result is that the signal-producing maximal Fisher information does not coincide with the signal at the inflection point of the transfer function. This result offers a new point of view for understanding of the optimum signal for either an empirical or theoretical system. The same result creates a new understanding of the coding range. The variety of models all yielding comparable results suggests that the relation we observe here between points of maximal Fisher information and inflection will be common to other models as well. The transfer function reflects the functional dependence between the signal and the mean ISI. The complete distribution of the ISI is employed for calculation of the Fisher information. Its approximation, used here, involves both the first- and the second-order moments. This new approach extends the classical meaning and interpretation of the rate coding by including as an influential factor the variability of ISIs. However, if a relation between the mean ISI and its variance exists, in other words if with changing the mean

2254

P. L´ansky´ and P. Greenwood

ISI the variance also changes in a known way, then the approximation of the Fisher information is based on only the mean ISI. Then the two different measures of coding efficiency, the transfer function and normalized Fisher information bound, based on mean ISI, E s (T), can be compared. The Poisson model is an approximation that in many experimental situations cannot be rejected, especially for low firing rates. Here, as elsewhere in the literature, we include it as an illustration and for comparison with other models. The other models we studied are closer to reality, and regardless of their complexity, the results obtained are qualitatively the same. The inverse gaussian distribution, equation 4.19, can often be replaced successfully, when fitted to experimental data, by a gamma distribution (McKeegan, 2002). This model lacks the interpretation of saturation available for µ via the diffusion approximation, but can be used in a way analogous to the exponential model. More complex models, assuming correlation structure in the input or including biophysical properties of a neuron, are not studied in this article. Their investigation can be performed only by simulation. We selected models for which we could obtain analytical results. To obtain the dynamic range of a neuron, the transfer function is usually plotted, and the dynamic range is taken as the interval from the firing threshold to the level where the firing rate saturates. Nizami (2002) pointed out difficulties in estimating the dynamic range from the transfer function. We offer a new definition based on the Fisher information. If J or its approximation is known, then the dynamic range can be taken to be an interval of the signal for which one of these measures is above a preselected level. The use of Fisher information for the evaluation of neuronal coding is increasing in recent literature. Wilke and Eurich (2002) used Fisher information to analyze the accuracy with which a neural population encodes a number of stimulus features. They found, as do we, that response variability is an important factor (see also Abbott & Dayan, 1999). They studied tuning functions, defined as mean firing rate as a function of stimulus, which, in their context—for example, in auditory neurons—are typically unimodal rather than saturating, as in our case. They found examples where the signal value producing maximal Fisher information is below the mode of the tuning function. Despite a different approach and assumptions, their result is consistent with ours. Brunel and Nadal (1998) were interested in system designs that will produce optimal tuning curves. They derived an asymptotic relation involving Fisher information and mutual information between the observable, a vector of observations from a population of neurons, and the stimulus. They reviewed results on maximization of mutual information with respect to choices of the transfer function over a tuning range. For the Poisson neuron, it was found that the transfer function ensuring constant Fisher information over the coding range has linear square root. In other words, the transfer function is proportional to the square of the signal. In our treatment, we

Optimal Signal Estimation Neuronal Models

2255

assume the form of the transfer function is a fixed sigmoid and the goal of constant Fisher information is then achieved by relation 3.9 of the ISI variance to the signal. The object of Johnson and Ray (2004) was, again, to derive a code for a population of neurons with the object of effectively representing a stimulus over its entire range. They began with a single-neuron Bernoulli model, where the Bernoulli parameter is the firing probability, regarded as a function of the stimulus. The Bernoulli model approximates a Poisson model. The mean square error of estimation of the stimulus is minimized by maximizing Fisher information. Constant Fisher information over a coding range can be achieved by a quadratically increasing firing rate, in agreement with Brunel and Nadal (1998). These authors also investigated the Poisson model with dead time and found that refractoriness reduces Fisher information. In our model, with somewhat different hypotheses, we also found that the refractoriness reduces Fisher information. At the same time, refractoriness reduces the signal level where the transfer function has maximal slope. Whether the signal producing maximal Fisher information is below or above that where the transfer function increases most rapidly depends on the length of the dead time. Bethge, Rotermund, and Pawelzik (2002) also studied efficient coding and its relation to Fisher information. They explored the boundaries, in several senses, of the utility of Fisher information. Nevertheless, Fisher information was seen as a valid measure for the precision of a code if the dynamic range is sufficiently large. In this and various other articles, Fisher information is used under the assumption that the observations, whether parallel, as in a population, or in series as in a single spike train, are independent, or, equivalently, the point process of firings is a renewal process. In this case, the Fisher information in n observations is just n-times the information per observation (see equation 3.1). If the observations are not independent, then the Fisher information has the same general definition and relation to efficient estimation, but the likelihood that must be considered is the multivariate distribution of a collection of observations, and the information per observation is not the same as equation 3.1. Results exist about the correct form of Fisher information in cases where independence fails and their relation to the Fisher information expression, 3.1, used here and elsewhere. For instance, if the model is a stationary Markov chain, then the asymptotic variance of an efficient estimator is the inverse of the sum of Fisher information of the marginal distribution plus a series of covariances of log-likelihood-related terms (Kessler, Schick, & Wefelmeyer, 2001). The correlational structure of the spike train has been studied recently by Chacron, Lindner, and Longtin (2004) and Chacron, Pakdaman, and Longtin (2003). The question of optimal signal, taking into account the correlation structure, could be pursued using this result of Kessler (2001).

2256

P. L´ansky´ and P. Greenwood

Acknowledgments Helpful conversations with Wolfgang Wefelmeyer are acknowledged. This work was supported by NSERC, Canada, and the Academy of Sciences of the Czech Republic (Information Society, 1ET400110401).

References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Adrian, E. D. (1928). The basis of sensation: The action of the sense organs. New York: Norton. Bedenbaugh, P., & Gerstein, G. L. (1994). Rectification of correlation by a sigmoid nonlinearity. Biol. Cybern., 70, 219–225. Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When Fisher information fails. Neural Comput., 14, 2317–2351. Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neural Comput., 11, 2409–2463. Brunel, N., & Nadal, J. P. (1998). Mutual information, Fisher information, and population coding. Neural Comput., 10, 1731–1757. Chacron, M. J., Lindner, B., & Longtin, A. (2004). ISI correlations and information transfer. Fluctuation and Noise Letters, 4, L195–L205. Chacron, M. J., Pakdaman, K., & Longtin, A. (2003). Interspike interval correlations, memory, adaptation, and refractoriness in a leaky integrate-and-fire model with threshold fatigue. Neural Comput., 15, 253–278. Chhikara, R. S., & Folks, J. L. (1989). The inverse gaussian distribution: Theory, methodology and applications. New York: Dekker. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Greenwood, P. E., Ward, L. M., Russel, D. F., Neiman, A., & Moss, F. (2000). Stochastic resonance enhances the electrosensory information available to paddlefish for prey capture. Phys. Rev. Lett., 84, 4773–4776. Greenwood, P. E., Ward, L. M., & Wefelmeyer, W. (1999). Statistical analysis of stochastic resonance in a simple setting. Phys. Rev., E60, 4687–4695. Heck, D., Rotter, S., & Aertsen, A. (1993). Spike generation in cortical neurons: Probabilistic threshold function shows intrinsic and long-lasting dynamics. In A. Aertsen (Ed.), Brain theory. Amsterdam, Elsevier. Johnson, D. H., & Ray, W. (2004). Optimal stimulus coding by neural populations using rate codes. J. Comput. Neurosci., 16, 129–138. Johnson, N. L., & Kotz, S. (1985). Distributions in statistics. New York: Wiley.

Optimal Signal Estimation Neuronal Models

2257

Kessler, M., Schick, A., & Wefelmeyer, W. (2001). The information in the marginal law of a Markov chain. Bernoulli, 7, 243–266. Lansky, P., Rodriguez, R., & Sacerdote, L. (2004). Mean instantaneous firing frequency is always higher than the firing rate. Neural Comput., 16, 477–489. Lansky, P., & Sato, S. (1999). The stochastic diffusion models of nerve membrane depolarization and interspike interval generation. J. Peripheral Nerv. Syst., 4, 1–16. McKeegan, D. E. F. (2002). Spontaneous and odour evoked activity in single avian olfactory bulb neurones. Brain Res., 929, 48–58. Nizami, L. (2002). Estimating auditory neuronal dynamic range using a fitted function. Hearing Res., 167, 13–27. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-fire neurons: From stochastic input to escape rates. Neural Comp., 12, 367–384. Rao, C. R. (2002). Linear statistical inference and its application (2nd ed.). New York: Wiley. Rospars, J.-P., Lansky, P., Duchamp-Viret, P., & Duchamp, A. (2003). Relation between stimulus intensity and response in frog olfactory receptor neurons in vivo. European J. Neurosci., 18, 1135–1154. Seung, H. S., & Sompolinsky, H. (1993). Simple-models for reading neuronal population codes. PNAS, 90, 10749–10753. Stemmler, M. (1996). A single spike suffices: The simplest form of stochastic resonance in model neurons. Network, 7, 687–716. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Wilke, S. D., & Eurich, C. W. (2002). Representational accuracy of stochastic neural populations. Neural Comp., 14, 155–189. Wu, S., Amari, S., & Nakahara, H. (2004). Information processing in a neuron ensemble with the multiplicative correlation structure. Neural Networks, 17, 205–214. Zhang, K. C., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Comp., 11, 75–84.

Received August 12, 2004; accepted March 8, 2005.

LETTER

Communicated by Peter Tino

Finite State Automata Resulting from Temporal Information Maximization and a Temporal Learning Rule Thomas Wennekers [email protected] Centre for Theoretical and Computational Neuroscience, University of Plymouth, Plymouth PL4 8AA, U.K.; Institute for Neuroinformatics, Ruhruniversity Bochum, 44780 Bochum, Germany; and Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany

Nihat Ay [email protected] Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany; Mathematics Institute, Friedrich Alexander University Erlangen-Nuremberg, 91054 Erlangen, Germany; and Santa Fe Institute, Santa Fe, NM 87501, U.S.A.

We extend Linkser’s Infomax principle for feedforward neural networks to a measure for stochastic interdependence that captures spatial and temporal signal properties in recurrent systems. This measure, stochastic interaction, quantifies the Kullback-Leibler divergence of a Markov chain from a product of split chains for the single unit processes. For unconstrained Markov chains, the maximization of stochastic interaction, also called Temporal Infomax, has been previously shown to result in almost deterministic dynamics. This letter considers Temporal Infomax on constrained Markov chains, where some of the units are clamped to prescribed stochastic processes providing input to the system. Temporal Infomax in that case leads to finite state automata, either completely deterministic or weakly nondeterministic. Transitions between internal states of these systems are almost perfectly predictable given the complete current state and the input, but the activity of each single unit alone is virtually random. The results are demonstrated by means of computer simulations and confirmed analytically. It is furthermore shown numerically that Temporal Infomax leads to a high information flow from the input to internal units and that a simple temporal learning rule can approximately achieve the optimization of temporal interaction. We relate these results to experimental data concerning the correlation dynamics and functional connectivities observed in multiple electrode recordings.

Neural Computation 17, 2258–2290 (2005)

© 2005 Massachusetts Institute of Technology

Temporal Infomax and Automata

2259

1 Introduction A fundamental question in computational neuroscience asks for the nature of codes employed by cortical neurons (Dayan & Abbott, 2001; Abbott & Sejnowski, 1999). Experiments suggest a considerable interaction of neurons already on the level of spikes, for instance, expressed by spatiotemporal correlations in multiple unit recordings (Abeles, Bergman, et al., 1993a; Eckhorn, 1999; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1998; Singer & Gray, 1995; Vinje & Gallant, 2000). Such correlations have been a matter of intensive theoretical and conceptual research, (cf., e.g., Abeles, 1991; Abeles, Vaadia, et al., 1993b; Aertsen, 1993; Gerstein, Bedenbaugh, & Aertsen, 1989; Palm & Aertsen (1986); Wennekers, Sommer, & Aertsen, 2003). A well-known measure that quantifies spatial relations of interacting units is the multi-information (see Studeny´ & Vejnarova, 1998) shared among the units, which is called mutual information in the case of two units (see Cover & Thomas, 1991). Multi-information can be expressed in terms of the Kullback-Leibler divergence as I ( p) := D( p p1 ⊗ · · · ⊗ p N ) =

N

H( pν ) − H( p).

(1.1)

ν=1

In equation 1.1, H(·) denotes the usual Shannon entropy and pν the νth marginal of p. I ( p) measures the distance of p from the factorized distribution p1 ⊗ · · · ⊗ p N . It is a natural measure for spatial interdependence of N stochastic units and a starting point of many approaches to neural coding and complexity, e.g., Martignon, von Hasseln, Grun, ¨ Aertsen, & Palm, 1995; Martignon et al., 2000; Nakahara & Amari, 2002; Rieke et al., 1998; Sporns, Tononi, & Edelman, 2000; Tononi, Sporns, & Edelman, 1994; Ay, 2002; Wennekers & Ay, 2003. Linsker (1986a, 1986b, 1986c), for instance, considered layered feedforward neural systems that maximize the mutual information between (stationary) input and output probability distributions. This approach is closely related to maximizing equation 1.1 (cf. Ay, 2002). Linsker’s work revealed surprising relations between information maximization and the spatial structure of receptive fields in the visual system. Recent experiments in that direction suggest that individual neurons may even adapt dynamically to maximize their information transfer with respect to a given stimulus ensemble (Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). As a fundamental concept regarding neural coding, the Infomax principle is still being discussed and developed further toward confined models for the development and functional significance of receptive fields (see, e.g., Abbott & Sejnowski, 1999; Barlow, 2001; Bell & Sejnowski, 1995; Li & Atick, 1994; Penev & Atick, 1996). Additional sparseness constraints have been shown to lead to more realistic receptive field types than pure Infomax

2260

T. Wennekers and N. Ay

(Simoncelli & Olshausen, 2001). In addition, theoretical work revealed a close relation between information maximization, on one hand, and principal or independent component analysis, on the other. This puts Linsker’s Infomax in line with projection techniques for high-dimensional data analysis (Cutler & Breiman, 1994; Hertz, Krogh, & Palmer, 1991; Bell & Sejnowski, 1995; Lee, Girolami, Bell, & Sejnowski, 2000). For these reasons, informationbased maximization principles can be seen an important guiding principle in computational neuroscience and neuroinformatics. Infomax methods reduce redundancy in feedforward neural networks for artificial and real data. This appears useful for primary and secondary sensory systems, which then encode stimuli into independent channels, a concept that is quite widely believed in neuroscience. A second common belief, however, is that real neural systems are highly interacting and reveal complex spatiotemporal correlations. In fact, those correlations are usually assigned to recurrent synaptic pathways, which are neglected in the classical feedforward frameworks. It might well be that early sensory processing makes heavy use of decorrelation and independent channels, but on a higher level, information must be reintegrated into functional ensembles and combined depending on the context that a set of features appears in. Hebbian learning in recurrent synapses has been proposed to provide this (Hebb, 1949; Wennekers et al., 2003). It can be seen as amplifying synapses connecting related cells in response to correlations in input patterns. Also, brain processes are dynamic and to a high degree independent of sensory stimulation. Accordingly, correlations between cells can also build up without sensory input, and the respective processes are most naturally considered in the spatiotemporal domain, not just for stationary stimulus ensembles with spatial correlations only. It is therefore of interest to study information maximization in more general settings than classical Infomax. In order to capture intrinsically temporal aspects of dynamic interactions in recurrent networks, the measure I in equation 1.1 has been extended by Ay (2001) to the dynamical setting of Markov processes, where it is referred to as (stochastic) interaction and can be seen as measuring the amount of correlations in a Markov process. In a previous paper (Ay & Wennekers, 2003), we have shown that the maximization of stochastic interaction in Markov chains leads to globally almost deterministic dynamical systems, where, nonetheless, every unit generates virtually random activity as characterized by a high entropy. That work neglected external input into the systems under study. This letter therefore investigates the more interesting case of Markov chains, where a part of the system is clamped to prescribed stochastic processes, but only the internal dynamic is optimized toward large stochastic interaction. Importantly, Markov processes optimized under this input constraint turn out to be finite state automata, where the internal dynamic is driven by the external input through complex, almost deterministic global state sequences, but again, single unit activity is virtually random.

Temporal Infomax and Automata

2261

To demonstrate and explain these phenomena, the article is organized as follows. Section 2 introduces the basic formalism of constrained Markov chains and generalizes Shannon entropy and mutual information to the spatiotemporal dynamics of Markov chains. Section 3 presents detailed simulations of small example systems with numerically maximized stochastic interaction under the constraint that parts of the systems follow prescribed stochastic processes. Section 4 explains the basic features of strongly interacting systems observed in the simulations on the base of mathematical properties of I ( p, K ) and derives strict upper bounds for the amount of order and entropy in optimized systems. Proofs of the stated theorems are sketched in the appendix. The letter closes with a discussion that relates our results to experiments concerning spatiotemporal correlations in biological neural ensembles. 2 Temporal Infomax on Constrained Markov Chains Consider a set V = {1, . . . , N} of binary units with state sets ν = {0, 1}, ν ∈ V. For a subsystem A ⊂ V, A := {0, 1} A denotes the set of all config¯ A) is the set of probability distributions urations restricted to A, and P( ¯ B | A) on A. Given two subsets A and B, where B is nonempty, K( is the set of all Markov kernels from A to B . In the case A = B, we ¯ A) = K( ¯ A | A). Informally, a Markov kernel use the abbreviation K( transforms probability distributions over A into probability distributions ¯ A), p B ∈ P( ¯ B ), over B : p B (ω ) = ω∈ A K (ω |ω) p A(ω), where p A ∈ P( ¯ and K ∈ K( B | A). If p A(ω) = δω,ω , that is, if p A is localized in a single state ω , then p B (ω ) = K (ω |ω ) for all ω ∈ B . The K (ω |ω ) can therefore be interpreted as transition probabilities, and because probability distribu tions are normalized, ω ∈ B p B (ω ) = 1, so must be Markov kernels in their first argument, ω ∈ B K (ω |ω ) = 1 for all ω ∈ A. A Markov chain over A ⊂ V is an infinite sequence of random variables Xn = (Xν, n )ν∈A, n = 0, 1, 2, . . ., where Xn+1 depends on only Xn . n is called time or step. A stationary Markov chain can be specified by an ini¯ A) and a Markov kernel K ∈ K( ¯ A). The tial distribution p0 = P(X0 ) ∈ P( probability distributions for Xn then evolve like P(Xn+1 ) = K P(Xn ), that is, P(Xn ) = K n P(X0 ) = K n p0 . In nonstationary Markov chains, K depends on n, such that transition probabilities can change over time. ¯ A) and a Markov kernel K ∈ For a probability distribution p ∈ P( ¯ B | A), we define a Markov transition as the pair ( p, K ) and the conK( ditional entropy of ( p, K ) as H( p, K ) = −

ω∈ A

, ω ∈

p(ω) K (ω | ω) ln K (ω | ω).

(2.1)

B

H( p, K ) defined this way is a natural extension of the Shannon entropy to Markov transitions, because − ln K (ω | ω) in equation 2.1 is the

2262

T. Wennekers and N. Ay

information content of an individual state transition supposing ω is known and K (ω | ω) p(ω) is the probability for that transition. Thus, equation 2.1 measures the average information generated by the Markov transition ( p, K ) just as the Shannon entropy measures the average information con tained in a stationary probability distribution p: H( p) = − ω p(ω) ln p(ω). Note that a Markov transition is not the same as a Markov chain. It describes just a single transformation step between states in two not necessarily equal spaces A and B . However, if A = B and p is a stationary probability distribution of K , then a Markov transition induces a stationary Markov chain in a natural way. As probability distributions, Markov transitions can be marginalized. We define the marginal kernels K ν ∈ K(ν ), ν ∈ V, of a kernel K ∈ K(V ) by K ν (ων

| ων ) :=

σ,σ ∈V σν =ων , σν =ων

p(σ ) K (σ | σ )

σ ∈V σν =ων

p(σ )

,

ων , ων ∈ ν .

(2.2)

Equation 2.2 projects the full kernel K (σ | σ ) defined on the whole state space to a kernel K ν (ων | ων ) for only unit ν. Clearly, in equation 2.3, the expression p(σ ) K (σ | σ ) is the probability that the system is in state σ and transits to σ . Thus, summing over all states σ, σ with unit ν clamped to ων and ων , respectively, gives the total probability for transitions where unit ν is in state ων before and in state ων after the transition, regardless of the rest σ ∈V p(σ ) in equation 2.2 of the system. The normalization by pν (ων ) := σν =ω ν ensures that K ν is a proper Markov kernel, that is, ων ∈ν K ν (ων | ων ) = 1 for all ων ∈ ν . In fact, pν is the marginal probability distribution for unit ν. Further, the pairs ( pν , K ν ), ν = 1, . . . , N are the marginal Markov transitions of the transition ( p, K ). Note that the marginal transition kernels are defined in equation 2.2 only for kernels K ∈ K(V ), and not for more general kernels in K( B | A). A generalization to different source and target spaces is straightforward. Marginalization with respect to sets A ⊂ A, B ⊂ B is defined analogous to equation 2.3 with sums restricted to the complement sets of A and B . The stochastic interaction measure of K with respect to p is defined as I ( p, K ) :=

H( pν , K ν ) − H( p, K ),

(2.3)

ν∈V

with values in the range 0, ν∈V ln |ν | , because the minimum of H( p, K ) is zero for deterministic systems and the maximum entropy of a single unit with |ν | states is ln |ν |. Evidently, for N binary units, the maximal interaction is N ln 2 (or N bits if we would use dual logarithms). Comparison with equation 1.1 shows that equation 2.3 has the form of a KullbackLeibler divergence and generalizes the usual mutual information to Markov

Temporal Infomax and Automata

2263

transitions. It measures the divergence of ( p, K ) from the product of its marginal transitions, thereby indicating how much ( p, K ) deviates from a product of independent single unit transitions or, in other words, how strong the units in ( p, K ) “interact” stochastically. Observe that I ( p, K ) is particularly large if the marginal transitions have high entropy, but that of the full transition is low. Then, supposing the current state ω ∈ V is known, the next global state is predictable with high confidence, but, conversely, not much information is gained from knowledge about single units, ων . We call such systems strongly interacting and study some of their properties in the sequel. Although single unit entropies are high in strongly interacting systems, this does not mean that stochastic interaction measures basically entropy (i.e., disorder). A set of independent processes has high marginal entropies, but the joint entropy is as high as the sum of the marginals, such that the interaction of independent processes is always zero (remember that I ( p, K ) is a KL distance). For probability distributions, Amari (2001) has shown that I ( p) in equation 1.1 can be represented in terms of a series of correlations of all orders. A similar representation likely exists for I ( p, K ) (but is unproved). In this sense, I ( p, K ) measures total correlation rather than entropy. In order to study strongly interacting systems, we consider Markov chains Xn = (Xν, n )ν∈V , n = 0, 1, 2, . . ., given by an initial distribution p0 ∈ ¯ V ) and a kernel K ∈ K( ¯ V ). We further separate V into two sets: a set P( of units ∂ called periphery of the system and the set V \ ∂, called the interior or set of internal units. The peripheral units represent the environment of the internal units; they provide input for the rest of the system. Given this distinction, we restrict attention to Markov kernels of the form K (ω | ω) = K (z , a | z, a ) = K (z | z, a ) K ∂ (a | a ) (ν) K (ων | z, a ) K ∂ (a | a ), =

(2.4) (2.5)

ν∈V\∂

where ω, ω ∈ V are global states, ων ∈ ν are states of single units, z, z ∈ V\∂ are internal states, and a , a ∈ ∂ are states on the periphery. For frequent later use, we abbreviated the Markov kernel in square brackets in equation 2.5 as K (z | z, a ) in equation 2.4. From equation 2.4, it is evident that the Markov chain on the periphery behaves independent of the internal units. This can be relaxed—in principle, we could let the periphery be influenced by activity in the interior, K ∂ (a |z, a ), but we do not consider this case in this work. K ∂ in equation 2.5 is a general Markov kernel reflecting that the environment may contain arbitrary spatiotemporal stochastic one-step dependencies. In contrast, K has the form of a so-called parallel kernel. Given the current global state ω, each kernel K (ν) , ν = 1, . . . , V \ ∂ in equation 2.5

2264

T. Wennekers and N. Ay

determines the next state of only a single unit ν independent of transitions in other units. Therefore, the kernels can be termed local, and the global transition is of product form similar to that of independent probability distributions. General kernels represent mappings between arbitrary global states, such that a source state ω can specifically target arbitrary subsets of other global states ω . The state transition of a certain unit then would depend on the simultaneous transitions of other units, that is, on “nonlocal” information. Parallel Markov chains are a more natural assumption in neural modeling than general Markov chains, because the activity of a neuron is determined only by its own input and internal dynamic, not by the simultaneous activity of other cells. An even more realistic model of transition kernels is given by parallel transitions that are adapted to a network structure. More precisely, with a graph (V, E) where E ⊆ V × V denotes the set of edges, we model the transition of each unit ν by a kernel K (ν) ∈ K¯ (ν |pa(ν) ). Here, pa(ν) denotes the parent set of the unit ν. The global transition K is then defined by a product corresponding to equation 2.5. For a Markov chain (Xν,n )ν∈V , n = 0, 1, 2, . . ., given by such a transition kernel K and an initial distribution p stationary with respect to K , we have the following representation of I ( p, K ) (Ay, 2002): I ( p, K ) =

I (Xν,n+1 : Xpa(ν),n |Xν,n ).

(2.6)

ν∈V

In equation 2.6, I (Xν,n+1 : Xpa(ν),n |Xν,n ) is the conditional mutual information of Xν,n+1 and Xpa(ν),n given Xν,n , which can be interpreted as the information flow from the parent set of ν to ν. In this sense, our measure of stochastic interaction is nothing but the sum of local information flows. Thus, the simultaneous maximization of the local information flows implies the maximization of the global measure of stochastic interaction. It turns out that the implication in the opposite direction is also true within the setting of natural gradient maximization, (see Ay, 2002). Maximization of information flow for biologically realistic single neurons is subject of current research and supported by close relations to biological findings (Bi & Poo, 2001; Froemke & Dan, 2002; Fairhall et al., 2001; Gutig, ¨ Aharonov, Rotter, & Sompolinsky 2003; Chechik, 2003; Bell & Parra, 2005). Our contribution here is to provide a translation of the concept of local information flow maximization to the global setting. We previously considered parallel Markov chains where I ( p, K ) was numerically maximized under no further constraint regarding K , that is, there was no input, ∂ = ∅ (Ay & Wennekers, 2003). We call this approach Temporal Information Maximization. The optimized, strongly interacting Markov chains were shown to be globally almost deterministic, but the firing of individual units was largely random and unpredictable. Strongly interacting Markov chains turned out to be representable in state-space V by complex

Temporal Infomax and Automata

111

C

2265

0.005 0.058 0.053

101

0.812 0.064

1.000 000 1.000

0.004

0.004

100

0.000

0.338

1.000 110 1.000 010 1.000

A

ω

001 011 101 111 000 010 100 110

0.662

000 001 010 011 ω ’ 100 101 110 111

001 1.000 011

B 3 2 1

0

5

10

15

20

25

30

35

Figure 1: An example for unconstrained optimization using N = 3 units. (A) The optimized Markov matrix, where dot size indicates transition probability. (B) A sample trajectory as a raster plot over time; dots correspond with an output of 1. (C) The state transition graph representing the matrix in A. Node labels denote states, ω, and edge labels transition probabilities. Observe the almost deterministic asymptotic state transitions (bold edges). State 111 is transient and 010 a branching state.

but systematically structured transition graphs consisting of a fraction of transient trajectories as well as sets of attracting nested loops that correspond with almost deterministic repetitive firing patterns of various lengths (see Figure 1). The observed globally deterministic but locally random activity patterns can be envisaged as intrinsic modes of activity in strongly

2266

T. Wennekers and N. Ay

interacting systems. This work considers Markov chains that maximize I ( p, K ) under the additional constraint that the transition kernel has the form 2.4 where K ∂ is fixed for a subset ∂ ⊂ V of units during optimization. This corresponds to Linsker’s original setting, where mutual information between output and input was maximized in a one-layer feedforward neural network for a stationary stimulus ensemble. Here, we maximize local flows in a recurrent setting. A second, perhaps more direct generalization of Linsker’s Infomax principle in the recurrent case would be to maximize the global information flow into the system (cf. equation 3.3). However, simulations in section 3.4 show that the maximization of stochastic interaction also leads to a quite high global flow. The relation between interaction maximization, local flows, and global flow is further worked out for the nontemporal case in feedforward systems in Ay (2002). 3 Examples for Strongly Interacting Systems This section presents simulations of strongly interacting Markov chains with various peripheries. The simulations numerically optimize the stochastic interaction measure I ( p, K ) in equation 2.3 for kernels of the form 2.5. 3.1 Simulation Procedures 3.1.1 Sample Trajectories. The simulations implement the usual Markov dynamics on a set of N binary units to generate sample trajectories: Starting from a random initial state, the next state is chosen iteratively according to the transition probabilities in the Markov kernel corresponding to the present state. Sample trajectories shown in subsequent figures serve only for visualization purposes; the optimization of I ( p, K ) is performed directly on the Markov kernel and does not need them. 3.1.2 Kernel Initialization. Entries of the parallel Markov kernel for internal units ν in equation 2.5 are initialized with independent equally distributed random values in the range [0, 1]. The peripheral kernels K ∂ (a | a ) are chosen independent of the internal unit kernels and are kept fixed during optimization. We consider different choices for the peripheral kernels, such as kernels with all entries equal, randomly initialized kernels, or special deterministic kernels. These kernels will be defined where used. 3.1.3 Numerical Optimization of I ( p, K ). A random search scheme is implemented to optimize stochastic interaction of the Markov chains. In every time step, the interaction I ( p, K ) is computed with respect to an induced stationary probability distribution p of the present kernel K obtained by solving the equation K p = p. (Usually we start from ergodic Markov chains, where p is unique. If, during the optimization, a Markov chain becomes nonergodic, we select an arbitrary one of the solutions of K p = p.) The kernel is

Temporal Infomax and Automata

2267

then perturbed such that I increases. We randomly choose a single entry in one of the individual kernels of the internal units and perturb it by a small random number ξ equally distributed on [−r, r ], where r is the learning rate (r = 0.05, if not stated otherwise). Perturbed values are clipped to the range [0, 1]. Formally, for randomly chosen ν ∈ V \ ∂ and ω ∈ {0, 1} N , we set (ν) (ν) K t+1 (1 | ω) = φ K t (1 | ω) + ξ (ν)

(ν)

K t+1 (0 | ω) = 1 − K t+1 (1 | ω),

(3.1) (3.2)

where the clipping function φ(x) is zero for x ≤ 0, one for x ≥ 1, and φ(x) = x otherwise. The new interaction measure is computed, and if it increases, the perturbation is accepted. Otherwise the optimization step is discarded and the procedure repeated. The simulation proceeds to the next time step if either I can be increased or five unsuccessful optimization trials have been performed. 3.1.4 Convergence. We stop optimization, output optimized values, and reinitialize the next simulation run after a fixed number of steps T (between 2000 and 12,000 depending on N, |∂|, and r ). This does not guarantee perfect convergence to a local maximizer, but for big enough T, the interaction becomes virtually constant under the described update scheme. Convergence issues have been studied in detail in Ay and Wennekers (2003), which considers unconstrained optimization (∂ = {}) and compares different optimization procedures. Results in that work show that a restriction to the special random search scheme described above and a fixed T stopping condition is not crucial. After convergence, state transition graphs are constructed from the resulting full Markov kernels and plotted using the public domain software dot.1 Simulation programs were implemented using the simulation environment Felix written by one of the authors (T.W.) and run on various Linux platforms. Because the optimization of I is algorithmically complex, simulations are restricted to small N. 3.2 Strongly Interacting Markov Chains and Automata. Figure 2 shows an example system comprising N = 4 units, where two of the units have been clamped to a Markov chain with equal transition probabilities between peripheral states. Figure 2A displays the peripheral kernel K ∂ and Figure 2B the full Markov matrix K (ω | ω) = K (z , a | z, a ). Here, as well as throughout the article, units are counted from left to right in binary representations of states z, a , and ω = (z, a ). 1 The program dot is part of the Graph Drawing Package graphviz from AT&T and Lucent Bell Labs available online at http://www.research.att.com/sw/tools/graphviz.

2268

T. Wennekers and N. Ay

a’

00 01 10 11

...

11 11 10 11

10

00

...

00

(z,a)

..

01

B

01 11 00 10

00 00 00 01

a

A

00 00 00 01 00 10 00 11 01 00 01 01 01 10 01 11 10 00 10 01 10 10 10 11 11 00 11 01 11 10 11 11

(z’,a’)

C

a z

4 3 2 1

0

5

10

15

20

25

30

Figure 2: Optimized Markov chain with N = 4 units, |∂| = 2 of which clamped to a peripheral chain with equal transition probabilities (0.25) between pe< ripheral states a , a . Interaction of the system is I ( p, K ) = 1.35773 ≈ 2 ln 2. (A) Peripheral kernel K ∂ (a | a ). (B) Full Markov kernel K (ω | ω) = K (z , a | z, a ). (C) Sample trajectory.

The most prominent difference between the kernel in Figure 2B and the unconstrained kernel in Figure 1A is that the columns in Figure 2B reveal not just one but four entries (with probabilities summing up to 1, since K is a Markov kernel). These entries are grouped into blocks as indicated in the figure such that all transitions for one source state (z, a ) target exactly one of the blocks. The blocks are uniquely characterized by the internal states, z, z , whereas the peripheral states a , a indicate only the precise location inside each block (cf., Figure 1B). Thus, given a source state z and a peripheral state a , the next internal target state z is uniquely defined such that the internal state transition kernel K (z | z, a ) as defined in equation 2.4 is deterministic. Only the next peripheral state is random, because by assumption, it is completely governed by K ∂ (a | a ) and the current peripheral state a . Thus, starting from some internal initial state, the peripheral Markov dynamic, viewed as input, drives the internal subsystem through deterministic state sequences. Nonetheless, because the dynamic on the periphery is random, sample trajectories do not reveal much determinism at a rough look (see Figure 2C). Computer science (Hopcroft & Ullman, 1979) defines deterministic finite state automata (DFAs) as a quintuple M = (Z, , δ, z0 , E), where Z =

Temporal Infomax and Automata

2269

01 01

01

10

11

00

10

00 00

11

01 10

10 10

00

11

11

00 01 11

Figure 3: Deterministic finite state automaton corresponding to Figure 2. Nodes are labeled by internal states z ∈ V\∂ and edges by peripheral states a ∈ ∂ . Given the current internal state z and peripheral state a , the automaton predicts the next internal, but not the next peripheral, state. Bold edges are for constant input 00.

{z1 , · · · , zn } is a finite set of states and = {a 1 , · · · , a m } a finite alphabet. The designated state z0 ∈ Z is called the initial state and E ⊆ Z the set of final or accepting states. Operation of the automaton is defined by the transition table δ, Z × → Z, which maps every pair (z, a ) ∈ Z × to exactly one successor state. A word (over ) is any finite sequence (or string) consisting of symbols in . A word x is said to be accepted by a DFA M iff, reading the word symbol by symbol starting from state z0 , application of the respective transition rules leads to a final state in E when the word is read completely. The set of all accepted words is a regular language. Now observe that as a finite state automaton, the strongly interacting Markov chain in Figure 2 provides a total mapping from V\∂ × ∂ to V\∂ . We may therefore identify the internal state space V\∂ with the state set Z of a DFA and the peripheral state space ∂ with a set of symbols . The Markov kernel K (z | z, a ) then corresponds to the transition table of that DFA, and the Markov chain can be represented by a labeled state transition graph as in Figure 3. Clearly, for a complete correspondence, we would also have to designate an initial state, z0 , and accepting states, E. However, z0 and E merely specify how a finite state automaton (FSA) actually accepts or rejects a certain input string. We could add equivalent constructs also in our Markov models by selecting initial and accepting states and presenting segmented input (finite words) (cf., e.g., Wennekers, 1998). Strongly interacting systems could in this way serve to recognize temporal patterns in the input stream. These issues are of secondary importance for this article. The main point here is that the maximization of stochastic interaction in constrained Markov chains

2270

T. Wennekers and N. Ay

leads to systems characterized by deterministic, input-driven internal state transitions. So far we have simplified things. Although most columns in the unconstrained Markov chain in Figure 1 are deterministic, the example also reveals nondeterministic transitions: transient and branching states have several possible targets. In constrained Markov chains, corresponding phenomena occur, in which case more than one block in one or more rows of the full Markov matrix reveals nonvanishing entries (simulations not shown). Transient global states are nonergodic; once left, they are never occupied again whatever the input sequence is. Corresponding columns may have an arbitrary number of filled blocks comparable to column 111 in Figure 1A. Branching states are ergodic with more than a single target state given the present global state (z, a ) similar to column 010 in Figure 1A. In larger systems, a certain number of nondeterministic state transitions become increasingly likely because the number of possible state transitions grows exponentially with system size. Simulations nonetheless indicate that their relative number is always small in strongly interacting systems. Of the 2 N entries per matrix column, a fraction at most linear in N are nonvanishing. For Markov chains slightly different from those used in the simulations, this can actually be proved rigorously (cf. section 4.2). Accordingly, the internal state transitions of strongly interacting Markov chains with constrained periphery are always weakly deterministic. Asymptotically, in system size, the relative fraction of nondeterministic transitions goes to zero. In the context of automata theory, this has the following consequence. Only completely deterministic Markov chains correspond to deterministic FSAs. However, automata theory also defines nondeterministic finite state automata (NFAs), which differ from DFAs basically by the fact that given the same state and input symbol, several successor states are possible. (As a second less important difference, they also can have more than one initial state; see Hopcroft & Ullman, 1979.) This is what we also observe in strongly interacting constrained Markov chains. Therefore, optimized constrained Markov chains correspond in general to NFAs. Nonetheless, because the relative fraction of nondeterministic transitions is small, the resulting NFAs are only weakly nondeterministic;—they are “almost” DFAs.

3.3 Impact of Periphery on Final FSAs. In section 4.1, we prove that maximizing stochastic interaction in systems of product form 2.5 is equivalent to maximizing the interaction of just K , if K ∂ is fixed. The kernels K for binary state variables have d := 2 N (N − |∂|) independent parameters, K (ν) (zν = 1 | ω), ν ∈ V \ ∂, ω ∈ {0, 1} N , because there are N − |∂| internal units and 2 N possible source states. K is deterministic if all the K (ν) (zν | z, a ) are either 0 or 1, such that unit ν either fires or remains silent with probability 1 given any source state (z, a ). For given N and |∂|, the number of deterministic parallel Markov chains therefore is NDF A := 2d , a number

Temporal Infomax and Automata

A

D K

2271

B

Kδ

C

Kδ

E K

F

Kδ

K

Figure 4: (A–C) Different Markov kernels K (ω | ω) resulting for N = 4, |∂| = 2 and peripheral transitions all of equal probability as in Figure 2. (Bottom row) Optimized kernels for different types of periphery as indicated by K ∂ in each frame. (D) Random kernel. (E) Identity. (F) Deterministic cyclic state sequence on ∂.

that grows unimaginably fast with N. For instance, for one peripheral unit |∂| = 1, NDF A equals 16, 65,536, and 248 for N = 2, 3, 4, respectively. Although not all of these deterministic Markov chains are local maximizers of I ( p, K ), as a consequence of the huge number of possible DFAs, optimized chains appearing in simulations for given N and K ∂ are highly nonunique. Figures 4A to 4C, for instance, display three (out of NDFA = 232 ) deterministic chains optimized under the same conditions as the one in Figure 2, that is, N = 4, |∂| = 2, and K ∂ (a | a ) = 0.25 for all a , a ∈ ∂ . Interaction in these systems is I ( p, K ) = 1.323508, 1.35529, and 1.331308 for Figures 4A, 4B, and 4C, respectively. Apparently, all these kernels are deterministic, but the corresponding automata (not shown) are certainly not equivalent. Because different numbers of source states converge onto the internal states in these examples, there cannot be a renumbering of internal states that maps one system onto the other. Figures 4D to 4F display comparable simulations for various choices of K ∂ . In Figure 4D, K ∂ has random entries in [0, 1]; in Figures 4E and 4F, the kernels are “deterministic,” although in order to yield unique stationary probability distributions, off-diagonal elements of K ∂ in Figure 4E (and similar in Figure 4F) were set to small, positive values. As before, individual simulation runs for each of these kernels converged to a variety of

2272

T. Wennekers and N. Ay

different systems as in Figures 4A to 4C, with realized transitions scattered throughout the whole Markov matrices. Note also that the full kernels K reflect the Markov chain on the periphery. If a K (z | z, a ) is 1, the whole ath column of K ∂ (a | a ) is copied into block z of the (z, a )th column of K . This is a consequence of the product form of K (z , a |z, a ) = K (z |z, a )K ∂ (a |a ) (cf. section 4.1). The common structural features of strongly interacting Markov chains as shown in Figure 4 are that transitions are confined to just one or a few internal target states per source state (z, a ) and that realized transitions are merely randomly scattered througout the blocks and columns of K . These features will be explained in section 4 on the base of mathematical properties of the interaction measure I ( p, K ). The question remains whether the peripheral Markov chain has an influence on the resulting FSAs at all. Figure 5 displays interaction values for all 0.8 0.7 0.6

I(p,K) 0.5 0.4 0.3 0.2 0.1 0 0

2

4

0 0 0 0

0

0 0 0 1

....

10

1 0 1 0

1

(5)

0 1 1

0

0 1 0 1

0 1 1

8

index bin(index) 0 0 1 1 1 .. 0 0 1 0 1 0 1

ω

(1,1) (1,0) (0,1) (0,0)

6

1

12

....

0

0

14

1 1 1 1 (6) 0 1 1 0

Figure 5: I ( p, K ) for N = 2, |∂| = 1 and all 16 possible parallel deterministic Markov kernels K . Periphery K ∂ has been clamped to different choices (cf. also Figure 4): Plus signs: all entries equal; crosses: identity; squares: cyclic sequence 01010 · · ·; circles: a special process (see text). At the bottom, binary representations of the deterministic Markov kernels K are plotted—K (z = 1 | ω) = bin(index)[ω]. Two transition diagrams for Markov chains 5 = 0101 and 6 = 0110 are also shown.

Temporal Infomax and Automata

2273

16 possible deterministic Markov kernels K for N = 2, |∂| = 1, and different kernels K ∂ . Kernels used for that figure did not result from an optimization but were explicitly constructed. Plus signs denote results for a peripheral kernel with all entries equal; that is, the peripheral unit is a Bernoulli process with prob[z = 1] = 0.5. Crosses indicate results for an identity kernel where transitions 0 → 0 and 1 → 1 are almost 1, but small off-diagonal probabilities of 0.001 ensure that the stationary probability distribution p ∂ = (0.5, 0.5) is unique. Trajectories of the peripheral unit here consist of long sequences of either zeros or ones, where the small off-diagonal entries occasionally switch between both states. Squares in Figure 5 denote a deterministic cyclic process on the periphery—trajectories · · · 01010101 · · ·. Circles in Figure 5 represent interaction values for a special peripheral process with K (1) (a = 1 | a ) = (0.05, 0.2), which is almost a Bernoulli process with rate ≈ 0.05 but an increased probability that the output stays 1 if it was so in the previous state. Because we have only one peripheral unit, the interaction of the periphery is always zero for the above choices, but the conditional entropies of the peripheries are ln 2 (i.e., as large as possible) for the Bernoulli process with rate 0.5, 0.216273 for the special peripheral process, and zero for the identity and cyclic kernels. At the bottom in Figure 5, the indexes of the DFAs used in the upper part of the figure are represented binary. The internal Markov kernels K i for DFA number i are chosen such that K i (z = 1 | ω) = bin(i)[ω] where bin(i)[ω] is the ωth position in the binary representation of i. Quite a few of the possible DFAs in Figure 5 have zero interaction for all tested peripheries: DFAs 0, 3, 4, 8, and 12 to 15. These automata are degenerate: For arbitrary input sequences, and thus for all peripheral Markov chains, they run into attractors, where the next internal state is perfectly predictable from its current state alone, regardless of the input. This is most obvious for automata 0 and 15, which map every state (z, a ) to internal state 0 or 1, respectively. Asymptotic sequences . . . 010101 . . . independent of the input as for DFA 3 have zero conditional entropies as well. The internal unit then has zero entropy, such that the interaction vanishes. Some DFAs have high interaction for certain peripheries but zero interaction for others. Also, vanishing interaction implies that the ergodic internal state sequences are predictable without knowing the input. Consider DFAs 5 and 6 in Figure 5. For the equal-probability Bernoulli process on the periphery, they lead to Bernoulli processes on the internal state. Consequently, the interaction is large in both cases. But for the identity on the periphery, the internal unit of DFA 5 either outputs . . . 00000 . . . or . . . 11111 . . . after transients have died out, such that it is predictable. In contrast, for DFA 6, it can output 00000 . . . and 11111 . . . for input 00000 . . . and 010101 . . . or 10101 . . . for input 1111 . . . All internal transitions z → z where z, z ∈ {0, 1} are equally likely, such that the interaction is maximal. Similarly, for the cyclic process on the periphery, DFA 5 leads to the deterministic sequence . . . 0101010 . . . with vanishing interaction, but 6 gives maximum interaction

2274

T. Wennekers and N. Ay

for outputs . . . 00110011 . . ., that is, all marginal one-step transition probabilities for the internal unit are 0.5, such that its entropy is ln 2. In summary, the general form of I ( p, K ) favors optimized Markov chains, where minimizing the conditional entropy H( p, K ) induces determinism, but maximizing the marginal single unit entropies leads to an “unfolding” of the chains, such that degenerate input-independent Markov chains are avoided, and in every internal state, many successor states are possible. A large number of systems can reveal these general structural features. The precise periphery selects Markov chains that still lead to unpredictable firing of internal single units for the special input sequences generated by the peripheral dynamics. 3.4 Information Flow. For sets A, B ∈ V, define the information flow from A to B as F (A, B) = H(XB,n+1 |XB,n ) − H(XB,n+1 |X A,n , XB,n ).

(3.3)

In equation 3.3, H(XB,n+1 |XB,n ) =: H(B|B) is the conditional entropy about the next state in B given the present state in B only. H(XB,n+1 |X A,n , XB,n ) =: H(B|A, B) is a noiseentropy: it measures the entropy of the next state in B that is left of H(B|B) if we know both the present state of B and that of A. So if we subtract it from H(B|B), we get the information A shares with the next state in B. This cannot be larger than the (Shannon) information of A and must also be smaller than |B| ln 2, the maximum information possible in B, if units in B are binary. Equation 3.3 is a conditional mutual information applied to successive steps of a given Markov chain—hence, the term information flow. Most interesting is the case where A = ∂ and B = V \ ∂. The noise entropy H(B|A, B) then is obviously H( p, K ), which is small for strongly interacting systems. H(B|B) is the internal entropy if the input is not known. The flow F in that case is the average information represented in the next state about the current peripheral state. Corollary 1 in section 4.2 provides an upper bound, H(B|A, B) ≤ log(N + 1), such that under the conditions of the corollary, F (A, B) ≥ H(B|B) − log(N + 1). Thus, using equation 3.3, H(B|B) − log(N + 1) ≤ F (A, B) ≤ H(B|B), providing lower and upper bounds for F . If the internal entropy increases linearly in system size, H(B|B) ∼ O(N), the term log(N + 1) becomes small for large N relative to H(B|B), that is, F (A, B)/H(B|B) → 1 for N → ∞. In informal terms, most of the internally observable entropy becomes a reflection of external entropy, which flows (almost) deterministically into the system. This does not imply that repetition of the same stimulus sequence necessarily leads to equal responses. In each repetition, the response sequence depends on the precise internal initial state as well as the relatively small amount of nondeterminism.

H(B|B) relative to max

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

flow/H(p^partial)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

flow

Temporal Infomax and Automata

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

2275

1 2 3 4 5 6 7 N

1 2 3 4 5 6 b

1 2 3 4 5 6 b

Figure 6: (Top) Internal entropies for various N (x-axis) and b = 1, . . . , N − 1 (different symbols) relative to maximally possible value (N − b) ln 2. Each single symbol represents one optimized system with random Markov chains for K ∂ (see text). (Middle) Flow F (∂, V \ ∂) for N = 6. (Bottom) Flow relative to Shannon entropy on periphery, H( p ∂ ).

Figure 6 displays values of H(B|B) and F for optimized Markov chains with N up to 6 and varying b := |∂|. Markov chains on the periphery had independent and identically distributed (i.i.d.) entries equally distributed in the range [0, 1]. The parallel Markov chains for the internal units were also initialized with random values in the range [0, 1]. The systems were then optimized as before.

2276

T. Wennekers and N. Ay

Figure 6 (top) shows H(B|B) relative to its maximally possible value (N − b) ln 2 over N. Different symbols denote different numbers of peripheral units b = 1, . . . , N − 1 (+ = 1, × = 2, ∗ = 3, = 4, . . .). Apparently H(B|B) scatters over the full range, so H(B|B) is not maximized to its highest possible value (N − b) ln 2 equivalent to a random walk on the internal states in B. One could expect this because a random walk on the internal states would also imply large marginal entropies. Because H(B|A, B) is small in optimized systems (smaller than ≈ .2 in most systems shown, and most often zero), H(B|B) is actually similar to the flow F (A, B). Figure 6 (top) therefore implies that the flow is only a small fraction of the maximally possible conditional entropy in the interior if the size of the periphery b is small compared to N − b. However, in the opposite case, the internal units explore most of their global state space, that is, H(B|B) and the flow approach the theoretical maximum of H(B|B). Figure 6 (middle) shows the absolute flow for n = 6 in dependence of b. As long as b ≤ N − b, the flow increases with b, such that an increasing amount of information on the periphery is represented in the next state of the internal units. In contrast, for b > N − b, the interior is smaller than the periphery and no longer able to represent the peripheral states, that is, the maximal internal entropy becomes progressively smaller than that on the periphery if b grows. Therefore, the curve in Figure 6 (middle) is unimodal with a maximum near b = N − b. Figure 6 (bottom) redisplays the flow relative to the Shannon entropy on the periphery. As long as b < N − b, the flow is a high fraction of the peripheral entropy. For high b—as Figure 6 (top) reveals more directly—it approaches the absolute information bound of the internal units, which decreases as N − b. This shows that maximizing I ( p, K ) does not strictly maximize the flow from the periphery to the interior but that the flow nonetheless reaches a significant fraction of its maximally possible value, bounded by either the Shannon information on the periphery or the capacity of the internal units. 3.5 A Relation to Temporal Learning Rules. It is known that a Hebbiantype coincidence learning rule maximizes mutual information in simple feedforward neural networks for stationary input-output relationships (Hertz et al., 1991). We may therefore ask whether temporal stochastic interaction can be optimized on the base of sample trajectories of a given Markov dynamics. Note in that context that this work does not refer to concrete networks but centers on Markov chains. So there is no obvious equivalent of “synaptic” learning. Instead, we implemented the following temporal learning rule based on consecutive firing patterns in subsequent steps. For given N and periphery ∂, transitions on the periphery and the parallel kernels of the internal units were initialized with i.i.d. equi-distributed random values in [0, 1]. Sample trajectories of the Markov dynamics were simulated in the usual way as described in section 3.1. Update of the dynamics now did not make use of computations of I ( p, K ) directly, but only of simulated

Temporal Infomax and Automata

2277

trajectories. If in step t the current state was ω = (z, a ) and in step t + 1 it was ω = (z , a ), for each internal unit ν, the kernel K (ν) was updated according to

(ν) (ν) K t+1 (1|ω) = φ K t (1|ω) + r · (2zν − 1) , (3.4) subject to normalization, K (ν) (1|ω) + K (ν) (0|ω) = 1. In equation 3.4, r is the learning rate (r = 0.1) and φ is the same clipping function as in equation 3.1, which keeps transition probabilities in the range [0, 1]. The rule increases componentwise the probability for the same firing pattern z , if the state ω appears again, by increasing single unit firing probabilities for units active in z given ω, and decreasing them otherwise. Simulations using rule-based update of the transition kernels showed that for all network sizes and peripheries, the rule indeed decreases H( p, K ) and increases I ( p, K ) from their initial values. The decrease (or increase) is not always monotonic. Due to the random nature of the activity dynamic, it can slightly fluctuate, and it also reveals occasional jumps because sometimes the attractor structure reorganizes such that the number of states in the ergodic component changes (not shown). Asymptotically, I and H become virtually constant, because K gets deterministic. Interaction and entropy approach values in a comparable range as the random search scheme, with final interaction on average somewhat smaller than compared to random search optimization. This is demonstrated in Figure 7 for networks of size N = 6 having two peripheral units. Resulting Markov kernels have the same structure as for random search update and are barely distinguishable from those in Figure 4. Systems that were optimized using random optimization in most cases were not further modified if we switched to rule-based updates. Such changes occurred, however, if the optimized system had branching points; rule-based update seems to avoid those. Systems

parallel rule

H(K,p)

H(K,p)

random search 3.5 3 2.5 2 1.5 1 0.5 0

0 0.5 1 1.5 2 2.5 3 3.5 I(K,p)

3.5 3 2.5 2 1.5 1 0.5 0

0 0.5 1 1.5 2 2.5 3 3.5 I(K,p)

Figure 7: Entropies H( p, K ) and interaction I ( p, K ) for randomly initialized Markov chains (N = 6, |∂| = 2). + indicates values before and × after optimization using the random search scheme (left) and a temporal learning rule (right) described in section 3.5.

2278

T. Wennekers and N. Ay

resulting from rule-based update were in turn also often not further optimized when switching to the random search scheme if they were already deterministic. Thus, the rule-based and random-search methods are largely consistent, though not completely equivalent. Neural systems may therefore employ a local temporal learning rule implementing effectively equation 3.4 to reach high stochastic interaction. 4 Analytical Results In this section, we consider the optimization of strongly interacting systems mathematically. Section 4.1 explains the most prominent features of constrained optimized Markov chains on the base of analytical properties of the conditional entropy, H( p, K ), and the interaction measure, I ( p, K ). Section 4.2 proves upper bounds for the number of nonvanishing transitions and the maximum entropy of optimized Markov chains. 4.1 Properties of H and I . We start with some instructive calculations assuming a Markov kernel of product form, K (ω | ω) = K (z , a | z, a ) = K (z | z, a )K ∂ (a | a ),

(4.1)

where ω = (z, a ), ω = (z , a ) ∈ V , a , a ∈ ∂ , and z, z ∈ V\∂ as usual. Then the conditional entropy of ( p, K ) can be written as H( p, K ) = −

ω,

=− z ,

− z ,

=−

p(ω)K (ω | ω) ln K (ω | ω)

a ,

(4.3)

p(z, a )K (z | z, a )K ∂ (a | a ) ln K ∂ (a | a )

(4.4)

z, a

a ,

p(z, a )K (z | z, a )K ∂ (a | a ) ln K (z | z, a )

z, a

p(z, a )K (z | z, a ) ln K (z | z, a )

z , z, a

−

(4.2)

ω

a , a

z

(4.5)

p(z, a ) K ∂ (a | a ) ln K ∂ (a | a )

(4.6)

= p ∂ (a )

= H( p, K ) + H( p ∂ , K ∂ ).

(4.7)

In equations 4.5 and 4.6, we used the normalization ofthe Markov kernels K ∂ and K in their first argument. In equation 4.6, z p(z, a ) = p ∂ (a ), where p ∂ is the probability distribution on the periphery. p ∂ is stationary, K ∂ p ∂ = p ∂ , because we maximize with respect to stationary distributions

Temporal Infomax and Automata

2279

p. Equation 4.7 shows that the total kernel entropy can be written as a sum of the conditional entropy of the periphery and that of the transition ( p, K ). Because the entropy of the periphery is a fixed quantity, optimization of the interaction I ( p, K ) can only influence H( p, K ) in equation 4.7. With equation 4.7, the interaction measure reads I ( p, K ) =

Hν ( pν , K ν ) − H( p, K )

(4.8)

ν∈V

=

Hν ( pν , K ν ) − H( p, K ) +

ν∈V\∂

Hν ( pν , K ν ) − H( p ∂ , K ∂ )

(4.9)

ν∈∂

= I ( p, K ) + I ( p ∂ , K ∂ ).

(4.10)

Equation 4.10 reveals also that the interaction I ( p, K ) can be written as a sum of the interaction of the periphery and that of the Markov transition ( p, K ). Again, I ( p ∂ , K ∂ ) is constant during optimization, such that the maximization of I ( p, K ) is actually equivalent to the maximization of I ( p, K ) = ν∈V\∂ Hν ( pν , K ν ) − H( p, K ). Thus, for a given periphery, the internal units should maximize their marginal entropies, but global transitions (z, a ) → z should become as deterministic as possible. Consider H( p, K ):

H( p, K ) = −

z , z, a

=

z, a

p(z, a )K (z | z, a ) ln K (z | z, a )

p(z, a ) −

z

(4.11)

K (z | z, a ) ln K (z | z, a ) .

=H(K (· | z,a ))

(4.12)

The underbraced term in equation 4.12 is the Shannon entropy of the internal state transitions induced by K restricted to the fixed source state (z, a ). H( p, K ) then is the weighted average over these entropies, where the weights are the probabilities that the system is indeed in state (z, a ) before the transition. To maximize I ( p, K ), we should make H( p, K ) small. Non ergodic states have p(z, a ) = 0, so their conditional entropy does not count in equation 4.12. Accordingly, they can have an arbitrary number of internal target states, with any distribution of transition probabilities K (·|z, a ). In contrast, for ergodic states (z, a ) with p(z, a ) > 0, equation 4.12 requires minimizing H(K (·|z, a )), but H(K (·|z, a )) is zero if K (· | z, a ) is deterministic, that is, if there is only a single target state z with K (z | z, a ) = 1. Thus, if all K (· | z, a ), z ∈ V\∂ , a ∈ ∂ are deterministic, the total entropy H( p, K ) obtains its absolute minimum of 0, such that every global state has a single internal successor. K (z | z, a ) can then be interpreted as the transition table of a deterministic FSA.

2280

T. Wennekers and N. Ay

Note, however, that we used parallel Markov kernels for the internal states in the simulations, which constrain K (z | z, a ) to be of the form K (z | z, a ) =

ν∈V\∂

K (ν) (zν | z, a ).

(4.13)

For any K (z |z, a ) to be deterministic, it suffices that all K (ν) (zν | z, a ) are either 0 or 1. Under this condition, the K (ν) (zν | z, a ) in equation 4.13 can be considered binary variables for fixed (z, a ) and zν ∈ {0, 1}, ν ∈ V \ ∂. Because a product of binary variables is equivalent to logical AND-ing these variables, for each pair (z, a ) there exists exactly one z with K (z | z, a ) = 1. On the other hand, if for a given (z, a ) one of the K (ν) (zν = 1| z, a ) is not zero or one, there must be at least two possible target states z0 , z1 , where unit ν is active in z1 and inactive in z0 and the other units have the same values in both states. Accordingly, H( p, K ) is zero for parallel kernels iff they consist of only transitions with probability 0 or 1 for ergodic states. Maximizing I ( p, K ) should therefore try to force entries in the parallel internal kernels K (ν) toward 0 or 1, which implies the same for K . As a consequence, the product form of K in equation 4.1 then implies for the full kernels that the ath column inside block z, z is just a copy of the ath column of K ∂ . This has been observed in section 3.3. Maximizing I ( p, K ) requires not only H( p, K ) to be small, but in addition that the marginal entropies Hν ( pν , K ν ), ν ∈ V \ ∂ are large. Without this second condition, all deterministic Markov chains would be maximizers, including many degenerate chains like those that map every state (z, a ) to a constant target state z . The marginal entropies of those units would be zero and the interaction of K therefore vanishing. In section 3.3, we have given further examples for degenerate chains and showed that Markov chains with high interaction are to some degree selective relative to the special periphery (see Figure 5). In fact, if for each state (z, a ), the next internal state is random, the sequences of bits of an internal unit will likely be unpredictable too. This explains the random scattering of filled blocks in Figure 4. For specific peripheries, some states and transitions appear more often than others. For the frequent peripheral states, the corresponding columns in the full Markov matrix should be as diverse as possible. This makes the optimized systems selective with respect to the clamped peripheral chains. If we would maximize the internal marginal entropies alone, the optimum would be independent Bernoulli processes with probability of firing equal to 0.5. These could then be coupled to a more or less degree to the periphery; that is, there could be some more or less high information flow from the peripheral units to the interior. However, by simultaneously minimizing H( p, K ), we enforce a low noise entropy and thereby usually an increased coupling between periphery and interior. If instead of the marginal entropies alone we would maximize just H(B|B), where B is the interior

Temporal Infomax and Automata

2281

V \ ∂, a random walk on the internal states z would be the optimum, again depending to an unconstrained degree on states on the periphery (which are marginalized out when going from the full kernel to that of B). Maximizing the flow F (∂, V \ ∂) = H(V \ ∂|V \ ∂) − H( p, K ) would then further reduce the noise entropy, such that the internal transitions become confined to an entropy bounded by the entropy on the periphery (or the capacity of V \ ∂ itself). Apparently, a high internal entropy would also lead to high internal marginal entropies, because if the state sequence is random, so would be the single bits. Therefore, one expects that maximizing the flow would also lead to high interaction values. In section 3.4, we have shown the opposite. However, if the internal marginal entropies are high—the single units are unpredictable—so too would likely be the full internal states, the more because in the simulations, we started from random initial conditions for the internal transitions and updated randomly selected entries in the kernels in each step. Therefore, the full optimized Markov kernels reveal entries in randomly scattered blocks, which in turn lead to random sequences of states when driven by the periphery. From that point of view, it is plausible that optimizing interaction also leads to high values for the flow into the systems. Finally, consider the rule-based update in section 3.5. Transitions that appear repeatedly, due to either the initialization that favors them or by chance in a sample realization, will be enforced until they become deterministic. Thus, the rule decreases H( p, K ). On the other hand, which transitions become deterministic is largely random, though constrained by the periphery. Therefore, internal states and single unit sequences stay largely random under the update scheme, corresponding to a high interaction and flow. In fact, in the simulations, ν∈V\∂ Hν started from high values due to the random initialization, but decreased in the majority of cases only moderately if the Markov kernels got confined to deterministic global transitions. The final change in H( p, K ) outperformed the slight loss in the marginal entropies (simulations not shown) such that I ( p, K ) increased (cf. Figure 7). The above arguments explain informally the special Markov chains observed in the simulations: first, the convergence toward almost deterministic systems, because those maximize H( p, K ), such that the chains become automata-like, and second, the fact that realized internal transitions are apparently randomly scattered throughout the optimized Markov kernels. This increases the number of possible pathways in the state transition graphs, and therefore the unpredictability of activity of individual units Hν ( pν , K ν ) as well as the internal entropy. Consequently, maximizing stochastic interaction also increases information flow into the system. Rule-based update decreases H( p, K ) but keeps internal state sequences random, such that it behaves similarly rather than directly optimizing I ( p, K ). The peripheral Markov chain has the further influence of weighting some of the possible pathways in the state transition graphs more strongly than others, according to the probabilities with which input

2282

T. Wennekers and N. Ay

sequences appear. This makes some of the possible chains more likely, and others less, but still leaves many locally optimal solutions with the described features. 4.2 Theorems. In this section, we state theorems showing that in fact all strongly interacting systems are weakly nondeterministic. As defined earlier, a strongly interacting system is a local maximizer of I . We consider a Markov transition with constrained periphery as weakly nondeterministic, if for every global state ω with p(ω) > 0 the number of possible internal target states is bounded by η(V/∂) + 1, where for a subset A ⊂ V we define η(A) :=

|v | − 1 .

v∈A

For binary neurons, that is, |v | = 2 for all v ∈ V, one has η(A) = |A|, such that weak nondeterminism imposes a bound linear in the number of internal units on the number of possible internal target states of the Markov kernel given any fixed global state in the ergodic component ( p(ω) > 0). For strictly deterministic systems, this number is exactly 1. General Markov chains, to the other extreme, can have exponentially many target states. In fact, for the unconstrained optimization of temporal interaction, ∂ = ∅, we already proved the following theorem (Ay & Wennekers, 2003). ¯ V ) and a transition kerTheorem 1. Consider a probability distribution p ∈ P( ¯ nel K ∈ K(V ). If ( p, K ) is a local maximizer of I , then for all ω ∈ supp p the following bound on the support of K (· | ω) holds: |supp K (· |ω)| ≤ 1 + η(V).

(4.14)

For binary units, the estimate 4.14 implies the linear bound |supp K (· | ω)| ≤ 1 + |V|. Instead of the unconstrained optimization, we now consider a driven system. Let ∂ be a subset of the set V of neurons, the periphery of the system. We assume that the process on the periphery is given by the environment of the system, which is fixed. We model this ¯ ∂ ) and a transition extrinsic process by a probability distribution p ∂ ∈ P( ¯ ∂ ). As already considered in equation 2.4, the intrinsic inkernel K ∂ ∈ K( formation processing is modeled by a transition kernel K from V to V\∂ . Thus, we investigate the optimization of I restricted to the set of transition kernels K from V to V that have the product structure K (z , a | z, a ) = K (z | z, a ) K ∂ (a | a ).

(4.15)

This constrained optimization leads to the following generalization of theorem 1.

Temporal Infomax and Automata

2283

Theorem 2. Let ( p, K ) be a local maximizer of the restriction of I to the set of transition kernels with product structure 4.15 and a fixed peripheral transition kernel K ∂ , and let ω = (z, a ) be an element of supp p. Then for the intrinsic kernel K of K , we have |supp K (z | z, a )| ≤ 1 + η(V \ ∂).

(4.16)

Note that we recover the estimate 4.14 if we set ∂ := ∅ in equation 4.16. Then, formally, K ∅ maps the empty state onto the empty state, such that K = K . Theorem 2 implies the following corollary on the entropy generated by a strongly interacting system: Corollary 1. In the situation of theorem 2, the conditional entropy of the next internal state given the current global state satisfies H( p,K ) (XV\∂ | X) ≤ ln 1 + η(V \ ∂) . The proofs of theorem 2 and corollary 1 are given in the appendix. Informally, they imply that all strongly interacting systems of the form 4.15, where information flows only into the system from some periphery, must be weakly nondeterministic. Given any internal state and input, the number of internal target states is small (linear in system size) as compared to the possible number of states (exponential). Accordingly, the internally generated entropy grows at most logarithmically in system size (|V \ ∂|). 5 Summary and Discussion We have defined a measure of stochastic interaction including spatial and temporal properties of stochastic processes as the divergence of a Markov chain from its product of marginal chains. We have shown numerically and analytically that the optimization of stochastic interaction in Markov chains with clamped periphery leads to deterministic, or at most weakly nondeterministic, FSAs. This can be envisaged as a consequence of locally maximizing conditional mutual information 2.6, which makes all single unit states in the next step (near) deterministic given the current global state. In such systems, the dynamics prescribed on a set of input units drives the internal dynamics through (almost) deterministic state transitions. Nonetheless, the internal single unit activities in strongly interacting systems are largely unpredictable given their current activity alone. These features are explained by the property of stochastic interaction to combine two goals. On one hand, it minimizes the conditional entropy for global state transitions, but simultaneously it maximizes the single unit entropies. As a consequence, the resulting internal Markov chains are confined from arbitrary Markov kernels toward deterministic kernels, but they unfold from degenerate chains,

2284

T. Wennekers and N. Ay

such that in every internal state, as many different target states as possible can be approached in dependence of the present input activity. This way, the recurrent internal dynamics of strongly interacting systems reveals complex internal structure, in contrast to pure feedforward networks. From a dynamical systems viewpoint, strongly interacting systems can be seen as driven or nonautonomous systems with rich internal dynamics. If the input is held constant for some time, activity flows into attractors specific for the particular input, (see Figure 3, where bold lines are for fixed input 00). Similar attractor structures can be constructed for inputs 01, 10, and 11. They correspond directly to the ergodic nested-loop attractors for unconstrained optimization (cf. Figure 1). Accordingly, peripheral activity constant over a certain time can select intrinsic modes of activity, and peripheral state transitions can further switch between such internal dynamic modes. This provides the simplest way of neural computations, because information about the history of the system can be represented and processed. In other words, the optimized DFAs represent spatiotemporal features of the input signals. Interestingly, some experimental evidence indeed suggests the existence of brief intrinsic modes or states in cortical neural activity. As Abeles et al. (1995) have demonstrated, cortical activity in prefrontal areas of monkeys flips among quasi-stationary states of several ten to hundred milliseconds duration defined by short-time firing rate patterns of simultaneously recorded neurons. Similar phenomena appear in our network for slowly varying input patterns, if the intrinsic dynamics is forced into different modes over time. Gat, Tishby, and Abeles (1997) have further shown that the state flips can be well segmented by hidden Markov models, suggesting that intrinsic modes can be switched on a fast timescale, although they persist themselves on a longer scale. It would be interesting to determine the stochastic interaction comprised by these experimentally observed Markovian systems and compare it with complete randomness or order under various behavioral conditions. Aertsen, Gerstein, Habib, and Palm (1989) defined functional connectivity of a set of neurons with reference to short-time correlations in their mutual firing patterns. In experimental data, these correlations were shown to change rapidly over time, with the interpretation that neurons dynamically form varying subgroups of interacting cells, also referred to as functional cell assemblies. Interestingly, in our optimized systems, correlations would change similarly if the network activity is driven through different intrinsic modes, but they are constant, determined by the transiently approached attractor, in each particular mode. In this way, our approach may provide an explanatory base for the complex correlation dynamics found in experiments. A further aspect of cortical neural activity seems important at that point. This is the presence of repetiting firing patterns with interspike intervals up to the order of tens to hundreds of milliseconds. Abeles, Bergman, et al. (1993a) have shown that such long-lasting synfire patterns appear

Temporal Infomax and Automata

2285

reliable and behavior dependent in multiple electrode recordings of monkeys performing simple behavioral tasks. In the light of the largely stochastic firing of single units, this observation is highly surprising (Abeles, 1991). The classical synfire chain model explains these long-time correlations by volleys of synchronized activity that propagate repeatedly along the same (i.e., deterministic) neural pathways (Abeles, 1991; Abeles, Vaadia, et al., 1993b). Activity in our optimized Markov chains reveals quite similar properties: Single-unit activity is virtually random, but the state transitions are largely deterministic and proceed along nested repetitive loops. So a network dynamic can be globally deterministic even if every single neuron’s activity looks virtually random. In fact, on the background of neural assembly and associative memory theories in Wennekers (1998), we have demonstrated that the classical synfire chain model can be extended in a simple and straightforward way to implement arbitrary deterministic and nondeterministic FSAs. Wennekers and Ay (2003) furthermore argue that synfire chain–type activation patterns appear naturally under the assumption that the brain maximizes temporal interaction. Attractor models of brain function, on the other hand, reveal only small stochastic interaction. We further mention a relation of our work to a series of papers by Tononi, Sporns, and Edelman (Sporns et al., 2000; Tononi et al., 1994; Tononi, Sporns, & Edelman, 1999). They considered the segregation and integration of neurons into functional ensembles based on several different measures for complexity: Shannon entropy, spatial interaction (also termed integration in Tononi et al.’s work), and two further measures that account for information flow between partitions of a set of units. The “integration” measure is equivalent to our stochastic interaction restricted to stationary probability distributions. Tononi et al. (1994, 1999) and Sporns et al. (2000) compared structural features of systems that optimize one or the other complexity measure. As a main result, they found that in particular their partition-based measures lead to networks with distinct structural characteristics such as clustered connectivity and a short wiring length (cf. also Murre & Sturdy, 1995, for an interesting complementary approach), whence the neurons organize into mutually segregated subgroups, with strong internal interactions. The spatial “integration” measure, however, usually leads to systems where most cells are bound into a single strongly interacting cluster. An important difference between our work and that of Tononi et al. (1994, 1999) and Sporns et al. (2000) is that stochastic interaction as used in this work is based on spatial and temporal interactions. Our example systems therefore show rich internal and input-dependent dynamics and are better described in space-time—for example, in terms of the intrinsic modes for constant input—rather than in space alone. We leave the mathematical conceptualizations of these issues to future work. Our principle of temporal information maximization complements Linkser’s Infomax principle for stationary input-output relations in layered

2286

T. Wennekers and N. Ay

feedforward systems. Linkser’s work (1986a–1986b) pointed out a surprising link between two previously unrelated and even distant areas of research: Information maximization and the structure of visual receptive fields. The principle of Temporal Infomax as developed in this work presents a reasonable extension of Linsker’s classical Infomax to the spatiotemporal domain. Again, it turns out here also that an information-theoretic maximization principle can suddenly be linked to a previously completely unrelated area of research: the theory of computing machines. The possibility of grounding the development of computational structures in neural systems on information-theoretic optimization principles seems as appealing as Linsker’s observation that such principles may guide the organization of sensory hierarchies. Both principles, of course, need further theoretical and experimental evaluation.

Appendix: Proofs Now we come to the proof of theorem 2. It is based on the following lemma, which we proved in Ay and Wennekers (2003): ¯ be a d-dimensional closed simplex in a real vector space and ext Lemma 1. Let its set of extreme points. For each subset e xt ⊂ e xt, (e xt ) denotes the open face ¯ with the extreme points e xt , and we have the stratification of ¯ =

(e xt ).

∅ = e xt ⊂ e xt

¯ supp x is the subset of e xt defined by x ∈ (supp x). For a point x ∈ , ¯ that is given by r linear equaNow consider an affine subspace V of aff tions: ¯ : x satisfies the r given linear equations}. V = {x ∈ aff ¯ locally maximizes a strictly convex function f : If a point x0 ∈ C := V ∩ C → R, then |supp x0 | ≤ d + 1 − dim V ≤ min r, d − dim C + 1.

(A.1)

Proof of Theorem 2. We fix the local maximizer ( p, K ) of the interaction I V , an ω ∈ supp p, and define the simplex ¯ := ( ¯ p, K ∂ , K , ω) ¯ V ) × K( ¯ ∂ ) × K( ¯ V\∂ | V ) : := {( p, K ∂ , L ) ∈ P(

Temporal Infomax and Automata

2287

L (· | σ ) = K (· | σ ) for all σ ∈ V , σ = ω} ⊂ RV × R∂ ×∂ × RV ×V\∂ . ¯ → P¯ V\∂ , This set can naturally be identified with P¯ V\∂ by the map ( p, K ∂ , L ) → L (· | ω). Now we define the convex subset, ¯ : C := {( p, K ∂ , L ) ∈

L v = K v for all v ∈ V \ ∂},

¯ with an affine subspace which can be represented as the intersection of V ∂ ×∂ V ×V\∂ of R × R ×R that is given by η(V \ ∂) equations. In order to apply lemma 1, we have to prove that the interaction I V is strictly convex on C. This part of the proof follows exactly the lines in Ay and Wennekers (2003) for the unconstrained case and is therefore not repeated here. Lemma 1 then implies |supp K (· | ω)| ≤ 1 + η(V \ ∂). Proof of Corollary 1. This follows directly from theorem 2. Acknowledgments N. A. thanks the Santa Fe Institute for hosting him during the final work on this letter. We also thank two anonymous referees for their valuable comments, which helped to improve the letter significantly. References Abbott, L., & Sejnowski, T. J. (Eds.). (1999). Neural codes and distributed representations. Cambridge, MA: MIT Press. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., & Vaadia, E. (1995). Cortical activity flips among quasi stationary states. Proc. Natl. Acad. Sci. (USA), 92, 8616–8620. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993a). Spatio-temporal firing patterns in frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629–1643. Abeles, M., Vaadia, E., Bergman, H., Prut, Y., Headman, I., & Slovin, H. (1993b). Dynamics of neuronal interactions in the frontal cortex of behaving monkeys. Concepts in Neuroscience, 4, 131–158. Aertsen, A. (Ed.). (1993). Brain theory. Amsterdam: Elsevier. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiol, 61, 900–917. Amari, S. (2001). Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47, 1701–1711.

2288

T. Wennekers and N. Ay

Ay, N. (2001). Information geometry on complexity and stochastic interaction (Preprint 95/2001). Leipzig: Max Planck Institute for Mathematics in the Sciences. Ay, N. (2002). Locality of global stochastic interaction in directed acyclic networks. Neural Computation, 14(12), 2959–2980. Ay, N., & Wennekers, T. (2003). Dynamical properties of strongly interacting Markov chains. Neural Networks, 16, 1483–1497. Barlow, H. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12, 241–253. Bell, A. J., & Parra, L. C. (2005). Maximizing sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & l. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129– 1159. Bi, G., & Poo, M. (2001). Synaptic modification of correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience, 24, 139–166. Chechik, G. (2003). Spike-timing-dependent plasticity and relevant mutual information maximization. Neural Computation, 15, 1481–1510. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Cutler, A., & Breiman, L. (1994). Archetypical analysis. Technometrics, 36, 338– 347. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Eckhorn, R. (1999). Neural mechanisms of scene segmentation: Recordings from the visual cortex suggest basic circuits for linking field models. IEEE Transactions on Neural Networks, 10, 464–479. Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Froemke, R., & Dan, Y. (2002). Spike-timing depending plasticity induced by natural spike trains. Nature, 416, 433–438. Gat, I., Tishby, N., & Abeles, M. (1997). Hidden Markov modelling of simultaneously recorded cells in the associative cortex of behaving monkeys. Network— Computation in Neural Systems, 8, 297–322. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Transactions on Biomedical Engineering, 36, 4–14. Gutig, ¨ R. G., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003) Learning input correlations through non-linear temporally asymmetric Hebbian plasticity. Journal of Neuroscience, 23, 3697–3714. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Lee, T. W., Girolami, M., Bell, A. J., & Sejnowski, T. J. (2000). A unifying informationtheoretic framework for independent component analysis. Comput. Math. Appl., 39, 1–21. Li, Z., & Arick, J. J. (1994). Toward a theory of the striate cortex. Neural Computation, 6, 127–146.

Temporal Infomax and Automata

2289

Linsker, R. (1986a). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proceedings of the National Academy of Sciences (USA), 83, 7508–7512. Linsker, R. (1986b). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proceedings of the National Academy of Sciences (USA), 83, 8390–8394. Linsker, R. (1986c). From basic network principles to neural architecture: Emergence of orientation columns. Proceedings of the National Academy of Sciences (USA), 83, 8779–8783. Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural Computation, 12, 2621–2653. Martignon, L., von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73, 69–81. Murre, J. M. J., & Sturdy, D. P. F. (1995). The connectivity of the brain: Multi-level quantitative analysis. Biological Cybernetics, 73, 529–545. Nakahara, H., & Amari, S. (2002). Information geometric measure for neural spike trains. Neural Comput., 14, 2269–2316. Palm, G., & Aertsen, A. (Eds.). (1986). Brain theory. Berlin: Springer. Penev, P. S., & Atick, J. J. (1996). Local feature analysis: A general statistical theory for object representation. Network: Computation in Neural Systems, 7, 477– 500. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek W. (1998). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypotheses. Annual Review of Neuroscience, 18, 555–586. Sporns, O., Tononi, G., & Edelman, G. M. (2000). Connectivity and complexity: The relationship between neuroanatomy and brain dynamics. Neural Networks, 13, 909–922. Studeny, ´ M., & Vejnarova, J. (1998). The multiinformation function as a tool for measuring stochastic dependence. In M. I. Jordan (Ed.). (1998). Learning in graphical models. Dordrecht: Kluwer, 1998. Tononi, G., Sporns, O., & Edelman, G. M. (1994). A measure for brain complexity: Relating functional segregation and integration in the nervous system. Proc. Natl. Acad. Sci. (USA), 91, 5033–5037. Tononi, G., Sporns, O., & Edelman, G. M. (1999). Measures of redundancy and degeneracy in biological networks. Proc. Natl. Acad. Sci. (USA), 96, 3257– 3262. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Wennekers, T. (1998). Synfire graphs: From spike patterns to automata of spiking neurons (Tech. Rep. N. 98-08). Ulm: Faculty for Computer Science, University of Ulm. Wennekers, T., & Ay, N. (2003). Spatial and temporal stochastic interaction in neuronal assemblies. Theory Biosci., 122, 5–18.

2290

T. Wennekers and N. Ay

Wennekers, T., Sommer, F., & Aertsen, A. (Eds.). (2003). Neural assemblies. Special Issue of Theory in Biosciences, 112, 1–104.

Received January 21, 2004; accepted March 9, 2005.

LETTER

Communicated by Eric Baum

Fast Access to Concepts in Concept Lattices via Bidirectional Associative Memories Rohana Rajapakse [email protected] School of Computing, Communications and Electronics, University of Plymouth, Plymouth, PL4 8AA, U.K.

Michael Denham [email protected]. Centre for Theoretical and Computational Neuroscience, University of Plymouth, Plymouth, PL4 8AA, U.K.

Bidirectional associative memories (BAMs) are shown to be capable of precisely learning concept lattice structures by Radim Bˇelohl´avek. The focus of this letter is to show that the BAM, when set up with a concept lattice by setting up connection weights according to the rule proposed by Bˇelohl´avek, always returns the most specific or most generic concept containing the given set of objects or attributes when a set of objects or attributes is presented as input to the object or attribute layer. A proof of this property is given here, together with an example, and a brief application of the property is provided. 1 Introduction Bˇelohl´avek (2000) proved that for each concept lattice B(G,M,I ), there is a bidirectional associative memory (BAM) such that the set of all concepts of B(G,M,I ) is precisely the set of all stable points of this BAM. This is an extremely useful finding that encourages the employment of BAMs to encode concept lattice structures in practical applications. Embedding concept lattices in BAM structures serves an attractive alternative to building concept lattices, as it greatly improves the efficiency while reducing the complexity of concept lattice building. Concept lattice structures have been recognized as a powerful alternative to the tree structures that dominate the implementation of structural representations of data. Unlike tree structures, a lattice structure allows the same item to be grouped under different categories (i.e., the same item can appear on different nodes), and also a given item to be reached by different routes or paths. These are very powerful properties that we wish to have in structural representations of data, as we often find data items that, by nature, occupy more than one place in categorization structures. And having Neural Computation 17, 2291–2300 (2005)

© 2005 Massachusetts Institute of Technology

2292

R. Rajapakse and M. Denham

more than one path to reach the same item allows access to an item through an alternative path in the case of a failure to reach it by a given default path. In the following, first we give brief introductions to the theory of formal concept analysis, concept lattices, and BAMs. Bˇelohl´avek’s theorem on the ability of BAMs to learn concept lattices is given next. The property of the BAM that we report on in this article is described and proved, followed by an example. Finally, a brief application that makes use of this property is provided. 2 Formal Concept Analysis Formal concept analysis (FCA) was proposed by Rudolf Wille in 1982 (Ganter & Wille, 1999; Wille, 1997) as a mathematical framework for performing data analysis. It structures data into units that are formal abstractions of concepts of human thought, allowing meaningful and comprehensible interpretation. FCA models the world as being composed of objects and attributes. It is assumed that an incidence relation connects objects to attributes. In addition, FCA encodes the specificity-generality relationship between related concepts by means of an order relation. The following definitions are crucial to the theory of FCA. Definition 1. A formal context K = (G,M,I ) is a triplet consisting of two sets G (set of objects) and M (set of attributes) and a relation I between G and M. Table 1 shows an example: the context of planets. Definition 2. A formal concept in a formal context is a pair (A,B) of sets A ⊆ G and B ⊆ M such that A↑ = B and B ↓ = A (completeness constraint), where A↑ = {m ∈ M|gIm for all g ∈ A} (i.e., the set of attributes common to all the objects in A), and B ↓ = {g ∈ G| gIm for all m ∈ B} (i.e., the set of objects that have all attributes in B). By gIm we denote the fact that object g has attribute m. Table 1: Context of the Planets. Size Planet Mercury (Me) Venus (V) Earth (E) Mars (Ma) Jupiter (J) Saturn (S) Uranus (U) Neptune (N) Pluto (P)

Small (ss)

Medium (sm)

× × × ×

×

× ×

Distance from Sun Large (sl)

× ×

Near (dn) × × × ×

Far (df )

× × × × ×

Moon Yes (my)

× × × × × × ×

No (mn) × ×

Fast Access to Concepts in Concept Lattices

2293

The set of all concepts of a context (G,M,I ) is denoted by B(G,M,I ). This consists of all pairs (A,B) such that A↑ = B and B ↓ = A, where A ⊆ G and B ⊆ M. Definition 3. Specificity-generality order relationship. If (A1,B1 ) and (A2 ,B2 ) are concepts of a context, then (A1 ,B1 ) is called a subconcept of (A2 ,B2 ), if A1 ⊆ A2 (or equivalently B1 ⊇ B2 ). This sub-super concept relation is written as (A1 ,B1 ) ≤ (A2 ,B2 ). According to this definition, a subconcept always contain fewer objects and greater attributes than any of its superconcepts. 3 Concept Lattice A lattice is an ordered set V with an order relation in which for any given two elements x and y, the supremum and the infimum elements always exist in V. Furthermore, such a lattice is called a complete lattice if supremum and infimum elements exist for any subset X of V (Ganter & Wille, 1999; Wille, 1997). The fundamental theorem of FCA (Ganter & Wille, 1999; Wille, 1997) states that the set of all formal concepts of a formal context forms a complete lattice. A complete lattice of formal concepts is called a concept lattice. Figure 1 shows the concept lattice of the context given in Table 1 as a line diagram. 4 Bidirectional Associative Memories Based on the early associative memory models (Hopfield, 1984), Kosko (1987, 1988) proposed a bidirectional associative neural network called the

{Me,V,E,Ma,J,S,U,N,P} → {} {E,Ma,J,S,U,N,P} → {my}

{Me,V,E,Ma,P} → {ss} {Me,V,E,Ma} → {ss,dn}

{E,Ma,P} → {ss,my}

{E,Ma} → {ss,dn ,my} {Me,V} → {ss,dn,mn}

{J,S,U,N,P} → {df,my}

{J,S} → {sl,df,my} {U,N} → {sm,df,my}

{P} → {ss,df,my}

{} → {ss,sm,sl,dn,df,my,mn}

Figure 1: Concept lattice of the context of planets given in Table 1.

2294

R. Rajapakse and M. Denham

bidirectional associative memory (BAM). It consists of two layers of neurons. The states (activities) of the neurons in the two layers are denoted by xi (i = 1, . . . , k) and y j ( j = 1, . . . , l), respectively, where k and l are the number of neurons in the two layers. The states xi and y j can be represented as either a binary (0 or 1) or a bipolar (+1 or −1) encoding. Each ith neuron of the first layer is connected to each jth neuron of the second layer by a y connection weight. A real threshold θ ix (θ j ) is assigned to the ith neuron of the first layer ( jth neuron of the second layer), respectively. In contrast to an autoassociator, the BAM is a hetero-associator that associates pairs of patterns together. A number of training rules have been developed for training BAMs. We use the rule proposed by Bˇelohl´avek (2000). This rule is proven to set up a concept lattice precisely in a BAM. 5 BAMs for the Representation and Storage of Concept Lattices: Bˇelohl´avek’s Theorem In general, there are BAM stable points that cannot be interpreted as formal concepts. However, using the weight computation given below, Bˇelohl´avek (2000) has shown not only that the BAMs can learn concept lattices, but also that the set of stable points in the BAM is precisely the set of nodes of the underlying concept lattice. In the following, we state Bˇelohl´avek’s theorem on this property; the reader is referred to Belohl´avek (2000) for details and for the proof of the theorem. Bˇelohl´avek’s Theorem. Let B(G,M,I ) be a concept lattice given by the context G,M,I with G and M finite. Then there is a BAM given by the weights W and thresholds θ such that Stab(W, θ ) = {SG (A),SM (B)|A,B ∈ B(G,M,I )}; where SZ (A) = a 1 , . . . , a n ∈ {0,1}n denotes the characteristic vector of A (A ⊆ Z), that is, a i = 1 if zi ∈ A and a i = 0 otherwise. This theorem says that there is essentially a one-to-one correspondence between stable points and formal concepts. Given below is Bˇelohl´avek’s weight-setting rule:

wi j =

 1 

−q

if gi , m j ∈ I if gi , m j ∈ / I

for i = 1, . . . , k, j = 1, . . . l; where q = max {k, l} + 1 and thresholds are set to −1/2.

6 Accessing Concepts in the Concept Hierarchy of a Concepts Lattice Encoding concept lattices in BAM structures in a way that ensures there are no spurious states is an important and attractive alternative to the complex lattice-building algorithms. However, building a concept lattice solves only part of the problem. Accessing the concepts in the concept hierarchy of the

Fast Access to Concepts in Concept Lattices

2295

underlying concept lattice is equally important. In the following (theorem), we report how concepts can be accessed from a concept lattice representation encoded in a BAM. The proof of the theorem follows. Theorem. Let X denote a BAM trained with the concept lattice B(G,M,I ) given by the context G,M,I with G and M finite. Then, given a set of objects A when presented to the object layer X(G) of X results in X stabilizing on the most specific concept containing the given set of objects. Conversely, given a set of attributes B when presented to the attribute layer X(M) of X results in X being stabilized on the most generic concept containing the given set of attributes. In its simplest case, when only one object or one attribute is presented to the corresponding layer of the BAM, it returns the object or attribute concept, respectively. Note that the object concept of a given object is defined as the most specific concept in the lattice containing the given object—that is, the concept represented by the lowest node where the given object appears. The attribute concept of a given attribute is defined as the most generic concept in the lattice containing the given attribute—that is, the concept represented by the highest node where the given attribute appears. Note that the proof that follows proves that given an object, the BAM stabilizes on the object concept. The converse of this—given an attribute, the BAM stabilizes on the attribute concept—can be proved by a similar argument. Therefore, the concepts in different parts of the concept hierarchy can be accessed by presenting appropriate sets of objects or attributes to the corresponding layer of the BAM. For instance, adding more objects to the input allows traversing up the concept lattice, and adding more attributes allows traversing down the concept lattice. Proof. Let B(G,M,I ) be the set of all concepts of the context G,M,I , where G is the set of all objects in the context, M is the set of all attributes in the context, and I is the incident relation. Let us assume that a BAM (say, X) is setup according to Bˇelohl´avek’s weight-setting rule embedding the concept lattice of this context in X. Then the stable points in X represent (with a one-to-one correspondence) all formal concepts (A,B) of the context G,M,I , and thus they hold the completeness constraint A↑ = B and B ↓ = A, where A↑ = {m ∈ M |∀g ∈ A s.t. g,m ∈ I } and B ↓ = {g ∈ G |∀m ∈ B s.t. g,m ∈ I }. Now, let us take an arbitrary object go . Assume, without loss of generality, that the BAM gives {go , gi } → {mx ,m y } when the object go is presented to the object layer of the BAM. This means {go , gi } → {mx ,m y } is a formal concept in the context. Therefore, by the completeness constraint (A↑ = B and B ↓ = A) we get: {go ,gi }↑ = {mx ,m y }

(6.1)

2296

R. Rajapakse and M. Denham

and {mx ,m y }↓ = {go ,gi }.

(6.2)

Now we will prove that no other formal concept that contains the object go and is more specific to the concept {go ,gi }→{mx ,m y } exists in the concept lattice. According to the order relation, a formal concept that is more specific to a second formal concept should contain fewer objects and more attributes than that of the second formal concept. We will prove that no such subconcept containing the object go can exist in the lattice, given the assumption that the BAM returns {go ,gi } → {mx ,m y } for the input {go }. There are three forms that a subconcept of a given concept can take. The first two actually contradict the completeness constraint, but they are included in the proof for the sake of completeness, as they satisfy the order relation. Case 1: A concept with fewer objects and same attributes Case 2: A concept with the same objects and greater attributes Case 3: A concept with fewer objects and greater attributes. Case 1. Take {go } → {mx ,m y } to be a formal concept in the concept lattice. This is more specific to the concept {go ,gi } → {mx ,m y } as it satisfies the subsumption relation A1 ⊆ A2 (and B1 ⊇ B2 ). If this is a formal concept of the context, from the completeness constraint, we get: {mx , m y }↓ = {go }.

(6.3)

Equations 6.2 and 6.3 lead to a contradiction, meaning that {go } → {mx ,m y } cannot be a formal concept of the context given the assumption that {go ,gi } → {mx ,m y } is a formal concept.

Case 2. Now take {go ,gi } → {mx ,m y ,mz } to be a formal concept of the context. This is more specific to the concept {go ,gi } → {mx ,m y } as it satisfies the subsumption relation B1 ⊇ B2 (and A1 ⊆ A2 ). If this is a formal concept, from the completeness constraint we get: {go ,gi }↑ = {mx ,m y ,mz }.

(6.4)

Equations 6.1 and 6.4 lead to a contradiction, meaning that {go ,gi }→{mx ,m y ,mz } cannot be a formal concept of the content given the assumption that {go ,gi }→{mx ,m y } is a formal concept.

Fast Access to Concepts in Concept Lattices

go

Forward Pass

2297

gi -4

+1

mx

+1

+1 +1 +1

my

Backward Pass

mz

Figure 2: BAM (X) with weight settings for case 3 of the proof.

Case 3. Finally, take without loss of generality that {go }→{mx ,m y ,mz } is a formal concept in the concept lattice. This does not violate the completeness constraint, and therefore does not contradict conditions 6.1 and 6.2. Hence, it is a potential candidate for a formal concept that can possibly exist in the concept lattice. Also, this is a more specific concept to {go ,gi } → {mx ,m y } as it satisfies the subsumption relation B1 ⊇ B2 (and, equivalently, A1 ⊆ A2 ). Now we will prove that this formal concept cannot exist in the concept lattice, given the assumption that the concept {go ,gi }→{mi ,m j } is present and is returned when the object go (alone) is presented to the object layer of X. Since {go ,gi } → {mx ,m y } was assumed to be a formal concept, both go and gi must possess at least the attributes mx and m y . The concept {go } → {mx ,m y ,mz } says that the object go possesses mz as well. Let us consider the dynamics of the BAM to see what it returns when the object go is presented to the object layer. Note that we use Bˇelohl´avek’s weight computation rule to set up connection weights to encode the concept lattice in X. Recall that a weight between an object and an attribute node is set to one (+1) if the object possesses the attribute; otherwise, the weight is set to the negative value of the maximum number of nodes in either layer +1 (see Figure 2). A node fires if the weighted sum of incomes xi wi exceeds −1/2. In the first forward pass, go is presented to the object layer (i.e., go = 1 and gi = 0); only the nodes correspond to mx , m y , and mz fire. In the first backward pass, mx , m y , and mz are the inputs to the attribute layer; only the node corresponding to go is fired. The state of the BAM does not change in subsequent passes, and therefore {1,0} → {1,1,1} is a stable state. In other words, the BAM stabilizes on {go } → {mx ,m y ,mz } when go is presented. This contradicts our first assumption that the BAM returns {go ,gi } → {mx ,m y } for the input object go . Therefore, if the BAM returns the concept {go ,gi } → {mx ,m y } for the input object {go }, as we first assumed, then the concept {go } → {mx ,m y ,mz } cannot exist in the concept lattice.

2298

R. Rajapakse and M. Denham

Me

V

+1

ss

Me

ss

Ma

E

-10

-10

sl

sm

E

V

sm

J

+1

S

-10

dn

Ma

sl

J

dn

U

N

+1

-10

df

S

my

U

df

P

my

Attribute Layer

mn

N

P

mn

Object Layer

Object Layer

Attribute Layer

Figure 3: BAM in operation. (Top) Forward pass. (Bottom) Backward pass.

Cases 1, 2, and 3 prove that given an input object (go ), a BAM stabilizes on the most specific concept in the lattice containing the given input object (go ). Similarly, we can prove that this property holds even for a given subset of objects rather than a single object. In addition, with a similar argument, we can prove that given a set of attributes as input to the attribute layer, the BAM stabilizes on the most generic concept containing the given set of attributes (proof not given here). Example. Consider the context of planets given in Table 1 and the corresponding concept lattice given in Figure 1. Assume that this concept lattice is encoded in a BAM by setting up the connection weights as described above. Figure 3 shows the forward pass (top) and the backward pass (bottom) of the BAM, given the object Ma as input to the object layer. Note that only the links that spread out from active nodes are shown. The nodes shown in gray are the active (firing) nodes, dashed lines represent links with negative (−q ) weights, and solid lines represent links with weight +1. In the forward pass (when the object Ma is presented as the input to the object layer), the nodes corresponding to the properties ss, dn, and my are activated (see Figure 3, top). This set of properties serves as the input to the attribute layer of the BAM in the backward pass, resulting in the activation of the nodes corresponding to Ma and E in the object layer (see Figure 3, bottom). The state of the BAM does not change in the subsequent passes; that is, the BAM stabilizes on the concept {Ma,E} → {ss, dn, my}. This indeed is the most specific concept (containing Ma) present in the concept lattice.

Fast Access to Concepts in Concept Lattices

2299

Query Lattice

Doc#1 Lattice

Doc#2 Lattice

Doc#3 Lattice

Doc#n Lattice

Figure 4: Concept matching between lattice representations.

7 Application We have employed the theories and ideas described above to develop a novel text retrieval model. Only a brief description of this model is given here to demonstrate the use of the property in a practical application. The details of the model can be found in Rajapakse and Denham (2002a, 2002b). In this application, each information item (documents and queries) is represented in an individual concept lattice trained to a separate BAM. The elements of these lattices (i.e., objects and attributes) are terms (and phrases) extracted from the corresponding text items. The allocation of roles to the elements as objects or attributes and the identification of the existence of a relation (incident relation) between object terms and attribute terms are done according to a set of rules developed based on the syntactic structures and semantic relations between terms in the English language. The retrieval of documents (to a query) is based on matching nodes (formal concepts) between the query and document lattices (see Figure 4). Instead of trying out all-to-all node matching, we made use of the property of the BAM discussed in this article to perform a selective node matching. The selection of node pairs to match between two lattices (a query and a document) is based mainly on object concepts (attribute concepts are also used if no objects are common to the query and the document). For each individual (unique) element that appears as an object in both the query lattice and the document lattice, object concepts are extracted from the query and document BAMs. This gives the most specific concepts (about the same object element) present in the two lattices. Our interest is in matching such specific formal concepts between queries and documents. Matching object-attribute element pairs between such query-document node pairs contributes to the similarity between the query and the document.

2300

R. Rajapakse and M. Denham

References Bˇelohl´avek, R. (2000). Representation of concept lattices by bidirectional associative memories. Neural Computation, 12(10), 2279–2290. Ganter, B., & Wille, R. (1999). Formal concept analysis: Mathematical foundations. Berlin: Springer-Verlag. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A., 81, 3088– 3092. Kosko, B. (1987). Adaptive bidirectional associative memory. Applied Optics, 26(23), 4947–4960. Kosko, B. (1988). Bidirectional associative memory. IEEE Trans. Systems, Man and Cybernetics, 18(1), 49–60. Rajapakse, R. K., & Denham, M. (2002a, March). A concept based adaptive information retrieval model using FCA-BAM combination for concept representation. In F. Crestani, M. Girolami, & C. J. Rijsbergen (Eds.), Advances in information retrieval: Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Glasgow, UK, March 2002 (pp. 150–168). Berlin: Springer-Verlag. Rajapakse, R. K., & Denham, M. (2002b, July). Information retrieval model using concept lattices for content representation, In Workshop Proceedings of the FCA KDD Workshop of the 15th European Conference on Artificial Intelligence (ECAI’02). Lyon, France. Available online: http://www.lirmm.fr/liqulere/Documents/FCAKDD Proceedings2002.pdf. Wille, R. (1997). Conceptual graphs and formal concept analysis. In D. Lukose, H. Delugach, M. Keeler, L. Searle, & J. Sowa (Eds.). Conceptual structures: Fulfilling Peirce’s dream: Proceedings of the ICCS’97 (pp. 290–303). Berlin: Springer.

Received November 1, 2004; received March 2, 2005.

NOTE

Communicated by Paul Tiesinga

An Extended Analytic Expression for the Membrane Potential Distribution of Conductance-Based Synaptic Noise M. Rudolph [email protected]

A. Destexhe [email protected] Unit´e de Neuroscience Int´egratives et Computationnelles, CNRS, 91198 Gif-sur-Yvette, France

Synaptically generated subthreshold membrane potential (Vm ) fluctuations can be characterized within the framework of stochastic calculus. It is possible to obtain analytic expressions for the steady-state Vm distribution, even in the case of conductance-based synaptic currents. However, as we show here, the analytic expressions obtained may substantially deviate from numerical solutions if the stochastic membrane equations are solved exclusively based on expectation values of differentials of the stochastic variables, hence neglecting the spectral properties of the underlying stochastic processes. We suggest a simple solution that corrects these deviations, leading to extended analytic expressions of the Vm distribution valid for a parameter regime that covers several orders of magnitude around physiologically realistic values. These extended expressions should enable finer characterization of the stochasticity of synaptic currents by analyzing experimentally recorded Vm distributions and may be applicable to other classes of stochastic processes as well.

1 Introduction The sustained and intense synaptic background activity typical of cortical neurons in vivo may have significant consequences on their integrative behavior (reviewed in Destexhe, Rudolph, & Par´e, 2003). Various methods were proposed to characterize this synaptic “noise” (e.g., Levitan, Segundo, Moore, & Perkel, 1968; L´ansky´ & L´ansk´a, 1987; van Rossum, 2001; Hillenbrand 2002; Tuckwell, Wan, & Rospars, 2002). Synaptic noise may also be modeled by global conductances described by stochastic processes of the Ornstein-Uhlenbeck (OU) type (Destexhe, Rudolph, Fellous, & Sejnowski, 2001). Recently, this system was investigated within the framework of Neural Computation 17, 2301–2315 (2005)

© 2005 Massachusetts Institute of Technology

2302

M. Rudolph and A. Destexhe

stochastic calculus, and analytic expressions were obtained for the steadystate membrane potential (Vm ) distribution (Rudolph & Destexhe, 2003), as well as for the moments of the underlying three-dimensional Fokker-Planck equation (Richardson, 2004). The approach proposed in Rudolph and Destexhe (2003) considered the membrane equation subject to conductance-based synaptic noise, based on the distribution of the stochastic processes as well as the expectation values of their differentials. The analytic expression obtained for the amplitude distribution of the Vm was shown to match well numerical simulations in a physiologically relevant but limited parameter space. However, as we show here, this matching does not apply to other parameter regimes, in particular for very small membrane time constants. How can we explain this mismatch, and if possible, how can it be corrected? In this note, we investigate this problem by first showing that solving stochastic differential equations solely based on the expectation values of differentials of the underlying stochastic processes does not completely capture the spectral properties of the investigated stochastic system. Although these differences are not dramatic if realistic parameter values are considered, they nevertheless can induce deviations of the predicted amplitude distribution for other parameter regimes. This suggests that both the amplitude distribution and the spectral properties of the underlying stochastic processes are needed to fully describe a stochastic system. We suggest a simple approach to correct for these spectral deviations. We consider a simpler and dynamically different system for which the analytic solution is known. This system is solved using local expectation values (as in Rudolph and Destexhe, 2003), which allows us to estimate the extension needed to incorporate the spectral signature of the stochastic process. We argue that by applying the same extension to the full conductance-based system, we obtain an extended analytic expression for the probability distribution of the Vm at steady state in which spectral deviations are compensated. Finally, we show that this extended expression matches numerical simulations even for extreme parameter regimes covering several orders of magnitude around physiologically realistic parameter values. 2 The Model and Analytic Solution What will be referred here as the full conductance-based system consists in a fluctuating conductance model (Destexhe et al., 2001), based on the passive

Extended Analytic Expression for Synaptic Noise

2303

membrane equation, 1 1 1 d V(t) = − (V(t) − E 0 ) − g˜ e (t)(V(t) − E e ) − g˜ i (t)(V(t) − E i ), dt τm C C (2.1) where V(t) denotes the membrane potential, C the membrane capacity, τm = C the effective membrane time constant, G L the leak conductance, G L +ge0 +gi0 and E 0 the average membrane potential in the presence of synaptic noise given by

E0 =

G L E L + ge0 E e + gi0 E i , G L + ge0 + gi0

(2.2)

where E L denotes the leak reversal potential. Note that this notation differs from the one used in Rudolph and Destexhe (2003) by using C = Cm a and G L = g L a , where a denotes the total membrane area. g˜ {e,i} (t) = g{e,i} (t) − g{e,i}0 are excitatory and inhibitory conductances, described by OU stochastic processes,

1 dg{e,i} (t) =− g{e,i} (t) − g{e,i}0 + dt τ{e,i}

2 2σ{e,i}

τ{e,i}

ξ{e,i} (t)

(2.3)

with mean g{e,i}0 , standard deviation σ{e,i} , and time constant τ{e,i} . ξ{e,i} (t) denotes gaussian white noise with zero mean and unit standard deviation, and E {e,i} are the respective noise reversal potentials for excitatory and inhibitory conductances. By using the stochastic calculus (see, e.g., Gardiner, 2002), the steadystate subthreshold activity can be characterized in terms of the probability distribution of the membrane potential. To this end, we used a set of differential rules (Itoˆ rules) describing products of differentials, hence infinitesimal displacements, of the corresponding integrated OU stochastic processes for excitatory and inhibitory conductances. With this, the FokkerPlanck equation corresponding to the Langevin equation 2.1 was deduced, which describes the time evolution of the probability ρ(V, t) that the membrane potential takes the value V at time t. In the steady-state limit t → ∞, this Fokker-Planck equation is explicitly solvable, yielding a membrane

2304

M. Rudolph and A. Destexhe

A

Numerical solution Analytic solution Extended analytic solution

τm = 3.63 ms

ρ(V)

ρ(V)

0.2

0.2

0.1

0.1

ρ(V)

τm = 1.36 ms

τ m = 1.03 ms

0.12 0.08 0.04

-70 -65 -60 -55

-70 -65 -60 -55

V (mV)

V (mV)

-70

-60

-50

V (mV)

C

B

Numerical solution Analytic solution Extended analytic solution

-61.6

V (mV)

-62

σ V2 (mV)

-62.4 -60

0.05 0.1 0.15

-62

11 10 9 8 0.05 0.1 0.15

15 10

-64 -66

5

-68 1

2 3 4

5

6 7

8

τ m (ms)

1

2 3 4

5

6 7

8

τ m (ms)

potential distribution given by  2 σi2 τi σe τe 2 2  ρ(V) = N exp A1 ln (V − E e ) + 2 (V − E i ) C2 C  σe2 τe (V − E e ) + σi2 τi (V − E i )  , + A2 arctan  (E e − E i ) σe2 τe σi2 τi 

(2.4)

where A1 = − A2 =

2C(ge0 + gi0 ) + 2C G L + σe2 τe + σi2 τi , 2 σe2 τe + σi2 τi

2C G L σe2 τe (E L − E e ) + σi2 τi (E L − E i ) + ge0 σi2 τi − gi0 σe2 τe (E e − E i ) , (E e − E i ) σe2 τe σi2 τi σe2 τe + σi2 τi

Extended Analytic Expression for Synaptic Noise

2305

and τ{e,i} denote effective noise time constants. This form of ρ(V) differs from the one presented in Rudolph and Destexhe (2003) by the use of “effective” noise time constants, which in general are not equal to τ{e,i} . The argumentation leading to this crucial change is the subject of this contribution, along . with the deduction of the explicit form of τ{e,i} Indeed, by comparing the solution provided in Rudolph and Destexhe (2003) with numerical simulations as well as experimental recordings with artificially recreated synaptic noise in dynamic clamp (Rudolph et al., 2004b), it was shown that this analytic expression characterizes well the subthreshold behavior in a physiologically relevant parameter regime (membrane time constants τm ≥ 3 ms; see Figures 1A, left, 1B, and 1C, dashed lines). However, marked deviations were found for other parameter regimes, in particular for small membrane areas and small membrane time

Figure 1: Comparison of the Vm distributions obtained numerically and using the extended analytic expression. (A): Examples of membrane potential distributions for different membrane time constants τm . In all cases, numerical simulations (gray) are compared with the original analytic solution (Rudolph & Destexhe, 2003, dashed lines) and the extended analytic expression (solid lines) obtained after compensating for the filtering problem (see equations 2.4 and 4.8). (B, C) Mean V and variance σV2 of the Vm distribution as a function of membrane time constant. Numerical simulations (gray) are compared with the mean and variance obtained by numerical integration of the original analytic solution (Rudolph & Destexhe, 2003, dashed lines) and the extended analytic expression (see equations 2.4 and 4.8, solid lines). The gray vertical stripes mark the parameter regimes displayed in the insets. Parameter values: g L = G L /a = 0.0452 mS cm−2 , Cm = C/a = 1 µF cm−2 , E L = −80 mV, ge0 = 12 nS, gi0 = 57 nS, σe = 3 nS, σi = 6.6 nS (A, right: σe = 3 nS, σi = 15 nS), τe = 2.728 ms, τi = 10.49 ms, E e = 0 mV, E i = −75 mV. Membrane area a : a = 30,000 µm2 (A, left), a = 10,000 µm2 (A, middle), a = 7,500 µm2 (A, right), 50 µm2 ≥ a ≥ 100,000 µm2 (B, C). For all simulations, integration time steps of at least one order of magnitude smaller (but at most 0.1 ms) than the smallest time constant (either membrane or noise time constant) in the considered system were used. To ensure that the observed effects were independent on peculiarities of the numerical integration, different values for the integration time step (in all cases, at least one order of magnitude smaller than the smallest time constant in the system) for otherwise fixed noise and membrane parameters were compared. No systematic or significant differences were observed. Moreover, to ensure valid statistics of the membrane potential time course, the simulated activity covered at least 100 s for each parameter set used. Simulations were performed using the NEURON simulation environment (Hines & Carnevale, 1997).

2306

M. Rudolph and A. Destexhe

constants τm 3 ms (see Figures 1A, middle and right, 1B and 1C, dashed lines). Two points must be noted related to these deviations. First, in the proposed approach leading to equation 2.1, only moments describing the probability distribution of the OU and integrated OU stochastic process were 2 τ{e,i} used. Moreover, in the steady-state limit, only the specific coupling σ{e,i} between the noise variance and time constant enters the solution. Motivated by the fact that the distributions of both white and colored (OU) noise are gaussian, this result suggests that white conductance noise source 2 τ{e,i} will yield equivalent with an “effective” variance proportional to σ{e,i} distributions at steady state, and therefore affects the membrane in a statistically equivalent fashion. However, this way, the spectral properties distinguishing white and colored noise will not be respected. Second, and linked to the first point, no explicit integration of the membrane equation was performed. In this sense, our proposed approach deviates from other approaches (e.g., Manwani & Koch, 1999; Richardson, 2004). Indeed, the restriction to expectation values of differentials allows one to treat the stochastic system, equations 2.1 and 2.3, analytically and to deduce a solution for the corresponding Fokker-Planck equation. On the other hand, as we will show below, this way, the spectral properties of the stochastic variables are not preserved. These properties can be accessed only by integration over the whole time domain. 3 The Filtering Problem In this section, we demonstrate the core problem by calculating the Fourier transforms of the original Langevin equation, equation 2.1, and that of an infinitesimal displacement of the membrane potential formulated in terms of differentials of the integrated OU stochastic process after application of the stochastic calculus (see equation B.7 in Rudolph & Destexhe, 2003). We then identify the differences between the two power spectra, which can be viewed as resulting from a modification of the spectral properties of the noise processes due to membrane filtering, and argue for a simple way to correct these differences. Defining V(ω) and g˜ {e,i} (ω) as the Fourier transforms of V(t) and g˜ {e,i} (t), respectively (ω denotes the circular frequency), V(t) =

1 2π

g˜ {e,i} (t) =

1 2π

∞

−∞

∞

−∞

dωV(ω)e iωt , dω g˜ {e,i} (ω)e iωt ,

Extended Analytic Expression for Synaptic Noise

2307

the Fourier transform of the membrane equation 2.1 reads

∞ 1 1 ˜ ˜ − ω ) − E˜ e ) V(ω) =− dω {g˜ e (ω )(V(ω iω + τm 2πC −∞ ˜ − ω ) − E˜ i )}, + g˜ i (ω )(V(ω

(3.1)

˜ where V(ω) = V(ω) − E 0 and E˜ {e,i} = E {e,i} − E 0 . Equation B.7 in Rudolph and Destexhe (2003) provides another expression for an infinitesimal displacement of the membrane potential V(t), deduced within the framework of the calculus of integrated OU stochastic t processes w˜ {e,i} (t) = 0 ds g˜ {e,i} (s). The Fourier transform of this infinitesimal displacement is given by

∞ 1 1 ˜ ˜ − ω ) − E˜ e ) iω + V(ω) =− dω {g˜ e (ω )(V(ω τm 2πC −∞ ˜ − ω ) − E˜ i )}, + g˜ i (ω )(V(ω

(3.2)

which can be viewed as the counterpart to equation 3.1. Here, g˜ {e,i} (ω) denote d 1 the Fourier transforms of g{e,i} (t) = dt w˜ {e,i} (t) − C α{e,i} (t), where

2 2α{e,i} (t) = σ{e,i} 1 + e −t/τ{e,i} + τ{e,i}

1 2 w˜ 2{e,i} (t) − σ{e,i} t. 2τ{e,i}

(3.3)

(t) is formally well defined but cannot Note that the Fourier transform of g{e,i} be calculated explicitly due to the nontrivial form of α{e,i} (t). Indeed, instead of directly evaluating equations 3.1 and 3.2, we follow another path. Comparing both Fourier transforms suggests that the stochastic calculus used in Rudolph & Destexhe (2003) introduces a modification of the spectral structure characterizing the original system. This modification is linked to the term α{e,i} (t) (see equation 3.3), whose appearance is a direct consequence of the use of the integrated OU stochastic process and its differentials. As indicated in equation 3.2, this translates into an alteration of the filtering properties described by the stochastic differential equation 2.1. Two points must be mentioned. First, both Fourier transforms, equations 3.1 and 3.2, show the same functional structure, with g˜ {e,i} (ω) in equation 3.1 (ω) of a new stochastic variable g{e,i} (t) replaced by the Fourier transform g˜ {e,i} in equation 3.2. Second, the functional coupling of g˜ {e,i} (ω) and g˜ {e,i} (ω) to the Fourier transform of the membrane potential V(ω) is identical in both cases. This, together with the fact that V(ω) describe the same state variable,

2308

M. Rudolph and A. Destexhe

now provides the basis for deducing an explicit expression for the effective and, thus, the extension of equation 2.4. time constants τ{e,i} In order to preserve the spectral signature of V(t), we assume that the (t) is functional form of the Fourier transform of the stochastic process g˜ {e,i} equivalent to that of the OU stochastic process g˜ {e,i} (t). The Fourier transform of the latter is given by g˜ {e,i} (ω) =

2 2σ{e,i}

τ{e,i}

ξ{e,i} (ω)

1 iω +

1 τ{e,i}

.

(3.4)

Thus, the above assumption can be restated in more mathematical terms as 2 2σ{e,i} 1 g˜ {e,i} (ω) = ξ{e,i} (ω) . τ{e,i} iω + τ 1

(3.5)

{e,i}

In writing equation 3.5, we further assumed that changes in the spectral (ω) are reflected in changes of the parameters properties of g˜ {e,i} (ω) and g˜ {e,i} describing the corresponding Fourier transforms. In our case, the latter are will the noise time constants τ{e,i} . This new “effective” time constants τ{e,i} later be used to compensate the change in the spectral filtering properties in the analytic solution, equation 2.4. Note that this restriction to changes in 2 τ{e,i} the noise time constants is possible because only the combinations σ{e,i} enter equation 2.4. Thus, each change in σ{e,i} can be mapped onto a change only. Moreover, due to their definition and equation 3.2, τe and of τ{e,i} τi undergo mutually independent modifications. In the next section, we and τ{e,i} and explicitly calculate the link between the time constants τ{e,i} thus provide a solution with which the filtering problem can be resolved. We then justify the validity of the above assumptions and verify the proposed procedure by comparing the obtained results with numerical simulations. 4 One Solution to the Filtering Problem As motivated above, the changes in the spectral filtering properties of the membrane equation 2.1 are reflected in new effective time constants τe and τi , which enter the steady-state solution 2.4. Moreover, as argued in the last section, this alteration of the spectral signature of the membrane potential V(t) can be accounted for by incorporating the explicit form of these effective into equation 2.4. In this section, we propose a way to time constants τ{e,i} deduce this explicit form by using a simplified version of the stochastic system.

Extended Analytic Expression for Synaptic Noise

2309

Due to the complicated functional structure of α{e,i} (t), a direct treatment (ω) is mathematically delicate. Therefore, in of the Fourier transforms g˜ {e,i} order to access τ{e,i} , we followed another approach. We suggested above as a function of the noise and membrane that an explicit expression of τ{e,i} parameters entering the original system, equations 2.1 and 2.3, alone will allow solving the filtering problem. Such a relation is already provided by the lowest-order moments of the Vm distribution, in particular the variance σV2 of the membrane potential at steady state. Indeed, as we will show below, comparing σV2 obtained by direct integration of the stochastic system (which keeps its spectral structure unaltered), and the σV2 deduced after application of the stochastic calculus proposed in Rudolph and Destexhe (2003), an can be deduced. explicit expression for τ{e,i} Unfortunately, for the stochastic system originally described in Rudolph and Destexhe (2003), direct integration does not lead to a closed form. For this reason, we need to choose a simplified stochastic system. Due to the fact that the application of stochastic calculus does not impair the qualitative (ω) and Vm (comcoupling between the stochastic processes g˜ {e,i} (ω), g˜ {e,i} pare equations 3.1 and 3.2), a simpler system with the same (conductance) noise processes but different coupling to Vm , such as additive coupling, can be considered. In a recent contribution (Richardson, 2004), such simplified models were investigated, among which an effective time constant approximation described the effect of colored conductance noise by a constant mean conductance and conductance fluctuations that couple to the mean Vm . The latter lead to a term describing current noise. This model is equivalent to equation 2.1, in which V(t) in the noise terms is replaced by its mean E 0 , that is, d V(t) 1 1 1 = − (V(t) − E 0 ) − g˜ e (t)(E 0 − E e ) − g˜ i (t)(E 0 − E i ), dt τm C C

(4.1)

where g˜ {e,i} (t) are given by equation 2.3. In contrast to equation 2.1, this simplified stochastic system can explicitly be solved by direct integration, which leaves, as required, the spectral characteristics unaltered. The variance of the membrane potential was found to be (Richardson, 2004) σV2 =

σ τ 2 τ σ τ 2 τ e m e i m i (E 0 − E e )2 + (E 0 − E i )2 . C τe + τm C τi + τm

(4.2)

An equivalent expression for the membrane potential variance can be deduced by approximating the explicit form of σV2 given in Manwani and Koch

2310

M. Rudolph and A. Destexhe

(1999). The latter was directly deduced from the power spectral density of the underlying stochastic processes. Treating the simplified stochastic system 4.1 within the framework detailed in Rudolph and Destexhe (2003) leads to the following Fokker-Planck equation: 1 V − E0 ρ(V, t) − (∂V ρ(V, t)) τm τm

2 (E 0 − E i )2 (E 0 − E e )2 α (t) + α (t) ∂V ρ(V, t) . − e i 2 2 C C

(∂t ρ(V, t)) = −

(4.3)

For t → ∞, we obtain the steady-state solution. In this limit, ∂t ρ(V, t) → 0, 2 τ{e,i} . To obtain the latter, we calcuρ(V, t) → ρ(V) and 2α{e,i} (t) → σ{e,i} lated the expectation value of equation 3.3 and used exp[−t/τ{e,i} ] → 0 for t → ∞ as well as the fact that in this limit, the integrated OU stochas2 (t) yields a Wiener process with two-dimensional cumutic process w ˜ {e,i} 2 2 τ{e,i} (see equation A.12a in lant < w˜ {e,i} (t) >= 2D{e,i} t where D{e,i} = σ{e,i} Rudolph and Destexhe, 2003). With this, equation 4.3 takes the form V − E0 1 ρ(V) − (∂V ρ(V)) τm τm

(E 0 − E e )2 2 (E 0 − E i )2 2 2 σ τ + σ τ ∂V ρ(V) . − e e i i 2 2 2C 2C

0=−

(4.4)

This equation is obtained from equation 4.3 by performing the limit t → ∞, t 1. Note that this limit is not equivalent to in which case the ratio τ{e,i} taking the limit τ{e,i} → 0. As already stated in the original contribution (Rudolph and Destexhe, 2003, see p. 2583, text after equation 3.3), for t → ∞, the noise time constants τ{e,i} become infinitesimally small compared to the time over which the steady-state probability distribution is obtained. This constitutes the basis for our assumption that the variables α{e,i} (t) take a form corresponding to that obtained in the case of a Wiener process. Equation 4.4 can now explicitly be solved, yielding

ρ(V) = e

−

(V−E 0 )2 2 2σV

 C1 e

E2 0 2 2σV

+ C2

  V − E π 2 0  , σ Erfi  2 V 2 2σ V

(4.5)

Extended Analytic Expression for Synaptic Noise

2311

where Erfi[z] denotes the imaginary error function and σV2 =

τm σe2 τe τm σi2 τi 2 (E − E ) + (E 0 − E i )2 0 e 2 C2 2 C2

(4.6)

the variance of the membrane potential. With the boundary conditions ∞ ρ(V) → 0 for V → ±∞, and normalization −∞ d V ρ(V) = 1, equation 4.5 simplifies to a gaussian, 2

(V−E 0 ) − 1 2 ρ(V) = e 2σV . 2 2πσV

(4.7)

This result is the equivalent of equation 2.4, which was deduced from the stochastic system given in equation 2.1, when considering the stochastic system 4.1. Comparing now the variance of the membrane potential distribution obtained with two qualitatively different methods, equations 4.2 and 4.6, respectively, yield the desired link between the time constants = τ{e,i}

2τ{e,i} τm . τ{e,i} + τm

(4.8)

If the argumentation and assumptions made in sections 2 and 3 are valid, then inserting these relations 4.8 in equation 4.7 should allow us to “correct” the change in the spectral signature introduced by reformulating the original stochastic system, equation 4.1, within the framework of stochastic calculus utilizing expectation values of differentials of the governing stochastic variables only. Moreover, and more crucial, following the above argumentation, equations 4.8 should also provide this “correction” if we apply them to the original stochastic system, equation 2.1. This leads to an extended analytic expression, in which the time constants of the noise are rescaled according to the effective membrane time constant, thus compensating for the filtering effect described in section 3. That this expression leads to significant improvement for fitting numerical simulations, even for extreme parameter values, is tested in the next section. 5 Numerical Simulations To test the validity of the extended analytic expression, we incorporated the effective time constants given in equation 4.8 into the analytic solution of

2312

M. Rudolph and A. Destexhe

the full model, equation 2.4, and compared them to numerical simulations of the original equations in extreme parameter regimes that included very small and very large membrane time constants τm as well as noise time constants τ{e,i} . Using these effective time constants had little effect on parameter regimes for which the agreement was already good (see Figure 1A left; compare dashed and solid lines), but they markedly improved the agreement for other parameter regimes, in particular for very small time constants (see Figure 1A, middle). Indeed, a nearly perfect agreement between mean and variance obtained from numerical simulations and the extended analytic expression was obtained for all parameters so far tested, including membrane time constants down to τm = 0.005 ms (see Figures 1B and 1C), which are three orders of magnitude smaller than those observed in real neurons. We also tested noise time constants ranging over seven orders of magnitude (τ{e,i} = 0.005 ms to 50,000 ms), whereas physiologically realistic time constants are expected to be of the order of 1 to 100 ms. Moreover, the extended solution can be shown to exactly describe highly asymmetric Vm distributions (see Figure 1A, right) and to correctly provide higher-order moments, like skewness or kurtosis, well within the errors of numerical simulations. 6 Discussion and Conclusion We have provided an extension of previous work (Rudolph & Destexhe, 2003) in which an analytic solution was obtained for the steady-state Vm distribution of passive membranes subject to conductance-based synaptic noise. We showed here that this previously obtained analytic expression significantly deviates from numerical simulations for certain parameter values, in particular for small membrane time constants. We proposed a way to obtain an extended analytic expression in which this matching problem is solved. We hypothesized that such deviations are due to a spectral filtering problem inherent to the stochastic calculus used in Rudolph and Destexhe (2003). This hypothesis is supported by the following arguments. First, we explicitly showed that the spectral structure of the stochastic system is altered, presumably consequent to the approach of characterizing stochastic processes and their integrals (in the sense of Riemann-Stieltjes) in terms of their distributions only, that is, the cumulants and moments of their characteristic functions (see Rudolph & Destexhe, 2003). This mathematically well-defined approach to stochastic systems (e.g., Gardiner, 2002) is a description at the level of local expectation values and allowed us to deduce an analytically solvable Fokker-Planck equation directly from equation 2.1, without explicit integration over stochastic variables. However, by strictly

Extended Analytic Expression for Synaptic Noise

2313

using moments of stochastic processes, alterations of the spectral properties of the stochastic system in question were introduced, as we explicitly showed here by calculating the Fourier transform of the Vm . Second, the (numerical) observation that the mismatch is larger for small membrane time constants (see Figure 1) can be intuitively justified based on filtering properties. When the membrane time constant is much smaller than the synaptic time constants, the power spectral density of the Vm will be essentially dominated by that of the synaptic processes. If the spectral signature of their effect on the membrane is altered, this will be maximally seen in the power spectral density of the Vm for fast membrane time constants. For slow membrane time constants, the filtering of fast frequencies is more pronounced, and possible alterations in spectral properties will be masked by the RC filtering of the membrane. Ultimately, the correction needed to recover the correct spectral properties should be directly estimated from the Fourier transform of the Vm . Unfortunately, the complex form of these equations does not permit such a direct estimate, and we had to follow a different way. We used a dynamically different and simplified stochastic system for which direct integration is possible (Richardson, 2004) and applied to it the same stochastic calculus as in Rudolph and Destexhe (2003). Comparing the moments of the Vm distribution allowed us to estimate a correction on the noise time constants to account for filtering effects. Because both systems share the same functional coupling of the Vm and conductances in Fourier space, we argue that the same correction applies to the original full conductance-based system, leading to the extended analytic expression. Indeed, numerical simulations show that the matching for this extended expression with simulations is remarkable—over seven orders of magnitude of the parameters. (NEURON demo programs comparing simulations with the extended analytic expression are available online at http://cns-iaf.cnrs-gif.fr.) We note, however, that the extended analytic expression for the membrane potential distribution of the full conductance-based model presented here does not bypass the two limitations already outlined in the original contribution. First, due to the nature of the distribution of the incorporated conductance noise processes, the presence of unphysical negative conductances cannot be accounted for and will lead to a mismatch between numerical simulations and analytic solution. A possible solution could be to make use of qualitatively different stochastic processes for conductances, for example, described by gamma distributions. Second, eventual changes in the sign of the driving force due to crossing of the conductance reversal potentials will lead to a different dynamic behavior of the numerical model that, due to the exclusive use of expectation values and averages, cannot be

2314

M. Rudolph and A. Destexhe

captured in the approach followed here. The expected deviations are most visible at membrane potentials close to the reversal potentials and large values of the involved conductances. Possible solutions here could include the use of different numerical integration methods as well as the use of different boundary conditions (e.g., ρ(V) → 0 for V = E e and V = E i ). These modifications are currently under investigation. In sum, this semiheuristic extension of the solution proposed in Rudolph and Destexhe (2003) provides us with an expression that quantitatively fits the steady-state Vm distribution for a considerably larger range of parameter values. Of course, this extended expression is not formally a solution of the original Fokker-Planck equation anymore, and further work is needed to formalize the problem in the appropriate mathematical framework. Another possible extension would be to incorporate other types of synaptic conductances such as the slow types of glutamatergic (NMDA) or GABAergic (GABA B ) currents. Most important, the proposed extension of our previous work should be particularly useful for performing accurate estimates of the 2 ) of synaptic conductances by fitting the exmean (g{e,i}0 ) and variance (σ{e,i} tended expressions to experimentally recorded Vm distributions (Rudolph, Piwkowska, Badoual, Bal, & Destexhe, 2004b; Rudolph, Pelletier, Par´e, & Destexhe, 2004a). Acknowledgments We thank Magnus Richardson for many discussions and for attracting our attention to the problem treated here. This research was supported by CNRS and the Human Frontier Science Program.

References Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Rev. Neurosci., 4, 739–751. Destexhe, A., Rudolph, M., Fellous, J.-M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107, 13–24. Gardiner, C. W. (2002). Handbook of stochastic methods. Berlin: Springer-Verlag. Hillenbrand, U. (2002). Subthreshold dynamics of the neural membrane potential driven by stochastic synaptic input. Phys. Rev. E, 66, 021909–021920. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. L´ansky, ´ P., & L´ansk´a, V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26.

Extended Analytic Expression for Synaptic Noise

2315

Levitan, H., Segundo, J. P., Moore, G. P., & Perkel, D. H. (1968). Statistical analysis of membrane potential fluctuations. Biophys. J., 8, 1256–1274. Manwani, A., & Koch, C. (1999). Detecting and estimating signals in noisy cable structures, I: Neuronal noise sources. Neural Comput., 11, 1797–1829. Richardson, M. (2004). The effects of synaptic conductances on the voltage distribution and firing rate of spiking neurons. Physical Review E69, 051918. Rudolph, M., & Destexhe, A. (2003). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Comput., 15, 2577–2618. Rudolph, M., Pelletier, J. G., Par´e, D., & Destexhe, A. (2004a). Estimation of synaptic conductances and their variances from intracellular recordings of neocortical neurons in vivo. Neurocomputing, 58–60, 387–392. Rudolph, M., Piwkowska, Z., Badoual, M., Bal, T., & Destexhe, A. (2004b). A method to estimate synaptic conductances from membrane potential fluctuations. J. Neurophysiol., 91, 2884–2896. Tuckwell, H. C., Wan, F. Y. M., & Rospars, J.-P. (2002). A spatial stochastic neuronal model with Ornstein-Uhlenbeck input current. Biol. Cybern., 86, 137–145. van Rossum, M. C. W. (2001). The transient precision of integrate and fire neurons: Effect of background activity and noise. J. Comp. Neurosci., 10, 303–311.

Received March 17, 2004; accepted March 7, 2005.

LETTER

Communicated by Walter Senn

Synaptic and Temporal Ensemble Interpretation of Spike-Timing-Dependent Plasticity Peter A. Appleby [email protected]

Terry Elliott [email protected] Department of Electronics and Computer Science, University of Southampton, Highfield, Southampton SO17 1BJ, U.K.

We postulate that a simple, three-state synaptic switch governs changes in synaptic strength at individual synapses. Under this switch rule, we show that a variety of experimental results on timing-dependent plasticity can emerge from temporal and spatial averaging over multiple synapses and multiple spike pairings. In particular, we show that a critical window for the interaction of pre- and postsynaptic spikes emerges as an ensemble property of the collective system, with individual synapses exhibiting only a minimal form of spike coincidence detection. In addition, we show that a Bienenstock-Cooper-Munro–like, rate-based plasticity rule emerges directly from such a model. This demonstrates that two apparently separate forms of neuronal plasticity can emerge from a much simpler rule governing the plasticity of individual synapses. 1 Introduction In addition to standard, rate-based long term potentiation (rLTP) (Bliss & Lømo, 1973; Gustafsson, Wigstrom, ¨ Abraham, & Huang, 1987; Dudek & Bear, 1992), a second form of activity-dependent synaptic plasticity governed by the exact timing of pre- and postsynaptic stimulation has recently been shown to operate in the nervous system (for review, see Roberts & Bell, 2002). This form of plasticity is widespread, appearing in the hippocampus (Bi & Poo, 1998; Debanne, G¨ahwiler, & Thompson, 1998), visual pathway (Zhang, Tao, Holt, Harris, & Poo, 1998), neocortex (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Egger, Feldmeyer, & Sakmann, 1999; Feldman, 2000; Froemke & Dan, 2002), and even the electric fish electrosensory lobe (Bell, Han, Sugawara, & Grant, 1997). A critical window for the interaction of the pre- and postsynaptic events is seen, with separations greater than around 50 ms failing to evoke any change. Within this window, the degree of modification is a function of the spike timing, and the phenomenon has become known as timing-dependent LTP (tLTP). In most cases, presynaptic spiking followed by postsynaptic spiking (positively correlated spiking) Neural Computation 17, 2316–2336 (2005)

© 2005 Massachusetts Institute of Technology

Ensemble interpretation of STDP

2317

leads to potentiation, while reversing the order of spiking (negatively correlated spiking) leads to depression (but see Bell et al., 1997). Several variations have been observed in vivo and in vitro, with marked differences in the width of the critical window (Debanne et al., 1998), degree of potentiation and depression (Egger et al., 1999), or polarity of change (Bell et al., 1997). In theoretical studies, the potentiation and depression phases of the tLTP modification curve have mainly been approximated by two exponential functions with different amplitudes, polarities, and decay constants, and then applied directly as a rule to determine changes in synaptic strength. This carries with it the implicit assumptions that the tLTP curve is valid at each individual synapse and that all the synapses making up the overall connection evolve similarly. In conjunction with certain constraints, this method can give rise to stable distributions of synaptic efficacies with competitive dynamics either emerging directly or introduced by synaptic scaling (Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Izhikevich & Desai, 2003). If the tLTP rule really is implemented at individual synapses, then each synapse must possess machinery capable of resolving pre- and postsynaptic spike timing with millisecond accuracy. Although mechanisms capable of operating as coincidence detectors may be present at the synapse, it is not clear that they can resolve spike timing with the level of accuracy needed by these models. Other models of tLTP have been built using a more biophysical approach, employing the idea that the NMDA receptor may serve as the required molecular coincidence detector (Castellani, Quinlan, Cooper, & Shouval, 2001; Karmarkar & Buonomano, 2002; Shouval, Bear, & Cooper, 2002). However, these models can be rather sensitive to the choice of parameters and sometimes predict a rarely observed, extra depressive phase at large spike timings (but see Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000). Experimental results based on spike triplet and quadruplet interactions (Froemke & Dan, 2002) instead of spike pair interactions demonstrate, moreover, that such interactions evoke changes in overall synaptic strength that are inconsistent with a simple, additive model employing the tLTP rule on the embedded spike pair sequences. Modifying the simple tLTP model to include constraints on spike interactions such as spike suppression does, however, allow the triplet and quadruplet results to be accommodated (Froemke & Dan, 2002). These various approaches to tLTP, although widely adopted, should be viewed with some caution. The experimental evidence for tLTP involves measuring the change in the overall connection strength between pre- and postsynaptic cells after many spike pairings. However, this overall change could arise, for example, from some simpler plasticity rule operating at individual synapses, and only when the change in the overall strength is viewed as a spatial and temporal average over these many individual changes might the overall change appear to follow a tLTP-like rule. That is, the tLTP curve may be an emergent, ensemble property of synapses but

2318

P. Appleby and T. Elliott

not actually instantiated at any individual synapse. Here we postulate such a rule governing individual synaptic changes and show that the observed tLTP curve can indeed emerge as a temporal and spatial average over multiple synapses and multiple spike pairings. The rule is robust under highly variable spike timings and eliminates the need for synapses to exhibit millisecond resolution coincidence detection. An explanation of spike triplet interactions emerges as a natural consequence of its structure, with no need to introduce additional constraints. We also show that a Bienenstock-CooperMunro (BCM)–like (Bienenstock, Cooper, & Munro, 1982), rate-based plasticity rule emerges directly from such a model. 2 Formulation of Model We now construct an activity-dependent synaptic plasticity rule that governs changes at individual synapses in response to pre- and postsynaptic spiking. We propose that a positively correlated spike pair will potentiate a given synapse by a fixed amount A+ , subject only to the requirement that the postsynaptic spike occurs within a finite time window relative to the presynaptic spike. Outside this time window, the postsynaptic spike does not evoke any change in synaptic strength. The duration of this time window is not fixed, but is taken to be a stochastic quantity governed by some probability distribution. This simple modification rule could be embodied by some biological, synaptic switch mechanism. The arrival of a presynaptic spike activates some process that elevates the synapse into a different functional state. The arrival of a postsynaptic spike while this process is still active, and the synapse is still in the elevated state, induces potentiation of the synapse by a fixed amount A+ . The postsynaptic spike is also taken to deactivate the process. In the absence of postsynaptic firing, the process will naturally deactivate in a stochastic, random manner, and subsequent postsynaptic spiking will not evoke a change in synaptic strength unless preceded by further presynaptic spiking. We label the resting state of the synapse as the OFF state. The elevation of the synapse into a different functional state, due to the arrival of a presynaptic spike, is represented by a transition to a different state that we label the POT state. While in the POT state, additional presynaptic spiking has no further effect. If, on the other hand, a postsynaptic spike occurs while the switch is in the POT state, then the switch is immediately returned to the OFF state via the transition POT → OFF. This transition is defined to induce an associated potentiation of synaptic strength of A+ . In the absence of further spiking, the switch will move from the POT state back to the OFF state in a stochastic manner, governed by some probability distribution. We refer to transitions triggered by pre- or postsynaptic spiking as active transitions, while those that occur stochastically are referred to as passive transitions. This abstracted rule is represented in Figure 1A, with semicircles representing active transitions and wavy lines representing passive transitions.

Ensemble interpretation of STDP POT

A

2319 OFF

B

pre

post

pre

post

OFF

post

pre

OFF

DEP

D

pre

A

s

POT

C

t 0

t

t

post

DEP

A

Figure 1: The simplest, self-consistent forms that the proposed synaptic switch can take, and its resulting plasticity rule. (A) The states and transitions that must exist in a switch accommodating the timing-dependent induction of LTP due to pre- and postsynaptic spiking. (B) Same as A but for the induction of LTD. (C) A unified three-state synaptic switch that can exhibit both LTP and LTD. (D) Change in synaptic strength evoked under the unified three-state switch for a representative spike pair at various spike timings. The arrows ⇑ and ⇓ indicate the induction of potentiation and depression, respectively.

The active transition POT → OFF is the only transition capable of inducing potentiation of synaptic strength. To account for depression of synaptic strength under negatively correlated spiking, a second synaptic switch is postulated (see Figure 1B). This switch behaves in a similar manner to the one just described, except here, postsynaptic spiking triggers the initial active transition to a different functional state OFF → DEP. When in the DEP state, further postsynaptic spiking has no effect, but a presynaptic spike will trigger the active transition DEP → OFF with an associated decrease in synaptic strength, by an amount A− . The stochastic, passive transition DEP → OFF returns the switch to the OFF state in the absence of further spiking. These two switch mechanisms adequately describe a step change in synaptic strength in response to positively or negatively correlated spiking. The two switches could exist independently, but we can unify them into a single, three-state synaptic switch (see Figure 1C). The parameters of this three-state switch are not necessarily symmetric, which in biological terms reflects the possible independence of the processes activated by pre- or postsynaptic firing when the synapse is in the OFF state. The modification induced by a representative spike pairing at various spike time differences is plotted in Figure 1D. The switch rule gives rise to a modification in accordance with a two-step function of fixed step heights, A± . The two random step

2320

P. Appleby and T. Elliott

widths, t± , are governed by the probability distributions that describe the passive transitions POT → OFF and DEP → OFF. If this hypothetical spike pairing were repeated, then the widths t± would likely take different values, giving rise to a different “critical window.” It is important for our model that the magnitude of synaptic plasticity, represented by the heights of the two step functions, is not dependent on the difference in spike times. The level of coincidence detection required is therefore minimal, as the synapse is required only to record the occurrence of a pre- or postsynaptic spike, not the precise time of occurrence. Although additional states and transitions may freely be added, we find that this simple, self-consistent, three-state switch is all that is required to reproduce a variety of tLTP results. We assume that an afferent makes multiple synapses onto a target cell. The overall strength of the connection between the two cells is defined, for simplicity and according to the usual convention, to be the linear sum of each individual synaptic strength. The synapses are treated independently, which, due to the stochastic nature of the synaptic modification rule, means that the synapses comprising a connection will often be in different states. It is therefore the spatial average over synapses, and the temporal average over spike pairs, that determines the overall change in connection strength.

3 Analysis of Model For notational convenience, we denote a presynaptic spike by the symbol π and a postsynaptic spike by the symbol p. Pre- and postsynaptic firing are assumed to be independent Poisson processes with rates λπ and λ p , respectively, and we set β = λπ + λ p . Because they are independent, the combined pre- and postsynaptic spike sequences form a single Poisson process of overall rate β. For a Poisson process of rate λ, the inter-event time (the “waiting time”) is an exponentially distributed random variable with parameter λ, and thus, in particular, the waiting time between any two spikes is exponentially distributed with parameter β and has the probability density function f T (t) = βe −βt . For any given spike in the combined train, the probability that it is presynaptic is λπ /β, and the probability that it is postsynaptic is λ p /β. Here we restrict our analysis to spike trains consisting of two spikes only, so that a two-spike train can manifest itself as one of four possible sequences: ππ, π p, pπ, or pp. The probability of observing a particular spike pattern i j, where i, j ∈ {π, p}, is then just pi j = λi λ j /β 2 . Longer spike trains are investigated numerically in section 4. Despite the more complicated nature of the higher-order interactions between multiple spikes, our results for longer spike trains share characteristics similar to those for the two-spike case. Under a specific spike pattern, modification of synaptic strength may or may not occur, depending on the state of the switch when the second spike arrives. We therefore seek an expression for the expected change in synaptic efficacy induced by a single spike pair under our switch rule. The spike

Ensemble interpretation of STDP

2321

patterns ππ and pp cannot cause a change in synaptic strength under our switch rule, so we need only to consider the π p and pπ patterns. Consider the π p pattern. The initial presynaptic spike triggers the active transition OFF → POT. The switch remains in the POT state until either the arrival of the postsynaptic spike or the occurrence of a stochastic, passive transition. In either case, the switch will be returned to the OFF state, but the active transition triggered by the postsynaptic spike will also induce a change in synaptic strength. We therefore require the probability that the switch is still ON when the postsynaptic spike arrives. We assume that the probability density function f (t) for the passive transition POT → OFF is given by a gamma probability density function of integer order n+ , f (t) =

(t/τ+ )n+ −1 1 exp(−t/τ+ ), (n+ − 1)! τ+

(3.1)

where τ+ is the characteristic timescale associated with this switching process. As n+ is an integer, this is equivalent to requiring the deactivation of n+ independent exponential decay processes. These processes could, for example, represent the activation of n+ independent signaling pathways in response to presynaptic spiking, all of which must deactivate for the switch to be considered OFF. The probability that the transition POT → OFF has occurred after a time t is then + POF F (t) =

t 0

dt f (t ) = 1 − e −t/τ+ e n+ (t/τ+ ),

(3.2)

n−1 (t/τ )i /i!. The probability that the switch is ON after a where e n (t/τ ) = i=0 + time t is then PON (t) = e −t/τ+ e n+ (t/τ+ ). The mean change in synaptic efficacy triggered by the π p spike pair of time difference t is the amplitude of synaptic plasticity, A+ , multiplied by the chance that the switch is still ON at time t + after the first spike, PON (t). Thus, the conditional expectation value for the change in synaptic efficacy, Sπ p (t), given a π p spike time interval t, is just + Sπ p (t) = +A+ PON (t).

(3.3)

Similarly, an identical argument for the pπ sequence gives us − Spπ (t) = −A− PON (t),

(3.4)

− + where PON is identical to PON , except that n+ and τ+ are replaced by n− and τ− , these being the parameters specifying the gamma distribution for the DEP → OFF stochastic transition. A representative plot of Sπ p (t) + Spπ (−t) as a function of spike time difference, t, is shown in Figure 2, with negative t corresponding to a pπ sequence and positive t corresponding

2322

P. Appleby and T. Elliott

Expected change

1

0.5

0

-0.5

-1 -80

-60

-40

-20

0

20

40

60

80

Spike timing (ms)

Figure 2: The expected change in synaptic strength as a function of the spike time difference t, from equations 3.3 and 3.4.

to π p sequence. Motivated by simplicity, and for approximate agreement with experimental data (Bi & Poo, 1998), we show the case where A+ = A− = 1, n+ = n− = 3 and τ− = τ+ = 20 ms. The form illustrated is relatively insensitive to the exact choice of parameters. We may now obtain the unconditional expectation value for the synaptic modification arising from any pattern of two spikes, regardless of the spike time interval and the particular spike pattern. For the pattern i j, we weight its conditional expected synaptic change Si j by the probability of the pattern, pi j , and we integrate out the spike times according to their probability density functions. The unconditional expection is given by the sum over all such patterns, so that E[S] =

i, j∈{π, p}

pi j

∞

∞

dt1 f T (t1 ) 0

dt2 f T (t2 )Si j (t2 ),

(3.5)

0

where, of course, Sπ π (t) = Spp (t) = 0. Defining J ± (λ) = 1 −

1 , (1 + λτ± )n±

(3.6)

we finally obtain E[S] =

λπ λ p [A+ J + (β) − A− J − (β)] β2

as the expected synaptic change arising from any two-spike sequence.

(3.7)

Ensemble interpretation of STDP

2323

This equation is an analytical expression for the expected change in synaptic efficacy induced by a two-spike train at given pre- and postsynaptic firing rates, λπ and λ p . In the limit of large λπ and λ p in equation 3.7, we have that E[S] ∝ (A+ − A− ). The sign of this expression indicates whether potentiation or depression of synaptic strengths is expected for high pre- and postsynaptic firing rates. Experimental work on rLTP shows that high preand postsynaptic firing rates generally lead to LTP (Sjostr ¨ om, ¨ Turrigiano, & Nelson, 2001). This requires that E[S] > 0 for large λπ and λ p , that is, A+ > A− . However, to be able to generate competitive dynamics, we also require a depressive phase where E[S] < 0; otherwise, synapses can never weaken on average. Putting λ p = λπ , and maintaining the requirement that A+ > A− , a sufficient condition is that ∂E[S]/∂λπ |λπ ,λ p =0 < 0. As E[S]|λπ ,λ p =0 = 0, this guarantees the presence of a depressive region. This produces a second constraint, γ =

A+ n+ τ+ < 1, A− n− τ−

(3.8)

which we interpret as depression dominating over potentiation. An identical condition has been observed, but not mathematically derived, for simulations of exponential-like tLTP plasticity rules in the context of generating dynamics that give rise to bimodal synaptic distributions (Song et al., 2000). Empirical work has also shown it to be a requirement for a BCM-like learning rule to emerge on average from such rules (Izhikevich & Desai, 2003). Here, we have shown that requiring our switch rule to maintain, on average, a BCM-like learning rule leads to mathematically derivable constraints. Whether the presence of a depressive regime, for which γ < 1, guarantees the presence of competitive dynamics in our switch rule, is an issue that we shall explore elsewhere. We set A+ = 1 and A− = 0.95, in accordance with the condition that A+ > A− , and choose n+ = n− = 3, as before. Setting τ− = 20 ms and choosing γ determines the remaining parameter τ+ . We also assume that the postsynaptic firing rate λ p is linearly related to the presynaptic rate λπ once it exceeds a value η, so that λp =

λπ − η for λπ ≥ η 0

for λπ < η

,

(3.9)

and we set η = 5 Hz. Varying the presynaptic firing rate λπ produces the family of curves shown in Figure 3 for different values of γ . When γ < 1, we observe that the behavior is qualitatively BCM-like, with a depressive phase at low presynaptic firing rates followed by a transition to potentiation as a threshold is passed. We find exact agreement, presented in section 4, between this analytical result and numerical simulation of two-spike trains.

2324

P. Appleby and T. Elliott 0.02

Expected change

0.01 0.80 0 0.75 -0.01 0.70 -0.02 0.65 -0.03 -0.04

0

50

100

150

200

Presynaptic firing rate (Hz)

Figure 3: The expected change in synaptic strength due to a single pair of spikes, for values of γ shown attached to each curve. The pre- and postsynaptic cells fire according to a Poisson process.

For longer spike trains, equation 3.7 represents the expected change induced by any pair of spikes, not the expected change induced by all spike interactions in these processes. Nevertheless, for longer spike trains, the total change induced by multiple spike interactions is of a qualitatively similar character, as we will show in section 4. An important feature of the BCM model (Bienenstock et al., 1982) is the sliding of the potentiation threshold in response to changes in the postsynaptic firing rate (Kirkwood, Rioult, & Bear, 1996; Philpot, Espinosa, & Bear, 2003). As the analytical expression shows, a threshold emerges from our model that is a function of various, easily modifiable parameters. Allowing some of these parameters to depend on the recent time average of postsynaptic firing, in a manner similar to other modeling approaches, would capture the sliding threshold of the BCM rule in a satisfactory way. 4 Results We now turn to numerical simulation to study the behavior of a single afferent innervating a single target cell. The connection between the afferent and target cells is assumed to comprise multiple synapses, which individually obey the stochastic switch rule set out above. The tLTP curve governing changes in overall connection strength emerges from the averaged effect of our synaptic switching rule. This averaging process can take place over multiple synapses or, equivalently, multiple spike pairings. We choose to simulate 10 synapses per afferent. This is partly so that an averaging process can be observed even with single spike pairs, but also to show that the synapses comprising a connection can often be in different states and

Ensemble interpretation of STDP

2325

undergo different modifications while still giving rise to the tLTP curve when viewed as an ensemble. As the stimulation protocols used involve many spike pairings, the simulations can, in fact, be repeated with just one synapse. The averaging process then occurs at this one synapse over many events, and the results are qualitatively similar. A typical tLTP or rLTP experimental protocol relies on evoking pre- and postsynaptic action potentials in synaptically coupled cells. In both cases, the normal function of a synapse as a propagator of neuronal activity is suppressed, with external current injections typically used to achieve spiking on demand. In our simulations, as in the experimental procedures, afferent and target cell spiking is assumed to be driven by an external force. Presynaptic spiking does not contribute to postsynaptic spiking in any way, and simulation of any kind of integrate-and-fire target cell is not required. Due to a high level of variability, the majority of experimental data on tLTP describes relative changes in connection strength. Multiple spike pairings are needed to evoke a statistically significant change in overall connection strength. We adopt a similar approach by defining the combined initial synaptic strength of the input afferent to be equal to 1, and then scaling the magnitudes of synaptic plasticity, A+ and A− , to reproduce the measured relative change in overall connection strength under a particular experimental protocol. 4.1 Spike-Based Results. In order to examine the timing dependence of our rule, we implement a particular experimental protocol that has been shown to evoke tLTP-like changes in embryonic rat hippocampal cultures (Bi & Poo, 1998). This protocol is typical of timing-based LTP experiments, and our parameters are chosen to reflect the main features of these data. As described above, we set n+ = n− = 3, A+ = 1.00, A− = 0.95, and τ− = 20 ms. Choosing γ = 0.70 generates a value for τ+ = γ A− τ− /A+ 13 ms. The magnitudes of synaptic plasticity, A+ and A− , set the overall scale for synaptic modifications. To match the experimental data, we require that the maximum possible relative change in overall connection strength evoked by 60 spike pairings is approximately ±1. We therefore require a scaling factor of 60 to be applied to the magnitudes of synaptic modification, and we set A+ = 1.00/60 and A− = 0.95/60 accordingly. This scaling has no other effect beyond producing a simulated change in overall connection strength equal to the experimentally observed value, and the dynamics of the synaptic switch are unchanged. Noise in the timing of spikes, reflecting both experimental error and variable transmission times, is drawn from a gaussian distribution with standard deviation of 1 ms. The spike pairing protocol consists of 60 pairings at 1 Hz applied at time differences ranging from −80 ms to +80 ms (Bi & Poo, 1998). The averaging of the synaptic modification rule over multiple synapses and pairings gives an overall change in connection strength that has two exponential-like phases, plotted in Figure 4. This change is consistent with experimental data (Bi & Poo, 1998), with polarity

2326

P. Appleby and T. Elliott

Simulated change

1

0.5

0

-0.5

-1 -80

-60

-40

-20

0

20

40

60

80

Spike timing (ms)

Figure 4: Simulated total change in overall connection strength as spike timing varies.

depending only on the signs of A± . These simulation results agree with the analytical expressions and are qualitatively unchanged for any realistic level of temporal gaussian noise with a standard deviation σ ≤ 10 ms. The parameters τ± determine the width of the temporal window—as τ± increase, spike pairings at greater time differences begin to evoke a significant change in synaptic efficacy. As discussed above, A± determine the maximum amplitude of plasticity, evoked when the time difference is very small. Adjustment of these parameters produces a family of tLTP curves that can reproduce a variety of experimental results without altering the basic characteristics, such as the exponential-like slopes, of the curve. A number of more complicated spike patterns, including triplets and quadruplets, have been explored in experimental preparations (Froemke & Dan, 2002). In the case of spike triplets, the experimental protocols are very similar to that of spike pairings, but instead of one pre- and one postsynaptic spike, an additional third spike (either pre- or postsynaptic) is introduced. We reproduce the experimental protocol exactly, repeating a particular stimulation pattern 60 times at 0.2 Hz. The parameters are the same as for the spike pairing simulations, and the results are set out in Table 1. The simulated results for the spike-triplet protocols are in close agreement with experiment (Froemke & Dan, 2002), a result that cannot be reproduced under other modification rules without additional constraints on spike interaction such as spike suppression (Froemke & Dan, 2002). Under earlier models of tLTP, spike triplets were treated as two separate spike pairings that individually obeyed a tLTP-like modification curve. Their linear addition gives a predicted change that is not in agreement with experimental data. Here, the switch provides a mechanism, in the form of the passive transitions to the

Ensemble interpretation of STDP

2327

Table 1: Experimental and Simulated Effect of Spike Triplets and Quadruplet. Pattern

Timing (ms)

Experiment

Simulation

π pπ pπ p π ppπ pπ π p

2.6/6.0 6.5/0.5 8.8/10.6/9.6 7.9/9.6/9.0

↑ ↓ ↑ ↓

+1.00 −0.94 +0.03 +0.03

Notes: The first column gives the spiking patterns (a presynaptic spike is denoted by π and a postsynaptic spike by p). The second column gives the spike time differences for the patterns. The third column gives an indication of the experimental measurement (Froemke & Dan, 2002); upward arrows indicate potentiation, and downward arrows indicate depression. The fourth column gives our simulated results.

OFF state, under which a triplet can evoke a change with a sign opposite to that predicted by such linear addition of pairings respecting the tLTP curve. With two presynaptic spikes and one postsynaptic spike, the first presynaptic spike moves the switch into the POT state. If the postsynaptic event occurs in a timely fashion, it will move the switch back to the OFF state and trigger an increase in synaptic strength. In this case, the second presynaptic event will move the switch to only the POT state, and this does not trigger any change in synaptic strength. If, however, the postsynaptic event occurs too late and the switch has already returned to the OFF state via a passive transition, then the switch will instead be moved to the DEP state. In this case, the second presynaptic event moves the switch back to OFF and triggers a depression of synaptic efficacy. It is the choice of parameters describing the switch, A± , n± , and τ± , that determines the average outcome for a given protocol. In fact, the parameters chosen roughly to reflect simple spike pairing results are sufficient to accommodate spike triplets. This explanation of triplet interactions emerges as a natural consequence of the switch rule, with no need for modifications or additional constraints. In the case of spike quadruplets, comprising of two pre- and two postsynaptic spikes, the switch rule leads to potentiation under both of the protocols set out in Table 1. This is not in agreement with the experimental results, where the second quadruplet protocol leads to depression. It is possible to accommodate the quadruplet results, leaving the pair and triplet results unchanged, by replacing the active transitions POT → OFF and DEP → OFF with active POT → POT and DEP → DEP transitions, respectively. However, such a modification of our switch rule destroys the stability of the rate-based limit (unpublished results), and this seems to be a high price to pay to account for results whose significance is currently unclear. 4.2 Rate-Based Results. Induction of LTP using a rate-based protocol was simulated by driving the presynaptic cell at a fixed frequency (ranging

P. Appleby and T. Elliott

Change in synaptic strength per spike pair

2328

0.03 0.02 0.01 0 -0.01 -0.02 -0.03 0

50

100

150

200

Presynaptic firing rate (Hz)

Figure 5: Change in overall connection strength, per pair of spikes, for simulated two-spike trains, as a function of presynaptic firing rate. The solid line shows the corresponding analytical result.

from 0–200 Hz) governed by a Poisson process, as set out above. The postsynaptic cell fires in a similar, Poisson manner, with frequency given by equation 3.9. This suppresses postsynaptic firing at very low presynaptic rates, as would be expected in a real system with many inputs. The cells are decoupled in the sense that presynaptic firing does not influence the postsynaptic cell membrane potential in any way, thus reproducing a typical experimental protocol in which two cells are held in current clamps and current injections are used to induce spiking (Bi & Poo, 1998). The parameters are identical to the spike-pairing simulations. In order to verify our analytical results, we first simulate the effect of pairs of spikes by truncating the simulation after the first pair of events. Each pair of events can therefore consist of two presynaptic spikes, two postsynaptic spikes, or one of each. The average, overall connection change after a total of 106 total spikes is shown in Figure 5, as a function of the presynaptic firing rate. We see exact agreement between the simulated two-spike interaction and the analytically derived result, equation 3.7. The two-spike results consider spike trains containing exactly two events, the interactions of which give rise to a BCM-like change in the overall connection strength. An identical change will arise from the interaction of any pair of spikes in a train provided that the synapse is in the OFF state. However, when longer pre- and postsynaptic spike trains are considered, further spike interactions may occur, and it is important to show that the qualitative form of the two-spike learning rule is unchanged by these higher-order corrections. We therefore simulate longer spike trains, of 50 and 100 spikes. The total change in overall connection strength then arises from a summation

Change in synaptic strength per spike pair

Ensemble interpretation of STDP

2329

0.03 0.02 0.01 0 -0.01 -0.02 -0.03

0

50

100

150

200

Presynaptic firing rate (Hz)

Figure 6: Change in overall connection strength, per pair of spikes, for simulated trains of 50 spikes (vertical crosses) and 100 spikes (diagonal crosses), as a function of presynaptic firing rate. The change per pair of spikes for a 50spike train is almost identical to that for the 100-spike train. Also shown for comparison is the analytical two-spike result, represented by the solid line.

of the many individual transitions that occur. The simulated overall connection change per spike pair, averaged over many such trains, is shown in Figure 6, allowing a direct comparison to the two-spike train result, which is also shown. An averaged, BCM-like plasticity rule emerges in all cases. The 50- and 100-spike trains give an average change per spike pair that is different from that calculated for simple two-spike trains, reflecting the influence of higher-order interactions between the spikes, but the results are nevertheless qualitatively similar. The higher-order interactions are of smaller and smaller significance, with the average change in connection strength per pair of spikes converging to a limiting value as the number of spikes in a train grows large. Thus, as intuitively expected, if we were to plot the total change after the 100-spike train, it would simply be twice the total change after the 50-spike train. 5 Discussion Spike-timing-dependent plasticity has received considerable attention in recent years. From a modeling perspective, two broad approaches to this novel form of synaptic plasticity have been adopted. The first approach assumes that the observed tLTP curves are an accurate description of the synaptic plasticity rule operating at each individual synapse between an afferent and its target (Song et al., 2000; van Rossum et al., 2000). The computational consequences of such an assumption are then determined. The second approach does not take the rules over directly, but rather attempts

2330

P. Appleby and T. Elliott

to derive them from a more detailed, biophysically plausible analysis of the molecular machinery present in synapses (Castellani et al., 2001; Karmarkar & Buonomano, 2002; Shouval et al., 2002). Although the second approach promises a deeper and more general understanding, models of this type are commonly beset by a variety of problems, including sensitivity to parameter choices and the prediction of an extra depression window for large spike time differences. Common to both approaches is the view that tLTP must be valid, at some level, at individual synapses for a single spike pair. Hence, individual synapses in such models are required to represent in some form the spike timing difference and adjust their strengths accordingly. In contrast, we consider that the experimental data for tLTP cannot be divorced from the experimental protocols used to obtain them. Rarely are monosynaptically coupled pairs of neurons studied, so that the changes in synaptic efficacy that are observed are mostly the results of the many changes in the strengths of the individual synapses that constitute the overall synaptic connection between afferent and target. Furthermore, the data are always obtained from a protocol in which multiple spike pairings are employed (Bi & Poo, 1998). Although it is entirely possible that the observed tLTP curves are, in fact, respected by individual synapses during a single spike pairing, so that the experimental averaging procedure over both multiple synapses and multiple spike pairings does faithfully report the underlying plasticity rule, we have instead sought to determine whether the observed tLTP rule can, in fact, arise from the synaptic and temporal average of individual synaptic changes respecting a much simpler plasticity rule. In this letter, we have indeed shown that such an alternative view is viable. The resulting model is susceptible to some degree of understanding and analysis, and very much reduces the computational demands placed on synapses. We have postulated that an individual synapse, when presented with a pre- and postsynaptic spike pair, adjusts its synaptic strength by a constant positive or negative jump or does not change its strength at all. Our hypothetical synapse is therefore required only to record the occurrence of a pre- or postsynaptic event and adjust its strength by a fixed amount if an appropriate spike is generated in a timely fashion. Provided that the synapse destroys this record in some stochastic manner, so that the trace is short-lived, we have shown that we can derive the tLTP rule directly. Thus, a simple synaptic modification rule can indeed give rise directly to a much more complex tLTP rule. The tLTP rule can thus be viewed as an average, ensemble, emergent property of neurons, where the average is over either multiple synapses or multiple spike pairings (or both). Such a view is analogous, for example, to the relationship between thermodynamics and statistical mechanics in physics. The gas laws, such as Boyle’s law, are not followed by individual gas molecules, but rather emerge, statistically, from the underlying motions of molecules following Newton’s laws. Temperature and pressure are not intrinsic properties of individual gas molecules,

Ensemble interpretation of STDP

2331

but emerge as properties of the collective system. Just so, we propose that tLTP is not an intrinsic property of individual synapses, but emerges as an ensemble property when much simpler rules are averaged over many synapses and over many spike pairs. Our switch rule also provides a mechanism by which spike triplets may naturally give rise to an overall change in connection strength similar to that observed in experiment. Under typical tLTP models (Song et al., 2000) this result can be achieved by introducing additional nonlinearities, such as spike suppression (Froemke & Dan, 2002). In our model, once a synapse is in, say, the POT state, this state cannot be changed by further presynaptic spikes until the synapse returns to the OFF state. In a strict sense, the synapse suppresses the effect of these subsequent presynaptic spikes, but this nonlinearity is of a rather different form from that proposed by Froemke and Dan (2002), in which the values of A+ and A− are scaled depending on the spike history. A biophysical implementation of a simple tTLP rule based around NMDA receptor dynamics has also been proposed (Senn, Markram, & Tsodyks, 2001). It is hypothesized that a fixed population of NMDA receptors undergoes transitions between three different functional states dependent on pre- and postsynaptic spiking. Spike times are encoded by allowing activated NMDA receptors individually to decay back to the unactivated state. The proportion of the population of the NMDA receptors remaining in the activated state determines the adjustment to the synaptic strength in a graded, continous manner. Such a model can reproduce a variety of tLTP results. Under our switch rule, we propose a superficially similar three-state switch mechanism, but it is the entire synapse that enters different functional states rather than individual NMDA receptors. Moreover, the synapse adjusts its strength in a fixed, all-or-none manner. The observed tLTP rule in our model is therefore an emergent property of the whole set of synapses, with single synapses obeying a much simpler rule. We have deliberately presented our switch rule in the simplest form possible that is consistent with a variety of spike-timing data. This permits a degree of analytical understanding of the rule, enabling us to determine the properties of the model without resorting to a purely numerical approach. Although our switch rule does indeed accommodate a wide range of experimental results, it is unrealistic to expect our model to be comprehensive at this stage. We note, in particular, that the frequency dependence of the tLTP observed in neocortical pyramidal cells (Markram & Tsodyks, 1996) is not captured by our model in its present form. We will explore this, and similar issues, in future work. Here we have referred to spatial (and temporal) averaging and have used standard probability theory to calculate, in an exact rather than approximate manner, the average change in synaptic efficacy either over time or over a large number of synapses, where the averaging arises due to the probabilitistic nature of our proposed switch mechanism and the intrinsic

2332

P. Appleby and T. Elliott

stochasticity of Poisson spike trains. In mean field approaches (Kistler & van Hemmen, 2000; Gerstner, 2001; Kempter, Leibold, Wagner, & van Hemmen, 2001; Senn, 2002), quantities are replaced by their local spatial averages so that the resulting approximate equations are more analytically tractable. However, the averaging in our model is exact and real, in the sense that we are proposing that real experiments actually measure these average, emergent properties of neurons. We have also shown that a BCM-like rate-based rule (rather than spikebased rule) can be formally derived from our model. This derivation requires no further assumptions and produces two explicit constraints on the choice of parameters. First, in order to generate LTP when both pre- and postsynaptic firing rates are high, we require that A+ > A− . That is, the level of potentiation induced under our switch rule by a presynaptic spike followed by a postsynaptic spike must be greater than the level of depression induced when the spike order is reversed. Second, in order to generate LTD at low firing rates, we require that γ = (A+ n+ τ+ )/(A− n− τ− ) < 1. This may be interpreted as depression dominating over potentiation. An identical condition has been observed to be a requirement for generating bimodal synaptic distributions under a simple, additive tLTP rule (Song et al., 2000) and in other models too (Gerstner, 2001; Kempter et al., 2001). The conditions for the derivation of BCM-like rules from tLTP rules have also been studied (Senn et al., 2001; Izhikevich & Desai, 2003; Burkitt, Meffin, & Grayden, 2004), it being found that by restricting spike interactions, BCM-like rules may be derived from a variety of tLTP models. Again, BCM implementations have usually been considered at the level of individual synapses, with the synapse required to perform complex computations, possibly represented at some abstract level, to determine the magnitude and direction of change of synaptic strength. However, the BCM-like rule we see arising from our switch rule is an emergent property that does not place such burdens on each synapse. In fact, the BCM-like rule in our model is “doubly” emergent, requiring first the emergence of the tLTP rule. Mathematically, this tLTP rule takes the form of a conditional expectation value for synaptic change, conditional, that is, on a given spike time difference. Second, to turn this conditional expectation value into an unconditional value, we must weight it by some probability density function for the spike time difference and the probabilities of each possible spike pair. Thus, in order to obtain the BCM-like rule, we assumed that the spike time difference distribution originated from Poisson spike trains. Had we chosen to drive the afferents in some other, non-Poisson manner, a different ratebased synaptic plasticity rule may have emerged. In this sense, the BCM rule emerges from an interaction between the tLTP rule, which is itself emergent, and how we have decided to drive the afferents. This suggests the possibility that apparently different rate-based rules may in fact merely reflect different choices available to the experimenter in how he decides to probe his experimental system, or indeed that as synaptic patterns change and hence

Ensemble interpretation of STDP

2333

neuronal firing patterns change, the nervous system may slowly change its own rate-based learning rule. We shall pursue this idea in future work. The most direct test of our synaptic switch hypothesis would be to observe the change in efficacy at an individual synapse induced by a pre- and postsynaptic spike pair. We predict that the changes in synaptic efficacy would be seen to occur in jumps of fixed magnitude, not as a smoothly varying function of the pre- and postsynaptic spike timing difference. Such a result would provide strong evidence that the observed tLTP window is not instantiated at the level of individual synapses and support our hypothesis that the window emerges only as an ensemble property of neurons. Indeed, in measuring LTP at putatively monosynaptically coupled pairs of hippocampal neurons, Peterson, Malenka, Nicoll, and Hopfield (1998) found all-or-none potentiation rather than a continuous change, possibly consistent with our approach. Such “binary” synapses have also been studied in other contexts and have been shown to reproduce various tLTP results (Fusi, Annunziato, Badoni, Salamon, & Amit, 2000; Fusi, 2002). We have proposed that single synapses implement something akin to a three-state switch, and from this have derived a tLTP rule and a BCM-like rule. Where might the machinery that implements this switch reside? The switch moves from the OFF state to the DEP state following a postsynaptic spike. All of a target’s input synapses might thus be expected to move into the DEP state, and this suggests that the natural locus for the DEP state of the switch may be the postsynaptic aspect of the synapse. An active transition back to the OFF state caused by a presynaptic spike may then cause a postsynaptic change in the synapse, resulting, for example, in the removal of some neurotransmitter receptors from the postsynaptic membrane, leading to a decrease in synaptic strength. Similarly, when the switch moves from the OFF state into the POT state following a presynaptic spike, all of the afferent’s output synapses might be expected to move into the POT state, and so this perhaps suggests a presynaptic locus for the POT state of the switch. An active transition to the OFF state caused by a postsynaptic spike may, this time, change the presynaptic aspect of the synapse, perhaps enhancing neurotransmitter release or even inducing terminal synaptic sprouting, so that synaptic strength is increased. Thus, the switch may actually be distributed across the entire synapse rather than confined exclusively to the pre- or postsynaptic side of the synapse. However, given the ubiquity of both anterograde and retrograde messengers in the nervous system, we should be cautious in using such arguments: the mathematical details of the switch do not commit us to any particular view concerning its exact locus or to the precise molecular machinery involved in its implementation. Regarding this latter issue of molecular implementation, we can certainly imagine many different scenarios and appeal to many different candidate molecules known to be involved in synaptic plasticity. Indeed, it is entirely possible that while the switch may be implemented broadly across the nervous system, the machinery implementing it may also vary broadly from system to

2334

P. Appleby and T. Elliott

system. To this extent, our proposal for a synaptic plasticity switch, within the framework of which we have shown that a broad range of experimental results on synaptic plasticity can be understood, does not stand or fall or any given mechanism, but to some extent stands above although is consistent with many different possible implementation details. Acknowledgments P.A.A. thanks the University of Southampton for the support of a studentship. T.E. thanks the Royal Society for the support of a University Research Fellowship. References Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bliss, T. V. T., & Lømo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 331–356. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2004). Spike-timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed point. Neural Comput., 16, 885–940. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. USA, 98, 12772–12777. Debanne, D., G¨ahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice culture. J. Physiol., 507, 237–247. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nat. Neurosci., 2, 1098–1105. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybern., 87, 459–470.

Ensemble interpretation of STDP

2335

Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Comput., 12, 2227–2258. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gustafsson, B., Wigstrom, ¨ H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774– 780. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors? J. Neurophysiol., 88, 507–513. Kempter, R., Leibold, C., Wagner, H., & van Hemmen, L. J. (2001). Formation of temporal-feature maps by axonal propagation of synaptic learning. Proc. Natl. Acad. Sci. USA, 98, 4166–4171. Kirkwood, A., Rioult, M., & Bear, M. F. (1996). Experience-dependent modification of synaptic plasticity in visual cortex. Nature, 381, 526–528. Kistler, W. M., & van Hemmen, L. J. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynptic action potentials. Neural Comput., 12, 385–405. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M. M., & Kato, K. (2000). Calcium stores regulate the polarity and input specificity of synaptic modfication. Nature, 408, 584–588. Peterson, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-or-none potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Philpot, B. D., Espinosa, J. S., & Bear, M. F. (2003). Evidence for altered NMDA receptor function as a basis for metaplasticity in the visual cortex. J. Neurosci., 23, 5583–5588. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Senn, W. (2002). Beyond spike timing: The role of nonlinear plasticity and unreliable synapses. Biol. Cybern., 87, 344–355. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. USA, 99, 10831–10836. Sjostr ¨ om, ¨ P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926.

2336

P. Appleby and T. Elliott

van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike-timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.

Received July 8, 2004; accepted March 29, 2005.

LETTER

Communicated by Wulfram Gerstner

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? Robert Legenstein [email protected]

Christian Naeger [email protected]

Wolfgang Maass [email protected] Institute for Theoretical Computer Science, Technische Universitaet Graz, A-8010 Graz, Austria

Spiking neurons are very flexible computational modules, which can implement with different values of their adjustable synaptic parameters an enormous variety of different transformations F from input spike trains to output spike trains. We examine in this letter the question to what extent a spiking neuron with biologically realistic models for dynamic synapses can be taught via spike-timing-dependent plasticity (STDP) to implement a given transformation F . We consider a supervised learning paradigm where during training, the output of the neuron is clamped to the target signal (teacher forcing). The well-known perceptron convergence theorem asserts the convergence of a simple supervised learning algorithm for drastically simplified neuron models (McCulloch-Pitts neurons). We show that in contrast to the perceptron convergence theorem, no theoretical guarantee can be given for the convergence of STDP with teacher forcing that holds for arbitrary input spike patterns. On the other hand, we prove that average case versions of the perceptron convergence theorem hold for STDP in the case of uncorrelated and correlated Poisson input spike trains and simple models for spiking neurons. For a wide class of cross-correlation functions of the input spike trains, the resulting necessary and sufficient condition can be formulated in terms of linear separability, analogously as the well-known condition of learnability by perceptrons. However, the linear separability criterion has to be applied here to the columns of the correlation matrix of the Poisson input. We demonstrate through extensive computer simulations that the theoretically predicted convergence of STDP with teacher forcing also holds for more realistic models for neurons, dynamic synapses, and more general input distributions. In addition, we show through computer simulations that these positive learning results hold not only for the common interpretation of STDP, where STDP changes the weights of synapses, but also

Neural Computation 17, 2337–2382 (2005)

© 2005 Massachusetts Institute of Technology

2338

R. Legenstein, C. Naeger, and W. Maass

for a more realistic interpretation suggested by experimental data where STDP modulates the initial release probability of dynamic synapses.

1 Introduction Spike-timing-dependent plasticity (STDP) has emerged in recent years as the experimentally most studied form of synaptic plasticity (see Abbott & Nelson, 2000; Fr´egnac, 2002; Gerstner & Kistler, 2002, for reviews). Numerous modeling studies have related STDP to various important learning rules and learning mechanisms such as Hebbian learning, short-term prediction (Mehta, 2001; Rao & Sejnowski, 2002), gain adaptation (Song, Miller, & Abbott, 2000), and boosting of temporally correlated inputs (Kempter, Gerstner, & van Hemmen, 1999; Song, et al., 2000; Gutig, ¨ Aharonov, Rotter, & Sompolinsky, 2003). The question of how a neuron can learn to fire at a prescribed time, given some presynaptic spike history, was investigated in the context of sequence learning, for example, in Gerstner, Ritz, and van Hemmen (1993) and Senn, Schneider, & Ruf, (2002). These two papers exploit tuning of synaptic delays to achieve timing precision. In this letter, we address the more general question to what extent STDP might support a more universal type of learning where a neuron learns to implement an “arbitrary given” map F from input spike trains to output spike trains. Obviously the goal to learn “arbitrary given” target transformations F is too ambitious, since there exist many maps F from input spike trains to output spike trains that cannot be realized by a neuron for any setting of its adjustable parameters. For example, no values of synaptic efficacies w could enable a generic neuron to produce a high-rate output spike train in the absence of any input spikes. Furthermore, a neuron can only learn to implement those transformations F in a stable manner that it can implement with a parameter setting that represents a equilibrium point for the learning rule under consideration (in this case, STDP). Since it is well known that the common version of STDP always produces bimodal distribution of weights, where each weight either assumes its minimal or its maximal possible value, we will consider in this article (with the exception of section 6) only the learning of target transformations F that can be implemented with such bimodal distribution of weights. Thus, we focus on those transformations F from input spike trains to output spike trains that can in principle be implemented by the neuron in a stable manner for some values p of its adjustable parameters, and ask which of these transformations F can be learned by such neuron, starting from some rather arbitrary values p0 of these adjustable parameters. On the basis of the experimental literature, it is not at all clear which of the many parameters that influence the input-output behavior of a biologically realistic synapse should be viewed as being adjustable for a specific protocol for inducing synaptic plasticity (i.e., “learning”). For example, there exists

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2339

strong empirical evidence (Markram & Tsodyks, 1996) that the common induction protocol with repeated pairings of pre- and postsynaptic spikes in a specific temporal relation does not change the scaling factors w of the amplitudes of excitatory postsynaptic potentials, (EPSPs, commonly referred to as “weight” or “synaptic efficacy”), but rather the synaptic release probability U for the first spike in a train of spikes. Whereas an increase of this parameter U will increase the amplitude of the EPSP for the first after a long inactive period spike in a spike train, just as an increase of the scaling factor w would do, it tends to decrease the amplitudes of shortly following EPSPs. We examine in this article through computer simulations both the case where scaling factors w and the case where initial release probabilities U are adjusted by STDP. In contrast to most preceding modeling studies for STDP, we will consider in the computer simulations of this article only biologically realistic models for dynamic synapses—those that are subject to short-term plasticity such as paired-pulse depressions and paired-pulse facilitation, in addition to the long-term plasticity induced by STDP. We assume that during learning, the neuron is taught to fire at particular points in time via extra input currents, which could, for example, represent synaptic inputs from other cortical or subcortical areas. Such a learning scenario is particularly compelling in cases where a neuron learns to predict input from other cortical or subcortical areas. This learning protocol is identical to the experimental paradigm investigated by Yves Fr´egnac and his collaborators (Fr´egnac, Schulz, Thorpe, & Bienenstock, 1988, 1992; Fr´egnac & Shulz, 1999), where synaptic plasticity is induced through the injection of currents into the postsynaptic neuron at particular points in time (relative to the time of the stimulus). We will refer to the conjecture that STDP enables neurons to learn (starting with arbitrary initial values of its parameters p) under this protocol any inputoutput transformation F that the neuron could in principle implement in a stable manner for some values p of its adjustable parameters as the spiking neuron convergence conjecture (SNCC) for STDP. Obviously this conjecture is closely related to the well-known perceptron convergence theorem (Rosenblatt, 1962; Haykin, 1999; Duda, Hart, & Storck, 2001), which asserts that the corresponding statement is true for the much simpler case of perceptrons (i.e., McCulloch-Pitts neurons or threshold gates with static synapses, static batch inputs, and static batch outputs—instead of time-varying input and output streams). We will specify the models for neurons and synapses and the rule for STDP that are examined in this article in section 2. In section 3, we discuss the relationship between STDP and the perceptron learning rule. Furthermore we prove in section 3 that the SNCC for STDP does not hold in a worst-case scenario for arbitrary distributions of input spike trains. In section 4, we carry out an analytical average case analysis of supervised learning with STDP for Poisson input spike trains (for the case of linear Poisson neurons and synapses without short-terms dynamics), and we prove

2340

R. Legenstein, C. Naeger, and W. Maass

that the SNNC holds in an average case sense for arbitrary uncorrelated Poisson input spike trains. We also derive in section 4 a criterion that clarifies under which conditions the SNNC holds for correlated Poisson input spike trains. In some situations, this criterion can be formulated in terms of linear separability, like the well-known learning criterion for perceptrons, but applied to the columns of the correlation matrix for the Poisson input. In sections 5 and 6, we demonstrate through computer simulations that the SNCC for STDP also holds for more general ensembles of uncorrelated and correlated Poisson spike trains as inputs and for more realistic models for neurons and synapses: for leaky integrate-and-fire neurons with dynamic synapses. In section 7, we show that such approximate convergence of learning also occurs when instead of weights, the initial release probabilities U of the synapses are modulated by STDP. 2 Models for Neurons, Synapses, and STDP A standard leaky integrate-and-fire neuron model was used for our simulations. The membrane potential Vm of such neuron is given by τm

d Vm = −(Vm − Vresting ) + Rm · Isyn (t) + Ibackground + Iinject (t) , dt

where τm = Cm · Rm is the membrane time constant, Rm is the membrane resistance, Isyn (t) is the current supplied by the synapses, Ibackground is a constant background current, and Iinject (t) represents currents induced by a “teacher.” If Vm exceeds the threshold voltage Vthresh , it is reset to Vreset and held there for the length Trefract of the absolute refractory period (see appendix A for details). We modeled the short-term-synaptic dynamics according to the model proposed in Markram, Wang, and Tsodyks (1998), with synaptic parameters U, D, F . The model predicts the amplitude A k of the excitatory postsynaptic current (EPSC) for the kth spike in a spike train with interspike intervals 1 , 2 , . . . , k−1 through the equations A k = w · uk · R k uk = U + uk−1 (1 − U) exp(−k−1 /F )

(2.1)

R k = 1 + (R k−1 − uk−1 R k−1 − 1) exp(−k−1 /D) with hidden dynamic variables u ∈ [0, 1] and R ∈ [0, 1] whose initial values for the first spike are u1 = U and R1 = 1 (see Maass & Markram, 2002, for a justification of this version of the equation, which corrects a small error in Markram et al., 1998). The parameters U, D, and F were randomly chosen from gaussian distributions that were based on empirically found data for such connections.

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2341

Depending on whether the input was excitatory (E) or inhibitory (I), the mean values of these three parameters (with D, F expressed in seconds) were chosen to be 0.5, 1.1, 0.05 (E) and 0.25, 0.7, 0.02 (I). The SD of each parameter was chosen to be 10% of its mean (with negative values replaced by values chosen from a uniform distribution between 0 and two times the mean). The effect of STDP is commonly tested by measuring in the postsynaptic neuron the amplitude A1 of the EPSP (or EPSC) for a single spike from the presynaptic neuron (after a longer resting period subsequent to the protocol for induction of STDP). Since A1 = w · U · R1 , one can interpret any change A in the amplitude of A1 (in comparison with the value of A1 before applying the protocol for STDP) as being caused by a proportional change w of the parameter w (with U unchanged), by a proportional change U of the initial release probability u1 = U (with w unchanged), or by a change of both w and U (and possible even further synaptic parameters). The first case is the one that is most commonly assumed in modeling studies (see, e.g., Abbott & Nelson, 2000; Fr´egnac, 2002; Gerstner & Kistler, 2002), and is analyzed in sections 5 and 6 of this letter. The second case is strongly favored by the experimental data of Markram & Tsodyks (1996), and it is apparently not contradicted by any of the other experimental data (since one usually measures the efficacy of the synapse after induction of plasticity with just a single test spike). This case is examined in section 7 of this letter. The third case is not considered because of a lack of quantitative experimental data. According to Abbott & Nelson (2000), the change A1 in the amplitude A1 of EPSPs (for the first spike in a test spike train) that results from (usually repeated) pairing of the firing of the presynaptic neuron at some time t pre and a firing of the postsynaptic neuron at time t post = t pre + t can be approximated for many cortical synapses by terms of the form A(t) =

W+ · e −t/τ+ , −W− · e t/τ− ,

if t > 0 if t ≤ 0

(2.2)

with constants W+ , W− , τ+ , τ− > 0 (and with an extra clause that prevents the amplitude A1 from growing beyond some maximal value Amax or below 0). For the theoretical analysis in section 4, spike trains S(t) are represented by sums of Dirac-δ functions S(t) = k δ(t − tk ), where tk is the kth spike time of the spike train. The leaky integrate-and-fire neuron is replaced here by a linear Poisson neuron model as in Kempter, Gerstner, & van Hemmen (2001) and Gutig ¨ et al. (2003). This neuron model outputs a spike train S post (t), which is a realization of a Poisson process with the underlying instantaneous firing rate R post (t). The effect of an input spike at input i at time t is modeled by an increase in the instantaneous firing rate of an

2342

R. Legenstein, C. Naeger, and W. Maass

amount wi (t )(t − t ), where is a response kernel and wi (t ) is the synaptic efficacy of synapse i at time t . Thus, the response kernel models the time course of a postsynaptic potential elicted by an input spike. Since the neuron model is causal, we have the requirement (s) = 0 for s < 0. We will consider plasticity only for excitatory connections so that wi ≥ 0 forall i and (s) ≥ 0 ∞ for all s. In addition, the response kernel is normalized to 0 ds (s) = 1. In the linear model, the contributions of all inputs are summed up linearly: R post (t) =

n j=1

∞

ds w j (t − s) (s) S j (t − s) ,

(2.3)

0

where S1 , . . . , Sn are the n presynaptic spike trains. Note that in this spike generation process, the generation of an output spike is independent of previous output spikes. 3 The Perceptron Convergence Theorem and a Counterexample to the Spiking Neuron Convergence Conjecture for STDP If one assumes that STDP affects only the parameter w, then the change w of the weight (or efficacy) of the synapse is according to equation 2.2 proportional to:

W+ · e −t/τ+ , −W− · e t/τ− ,

if t > 0 if t ≤ 0,

(3.1)

with an extra clause that prevents w from becoming larger than some maximal value wmax or smaller than 0. Hence, STDP changes the value wold of the synaptic weight to wnew = wold + w according to the rule wnew =

min{wmax , wold + W+ · e −t/τ+ }, max{0, wold − W− · e t/τ− },

if t > 0 if t ≤ 0 ,

(3.2)

with some parameters W+ , W− > 0. There exists some analogy between this STDP rule and common learning rules such as the Hebb rule, the perceptron learning rule, and the least-meansquare learning rule for strongly simplified neuron models that are used in the context of artificial neural networks. These simplified neuron models do not “fire.” Instead, their inputs and outputs consist of real numbers, which may change their value at each discrete time step. If x = x0 , . . . , xn ∈ Rn+1 denotes the input vector to an artificial neuron and y ∈ R the resulting output, then the basic Hebbian learning rule for changing the weights w = w0 , . . . , wn ∈ Rn+1 of a linear neuron with y = w · x is w = η · x · y,

(3.3)

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2343

where η ≥ 0 is the learning rate. For supervised learning in artificial neural networks, there exists in addition a target value yteacher for the output of a neuron, and by replacing y in the Hebb rule, equation 3.3, by yteacher − y, one gets the rule w = η · x · (yteacher − y),

(3.4)

which is for a linear neuron model y = w · x the least-mean-square or Widrow-Hoff rule (see section 2.2 in Rosenblatt, 1962; Haykin, 1999). This learning rule implements gradient descent in weight space for the mean of squared errors (yteacher − y)2 if applied to a list of training examples. In the context of McCulloch-Pitts neurons, also called perceptrons or threshold gates, whose output y assumes only values 0 or 1, this rule, equation 3.4, is the well-known perceptron learning rule (see Haykin, 1999, and Duda et al., 2001).1 For this case one can write rule 3.4 equivalently in the form  if yteacher = 1 and y = 0 η · x, w = η · (−x), if yteacher = 0 and y = 1 (3.5)  0, otherwise . The first line of this perceptron learning rule implements learning from a positive example x (which is in this case a positive counterexample hypothesis defined by the current weight vector w, since yteacher = 1 but y = sign(w · x) = 0). The second line implements learning from a negative example x (i.e., from a negative counterexample to the current hypothesis defined by w). The seemingly trivial third line of the rule makes sure that w is not changed for the current example x if it is correctly classified with the current weight vector w (i.e., yteacher = sign(w · x)). The main result about this perceptron learning rule is the perceptron convergence theorem (see Rosenblatt, 1962; Haykin, 1999; Duda et al., 2001). It states that learning with the perceptron learning rule converges for a given list L of examples if and only if the list L is linearly separable. If L is linearly separable, then the weight vector to which this learning rule converges is autonomically a solution of the corresponding classification

1 If one defines sign z = 1 if z ≥ 0, else sign z = 0 (as we do throughout this letter), then the output y of a perceptron can be defined in compact form as y = sign(w · x). One commonly uses the convention that in the context of perceptrons, the first component x0 of any input vector x has a fixed value x0 = 1. This implies that n wi xi ≥ −w0 1, if i=1 sign(w · x) = 0, otherwise,

and hence the weight w0 for this dummy component x0 of the input, multiplied with −1, assumes the effective role of an adaptive threshold for the perceptron.

2344

R. Legenstein, C. Naeger, and W. Maass

problem. Obviously linear separability of L is a necessary condition for the convergence of perceptron learning. But this simple condition from linear algebra is also sufficient: if there exists a weight vector w∗ that can classify the list L = x(1), y1 , . . . , x(m), ym of examples from Rn+1 × {0, 1} without error, (i.e., sign(w∗ · x(k)) = y k for k = 1, . . . , m), then the perceptron learning rule, equation 3.5, will converge to some weight vector w with the same property (i.e., sign(w · x(k)) = y k for k = 1, . . . , m) after cycling some finite number of times through the list L of training examples (starting from any initial weights w ∈ Rn+1 ). The perceptron convergence theorem can be interpreted as a very positive result on learnability, since it implies that the perceptron learning rule enables a perceptron to learn any map from inputs x to outputs y that it could possibly implement in a stable manner. Note that any weight vector that allows a perceptron to become consistent with a list L of training examples yields an equilibrium point for the perceptron learning rule, since in contrast to STDP, this learning rule automatically becomes inactive when errors no longer occur for the training examples. Hence, any setting of w that allows a perceptron to solve a given classification task is automatically stable with regard to the perceptron learning rule. Such automatic stability is not provided by STDP. Therefore, in order to make the spiking neuron convergence conjecture more meaningful (by giving it a larger chance to be true), we consider in this section only learning tasks for spiking neurons for which a solution exists that is stable with regard to STDP. In other words, we want to clarify whether in a supervised paradigm where the output is clamped to the teacher signal, STDP enables a spiking neuron, starting from any initial weights, to learn any transformation F from input spike trains to output spike trains that it can possibly implement in a stable manner (this is the spiking neuron convergence conjecture, or SNCC; see section 1). One salient difference between the perceptron learning rule and STDP is caused by the different structure of inputs and outputs of perceptrons and spiking neurons: inputs and outputs to a perceptron are (static) vectors of numbers, whereas they are functions of time (spike trains) in the case of a spiking neuron. Thus, mathematically, the transformation F from inputs to outputs computed by a spiking neuron with n input channels is a filter that maps n functions Si that represent n input spike trains S1 , . . . , Sn onto some output spike train S of the same form.2 Apart from this basic difference regarding the types of inputs and outputs, the perceptron learning rule and STDP also differ in the following structural aspects: i. The sign of any weight wi of a perceptron can be changed by the perceptron learning rule, whereas one usually does not assume that

2 Obviously a spiking neuron can implement only causal filters F , where for any time t, the value of S(t) depends on only the initial segments of S1 , . . . , Sn up to time t.

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2345

STDP can turn an excitatory synapse into an inhibitory synapse, or vice versa. ii. In the case where an example x that should be classified negatively is incorrectly classified through the current weight vector w (i.e., y = sign(w · x) = 1 but yteacher = 0), the perceptron rule changes w in a way that makes a reoccurrence of this mistake less likely. Something quite different happens in the analogous scenario for STDP if the neuron fires in response to an input for which it is not supposed to fire. In our training paradigm, where hyperpolarizing teacher currents suppress all undesired firing during training, no changes of synaptic parameters are triggered by such mistakes during training. Hence, this mistake is likely to show up again during testing (where there are no teacher currents anymore). For the alternative training paradigm where the teacher does not suppress this undesired firing during training, rules 2.1 to 2.3, 3.1, and 3.2 for STDP change the synaptic parameters in a way that positively reinforces future reoccurrences of this mistake.3 iii. The perceptron learning rule leaves the weights of the perceptron unchanged when it does not make a mistake (i.e., yteacher = y; see the third line of equation 3.5), whereas STDP will continue to change synaptic parameters even if the neuron fires exactly at the desired times t (even if this firing occurs without the help of an extra “teaching current”). It had been shown in Amit, Wong, and Campbell (1989) that the first apparent difference (i) between perceptron learning and STDP is not crucial for the convergence of learning, since the perceptron convergence theorem also holds for a sign-constrained version of the perceptron learning rule.4 We will show in the remainder of this section that the structural difference ii (even without difference, iii) is quite serious, and entails a falsification of the SNCC for STDP in some worst-case learning scenarios. To elucidate this fact, we first demonstrate in Figure 1 that the perceptron convergence theorem would no longer hold for certain learning scenarios if the second line of the perceptron learning rule, equation 3.5 (which specifies its response to negative counterexamples) is deleted, even if one starts with initial weights of value 0. The reason is that in this case, the resulting decision boundary

3 This may be just a deficit of current formalizations of STDP, not of the biological reality of synaptic plasticity. Debanne et al. (1998) and Fr´egnac (2002) have provided evidence for synaptic plasticity resulting from teacher-induced suppression of firing (Figure 2D in Markram, Lubke, ¨ Frotscher, and Sakmann (1997) also shows this effect). 4 Senn & Fusi (in press) have recently shown that the perceptron convergence theorem also remains valid for a learning rule that in addition keeps the sizes of positive weights bounded.

2346

R. Legenstein, C. Naeger, and W. Maass

x2 2

1.5, 1.5

H 1 w*

H*

1

2

3

4

x1

Figure 1: Demonstration that the perceptron convergence theorem fails if the second line of the perceptron learning rule, equation 3.5, is deleted, even if one starts with small initial weights. Assume that the hyperplane H ∗ generated by weight vector w∗ is the target decision boundary (positive examples above H ∗ , negative examples below H ∗ ), and that the list L of examples that occurs in the perceptron convergence theorem consists of just two examples: the positive example 1.5, 1.5 and the negative example 4, 0.∗ If one starts, for example, with the initial weight vector w = 0, 0, a decision boundary parallel to H will arise, no matter how long the training is continued, if the second line of the perceptron learning rule is deleted. Any such decision boundary will missclassify one of the two examples in the list L. ∗

Formally the perceptron learning rule is applied in this example to a list L consisting of the positive example 1, 1.5, 1.5 and the negative example 1, 4, 0, that is, L = 1, 1.5, 1.5, 1, 1, 4, 0, 0. Thus, the points 1.5, 1.5, 4, 0 have to be expanded by an additional dummy coordinate with value 1, whose associated weight represents the (adjustable) constant term in the resulting hyperplane H (see note 1). But this formal detail does not affect the validity of the argument.

depends on accidental details of the positive examples in the training set L, and negative examples cannot have any impact on learning. One can transfer the main idea of the counterexample illustrated in Figure 1 into the domain of spike trains and prove in this way that the SNCC for STDP is false, at least for certain learning scenarios (see Figure 2). If the set of possible spike inputs consists of only the two patterns shown in Figure 2, then STDP does not converge from all initial weight settings to a stable solution, although a stable solution exists. Details of the verification are in appendix B. Since we consider in Figure 2 only inputs where each presynaptic neuron fires at most once before the target firing time t3 of the postsynaptic neuron, the same example also proves that the SNCC for STDP fails if one assumes that STDP changes the initial release probabilities U instead of the scaling factors w (see the synapse model discussed in section 2, with a rule for U

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? A

B

S1

S1

S2

S2

S3

S3 t1

t2 t3 t4

time

t'1

t'2 t'3

2347

time

Figure 2: Construction of a counterexample to an analogous version (SNCC) of the perceptron convergence theorem for STDP. S1 , S2 , S3 denote three input spike trains to three synapses of a neuron. (A) A positive example where firing of the postsynaptic neuron at time t3 is desired. (B) A negative example where no firing of the postsynaptic neuron is desired. See appendix B for details.

that is analogous to the previously discussed rule, equation 3.1, for w; this rule will be discussed as rule 7.1). Thus we have proven that the spiking neuron convergence conjecture for STDP is not generally valid for either the case where synaptic efficacies w are modulated by STDP or for the case where initial release probabilities U are modulated by STDP. 4 Theoretical Results on STDP in the Context of Supervised Learning: Average Case Analysis We showed in the preceding section that it is not possible to derive for STDP a convergence result that has the same mathematical structure as the perceptron convergence theorem (yielding a guarantee of convergence for any set of inputs just under the assumption that a suitable weight vector exists). Therefore, we now turn to an average case analysis of STDP for Poisson input spike trains. The reason that the validity of the SNCC for STDP depends on the distribution of inputs can already be read off from the analogous scenario for the perceptron learning rule without line 2 of equation 3.5 that is shown in Figure 1. If the list L of training examples contained not just a single positive example (i.e., one example of a point that lies above the target decision boundary H ∗ ) but rather a larger set of positive examples covering the area above H ∗ , then L would contain more positive examples x1 , x2 with x2 > x1 than positive examples with x2 < x1 . This asymmetry in the coordinates of positive examples is likely to cause a weight vector w with w2 > w1 , since the perceptron learning rule in equation 3.8 without line 2 creates a weight vector w that is proportional to the sum of positive counterexamples that occur during learning. Hence, the angle between the resulting vector w and the target vector w∗ is likely to get smaller for such more uniform distribution of inputs (compared with the worst-case scenario discussed in Figure 1). Analogously, if one generates positive training examples for

2348

R. Legenstein, C. Naeger, and W. Maass

a spiking neuron by injecting Poisson input spike trains, rather than constructing particular examples of spatiotemporal input patterns as in Figure 2, one creates a more uniform distribution of spatiotemporal input patterns for which the neuron is supposed to fire. In this way, the learning process via STDP also implicitly gets information about the distribution of negative examples, that is, spatiotemporal input patterns for which the neuron is not supposed to fire, and hence they can indirectly influence the learning process even without any explicit provision in the rule for STDP that discourages the firing of the neuron for such input patterns. It turns out that for the average case, a form of the SNCC does, in fact hold (see theorem 1) if the output of the neuron is clamped to the teacher signal; hence, neither false positives nor false negatives arise during training. Furthermore, we show in theorem 2 that a general criterion for learnability can be given that has the form of a condition on the correlation matrix of Poisson inputs. Curiously enough, this condition has the form of a linear separability condition, just like the condition on learnability for perceptrons, although it arises here in a quite different context. In general, it turns out that all these provable convergence results for STDP require a suitable choice of the relationship between the parameters W+ and W− that scale the relative impact of synaptic facilitation and depression in STDP. As a preparation for the subsequent average case analysis, we need to express weight changes resulting from STDP by suitable integrals. STDP exploits correlations between input and output spike trains on the timescale of the positive learning window. Inputs that are strongly correlated with the output spike train are reinforced. If the integral over the whole learning window is negative, inputs with correlations on chance level or slightly above are weakened. More formally, let Si be the spike train of input i and let S∗ be the output spike train. Both are represented in this section as sums of δ-functions (see the definitions at the end of section 2). We consider the total weight change wi (t) = wi (t + T) − wi (t) resulting from pre- and postsynaptic spikes within a given time interval of length T. Ignoring the effect of weight clipping, the total weight change is the integral over all individual weight changes resulting from learning rule 3.1:

t+T

wi (t) =

dt

t

t+T

dt A(t − t )S∗ (t )Si (t ).

(4.1)

t

By substituting s = t − t we get

t+T

wi (t) = t

dt

t+T−t t−t

ds A(s)S∗ (t + s)Si (t ).

(4.2)

But weight changes during the time interval [t, t + T] can potentially also be caused by pre- or postsynaptic spikes that do not fall into this interval

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2349

(especially if T is small). This can be taken into account by extending the integration range of the second interval to (−∞, ∞), so that one arrives at

t+T

wi (t) =

dt

t

∞

−∞

ds A(s)S∗ (t + s)Si (t ).

(4.3)

This formula assumes for simplicity that all weight updates resulting from pre- and postsynaptic firing are always credited to the time point t of the presynaptic firing. The error caused by this approximation is small if the learning rates defined by W+ , W− (i.e., the size of each single weight update) are sufficiently small and the firing rates are sufficiently small so that the value of a weight cannot change too much over the length of a single learning window of STDP (i.e., during a single time interval of a length s for which A(s) is still relatively large). 4.1 A Necessary Condition on Input Spike Trains. If we assume that the statistics of input and output spike trains are constant over a learning trial, the total weight change over a sufficiently long time interval T provides a good predictor for the end result of a learning process. Consider a neuron with n synapses and a set M ⊆ {1, . . . , n}. Suppose that the neuron computes the target transformation F ∗ if and only if wi = wimax for all i ∈ M and wi = 0 for all i ∈ M. Then for learning F ∗ , the learning window should be such that all weights wi with i ∈ M have positive total weight change. On the other hand, all weights wi with i ∈ M will need to have negative total weight change (if it is allowed that the initial weights can be nonzero). If one assumes a simple learning window with exponentially decaying positive and negative parts as given in rule 3.1, one can determine the possible range of W− /W+ by this criterion. For every i ∈ M, the total change of wi has to be positive:

t+T

dt

t

∞

0

t+T

−

dt

t

0

−∞

ds W+ S∗ (t + s)Si (t ) e −s/τ+ ds W− S∗ (t + s)Si (t ) e s/τ− > 0.

(4.4)

Therefore, W− /W+ must satisfy t+T ∞ dt 0 ds S∗ (t + s)Si (t ) e −s/τ+ W− < t t+T 0 W+ dt ds S∗ (t + s)Si (t ) e s/τ− t

−∞

(4.5)

2350

R. Legenstein, C. Naeger, and W. Maass

for all i ∈ M. Furthermore, for every i ∈ M, the total change of wi has to be negative, that is, W− /W+ must satisfy t+T ∞ dt 0 ds S∗ (t + s)Si (t ) e −s/τ+ W− > t t+T 0 W+ dt ds S∗ (t + s)Si (t ) e s/τ− t

(4.6)

−∞

for all i ∈ M. A value in the middle between these maximum and minimum values for W− /W+ seems desirable to minimize the effects of noise in the learning process. 4.2 Correlated and Uncorrelated Poisson Input. In general, the spike trains S1 , . . . , Sn , S∗ may not be known, only the process that generated them. For example, one may only know the statistics of the inputs (e.g., correlated Poisson spike trains), but not the actual realizations. Furthermore, if we assume that the target spike train S∗ is generated by some neuron with a certain target weight vector, the spike generation process might be stochastic and S∗ is therefore not known explicitly. In these cases, the change wi is a random variable with a mean drift and fluctuations around it. We will focus on the drift by assuming that individual weight changes are very small and only averaged quantities enter the learning dynamics (see Kempter et al., 1999). The STDP rule, 3.2, avoids the growth of weights beyond bounds 0 and wmax by simple clipping. This leads to weights that tend to assume either of the clipping values 0 or wmax . Alternatively, one can make the weight update dependent on the actual weight value, w =

W+ · f + (w) · e −t/τ+ , −W− · f − (w) · e t/τ− ,

if t > 0 if t ≤ 0,

(4.7)

with suitable functions f + (w) and f − (w) (see Kistler & van Hemmen, 2000; van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001). In Gutig ¨ et al. (2003), a general rule is suggested where the weight dependence has the form of a power law with a nonnegative exponent µ: f + (w) = ((wmax − w)/wmax )µ+ and f − (w) = (w/wmax )µ− . For µ+ = µ− = 0 this rule recovers the basic additive update. The case µ+ = µ− = 1 corresponds to a multiplicative model where the update is linearly dependent on the current weight value. In the remainder of this section, we assume for simplicity that wmax = 1 and µ+ = µ− = µ. Then the weight-dependent update factors simplify to µ µ f + (w) := (1 − w)µ and f − (w) := w µ . Thus, rule 4.7 becomes w =

W+ · (1 − w)µ · e −t/τ+ , −W− · w µ · e t/τ− ,

if t > 0 if t ≤ 0.

(4.8)

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2351

With this synaptic update rule, the total weight change can be approximated by

t+T

wi (t) = t

−

dt

∞

0

0

−∞

µ

ds W+ f + (wi (t))e −s/τ S∗ (t + s)Si (t )

µ ds W− f − (wi (t))e s/τ S∗ (t + s)Si (t ) ,

(4.9)

where we have set τ+ = τ− = τ for convenience and replaced f + (wi (t )) by f + (wi (t)), as well as f − (wi (t )) by f − (wi (t)) (assuming that learning proceeds on a timescale larger than T—i.e., that wi (t) does not changes much during a time interval of length T). Consider the ensemble of all possible realizations of input and output spike trains given by some fixed spike generation processes for input and output spike trains. The average over this ensemble is in the following denoted by . E and called ensemble average. Taking the ensemble average over the weight change in equation 4.9 and dividing by T yields wi E (t) 1 = T T −

t

t+T

µ dt f + (wi (t))

µ f − (wi (t))

∞ 0

0

−∞

ds W− e

s/τ

ds W+ e −s/τ S∗ (t + s)Si (t ) E ∗

S (t + s)Si (t ) E ,

(4.10)

where the function Si (t )S∗ (t + s) E , which measures the correlation between Si and S∗ , is defined as the joint probability density for observing an input spike at synapse i at time t and an output spike at time t + s. A real neuron does not integrate over the whole ensemble; instead, learning is driven by a single realization of the stochastic process. But instead of averaging over several trials, we may also consider one single long trial during which input and output characteristics remain constant. In the following analysis, input and output spike trains will always be assumed to result from Poisson processes. Because disjoint time intervals are independent in a Poisson process, the integral in equation 4.9 decomposes into many independent events. Thus, for sufficiently small individual weight updates, learning is self-averaging (see also Kempter et al., 1999). This means that instead of learning on different examples from the ensemble, one can also learn from a long single example to achieve the mean drift in equation 4.10. We can exchange the integrals in equation 4.10 and introduce a tempo t+T rally averaged correlation function Ci (s; t) := T1 t dt Si (t )S∗ (t + s) E . Since in the following we will assume that spike trains are homogeneous Poisson spike trains, the temporal average can be skipped, and we get Ci (s; t) = Si (t)S∗ (t + s) E

(4.11)

2352

R. Legenstein, C. Naeger, and W. Maass

for the temporal averaged correlation function. We approximate the lefthand side of equation 4.10 by dwi (t)/dt ≡ w˙ i (t), and thereby obtain

µ

w˙ i (t) = W+ f + (wi (t)) µ

∞ 0

− W− f − (wi (t))

ds e −s/τ Ci (s; t) 0

−∞

ds e s/τ Ci (s; t).

(4.12)

We call w˙ i (t) the synaptic drift of synapse i. We now return to the previously discussed learning task. Consider an arbitrary set M ⊆ {1, . . . , n} and assume that the target weight vector w∗ satisfies wi∗ = 1 if i ∈ M and wi∗ = 0 otherwise. The target output spike train S∗ is produced by a neuron with synaptic efficacies w∗ and input spike trains S1 , . . . , Sn . The question is whether a neuron with some rather arbitrary initial weight vector can learn the target transformation F ∗ , which maps inputs S1 , . . . , Sn to the target output S∗ , defined by S1 , . . . , Sn , w∗ . We assume that the neuron receives S1 , . . . , Sn as inputs and is forced to spike only at times given by S∗ during training. Note that for homogeneous Poisson spike trains as inputs and a stationary generation process of the target output S∗ , Ci (s; t) is constant over time. We will skip the dependence on t in the notation to emphasize this. A precise mathematical characterization of those target transformations F ∗ (defined by some weight vector w∗ ), which can be learned by STDP, turns out to be a bit complicated. One complication arises from the fact that a direct analysis of convergence for the STDP rules 3.1 and 3.2 is very difficult because the resulting fluctuations around the barriers 0 and wmax are hard to analyze. It turns out that rule 4.18, that is, rule 4.7 with f + (w) = µ µ f + (w) = (1 − w)µ and f − (w) = f − (w) = w µ , is easier to analyze. But this rule no longer yields convergence to the target vector w∗ (in the case of supervised training with teacher-enforced output spike train S∗ ), but yields instead convergence to some other weight vector that is now dependent on µ. For example, equation 4.20 in the proof of theorem 2 will show that STDP with multiplicative updates according to rule 4.8 converges to a weight vector in (0, 1)n even if w∗ ∈ {0, 1}n . We express this weight vector through a function W : R+ → (0, 1)n , which maps each µ > 0 onto a weight vector W(µ) (we set R+ := {x ∈ R : x > 0} in this article). For µ → 0, this rule, 4.8, approximates the original STDP rule, 3.1, and, accordingly, the function W(µ) converges to the target vector w∗ . Thus, we have to replace a direct analysis of supervised learning with rule 3.1 by the analysis of the limit of supervised learning with rule 4.8 for µ → 0. This motivates the following definition of learnability: Definition 1. We say that a target weight vector w∗ ∈ {0,1}n can approximately be learned in a supervised paradigm where the output is clamped to the teaching

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2353

signal by STDP with soft weight bounds on homogeneous Poisson input spike trains (short: “w∗ can be learned”) if and only if there exists a function W : R+ → (0, 1)n with limν→0 W(ν) = w∗ and there exist W+ , W− > 0, such that for all µ > 0, the ensemble averaged weight vector w(t) E with learning dynamics given by equation 4.12 converges to W(µ) for any initial weight vector w(0) ∈ [0, 1]n . The following theorem asserts that stability of target weight vectors under STDP already implies that they can be learned. This implies that each locally stable equilibrium point of the weight dynamics is a global attractor for the dynamical system defined by the learning equations.5 Theorem 1. A target weight vector w∗ ∈ {0, 1}n can be learned if and only if there exists a function W : R+ → (0, 1)n with limν→0 W(ν) = w∗ and there exist W+ , W− > 0, such that for all µ > 0, W(µ) is a stable equilibrium point of the ensemble averaged weight vector w(t) E with learning dynamics given by equation 4.12. Proof. Due to teacher forcing, the integrals over the positive and negative learning window in equation 4.12 do not depend w(t) and are therefore ∞ on −s/τ pos neg constant. We use the abbreviation C for ds e Ci (s) and Ci for i 0 0 s/τ Ci (s). The learning dynamics can therefore be separated into n −∞ ds e independent one-dimensional dynamical systems. To show the “if” part of theorem 1, we show that for any µ > 0, the stable equilibrium point W(µ) = wµ1 , . . . , wµn is the only equilibrium point of the system. Consider an arbitrary µ > 0 and an arbitrary synapse i. Since wµi is a stable equilibrium point, the synaptic drift for small perturbations from wµi is such that wi converges to wµi . We show that the synaptic drift has this property for all initial values wi (0) ∈ [0, 1] (since the system is time invariant, it suffices to consider perturbations at t = 0). For all wi (0) < wµi with wi (0) sufficiently close to wµi , we know that the synaptic drift is positive, because the equilibrium point is stable. From pos neg equation 4.12, we get 0 < w˙ i (0) = W+ Ci (1 − wi (0))µ − W− Ci wi (0)µ . By pos pos neg neg definition, we have Ci , Ci ≥ 0, and Ci = Ci = 0 is impossible since this would imply w˙ i (0) = 0 for all values of wi (0). Therefore, it holds for pos neg all wi (0) with 0 ≤ wi (0) < wi (0) that W+ Ci (1 − wi (0))µ − W− Ci wi (0)µ < pos neg W+ Ci (1 − wi (0))µ − W− Ci wi (0)µ . Hence, the synaptic drift is positive for all weight values smaller than wµi . A similar argument shows that the synaptic drift is negative for all weight valueswi (0) with wµi < wi (0) ≤ 1.

5 A point x∗ in the state space of a dynamical system is called an equilibrium point if it has the property that whenever the state of the system starts at x∗ , it remains at x∗ for all future times. A equilibrium point x∗ is said to be stable if the state of the system converges to x∗ for all sufficiently small disturbances away from it.

2354

R. Legenstein, C. Naeger, and W. Maass

Together, this implies that wµ is the only globally stable equilibrium point of the learning dynamics. Hence, the ensemble averaged weight vector w(t) E converges to W(µ) for any initial weight vector w(0) ∈ [0, 1]n . We now show the “only if” part of theorem 1. If the target vector can be learned, then for some W+ , W− > 0, we know that for any µ > 0, the ensemble averaged weight vector w(t) E converges to W(µ) for any initial weight vector w(0) ∈ [0, 1]n . Since W(µ) ∈ (0, 1)n , we can draw w(0) from a small surrounding of W(µ) which is still in [0, 1]n . This implies that W(µ) is a stable equilibrium point of w(t) E under the learning dynamics. Hence, for these values of W+ and W− , it holds for all µ > 0 that W(µ) is a stable equilibrium point of w(t) E under the learning dynamics. This implies the “only if” part of theorem 1. For a more thorough analysis of the learning equation, we will have to incorporate a specific neuron model. For the integrate-and-fire neuron, no closed formula exists that relates the correlation between inputs and outputs to the neuron parameters. We therefore give an analysis for the linear Poisson neuron model (see section 2; see also Gerstner & Kistler, 2002). The next theorem is the main result of this section. We define the normalized cross correlation between input spike trains Si and S j with a common rate r > 0 as Ci0j (s) =

Si (t) S j (t + s) E − 1, r2

(4.13)

which assumes value 0 for uncorrelated Poisson spike trains. In our neuron model, correlations are shaped by the response kernel (s), and they enter the learning equation 4.12 with respect to the learning window. This motivates the definition of window correlations, c i+j = 1 +

1 τ

∞

ds e −s/τ

0

∞

0

ds (s )Ci0j (s − s ),

(4.14)

for the positive learning window and c i−j

1 =1+ τ

0

−∞

ds e

s/τ 0

∞

ds (s ) Ci0j (s − s )

(4.15)

for the negative learning window. In these definitions, the second integral expresses a filtering of the correlation function with the response kernel . We call the matrices C ± = {c i±j }i, j=1,...,n the window correlation matrices. Note that window correlations are nonnegative and that for homogeneous

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2355

Poisson input spike trains and a nonnegative response kernel, they are positive.6 We are now ready to formulate an analytical criterion for learnability: Theorem 2. A weight vector w∗ can be learned for homogeneous Poisson input spike trains with window correlation matrices C + and C − to a linear Poisson neuron with nonnegative response kernel if and only if w∗ = 0 and n n ∗ + ∗ + k=1 wk c jk k=1 wk c ik n > n ∗ − ∗ − k=1 wk c ik k=1 wk c jk for all pairs i, j ∈ {1, . . . , n}2 with wi∗ = 1 and w ∗j = 0. This theorem can be interpreted in the following way. The amount of correlation between input i and the output also depends on other inputs k, which are correlated with this input. Furthermore, the impact of such a correlated input depends on its weight. In the linear model, these effects are summed up. Theorem 2 asserts a criterion on the fraction of such summed correlations in the positive and negative learning window. This fraction needs to be larger for synapses that should be potentiated than for synapses that should be depressed. Proof. We will prove theorem 2 with the help of theorem 1. We therefore first analyze the equilibrium points of equation 4.12 for the linear Poisson neuron model. Consider a linear Poisson neuron with the constant target weight vector w ∗ . We obtain the correlation function Si (t) S∗ (t + s) E of input i with the output by inserting the instantaneous rate of the linear Poisson neuron with given input spike trains S1 , . . . , Sn (see equation 2.3) into Equation 4.11:7 Ci (s) = Si (t) S∗ (t + s) E =

n j=1

w ∗j

∞

ds (s ) Si (t) S j (t + s − s ) E .

0

∞ From equation 4.14, it follows that c i+j = 0 only if 0 ds Si (t) S j (t + s) E = 0. ∞ According to Bayes’ theorem, this equality can be rewritten as Si (t) E 0 ds S j (t + s)|spike in Si at time t E = 0. This implies that either the rate of Si or the rate of S j is zero, which contradicts our assumption of positive rate. A similar argument can be applied for c i−j . 7 To show that this is valid, we observe that S (t)S∗ (t ) = S (t)S∗ (t ) (this is i E i E E just a rearrangement of the summation terms). Here, . E indicates the ensemble average ∗ over the ensemble for given S1 , . . . , Sn , that is, only S (t ) is varied. 6

2356

R. Legenstein, C. Naeger, and W. Maass

With the use of simple mathematics, we can rewrite this equation as Ci (s) = r

2

n

w ∗j

∞

1+

ds (s )

0

j=1

Ci0j (s

−s ) .

(4.16)

Equation 4.16 describes the input-output correlations of a neuron with target weights w∗ . Since the output of the teached neuron in our setup is clamped to S∗ , these correlations drive learning in the synapses of the taught neuron. Substituting equation 4.16 into 4.12 and using equation 4.13, we can calculate the synaptic drift as µ

w˙ i = r 2 W+ f + (wi )

n

w ∗j

j=1 µ

− r 2 W− f − (wi )

n

∞

ds e −s/τ 1 +

0

w ∗j

j=1

∞

0

0

−∞

ds e s/τ 1 +

ds (s ) Ci0j (s − s )

∞

0

ds (s ) Ci0j (s − s ) . (4.17)

Equation 4.17 can be rewritten in terms of the window correlations c i+j and c i−j as

w˙ i = τr

2

µ W+ f + (wi )

n

w ∗j c i+j

− W− f − (wi )

n

j=1

w ∗j c i−j

.

(4.18)

j=1

We find the equilibrium point wµi of synapse i by setting w˙ i = 0 in equation 4.18. This yields µ

f − (wµi ) = µ f + (wµi )

wµi 1 − wµi

µ

=

n ∗ + W+ j=1 w j c i j n , W− j=1 w ∗j c i−j

(4.19)

which is defined for w∗ = 0 (note that w ∗j ≥ 0 for j = 1, . . . , n, and c i+j , c i−j > 0 for i, j = 1, . . . , n). We denote wµi = 1 +

1 1/µ

i

n ∗ + W+ j=1 w j c i j W− nj=1 w ∗j c i−j

by i and find

−1 .

(4.20)

The equilibrium points of the learning dynamics for given µ and W+ , W− are therefore given by wµ = wµ1 , . . . , wµn . If we cannot find values for W+ , W− such that these equilibrium points are stable, then the target function cannot be learned due to theorem 1. The stability analysis of the equilibrium points

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2357

is based on equation 4.18. One can see that the drift is identical to zero for all W+ , W− if w∗ = 0. In this case, every point in the state space is an equilibrium point, but none is stable. It follows that the target function cannot be learned if w∗ = 0. In the following, we assume that w∗ = 0. We show that in this case, the equilibrium point is stable for all µ, W+ , W− > 0. We consider a small perturbation δw of a single component wi from the equilibrium point wµi . This leads to some drift w˙ i of the perturbed system:

n n τr 2 µ ∗ + ∗ − w˙ i = w j c i j − W− f − (wµi + δw) w j ci j . W+ f + (wµi + δw) n j=1 j=1 (4.21) µ

µ

µ

For all µ > 0 and δw > 0, it holds that f + (wi + δw) < f + (wi ) and f − (wi + µ δw) > f − (wi ). Because c i+j , c i−j > 0, the synaptic drift of the perturbed system w˙ i is smaller than the synaptic drift of the system in equilibrium, which is 0. It follows that the synaptic drift is negative for δw > 0. A similar argument shows that the synaptic drift is positive for δw < 0. Therefore, the equilibrium point of the system is stable if and only if w∗ = 0. To summarize, we know that there exists a function W : R+ → (0, 1)n such that for all W+ , W− , µ > 0, W(µ) is a stable equilibrium point of the learning dynamics if and only if w∗ = 0. Here, we can identify W(µ) with wµ . If we compare this statement with theorem 1, we can deduce that the target vector w∗ can be learned if and only if w∗ = 0 and limµ→0 wµi = wi∗ for all i ∈ {1, . . . , n}. In the following, we show that this criterion is indeed equivalent to the criterion given in theorem 2. ¯ where M contains all indices i We define two sets of indices M and M, ¯ contains all indices i with wi∗ = 0. More formally, we with wi∗ = 1 and M ¯ = {i ∈ {1, . . . , n}|wi∗ = 0}. Note define M = {i ∈ {1, . . . , n}|wi∗ = 1} and M that limµ→0 wµi = 1 if and only if i > 1. Furthermore, limµ→0 wµi = 0 if and only if i < 1. Therefore, limµ→0 wµ = w∗ holds if and only if i > 1 ¯ By the definition of i , this statefor all i ∈ M and i < 1 for all i ∈ M. ment is equivalent to the following statement: limµ→0 wµ = w∗ if and only if n ∗ + W− j=1 w j c i j < n ∗ − W+ j=1 w j c i j n ∗ + W− j=1 w j c i j > n ∗ − W+ j=1 w j c i j

for all i ∈ M, and

(4.22)

¯ for all i ∈ M.

(4.23)

2358

R. Legenstein, C. Naeger, and W. Maass

Equations 4.22 and 4.23 can be taken together to form a single criterion: limµ→0 wµ = w∗ if and only if n n ∗ + ∗ + k=1 wk c jk k=1 wk c ik n > n − ∗ ∗ − k=1 wk c ik k=1 wk c jk

¯ for all pairs i, j with i ∈ M and j ∈ M. (4.24)

If condition 4.24 is satisfied, we know that we can find values for W+ , W− > 0 such that conditions 4.22 and 4.23 are satisfied. On the other hand, if condition 4.24 is not satisfied, there are no such values. Note that condition 4.24 is satisfied if no such pairs exist (i.e., wi∗ = 1 for all i). In this case, we can choose W− /W+ arbitrarily small to guarantee convergence. Hence, we have shown that a target vector w∗ can be learned if and only if w∗ = 0 and condition 4.24 is satisfied. This concludes the proof of theorem 2. For a wide class of cross-correlation functions, one can establish a relationship between learnability by STDP and the well-known concept of linear separability from linear algebra. Definition 2. Let c1 , . . . , cm ∈ Rn and y1 , . . . , ym ∈ {0, 1}. We say that a vector w ∈ Rn linearly separates the list c1 , y1 , . . . , cm , ym if there exists a threshold such that yi = sign(ci · w − ) for i = 1, . . . , m. The perceptron convergence theorem asserts that a list of training examples can be learned if a weight vector exists that separates the list (i.e., if the list is linear separable). We will show that the definition of linear separability turns out to be useful also in the context of spiking neurons if it is applied to the window correlation matrix C + of input spike trains. Because of synaptic delays, the response of a spiking neuron to an input spike is delayed by some time t0 . One can model such a delay in the response kernel by the restriction (s) = 0 for all s ≤ t0 .8 The following corollary asserts that if input correlations Ci0j (s) vanish for time differences s < −t0 (i.e., cross correlations appear only in a time window smaller than the delay), then learnability can be stated in terms of linear separability. As shown in the the proof of corollary 1, this condition implies that c i−j = 1 for all i, j. Corollary 1. If there exists a t0 ≥ 0 such that the response kernel (s) = 0 for all s ≤ t0 and Ci0j (s) = 0 for all s < −t0 , i, j = 1, . . . , n, and the window correlation matrix C + is positive, then the following holds for the case of homogeneous Poisson

8

Different synapses have different delays. Here, we consider only a single delay t0 for all synapses. However, this assumption is not critical for the analysis. It can easily be generalized to various different delays.

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2359

input spike trains to a linear Poisson neuron with response kernel : A weight vector w∗ can be learned if and only if w∗ = 0 and w∗ linearly separates the list + ∗ + ∗ + + L = c+ 1 , w1 , . . . , cn , wn , where c1 , . . . , cn are the rows of C . Corollary 1 can be viewed as an analogon of the perceptron convergence theorem for the average case analysis of STDP. Proof. The window correlations c i−j are given by c i−j = 1 +

1 τ

=1 +

1 τ

+

−∞

∞

t0

0

0

−∞

ds e s/τ

∞

0

ds e s/τ 0

ds (s )Ci0j (s − s )

t0

ds (s )Ci0j (s − s )

ds (s )Ci0j (s − s )

=1 . The first integral in the square brackets vanishes because (s ) = 0 for s ∈ [0, t0 ]. The second integral in the square brackets vanishes because Ci0j (s − s ) = 0 for s − s < −t0 and (t0 ) = 0. n w∗ c + We can apply theorem 2. The inequality in theorem 2 becomes k=1n wk ∗ik > n ∗ + k=1 wk c jk n ∗ . k=1 wk

k=1

k

¯ = {i ∈ {1, . . . , n}|wi∗ = 0}. Let M = {i ∈ {1, . . . , n}|wi∗ = 1} and M We find that the weight vector can be learned if and only if w∗ = 0 and n k=1

+ wk∗ c ik >

n

wk∗ c +jk

(4.25)

k=1

¯ for all pairs i, j with i ∈ M and j ∈ M. It remains to be shown that condition 4.25 is equivalent to the statement ∗ + ∗ that w∗ linearly separates the list L = c+ 1 , w1 , . . . , cn , wn . Condition 4.25 is satisfied if and only if there exists some threshold such that w∗ · ci+ > ¯ This is equivalent to > w∗ · c+j for all pairs i, j with i ∈ M and j ∈ M. the condition that there exists some threshold such that sign(ci+ · w∗ − ) = wi∗ for all i = 1, . . . , n. Therefore condition 4.25 holds if and only if w∗ linearly separates L. The formulation of corollary 1 is tight in the sense that linear separability of the list L alone (as opposed to linear separability by the target vector w∗ ) is not sufficient to imply learnability. This follows from the following fact:

2360

R. Legenstein, C. Naeger, and W. Maass

Proposition 1. There exists a window correlation matrix C + = {c i+j }i, j=1 ,...,n with window correlations c i+j and there exist vectors w, w∗ ∈ {0, 1}n , such that w lin∗ + ∗ ∗ early separates the list L = c+ 1 , w1 , . . . , cn , wn but w does not linearly separate L. Thus, the list L is linearly separable, but the target vector w∗ cannot be learned by STDP. Proof. Consider homogeneous Poisson input spike trains of rate r that c have normalized cross correlation functions of the form Ci0j (s) = ri j δ(s) with nonnegative correlation coefficients c i j . Let C denote the matrix with entries c i j and ci denote the ith row of C. Furthermore consider some response kernel with (s) = 0 for s ≤ 0. Obviously, we can apply corollary 1 here. The positive window correlation functions are of the form c i+j = 1 + c i j γ for some constant γ > 0. One can show that for target values y1 , . . . , yn ∈ {0, 1}, the + n list L = c+ 1 , y1 , . . . , cn , yn is linearly separated by a vector w ∈ {0, 1} if and only if w linearly separates the list L = c1 , y1 , . . . , cn , yn (see appendix C). We will therefore consider the matrix C of correlation coefficients c i j instead of C + . Consider the matrix and vectors 

1  0.25   C =  0.1   0.5 0

0.25 1 0.1 0.5 0

0.1 0.1 1 0.05 0

0.5 0.5 0.05 1 0

     0 0 1 1 0 0           0  , w∗ =  1  , w =  1  .      0 0 0 1 1 1

The list L = c1 , w1∗ , . . . , cn , wn∗ where ci is the ith row vector of C is not separated by w∗ , because Cw∗ = (1.35, 1.35, 1.2, 1.05, 1)T . However, w separates L because Cw = (0.1, 0.1, 1, 0.05, 1)T . One can show that there exist Poisson spike trains with correlation matrix C (see Legenstein & Maass, 2004). For uncorrelated input spike trains of rate r > 0, each input spike train is correlated only with itself and only for zero time lag. Thus, the normalized δ cross-correlation functions are given by Ci0j (s) = ri j δ(s), where δi j is the Kronecker delta function. In this case, the condition for corollary 1 is satisfied for every response kernel with (s) = 0 for s ≤ 0. Furthermore, the positive window correlations are given by c i+j = 1 + δi j γ for some constant γ > 0. For arbitrary target values y1 , . . . , yn , a weight vector w separates the cor+ responding list list L = c+ 1 , y1 , . . . , cn , yn if and only if w separates the list L = e1 , y1 , . . . , en , yn where the vectors e1 , . . . , en are the the column vectors of the identity matrix (see appendix C). But every weight vector w∗ ∈ {0, 1}n with w = 0 separates the list e1 , w1∗ , . . . , en , wn∗ . Hence, for a window correlation matrix with such entries, every weight vector w∗ ∈ {0, 1}n separates the corresponding list. Thus, with suitable values of

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2361

W− and W+ , any target weight vector w∗ ∈ {0, 1}n with w∗ = 0 can be learned for the case of uncorrelated Poisson input spike trains: Corollary 2. A target weight vector w∗ ∈ {0 , 1 }n can be learned in the case of uncorrelated Poisson input spike trains to a linear Poisson neuron with response kernel such that (s) = 0 for all s ≤ 0 if and only if w∗ = 0. Equations 4.21 and 4.22 give necessary conditions for the relationship between long-term depression and long-term potentiation for successful learning. For uncorrelated Poisson input, equation 4.23 predicts that W− /W+ has to be larger than 1. By equation 4.22, this fraction is bounded from above by w∗ W− ni <1+ . W+ τr j=1 w ∗j

(4.26)

As described in section 4.1, an optimal fraction W− /W+ lies halfway between 1 and the upper extreme of this inequality. For increasing n and a constant fraction of nonzero weights, the sum in the denominator of equation 4.26 becomes larger. Equation 4.26 therefore predicts that this ratio drops with increasing n (see experiment 1 in section 5). For uncorrelated Poisson input and with different powers µ+ and µ− , equation 4.19, which describes the fixed point of a synapses i, reads µ

wµi− (1 − wµi )µ+

W+ = W−

w∗ 1 n i ∗ 1+ τr j=1 w j

.

(4.27)

Note that this equation holds not only for binary target vectors w∗ but also for continuous target vectors w∗ ∈ [0, 1]n . The learning rule therefore reflects the ordering of the target weights w∗ in its equilibrium point (i.e., for two synapses i, j with wi∗ > w ∗j , we get wµi > wµj ). With appropriate parameters µ+ , µ− , W+ , and W− , one should be able to learn a good approximation to w∗ . This is confirmed with computer simulations in section 6.2. However, this ordering breaks down for correlated inputs, because cross correlations between inputs have a strong influence on the equilibrium points of the learning dynamics. 5 Computer Simulations of Supervised Learning with STDP: Weight Modulations We have shown through computer simulations that in spite of the negative result from section 3 for the SNCC in a worst-case input scenario, the SNCC for STDP is approximately satisfied for Poisson input spike trains, with andwithout correlations among them. This positive result is not surprising

2362

R. Legenstein, C. Naeger, and W. Maass

in view of the theoretical predictions of section 4, but it is not automatically implied by the preceding theory. In order to make a theoretical analysis feasible, we needed to make in section 4 a number of simplifying assumptions on the neuron model (linear Poisson neuron) and the synapse model (static synapses). In addition, a number of approximations had to be used in order to simplify the estimates; for example, we had analyzed only ensemble average and drift and had assumed that the impact of stochastic fluctuations could be ignored. As a consequence, we will see for the more realistic models of neurons and synapses that the weight vector in general does not converge to the target vector, but rather fluctuates in the neighborhood of the target vector. We consider in this section and in sections 6 and 7 the more realistic models for neurons and synapses discussed in section 2. We also show that in some cases, a less restrictive teacher forcing suffices that tolerates undesired firing of the neuron during training. Details on the simulations can be found in appendix A. Apart from the failure of common rules for STDP to respond appropriately (by a suitable reduction of weights of excitatory synapses) to “negative examples,” where the neuron fires although it should not fire, we identified in section 3 two other structural differences between the perceptron learning rule and STDP: i. STDP cannot change the “sign” of a synapse. iii. STDP keeps changing synaptic parameters for inputs that are already processed in the desired way by the neuron. In all our simulations, we apply STDP just to excitatory synapses (and they remain excitatory), whereas the parameters of inhibitory synapses remain unchanged (largely because of a lack of commonly accepted experimental data on STDP for “generic” inhibitory synapses). We show that the resulting structural difference i to the perceptron learning rule causes no problem for the convergence of learning in the computer experiments discussed in this letter (note that no inhibitory inputs were considered in the theoretical analysis of section 4). The problem iii certainly has an impact insofar as it causes never-ending fluctuations around the target vector and does not allow a locking onto the target vector after finitely many steps as in the case of perceptron learning. The theoretical analysis of section 4 had assumed that the neuron never fires during training except when it is supposed to fire. In the subsequent computer simulations, the neuron received a strong depolarizing input when it was supposed to fire and a hyperpolarizing input, which prevented most (but not all) undesired firing, when it was not supposed to fire. It turns out that the use of such hyperpolarizing teacher input is not necessary if one instead starts the learning with small (randomly assigned) initial weights.

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2363

With large initial weights and without hyperpolarizing teacher input, learning capabilities are weak (results not shown). 5.1 Experiment 1 (Uncorrelated Input). In this experiment, a leaky integrate-and-fire neuron received inputs from n = 100 dynamic synapses; 90% of these synapses were excitatory and 10% were inhibitory. For each excitatory synapse, the maximal efficacy wmax was chosen from a gaussian distribution with mean 54 and SD 10.8, bounded by 54 ± 3SD9 . Target weight vectors w∗ were chosen as follows. We randomly selected one-half of the excitatory synapses and set their weights to the corresponding maximal efficacy wmax . The weights of the other excitatory synapses were set to zero. The resulting target weight vector w∗ was then used to define a transformation F , which maps 100 input spike trains to one output spike train. The threshold of the neuron was set such that the rate of the output spike train was approximately 25 Hz for an input consisting of 100 uncorrelated Poisson spike trains with a rate of 20 Hz (this input rate was used for all subsequent experiments, except for experiment 5). We then replaced the weights of all excitatory synapses by new, randomly chosen values according to a gamma distribution with mean 9 and standard deviation 6.3. Weights of inhibitory synapses remained fixed throughout the experiment (this also holds for all other experiments discussed in this article). We then examined whether the neuron can learn with STDP to reproduce the previously defined transformation F from input spike trains to output spikes for an input consisting of 100 uncorrelated Poisson spike trains at a rate of 20 Hz. Information about the target transformation F was given to the neuron only in the form of short current injections (1 µA for 0.2 ms) at those times when this transformation F (i.e., the neuron with the weight vector w∗ ) would have produced a spike. Learning was implemented as standard STDP (see rule 3.2) with parameters τ+ = τ− = τ = 20 ms, W+ = 0.3, and W− /W+ = 1.035. The learning simulation was performed for 3600 seconds of simulated biological time with one long input sequence (i.e., without repetition of identical spike trains). Longer simulations (4 hours simulated biological time) were performed to test the stability of results. No significant changes in the results were observed for these runs. Results of a typical learning trial are shown in Figure 3. Three different performance measures were used for analyzing the learning progress (see the three curves in Figure 3B). The most informative one (“spike correlation,” plotted in Figure 3B with a dotted line) measures for test inputs that were not used for training (but had been generated by the same process) the deviation between the output spike train produced by the target transformation F for this input, and the output spike train pro-

9

Values lower than 21.6 (greater than 86.4) were replaced by 21.6 (86.4).

2364

R. Legenstein, C. Naeger, and W. Maass

A target trained 0

0.5

1 time [sec]

B 1

C

1.5

2 target

60

w

40 0.8

20 0

0.6

angular error [rad] weight deviation spike correlation

0.2

D

10 15 Synapse

20

trained

60 40

w

0.4

5

20 0 0

1000

2000

time [sec]

3000

0

5

10 15 Synapse

20

Figure 3: Learning an arbitrary transformation F on 100 uncorrelated Poisson inputs. (A) Output spike train on test data after 1 hour of training (trained) compared to the output of the target transformation F (target). (B) Evolution of the angle between weight vector w(t) and the vector w∗ that implements F in radiant (angular error, solid line), the weight deviation (dashed line), and spike correlation (dotted line). (C) Twenty weights from the vector w∗ (each weight has its maximal possible value or value 0). (D) Corresponding weights of the learned vector w(t) after 1 hour of training.

duced for the same input by the neuron with the current weight vector w(t). For that purpose, each spike in these two output spike trains was replaced by a gaussian function with an SD of 5 ms. The spike correlation between both output spike trains was defined as the correlation between the resulting smooth functions of time (for segments of length 100 s). This measure penalizes missing or superfluous spikes produced by the trained neuron, but also imprecision in timing of spikes on the scale of a few ms. The other two measures are obtained by comparing directly the current weight vector w(t), with the target weight vector w∗ . The angular error measures the angle between these two vectors (solid line in Figure 3B). Note that this measure does not reflect differences in the magnitude of vectors, in contrast to the third measure: weight deviation. Weight deviation is the mean absolute weight difference normalized bythe mean target weight. Thus, the weight ne |w ∗ −w (t)| deviation can be computed as i=1nei w ∗ i , with ne being the number of i=1 i excitatory weights. Note that the latter two measures are very direct, but

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2365

they can be deceptive, since in general, several different weight vectors can produce good approximations to the target transformation F (especially if inputs are strongly correlated). Figures 3C and 3D show for an arbitrary subset of 20 of the 90 excitatory synapses the values of weights in w∗ (see Figure 3C) and w(t) (see Figure 3D) for t = 3600 s. The weights in w∗ have either value 0 or the randomly chosen maximal value wmax for that weight. The results shown in Figure 3 demonstrate that the spiking neuron with dynamic synapses was able to learn with STDP after about 30 minutes of training the target transformation F quite well, and further learning with STDP did not reduce the quality of the approximation. Although the chosen spike correlation measure equals zero for uncorrelated Poisson spike trains of a common rate, we tested the spike correlation of randomly chosen weight vectors (instead of the learned vector). The spike correlation produced by 20 weight vectors drawn from the same distribution as the target weight vector w∗ was 0.24 ± 0.04 (mean ± SD). Hence, the spike correlations achieved are far above chance level. In order to test whether this positive result is representative, we carried out 100 repetitions of the same experiment with different target vectors w∗ , different initialization w(0) of the weight vector before learning, and different numbers of inputs. Twenty repetitions of the experiment (always with new Poisson spike trains) were carried out for five different dimensions (i.e., for five different numbers of simultaneously injected Poisson spike trains) between 25 and 200. The quotient W− /W+ was set to 1.12, 1.05, 1.035, 1.025, 1.0175 for 25, 50, 100, 150, and 200 inputs respectively. Results are shown in Figure 4. Figure 4A and 4B show that randomly chosen target transformations F are learned quite well with STDP, with only slight deterioration of performance even for biologically realistic large numbers of input spike train. The required training time increases roughly linearly with the number of inputs, but stays within a reasonable range. 5.2 Experiment 2 (Noisy Teacher). In a realistic scenario of prediction learning, the predicted inputs are likely to have some timing jitter. We therefore repeated experiment 1 with the timing of “teacher spikes” jittered by gaussian noise with zero mean and SD 4 ms. In this case, learning took considerably longer (65 ± 12 minutes convergence time until an angular error of ≤ 10 degrees was achieved for the case 100 input spike trains, for 20 repetitions of the experiment; 500 minutes simulated training time), and yielded the following results: spike correlation 0.67 ± 0.1, angular error 7.5 ± 1.9 degrees, weight deviation 2.3 ± 0.5%, for W+ = 0.045, W− /W+ = 1.0055.10

10

Somewhat better results can be achieved with additional inhibitory input that reduces non-teacher-induced firing (see experiment 3 for details). One then gets a spike correlation of 0.73 ± 0.16.

2366

R. Legenstein, C. Naeger, and W. Maass

A

C 45

0.8

40

0.7

35

0.6 0.5

25

50

6 angular error [°]

B

75 100 125 150 175 200 Nb. Inputs

4 2 0

convergence time [min]

spike correlation

1 0.9

30 25 20 15 10 5

25

50

75 100 125 150 175 200 Nb. Inputs

0

25

50

75 100 125 150 175 200 Nb. Inputs

Figure 4: Results on different input sizes. For each input size, the simulation was repeated 20 times for different target transformations F, different inputs, and different initial conditions. The mean and standard deviation is shown for spike correlation (A) and angular error (B) after 1 hour of training. (C) Training time needed until an angular error of less than 10 degrees is achieved.

5.3 Experiment 3 (Correlated Input). There exist many correlations among spike trains from different neurons in a neural system, and therefore we have also carried out a variation of experiment 1 where different subgroups of input spike trains had different degrees of correlation. In this setup, inputs with weight 0 in the generation of the target transformation are correlated with the output. The reason is that such inputs are correlated with other inputs that have a positive weight and correlations with the output. Furthermore, stronger correlated groups have a stronger influence on the output. In the extreme case, weighted inputs of input groups with small correlation within the group may be less correlated with the output than nonweighted inputs within strongly correlated groups. Again, equations 4.5 and 4.6 help to predict successful learning and determine a suitable quotient of W− /W+ . The experimental setup was similar to that of experiment 1. The 90 excitatory inputs were divided into 9 groups of 10 synapses per group. Spike trains were correlated within groups, whereas there were virtually no correlations between spike trains of different groups. Correlated spike trains with given correlation coefficients cc and given decays τcc of correlations for time-shifted versions of such spike trains were generated according to the methods that were introduced and analyzed in Gutig ¨ et al. (2003). More precisely, spike trains Si , S j were generated such that the correlation function Ci j (t) = Si (t)S j (t + t)t of Si and S j is

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2367

exponentially decaying as a function of |t|, with some small time constant τcc (see appendix A). The correlation coefficient cc i within group i consisting of 10 spike trains was set to 0.1 ∗ (i − 1) for i = 1, . . . , 9. The time constant of decay τcc was set to 10 ms.11 Target transformations F where all synapses belonging to the same group of size 10 are all assigned the weight 0 or the maximal possible value can be learned as well as the target transformation considered in experiment 1. Therefore, we have focused on the more difficult case where target transformations F have to be learned that require different weights for highly correlated input spike trains. More precisely, we have chosen the most difficult case: target transformations F that were generated by assigning within each of the 9 groups of the 10 excitatory synapses to 5 of them the weight 0 and to 5 of them their maximal weight value wmax (which was again chosen randomly for each synapse as in experiment 1). Figure 5A shows a typical weight vector that results in this way. Note that learning is based not only on teacher spikes but also on non-teacher-induced firing. Therefore, in addition to the difficulties noted above, strongly correlated groups of inputs tend to cause autonomous (i.e., not teacher-induced) firing of the neuron, which results in weight increases for all weights within the corresponding group of synapses according to well-known results for STDP (Song et al., 2000; Gutig ¨ et al., 2003). Obviously this effect makes it quite hard to learn a target transformation F that requires that half of the weights for each correlated group have value 0. However, spike correlations of 0.79 ± 0.09 could still be achieved (20 runs, angular error 14.1 ± 10 degrees, weight deviation 8.6 ± 6.3 after 1 hour of training, convergence time 716 ± 359 s until an angular error of ≤ 10 degrees is reached, for W+ = 0.45, W− /W+ = 1.05). The performance was better if additional inhibitory input was given to the neuron that reduced the occurrence of non-teacher-induced firing of the neuron. We added 30 inhibitory synapses with weights drawn from a gamma distribution with mean 25 and standard deviation 7.5 that received additional 30 uncorrelated Poisson spike trains at 20 Hz. The weight vector w(t) resulting after 1 hour of learning in the presence of such additional inhibitory input is shown in Figure 5B. One can see that the deviation from the target weight vector w∗ shown in Figure 5A is very small, even for highly correlated groups of synapses with heterogeneous target weights. On 20 trials (each with a new random distribution of maximal weights wmax as in experiment 1, and hence with a new target transformation F ), the mean spike correlation after 1 hour of training was 0.83 ± 0.08, with an 11 The peak correlation of the cross-correlation function is actually smaller. The correlation factor cc is obtained in the limit of τcc = 0. cc i can be interpreted as the correlation present in a large time window. Since the time constant for STDP used is 20 milliseconds, this definition is reasonable and more realistic than correlations with τcc = 0 (i.e., exact coincidence of spikes).

2368

R. Legenstein, C. Naeger, and W. Maass 60

C

target

40 20 0

w

B

20

60

40 60 Synapse trained

80

40

0.8 0.7 0.6 0.5 0.4

20 0

1 0.9

spike correlation

w

A

20

40 60 Synapse

80

0.3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 input correlation cc

Figure 5: Effects of correlated inputs. (A) A typical target weight vector w∗ for experiment 3 (each weight has its maximal possible value or value 0) and (B) a typical learned weight vector. No significant loss of accuracy can be seen for weights of synapses that receive highly correlated input spike trains (cc = 0.8 for synapses 81 to 90) in comparison with synapses that receive weakly correlated (cc = 0.1 · (i − 1) for the ith group) or uncorrelated inputs (e.g., synapses 1 to 10). (C) The result of experiment 4 with sharper correlation (τcc = 6 ms instead of 10 ms) and 4 groups with the correlation cc plotted on the x-axis (solid line). It also shows as a dashed line the spike correlation achieved by randomly drawn weight vectors (where half of the weights were set to wmax and the other weights were set to 0).

angular error of 6.8 ± 4.7 degrees and a weight deviation of 4.25 ± 2.2%. The spike correlation produced by 20 weight vectors drawn from the same distribution as the target weight vector w∗ was 0.45 ± 0.05. 5.4 Experiment 4 (Dependence of Learning Performance on Input Correlation). In order to evaluate the dependence of correlation among inputs, we proceeded similarly as in experiment 3, but increased and sharpened the correlation among inputs. Now 4 groups consisting each of 10 input spike trains were constructed for which the correlations within each group had the same value cc (the input spike train to the other 50 excitatory synapses, were uncorrelated, as were the inputs to 10 inhibitory synapses; 30 extra uncorrelated inhibitory inputs were added during training as in experiment 3 to reduce undesired firing). In order to make the effects of these correlated inputs more pronounced, the time constant τcc for the temporal decay of input correlations was reduced from 10 to 6 ms. Target transformations F were chosen as in experiment 3 in the most adverse way: half of the weights of w∗ within each correlated group were set to 0, the other half to a randomly chosen maximal value. The learning performance after 1 hour of training for 20 trials is plotted in Figure 5C for seven different values of the correlation cc that is applied in four of the input groups (solid line). The quotient W− /W+ was set to 1.05, 1.055, 1.06 for correlations of 0.3, 0.4, and

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2369

higher correlations, respectively. Note that higher correlations induce more correlation of unweighted inputs with the output. Due to equation 4.5, this implies larger W− /W+ for larger correlations. W+ was set to 0.45. One sees that highly correlated inputs do indeed reduce the performance of learning “difficult” target transformation F with STDP. The resulting correlation between the target output spike train produced by F and the output spike train produced by the neuron with weight vector w(t) after training is not too bad, even for highly correlated inputs (0.64 ± 0.05 for cc = 0.9), although the learned weight vector w(t) is far off the target vector w∗ (angular error of 40 ± 3.4 degrees for cc = 0.9). In this case, many different weight vectors produce quite similar output spike trains since the majority of output spikes of F are caused by correlated activity in one of the four correlated input groups, and redistribution of weights within each correlated group causes only slight changes in the output spike trains (see the dashed line in Figure 5C). Furthermore, STDP is not well suited for selecting the right ones within these correlated groups for weight amplification. In order to test the approximate validity of theorem 2 for leaky integrateand-fire neurons and dynamic synapses, we repeated the above experiment for input correlations cc = 0.1, 0.2, 0.3, 0.4, and 0.5. For each correlation value, 20 learning trials (with different target vectors) were simulated. Sixty-five percent of the 100 learning trials were classified as being learnable. The normalized cross correlation between inputs i and j (see equation 4.13) is approximately given by Ci0j (s) = 2τcccc r e −|s|/τcc for a mean input rate of r = 20 Hz and a correlation decay constant of τcc = 6 ms. We had to choose a response kernel such that (s) reflects the probability of spiking of the integrate-and-fire neuron as a function of time s since an input spike. This is experimentally measured with the peristimulus time histogram (PSTH). For an integrate-and-fire neuron without synaptic noise, the PSTH is proportional not to the shape of the EPSP but to its derivative (see Herrmann & Gerstner, 2001). Since the derivative of the EPSP also assumes negative values and its integral from 0 to infinity is vanishing, we could not use it for the analysis (we assumed in the analysis that the response kernel is positive and that its integral equals 1). Instead, we determined the PSTH of the neuron in simulations and fitted a double exponential to its positive part. This 1 resulted in a response kernel of the form (s) = τ1 −τ (e −s/τ1 − e −s/τ2 ) with 2 τ1 = 2 ms and τ2 = 1 ms (least mean squares fit). For this model, we calculated the window correlations c i+j and c i−j numerically. For each trial, we first checked whether the (randomly chosen) target vector w∗ was learnable according to the condition given in theorem 2 (note that any rescaling of the target weight vector does not change the result). The actual performance of learning with STDP was evaluated after 50 minutes of training. To guarantee the best possible performance for each learning trial, training was performed on 27 different values for W− /W+ between 1.02 and 1.15. In each trial, the best performance was chosen to evaluate the quality of convergence. The result is shown in Figure 6. Figure 6 shows

2370

R. Legenstein, C. Naeger, and W. Maass

1−spike correlation

0.4

predicted to be learnable predicted to be not learnable

0.3

0.2

0.1

0

0

5

10

15

weight error [°]

Figure 6: Comparison between theory and simulation results for a leaky integrate-and-fire neuron for input correlations between 0.1 and 0.5 (τcc = 6 ms). Each cross marks a trial where the target vector was learnable according to theorem 2. Each open circle marks a trial that is not learnable according to theorem 2. The actual learning performance of STDP is plotted for each trial in terms of the weight error (x-axis) and 1 minus the spike correlation (y-axis).

that the theoretical prediction of learnability or nonlearnability for the case of simpler neuron models and synapses from theorem 2 (which was in addition derived under some simplifying statistical assumptions) translates in a biologically more realistic scenario into a quantitative grading of the learning performance that can ultimately be achieved with STDP.

5.5 Experiment 5 (Time-Varying Input Rates). Good learning results were also obtained using spike trains with time-varying correlated firing rates as inputs. The algorithm we used to produce such inputs had been introduced in Song et al., (2000). This algorithm generates time-varying firing rates that have a cross-correlation function that decays exponentially with a time constant τc and an amplitude given by parameters called correlation parameters (see appendix A). Specifically, the correlation between the rates of two inputs i and j is c i c j , where c i and c j are the correlation parameters of these inputs. We assigned to the n = 90 excitatory inputs correlation parameters that varied between 0.2 and 0.9 (specifically, c i of input i was set to 0.2 + 0.7(i − 1)/(n − 1). The time constant τc was set to 20 ms. In 20 learning trials, spike correlation was 0.89 ± 0.07, angular error was 4.7 ± 3.2 degrees, and weight deviation was 2.7 ± 1% (after 100 minutes of training, W+ = 0.24, W− /W+ = 1.022). No additional inhibitory input during learning was used for this experiment.

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2371

Table 1: Comparison of Learning Performance Between the Usual STDP Rule (“Basic”) and the Modification (“Modified”) Suggested by Froemke & Dan (2002). STDP Rule Basic Modified Basic Modified Basic Modified

Maximum Input Correlation

Spike Correlation

Angular Error (◦ )

Weight Deviation (%)

0.8 0.8 0.54 0.54 0 0

0.83 ± 0.08 0.73 ± 0.09 0.83 ± 0.11 0.91 ± 0.05 0.84 ± 0.08 0.9 ± 0.07

6.8 ± 4.7 17.2 ± 6.1 4.5 ± 1.5 3.7 ± 1.6 3.2 ± 1.3 2.7 ± 2.4

4.25 ± 2.2 8.4 ± 3.8 3.2 ± 0.6 2 ± 0.6 1.9 ± 0.4 0.93 ± 0.6

Notes: The last three columns show how well randomly drawn target transformations F were learned in each case. The first two lines report learning results achieved for the same input distribution as in experiment 3, with nine groups of inputs where the correlation within group i is 0.1 (i − 1). Lines 3 and 4 report results for inputs with slightly weaker correlations (0.07 · (i − 1) in group i, i = 1, . . . , 9). The last two lines report results for uncorrelated inputs. Training time was 60 minutes for the basic update and 90 for the modified update, with 20 repetitions for different target transformations F and different initial parameters. Learning parameters used for the modified update rule were W+ = 1.34, 1.34, 0.59, and W− = 0.66, 0.625, 0.265 for a maximum correlation of 0.8, 0.54, 0 respectively.

6 Variations of STDP Rules for Modulation of Weights 6.1 Learning Rule for Spike Trains Suggested by Froemke and Dan. In modeling studies for STDP, one usually applies the STDP rule uniformly to all pairs of pre- and postsynaptic spikes. In one recent experimental study (Froemke & Dan, 2002), plasticity was induced not by repeated parings of isolated pre- and postsynaptic spikes, but by longer pre- and postsynaptic spike trains of a type as they occur in vivo. It was found that a correction term to the STDP rule that weakens the impact of pre- and postsynaptic spikes that occur shortly after another spike within the same neuron (see appendix A) fits these experimental data better. We examined the impact of this modified STDP rule on teacher-induced learning and found that it somewhat reduces the learning accuracy in the presence of highly correlated inputs, but has no or even a slightly positive effect for other input distributions (see Table 1). 6.2 Learning Intermediate Values of Weights. The STDP rule, equation 3.2, avoids the growth of weights beyond bounds 0 and wmax by simple clipping. Alternatively one can also make the weight update dependent on the actual weight value, as discussed in section 4.2. With the update rule given in equation 4.7, intermediate values of weights between 0 and wmax become stable (as long as the input distribution does not change). However, Gutig ¨ et al. (2003) showed that this effect is highly sensitive with regard to the

2372

R. Legenstein, C. Naeger, and W. Maass

w

A

150

C

target

150

100

136 120

50

w

B

100 20

150

40 60 Synapse trained

87 104 71

80 w

0

50

100

55 39 23 6.5

50 0

20

40 60 Synapse

80

0

0

2000 4000 time [s]

6000

Figure 7: Learning with a multiplicative variation of STDP that is able to produce stable intermediate weight values. (A) Weight vector w∗ of the target transformation. (B) Learned weight vector w(t) after 100 minutes of training. (C) Temporal evolution of weights during training (each weight can vary between 0 and wmax = 216). The numbers on the right-hand side give the values of these weights that were used to generate the target transformation F .

values of µ+ and µ− and that these parameters require different values for different input distributions. In a parameter regime where stable intermediate weight values can be produced by STDP, more target transformations F from input spike trains to output spike trains can be implemented by a neuron in a stable manner, and hence can potentially be learned. Our computer simulations show that this is in fact the case (at least for uncorrelated Poisson inputs). A typical learning result is shown in Figure 7, for a target transformation F with intermediate weights between 0 and 144 for 90 excitatory synapses, as shown in Figure 7A (wmax = 216). The temporal evolution of nine selected weights during learning is shown in Figure 7C, and the resulting weight vector w(t) after 100 minutes of learning in Figure 7B. In 20 trials of 100 minutes duration (each with different initial weights drawn from a uniform distribution over [0,108], and 100 uncorrelated Poisson input spike trains at 20 Hz), a spike correlation of 0.77 ± 0.01, angular error of 20.2 ± 0.07 degrees, and a weight deviation of 8.3 ± 0.07% was reached. In this experiment, learning parameters were W+ = 0.12, W− /W+ = 1.03, µ+ = 0.01, and µ− = 0.03. Results are highly sensitive to these parameters. 7 Modulation of Initial Release Probabilities by STDP Experimental data from slice (Markram & Tsodyks, 1996) suggest that synaptic plasticity may not change the uniform scaling of the amplitudes of EPSPs resulting from a presynaptic spike train (i.e., the parameter w), but rather redistribute the sum of their amplitudes in a different way to

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2373

individual EPSPs. If one assumes that STDP changes the parameter U that determines the synaptic release probability12 for the first spike in a spike train, whereas the weight w remains unchanged (see the synapse model discussed in section 2), then the same experimental data that support rule 2.2 for STDP support the following rule for changing U: Unew =

min{Umax , Uold + U+ · e −t/τ+ }, max{0, Uold − U− · e t/τ− },

if t > 0 if t ≤ 0,

(7.1)

with suitable nonnegative parameters Umax , U+ , U− , τ+ , τ− . One can easily prove that the class of transformations F that a neuron can implement for different vectors U of initial release probabilities (with generic values of w) is a different one from the class of transformations it can implement for different vectors w. Hence, not only the learning rule changes from equation 3.2 to equation 7.1, but also the class of potential targets F for learning changes. Analogously as before, we first assigned to each excitatory synapse a value Umax drawn from a gaussian distribution with mean 0.25 and SD 0.02 (bounded by 0.25 ± 3 SD), as well as a value w drawn from a gamma distribution with mean 12 and standard deviation 8.4. The synaptic parameters D and F were chosen from gaussian distributions with mean 0.7, 0.021. The SD of each parameter was chosen to be 10% of its mean (with negative values replaced by values from a uniform distribution between zero and two times the mean). Then target transformations F for learning were constructed by randomly choosing for each excitatory synapse either 0 or Umax as the value for U (with randomly drawn w-values from a gamma distribution with mean 12 and SD 8.4). Figure 8A compares a typical target spike train used in this section to a typical target spike train used in section 5. It shows that transformations F used here typically produce other output spike trains than the corresponding assignment of values wmin = 0 and wmax to these synapses (with U-values chosen as described in section 2: drawn from a gaussian distribution with mean 0.5 and SD 0.05; wmax randomly chosen as in experiment 1). Subsequently learning according to rule 7.1 was started with teacher-induced pulses according to F and initial values of U randomly chosen from a uniform distribution in the interval [0, 0.1] (30 extra uncorrelated inhibitory inputs were added during training as in experiment 3 to reduce undesired firing). Figure 8 shows results of repeating experiment 1 (which was for uncorrelated Poisson inputs) in this new setting. Twenty repetitions of this experiment (with different random choices of learning targets F and different initial conditions) yielded after 42 minutes of training the following results: spike correlation 0.88 ± 0.036, angular error 27.9 ± 3.7 degrees, U− deviation 14.6 ± 2.6%, for U+ = 0.0012, 12

If one assumes that neurons are connected by a sufficiently large number of synaptic release sites, release probability can be approximated in a deterministic model by the amplitude of EPSPs.

2374

R. Legenstein, C. Naeger, and W. Maass

A

weight

utilization 0

1 time [sec]

1 0.8 0.6

1.5

C U

B

0.5

target 0.2 0.1

angular error [rad] weight deviation spike correlation

0.4

0

5

D U

0.2

2

10 15 Synapse

20

trained 0.2 0.1

0 0

1000 2000 time [sec]

0

5

10 15 Synapse

20

Figure 8: Results of modulation of initial release probabilities according to STDP. (A) To demonstrate typical differences between target transformations F resulting from synapse-specific values of U rather than synapse-specific values of w, we plotted the output of two such transformations F for the same input (100 uncorrelated Poisson spike trains at 20 Hz). For the upper trace, Fw was constructed by random assignments of minimal or maximal values of w to individual synapses. For the lower trace, FU was constructed by choosing Umax (Umin = 0) for a synapse whenever wmax (wmin = 0) was chosen for the same synapse in the construction of Fw . (B) Performance of U-learning, analogous to Figure 3B for experiment 1. (C, D) Same plots as in Figures 3C, and 3D but for values of U (rather than w) of the target transformation F and after training (with randomly chosen initial values).

U− /U+ = 1.055. The spike correlation produced by 20 U vectors drawn from the same distribution as the target U vector (which corresponds to some baseline value of correlation) was 0.55 ± 0.09 (mean ± SD). Again, achieved spike correlations are far above chance level. We also repeated experiment 3 with correlated Poisson inputs (more precisely, 9 groups of 10 inputs with correlation 0.1 · (i − 1) among the inputs in group i) for the setting of U-learning. A typical result is plotted in Figure 9. Although the deviation between the vector U∗ that was used to generate F and the last vector U(t) (after 35 minutes of training) is rather large (see Figures 9B and 9C), the output spike train produced by the trained neuron matches that produced for the same input by the target transformation F quite well (see Figure 9A). Apparently the output spike train is less sensitive to changes in the values of U than to changes in w. This was confirmed by testing spike correlations between output spike trains produced

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2375

A target trained 0

0.5

0.2 0.1 0

1.5

C 0.3

target

U

U

B 0.3

1 time [sec]

2 trained

0.2 0.1

20

40 60 Synapse

80

0

20

40 60 Synapse

80

Figure 9: U-learning with correlated inputs (same input as in experiment 3; see Figure 5). (A) Typically a good fit is achieved between the output produced by the target transformation F (“target”) and the output (“trained”) produced by a neuron whose U-values were modulated according to STDP rule 7.1. (B) Vector U∗ used to generate the target F . (C) Vector U(t) produced after 35 minutes of training.

by random U vectors and output spike trains produced by the target U vector. Such U vectors, drawn from the same distribution as the target U vectors, already achieved a spike correlation of 0.69 ± 0.6 (mean ±SD, 20 trials). Consequently, since only the “behavior of F ” but not the vector U∗ is made available to the neuron during training, the resulting correlation between target and actual output spike trains is quite high, whereas angular error between U∗ and U(t), as well as the average deviation in U, remain rather large. This fact is supported by 20 repetitions of this experiment with different targets F and different initial conditions, which yielded after 35 minutes of training the following results: spike correlation 0.75 ± 0.08, angular error 39.3 ± 4.8 degrees, U− deviation 25.9 ± 4.9%, for U+ = 8 · 10−4 , U− /U+ = 1.09. These positive results for U-learning with STDP are somewhat surprising, since increasing U for a synapse from a neuron that fired at some time t1 shortly before a desired firing at time t2 of the postsynaptic neuron in general does not increase the probability that the postsynaptic neuron will fire on its own at time t2 if the same input spike trains would be repeated. The reason is that the presynaptic spike at time t1 may be preceded by other spikes from the same presynaptic neuron, so that an increase of the initial release probability U of the corresponding synapse is likely to deplete synaptic resources at a faster rate and may actually result in an EPSP of a smaller amplitude in response to the presynaptic spike at time t1 . The positive results for U-learning with STDP reported in this section point to a possible benefit of relatively small values of the initial release probability U, since in this case, the previously described adverse scenario is less likely

2376

R. Legenstein, C. Naeger, and W. Maass

to occur for realistic presynaptic firing rates (in our simulations, U was not allowed to grow beyond a randomly chosen value Umax that had a mean of 0.25; increasing Umax reduced the effectivity of U-learning with STDP). 8 Discussion We have examined in this letter the question, “What can a spiking neuron learn with STDP?” The answer at which we have arrived is that a spiking neuron can learn with STDP basically any map F from input to output spike trains that it could possibly implement in a stable manner. This holds at least for uncorrelated and correlated Poisson input spike trains. In other words, the spiking neurons convergence conjecture (SNCC) for STDP is approximately satisfied for such inputs in an average case sense. One could interpret this as saying that STDP endows spiking neurons with universal learning capabilities for Poisson inputs (since no neuron could possibly learn a transformation that it cannot implement with any setting of its adjustable parameters). In particular, STDP enables spiking neurons to learn to predict even very complex temporal patterns of input currents that are provided to the neuron during training. On the other hand, we have shown that this result is quite sensitive to the distribution of inputs for which learning takes place, since we showed in section 3 that the SNCC for STDP is provably false for some worst-case input scenarios. We have highlighted in section 3 three structural differences between the perceptron learning rule for McCulloch-Pitts neurons and STDP for spiking neurons. One of these differences (failure of common rules for STDP to discourage firing for inputs for which firing is not desired) was used in section 3 to explain why the perceptron learning rule is guaranteed by the perceptron convergence theorem to converge from arbitrary initial values to an error-free solution (provided that such solution exists), whereas no corresponding guarantee can be given for STDP in the worst case. On the other hand, our theoretical average case analysis in section 4 and our computer simulations have shown that Poisson input spike trains provide a sufficiently rich set of positive examples (i.e., of input spike patterns for which the neuron is supposed to fire in order to approximate a given target map F from input spike trains to output spike trains) so that a lack of adjustment of parameters in response to negative examples is less severe. But our theoretical analysis shows that convergence of learning with STDP requires a proper choice of the relationship between the parameters W+ and W− (see, e.g., equation 3.1) which determine the balance between long-term synaptic facilitation and long-term synaptic depression in STDP. In the alternative interpretation of STDP where the initial release probability U is adjusted (see section 7), a suitable balance of the parameters U+ and U− in equation 7.1 is needed for convergence of learning. Nevertheless, our results suggest that it would be quite important to study more systematically the changes of synaptic parameters resulting from presynaptic spikes that

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2377

do not cause postsynaptic spikes. Some evidence for the existence of such biological mechanisms has been provided in the previously cited work by Yves Fr´egnac and his collaborators, as well as in Markram et al. (1997). It had already been shown in previous modeling studies (see, e.g., Kempter et al., 1999; Song et al., 2000) that STDP enables the most dominating one among several input sources to control the output of a neuron. In one sense, this result is closely related to the experimental results by Fr´egnac et al. and to the modeling results of this article, since the extra input currents induced by a “teacher” represent the dominant input source. But there is one essential difference: the control of the output spike train by the dominant input source is achieved in the latter two cases not by strengthening the synapses from this dominant input source; in fact this dominant input source disappears after training, and the neuron still fires at times when the dominant input source would have been very high. We have shown in section 4 that a mathematical average case analysis can be carried out for supervised learning with STDP. This theoretical analysis also supports (under some simplifying assumptions) the validity of the SNCC for Poisson inputs. In addition, this theoretical analysis produces the first criterion that allows us to predict whether supervised learning with STDP will succeed (or equivalently, whether a weight vector is stable under STDP) in spite of correlations among Poisson input spike trains. For the special case of “sharp correlations” (i.e., when the cross correlations can be approximated by a δ-function), this criterion can be formulated in terms of linear separability of the rows of the correlation matrix for the input, and its mathematical form is therefore reminiscent of the well-known condition for learnability in the case of perceptron learning. In this sense, corollary 1 can be viewed as an analogon of the perceptron convergence theorem for spiking neurons with STDP. Our computer simulations show that the analytically derived criteria predict quite well whether STDP converges for correlated Poisson input spike trains even for the case of more realistic models of neurons and synapses and for the case where a number of simplifying statistical assumptions regarding the input statistics are not satisfied. In contrast to previous modeling studies for STDP, we have based all computer simulations discussed in this article on biologically realistic models for dynamic synapses. Furthermore, we have shown in section 7 that an alternative interpretation of STDP where one assumes that it modulates the initial release probabilities U of dynamic synapses, rather than their scaling factors w, gives rise to very satisfactory convergence results for learning. This alternative interpretation of STDP is strongly suggested by data from experiments where the effect of STDP was tested with more than a single test spike (Markram & Tsodyks, 1996), but its possible impact on learning has so far been studied very little. The simulation results for modulations of initial release probabilities U by STDP (with relatively small values of U) are surprisingly positive if one takes into account that an increase of U has a quite different impact on the amplitude of an EPSP caused by a spike

2378

R. Legenstein, C. Naeger, and W. Maass

within a longer spike train than a corresponding increase of the synaptic efficacy w. Those positive learning results may point to functional benefits of small release probabilities for synapses that are relevant for precise timing of firing in neural circuits. Appendix A: Details to Computer Simulations A.1 Neuron Parameters. Membrane time constant τm = 30 ms, absolute refractory period Trefract = 3 ms, resting potential Vresting = 0 V, reset voltage Vreset = 14.2 mV, membrane resistance Rm = 1 M , constant background current Ibackground randomly chosen for each trial from the interval [13.5 nA, 14.5 nA]. Threshold voltage was set such that each neuron spiked at a rate of about 25 Hz. This resulted in threshold voltages slightly above 15 mV. A.2 Synaptic Parameters. The synaptic current x(t) of a synapse is increased by Ak · τqS each time a presynaptic spike arrives, with x(0) = 0. Here, q = 3 pC (q = 6 pC) is the total charge that is injected into the postsynaptic neuron by the excitatory (inhibitory) synapse by a single spike with amplitude A = 1. Otherwise, the synaptic current decreases exponentially, τ S ddtx = −x with τ S = 3 ms (τ S = 6 ms) for excitatory (inhibitory) synapses (see Gerstner & Kistler, 2002). A.3 Correlated Spike Trains. To produce n spike trains with correlation factor cc and frequency f , we proceeded, as in Gutig ¨ et al. (2003), with a time bin of size t = 0.2 ms bins. We constructed a Poisson spike train Sr with frequency f by assigning a spike to each bin with probability f t. The spike train Sr was used√ as a template for the construction √ √ of the input spike trains. Let θ = f t(1 − cc) + cc and φ = f t(1 − cc). Each input spike train was generated by assigning a spike to a bin not in Sr with probability φ and assigning a spike to a bin in Sr with probability θ (see Gutig ¨ et al., 2003). To model an exponential decay with time constant τcc in the cross-correlation function, we added timing jitter drawn from a Laplacian distribution with time constant τcc /2 to all spikes in these spike trains. To generate correlated rates in experiment 5, we used an algorithm that has been introduced in Song et al. (2000). The rates of two different inputs i and j with correlation parameters c i and c j have the cross-correlation function ri (t)r j (t )t = r¯ 2 (1 + c i c j exp(−|t − t |/τc )), where t represents an average over the ensemble of rates, and the average firing rate r¯ is chosen to be 20 Hz. The cross-correlation function of the rate of a given input is r (t)r (t )t = r¯ 2 (1 + exp(−|t − t |/τc )). To generate such rates for n inputs, we chose intervals of time from an exponential distribution with mean interval τc . For every interval, we generated n + 1 random numbers, y and xa for a = 1, 2, . . . , n, from gaussian distributions with zero mean and standard deviation one and σa respectively, where σa2 = 1 − c a2 . At the start of each

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2379

interval, the firing rate for input a was set to ra = r¯ (1 + xa + c a y) and held at this value until the start of the next interval. A.4 Modified Synaptic Update Rule. The modified update rule used in section 6 was suggested in Froemke and Dan (2002) and assigns to each pre- and postsynaptic spike an efficacy that depends on the time difference to the preceding spike in the same neuron. The efficacy of the ith spike is given by i = 1 − exp(−(ti − ti−1 )/τs ) , where ti and ti−1 are the timings of the ith and (i − 1)th spike, respectively, and τs is the suppression time constant. The actual change in the amplitude of the EPSP for prespike i and postspike pr e post post pr e j is i · j · A, where A is given by equation 2.2 for t = t j − ti . The contributions of different spike pairs were combined additively. The pr e post parameters were chosen as in Froemke & Dan (2002): τs = 28 ms, τs = 88 ms, τ+ = 14.8 ms, τ− = 33.8 ms. Appendix B: Details to the Counterexample in Section 3 The two panels of Figure 2 denote two different input scenarios with input spike trains S1 , S2 , S3 : one where the neuron is supposed to fire at time t3 (A) and one where the neuron is not supposed to fire at all (B). The maximal weight wmax of the three synapses (which is here assumed to be the same for all three synapses) should be scaled in such a way that a single spike cannot bring the neuron to its firing threshold, but two spikes at time t2 with synaptic weights wmax will make it fire at time t3 in the scenario of Figure 2A and a spike at time t1 with weight wmax /4 together with a spike at t2 with weight wmax will make it fire at time t3 in the scenario of Figure 2B (but no single spike on its own). Furthermore the second spike of S2 in scenario A should be timed in such a way that postsynaptic firing at time t3 cannot cause an increase of w2 (because W+ · e −(t3 −t1 )/τ+ = W− · e (t3 −t4 )/τ− in rule 3.2). Then initial values w1 = w3 = wmax and w2 = 0 provide a solution to both constraints of Figure 2, which is stable with regard to STDP. But if learning starts, for example, with initial values w1 = w3 = wmax and w2 = wmax /4, then the neuron will fire initially in both scenarios A and B. Furthermore, no application of STDP for any sequence of scenarios A and/or B (even with teacher-induced firing at time t3 in scenario A or even with teacher-induced prevention of firing in scenario B) can decrease any of the weights. Learning with STDP also fails if one starts with small initial weights (e.g., w1 = w3 = 0, w2 = wmax /4) and teacher-induced hyperpolarization prevents all undesired firing (i.e., all firing except at time t3 in scenario A). If sufficiently many instances of scenario A occur during training (in addition to an arbitrary number of scenarios B), then learning will in this case also converge to w1 = w3 = wmax and w2 = wmax /4, so that the neuron will also fire in scenario B. Hence, learning with STDP does not converge from these initial weights to a solution of this learning problem, although a stable solution exists. Note that such

2380

R. Legenstein, C. Naeger, and W. Maass

counterexamples can be constructed for any given positive values of the parameters W+ , W− . This counterexample shows that no convergence theorem can exist for STDP that holds, like the perceptron convergence theorem, for any given set of inputs. But this counterexample does not yet demonstrate failure of convergence of STDP for realistic conditions with noise, since the assumption W+ · e −(t3 −t1 )/τ+ = W− · e −(t3 −t4 )/τ− will no longer remain valid if there is jitter on the firing times. Appendix C: A Simple Result on Linear Separability (Needed for the Proof of Proposition 1) Consider the vectors c1 , . . . , cm ∈ Rn where ci = (c i1 , . . . , c in ) and labels y1 , . . . , ym ∈ {0, 1}. Furthermore consider vectors c1 , . . . , cm ∈ Rn where ci = (c i1 , . . . , c in ) with c i j = a + bc i j for arbitrary constants a ∈ R and b > 0. We show that a vector w ∈ Rn linearly separates the list c1 , y1 , . . . , cm , ym if and only if w ∈ Rn linearly separates the list c1 , y1 , . . . , cm , ym . c i j Since c i j = b − ab = a + b c i j with a ∈ R and b > 0, we need to show only one direction. Suppose that w ∈ Rn linearly separates the list c1 , y1 , . . . , cm , ym . From Definition 2, it follows that there exists a thresh old ∈ R such that yi = sign( nj=1 c i j w j − ) for i = 1, . . . , m. Therefore, for the threshold = + ab nj=1 w j , we have

n a ( + c i j )w j − yi = sign b j=1

for all i = 1, . . . , m.

Since for every x ∈ R and γ > 0, it holds that sign(x) = sign(γ x), we have

n (a + bc i j )w j − b yi = sign

for all i = 1, . . . , m.

j=1

Hence, there exists a threshold such that yi = sign

n

c i j w j

−

for all i = 1, . . . , m.

j=1

This shows that w ∈ Rn linearly separates the list c1 , y1 , . . . , cm , ym . Acknowledgments We thank Yves Fr´egnac, Wulfram Gerstner, and especially Henry Markram for inspiring discussions and two anonymous referees for helpful

What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?

2381

comments. The work was partially supported by the Austrian Science Fund FWF, project P15386, and PASCAL, project IST2002-506778, of the European Union. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neurosci., 3, 1178–1183. Amit, D. J., Wong, K. Y. M., & Campell, C. (1989). Perceptron learning with signconstrained weights. J. Phys. A: Math. Gen., 22, 2039–2045. Debanne, D., Shulz, D. E., & Fr´egnac, Y. (1998). Activity dependent regulation of onand off-responses in cat visual cortical receptive fields. Journal of Physiology, 508, 523–548. Duda, R. O., Hart, P. E., & Storck, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. Fr´egnac, Y. (2002). Hebbian synaptic plasticity. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 515–522). Cambridge, MA: MIT Press. Fr´egnac, Y., & Shulz, D. E. (1999). Activity-dependent regulation of receptive field properties of cat area 17 by supervised Hebbian learning. Journal of Neurobiology, 41(1), 69–82. Fr´egnac, Y., Shulz, D., Thorpe, S., & Bienenstock, E. (1988). A cellular analogue of visual cortical plasticity. Nature, 333(6171), 367–370. Fr´egnac, Y., Shulz, D., Thorpe, S., & Bienenstock, E. (1992). Cellular analogs of visual cortical epigenesis. I. Plasticity of orientation selectivity. J. Neurosci., 12(4), 1280– 1300. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 415, 433–438. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biological Cybernetics, 69, 503–515. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetric Hebbian plasticity. Journal of Neuroscience, 23, 3697–3714. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River: Prentice Hall. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-fire neuron. J. Comp. Neurosci., 11(2), 135–151. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2741. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Computation, 12, 385–405.

2382

R. Legenstein, C. Naeger, and W. Maass

Legenstein, R. A., & Maass, W. (2004). Additional material to the paper: What can a neuron learn with spike-timing-dependent plasticity? (Tech. Rep.). Graz: Institute for Theoretical Computer Science, Graz University of Technology. Maass, W., & Markram, H. (2002). Synapses as dynamic memory buffers. Neural Networks, 15, 155–161. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. PNAS, 95, 5323–5328. Mehta, M. R. (2001). Neuronal dynamics of predictive coding. Neuroscientist, 7, 490– 495. Rao, R. P. N., & Sejnowski, T. J. (2002). Predictive coding, cortical feedback, and spiketiming dependent plasticity. In R. P. N. Rao, B. A. Olshauser, & M. S. Lewicki, (Eds.), Probabilistic models of the brain (pp. 297–315). Cambridge, MA: MIT Press. Rosenblatt, J. F. (1962). Principles of neurodynamics. New York: Spartan Books. Rubin, J., Lee, D., & Sompolinsky, H. (2001). Equilibrium properties of temporal asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. Senn, W., & Fusi, S. (in press). Learning only when necessary: Better memories of correlated patterns in networks with bounded synapses. Neural Computation. Senn, W., Schneider, M., & Ruf, B. (2002). Activity-dependent selection of axonal and dendritic delays or, why synaptic transmission should be unreliable. Neural Computation, 14(3), 503–619. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. van Rossum, M. C. W., Bi, G., & Turrigiano, G. G. (2000). Stable Hebbian learning through spike-timing-dependent plasticity. Journal of Neuroscience, 20, 8812–8821.

Received April 26, 2004; accepted March 22, 2005.

LETTER

Communicated by John Rinzel

How Membrane Properties Shape the Discharge of Motoneurons: A Detailed Analytical Study Claude Meunier [email protected]

Karol Borejsza [email protected] Laboratoire de Neurophysique et Physiologie, UMR CNRS 8119, Universit´e Ren´e Descartes, 75270 Paris cedex, France

Electrophysiological experiments and modeling studies have shown that afterhyperpolarization regulates the discharge of lumbar motoneurons in anesthetized cats and is an important determinant of their firing properties. However, it is still unclear how firing properties depend on slow afterhyperpolarization, input conductance, and the fast currents responsible for spike generation. We study a single-compartment integrate-andfire model with a slow potassium conductance that exponentially decays between spikes. We show that this model is analytically solvable, and we investigate how passive and active membrane properties control the discharge. We show that the model exhibits three distinct firing ranges (primary, secondary, and high frequency), and we explain the origin of these three ranges. Explicit expressions are established for the gain of the steady-state current-frequency (I− f ) curve in the primary range and for the gain of the I− f curve for the first interspike interval. They show how the gain is controlled by the maximal conductance and the kinetic parameters of the afterhyperpolarization conductance. The gain also depends on the spike voltage threshold, and we compute how it is decreased by threshold accommodation (i.e., the increase of the threshold with the injected current). In contrast, the gain is not very sensitive to the input conductance. This implies that tonic synaptic activity shifts the currentfrequency curve in its primary range, in agreement with experiments. Taking into account the absolute refractory period associated with spikes somewhat reduces the gain in the primary range. More importantly, it accounts for the approximately linear and steep secondary range observed in many motoneurons. In the nonphysiological high-frequency range, the behavior of the I− f curve is determined by the fast voltage-dependent currents, via the amplitude of the fast repolarization afterspike, the duration of the refractory period, and voltage threshold accommodation, if present.

Neural Computation 17, 2383–2420 (2005)

© 2005 Massachusetts Institute of Technology

2384

C. Meunier and K. Borejsza

1 Introduction In many neurons, spikes trigger a long-lasting afterhyperpolarization (AHP) mediated by small conductance calcium-activated potassium channels (SK channels). Lumbar motoneurons of anesthetized cats (Schwindt & Crill, 1982) provide the simplest preparation where the impact of AHP on the discharge may be studied. Indeed, voltage evolution in the subthreshold range is then largely controlled by passive membrane properties and by the AHP conductance, though sag currents (in particular, the potassium Ih current) might also play some role. Spinal α motoneurons respond to an intracellular current pulse by a regular discharge after a decrease of the instantaneous frequency over the first three or four spikes (frequency adaptation). Their steady-state current-frequency relation (I − f curve), computed from the mean firing frequency after the initial phase of frequency adaptation has subsided, is described as linear or piecewise linear (Schwindt & Crill, 1982). In the latter case, it displays a primary range, with a slope of the order of 1 Hz/nA, followed by a steeper secondary range. Baldissera and Gustafsson, (1971) showed that AHP augmented spike after spike, which explained the initial adaptation of the firing rate. Blockade of the AHP conductance by apamin and free calcium chelation by BAPTA or EGTA were shown to increase the firing frequency of motoneurons (Zhang & Krnjevi´c, 1987) and the slope of their I − f curve. This indicated that AHP was responsible for the low frequency of the adapted discharge in the primary firing range (typically 10–60 Hz). Blocking AHP also made the discharge more irregular. It was thus experimentally proved that AHP greatly shapes the firing properties of motoneurons. Primary range firing in motoneurons has been largely studied via modeling. Insight into the role of AHP and primary range firing was gained, in particular, from single-compartment integrate-and-fire models supplemented with a slow AHP conductance. Using a minimal model with an exponentially decaying and nonsummating AHP conductance, Kernell (1968) suggested that the gain of motoneurons in the primary range varied as the inverse of the AHP conductance and the AHP decay time. All subsequent modeling studies relied on numerical simulations. Incorporating AHP summation allowed Baldissera, Gustafsson, and Parmiggiani (1976) to study frequency adaptation and explain why transient I − f curves (i.e., curves computed from the instantaneous firing frequency in the first, second, or third interspike interval) had larger gains than the steady-state current I − f curve. More recently, Capaday (2002) concluded that the gain did not depend on the input conductance of the motoneuron, which explained why, in experiments, tonic synaptic activity shifted steady-state I − f curves without changing their slope (Granit, Kernell, & Lamarre, 1966a). The numerical approach used in these studies allowed researchers to explore how varying some membrane property should affect the I − f relation of motoneurons. However, this approach is not suitable to unravel how the various

How Membrane Properties Shape the Discharge of Motoneurons

2385

membrane properties (leak current, fast ionic currents, AHP) quantitatively control the shape of transient and steady-state I − f curves. In particular, one cannot establish explicit formulas relating the gain to the model parameters. Secondary-range firing remains more poorly understood than primaryrange firing. Baldissera et al. (1976) and Baldissera, Gustafsson, and Parmiggiani (1978) claimed that saturation of the AHP conductance made the current-frequency curve piecewise linear. However, the secondary range looks parabolic in other single-compartment integrate-and-fire models (Kernell, 1968; Traub, 1977; Capaday, 2002). Powers (1993) pointed out the difficulty of achieving linearity in the secondary range: it was necessary to finely tune all the parameters of his model to roughly reproduce a target I − f curve. In keeping with previous suggestions of Schwindt and Crill (1982), he concluded that it was important to take into account the L-type calcium current if one wanted to replicate experimental I − f curves. Obtaining a linear secondary range in a multicompartment conductance-based model of motoneuron (Traub & Llin´as, 1977) required the clustering of the potassium calcium-dependent channels at the right electrotonic distance from the soma. No consistent view of secondary-range firing emerged from these various studies. In this article, we study a model of motoneuron that incorporates an exponentially decaying AHP conductance and generalizes the earlier models of Kernell (1968), Baldissera et al. (1976), and Capaday (2002). It takes into account the frequency-dependent summation of the AHP conductance and its progressive saturation as the firing rate increases. The model can be analytically solved, so that numerical simulations are not needed to investigate its behavior. This allows us to answer the following question: How do passive membrane properties and active membrane properties determine, qualitatively and quantitatively, firing properties (subthreshold voltage trajectory, steady-state I − f curve, gain, I − f curve for the first interspike interval)? We show that if the absolute refractory period is not taken into account, the model displays two approximately linear firing ranges (primary range and high-frequency range) separated by a nonlinear secondary range. We establish an explicit expression for the gain in the primary range that shows how it depends on passive membrane properties and on the AHP conductance. When absolute refractoriness is incorporated in the model, the high-frequency range, which corresponds to firing frequencies never observed in real motoneurons, disappears, and the secondary range becomes approximately linear. Motoneurons may exhibit voltage threshold accommodation, that is, an increase of their spike voltage threshold with the current injected, likely due to sodium current inactivation. Our model allows us to study the impact of this phenomenon on the I − f curves of motoneurons. The letter is organized as follows. In section 2, we introduce the model and explain how it can be analytically solved. In section 3, we describe the

2386

C. Meunier and K. Borejsza

three firing ranges of the model and explain the origin of these three ranges by analyzing the behavior of the model in the limit where passive voltage relaxation is assumed to be instantaneous compared to AHP relaxation. We then study in section 4 the impact on the steady-state behavior of the model of the different membrane conductances: the slow AHP conductance, the leak conductance that determines the input conductance, and the fast currents responsible for spikes. This is done by assessing how varying the model parameters affects the steady-state I − f curve and by establishing explicit formulas for the gain. We also study in section 5 how the I − f curve for the first interspike interval depends on the AHP conductance. Finally, we compare our results to those of previous studies and briefly discuss their implications for motoneurons and other neurons in section 6. The appendices are devoted to the most technical aspects of our study. 2 The Model and Its Solution 2.1 The Model. Our model has only one compartment. Spiking dynamics is of the integrate-and-fire type: an action potential is emitted every time the membrane voltage reaches some prescribed threshold Vth (see Figure 1). Membrane repolarization after the spike, due to fast potassium currents, is accounted for by resetting the membrane potential to the fixed value Vr . Below threshold, voltage is governed by the linear differential equation, C

d V[t] = −G in V[t] + G AHP [t](VK − V[t]) + I, dt

(2.1)

where C is the membrane capacitance, G in the input conductance, G AH P [t] the AHP conductance, VK the Nernst potential of potassium ions, and I a constant current that elicits the steady repetitive discharge. The resting membrane potential and the reversal potential of the leak current are both set to 0. As in previous integrate-and-fire models of motoneurons (with the exception of Powers, 1993), we consider that the decay of the AHP conductance between spikes is exponential, a reasonable assumption in view of experimental data. We set β G AHP [t] = G max AHP z[t] ,

where G max AHP is the maximal value that the AHP conductance may reach, z is its activation variable, and β is a positive constant. In previous models, β was set to 1 (Kernell, 1968; Capaday, 2002) or 2 (Baldissera et al., 1976). Here, it may take any positive value. The activation variable z decays between spikes as e −t/τz . The AHP conductance also decays exponentially but with the faster time constant τAHP = τz /β (for β > 1). In all the conditions investigated in

How Membrane Properties Shape the Discharge of Motoneurons

2387

Spike

Vth

-50 mV Fast repolarization

-55 mV

Vr Afterhyperpolarization 25 ms

Figure 1: Steady-state discharge of a slow motoneuron innervating the triceps surae muscle in a cat anesthetized with pentobarbital sodium. Action potentials are clipped. The discharge was elicited by a 12 nA current step. Firing frequency is about 15 Hz, which corresponds to the primary range. Resting membrane potential was Vr est −60 mV and input conductance Rin 2.3 M. Action potentials are fired when membrane voltage reaches the spike threshold Vth −51.3 mV. They are followed by a fast repolarization that brings the membrane back to Vr −56.2 mV. The spike and the fast repolarization have a total duration of T = 5 ms. This amounts to 7.5% of the interspike interval. Voltage trajectory in the interspike interval is dominated by AHP. Its “scoop-and-ramp” shape is typical of primary range firing. Membrane voltage initially decreases from Vr to about −59.5 mV and then rises again until the spike voltage threshold is reached. (Courtesy of L. Brizzi and D. Zytnicki.)

what follows, we fix τAHP and β. The value of τz is then determined by τz = βτAHP . We must supplement our integrate-and-fire model with some prescription for AHP activation during action potentials. The potassium current responsible for AHP is a calcium-dependent current mediated by SK channels. It is activated through a cascade of intracellular events initiated by the spike-triggered influx of calcium ions. In view of the complexity of the processes involved, we chose to describe the effective dependence of AHP activation on the membrane voltage by the first-order kinetic equation, τz [V]

dz = z∞ [V] − z, dt

as in Powers (1993).

(2.2)

2388

C. Meunier and K. Borejsza

Finally, we assume that the AHP conductance becomes fully activated as the firing rate goes to infinity. In this limit, z tends to 1 and the AHP conductance to G max AHP . 2.2 Solving the Model. One determines the steady-state I − f curve of the model as follows. First, we compute the time course of the AHP conductance. We integrate the linear kinetic equation, 2.2, over the duration T of the action potential and of the fast repolarization that follows it. If the activation of the AHP conductance is equal to zbef before the spike (at time 0), it becomes zaft = (1 − α)zbef + α

(2.3)

at time T. The parameter α = 1 − e−

T 0

dt/τz [V[t]]

is between 0 and 1 and depends on the kinetic parameters of the AHP conductance. Before the first spike of the train, the AHP conductance is not activated, that is, zbef = 0. After this first spike, it is equal to zaft = α. Therefore, α is the AHP activation elicited by a single spike. Accordingly, we call it the AHP activation parameter. In the model of Kernell (1968), each spike fully activates the AHP conductance, so that α = 1. The AHP conductance increases over the first few spikes of the train until a steady-state regime is achieved where spikes are periodically emitted. In this regime, the periodicity condition zaft e −βT/τAHP = zbef is satisfied, where T denotes the time spent in the subthreshold voltage range, that is, the time needed for the membrane voltage, initially equal to Vr , to reach the voltage threshold Vth . From this condition and equation 2.3, we deduce that the AHP conductance is given between spikes by β −t/τ AH P ss G AHP [t] = G max , AHP z [T/τAHP ] e

(2.4)

where zss [T/τAHP ] =

α . 1 − (1 − α)e −T/βτ AH P

(2.5)

This last equation shows that the summation of AHP is frequency dependent and that the AHP conductance saturates as the firing frequency increases. In a second step, we compute T, using the general solution of the integrate-and-fire model established in appendix A under the assumption that the AHP decay is a purely time-dependent process. We must compute

How Membrane Properties Shape the Discharge of Motoneurons

2389

the two functions θ [t] and ψ[t], defined by equations A.2 and A.3. Using equation 2.4, we obtain θ[t] = t/τAHP + K 1 − e −t/τ AH P / ψ[t] =

(2.6)

(K /)1/ e K / −1/, e −t/τ AH P K / − [−1/, K /] .

(2.7)

In these equations, = τm /τAHP is the ratio of the two timescales of the model, and K =

β ss G max AHP (z [T/τAHP ]) G in

(2.8)

is the ratio of the peak AHP conductance, achieved just after spikes, to ∞ the input conductance. (a , x) = x e −y ya −1 dy is the incomplete gamma function, which is computed using a continuous fraction expansion (Abramowitz & Stegun, 1970; Press, Flannery, Teukolsky, & Vetterling, 1992). Substituting then θ[t] and ψ[t] with equations 2.6 and 2.7 in equations A.4 and A.5, we finally get I − VK ψ[t]e −θ[t] G in θ [T] VK − Vr e I − G in Vth = G in (Vth − VK ) − 1 + G in . ψ[T] ψ[T] V[t] = VK + (Vr − VK ) e −θ [t] +

(2.9) (2.10)

Equation 2.9 gives the subthreshold voltage trajectory of the model. Equation 2.10 provides the relation between the injected current I and the time T spent in the subthreshold voltage range. It provides the steady-state frequency-current curve of the model, when the spikes and the fast repolarization that follow them are assumed instantaneous (T = 0). If we incorporate in our model the absolute refractory period T, the length of the interspike interval becomes T + T, where T is given by equation 2.10. To get the I − f curve of the model, one just substitutes T with 1/ f − T in equation 2.10. This means that the I − f curve of the model with absolute refractory period T is obtained by applying the transformation f → f /(1 + f T) to the I − f curve computed for T = 0. 3 Firing Ranges of the Model 3.1 Description of the Three Firing Ranges. Typical steady-state I − f curves of the model are shown in Figure 2A (solid line) for two values of T: T = 0 ms (no refractory period) and T = 5 ms (as in Figure 1). These curves were obtained for α = 0.5, τAHP = 40 ms and for values of the other

2390

A

C. Meunier and K. Borejsza

B

250

250

3 200

200

150

β=1 100

100

f (Hz)

f (Hz)

β= 3

∆T=0 ms

150

2 50

∆T=5 ms

1

β

50

0

0 0

20

40

60

I (nA)

80

100

0

20

40

60

80

100

I (nA)

Figure 2: Typical steady-state I − f curves of the model. Model parameters: Vth = 15 mV, VK = −15 mV, Vr = 7.5 mV, G in = 1 µS, G max AHP = 3 µS, τm = 8 ms, τAHP = 40 ms, α = 0.5. (A) The exponent β of the activation variable is set to 1. T = 0 (no absolute refractory period, solid line) or 5 ms (dash-dotted line). The current threshold for repetitive firing is 15 nA. Black dots indicate the different conditions for which subthreshold voltage trajectories are drawn in Figure 3. Taking into account the absolute refractory period (dash-dotted line) strongly affects the I − f curve above the physiological range of frequencies ( f > 50 Hz). (B) T = 0. Setting β to 2 (dashed line) or 3 (dotted) does not modify the steadystate I − f curve obtained for β = 1 (solid line) very much. Firing rates in the primary range increase by 4 Hz (for β = 2) to 6 Hz (for β = 3), as shown by the inset.

parameters similar to those used by Powers (1993). The analytical study of the model can be performed for any positive value of the exponent β. As illustrated in Figure 2B, increasing β of the activation variable does not modify the shape of steady-state I − f curves and never increases the firing rate by more than 20%. Therefore, steady-state I − f curves are drawn only for β = 1 in Figure 2A and in the following sections. When T = 0 (solid line in Figure 2A), the steady-state I − f curve is not piecewise linear: firing frequency smoothly increases with the injected current. Nonetheless, we can identify two approximately linear firing ranges. The first linear range (1) corresponds to the physiological range of frequencies of 20 to 50 Hz. It is located around the inflection point of the I − f curve. Its slope is about 1.2 Hz/nA. These figures fit well with those typical of primary-range firing in motoneurons of anesthetized cats. Another linear range (3) is visible at frequencies above 200 Hz. Such firing frequencies are well above those achieved in anesthetized preparations before motoneurons stop to fire. This high-frequency range is much steeper than the primary range. The intermediate region (2) corresponds to the secondary firing range of motoneurons. Frequencies are of the order of 50 to 150 Hz, and the slope

How Membrane Properties Shape the Discharge of Motoneurons

A

2391

B

1

1

100

50

0.9

0.6

0.8

12.5 Hz

0.4

62.5 Hz

0.7

0.2

th

V/V

V/V

th

0.8

0.6

f

Primary

f

Secondary

0 0

10

20

30

40

t (ms)

50

60

70

80

0

10

20

30

40

50

60

70

0.5 80

t (ms)

Figure 3: Subthreshold voltage trajectories. Steady-state regime. Same parameters as in Figure 2. T = 0 ms. Membrane voltage V (in units of Vth = 15 mV) is displayed as a function of time. (A) Primary firing range (zone a in Figure 2A). The firing rate increases from 12.5 Hz (bottom) to 50 Hz (top) by increments of 12.5 Hz. These frequencies correspond to 0.5, 1, 1.5, and 2 times 1/τAHP = 25 ms. (B) Secondary range (zone b in Figure 2A). The firing rates increase from 62.5 Hz (bottom) to 100 Hz (top) by increments of 12.5 Hz. These frequencies correspond to 2.5, 3, 3.5, and 4 times 1/τAHP , respectively. The thin dotted lines are the best exponential fits to the actual trajectories. Note that different scales were used in A and B because voltage never drops below Vr in the secondary range, in contrast to primary range firing.

is four to five times larger than in the primary range. In this intermediate region, the I − f curve is clearly not linear. Subthreshold voltage trajectories are plotted in Figure 3 for steady-state firing in the primary and secondary ranges (dots in Figure 2A) They are computed as explained in section 2.2. In the primary range, the voltage trajectory has the typical “scoop-and-ramp” profile described in Schwindt and Crill (1982). The voltage drops for the first 10% of the interspike interval. This initial voltage dip comes from the combined action of the AHP conductance, which is maximal at that time, and of passive voltage relaxation, which slows the membrane hyperpolarization elicited by the AHP conductance. After reaching its minimum, the voltage slowly rises to the spike voltage threshold. During this slow phase, passive voltage relaxation plays a minor role, and the voltage is essentially controlled by the exponential relaxation of the AHP conductance. The initial voltage dip becomes less pronounced as the firing rate increases (see Figure 3A). The voltage ramp does not depart much from linearity (except at the lowest frequency), and it has approximately the same slope at all frequencies. At the end of the primary range, for f ≈ 55 to 60 Hz, the dip completely disappears, and the concavity of the voltage trajectory changes. In the secondary range, voltage

2392

C. Meunier and K. Borejsza

trajectories steepen as the firing frequency increases (see Figure 3B). This behavior matches the description of voltage trajectories in motoneurons of anesthetized cats reported in Schwindt and Calvin (1972) and Schwindt (1973). As shown in Figure 3B, the voltage profile is quite linear for f = 62.5 Hz. As the firing rate increases, the initial part of the voltage trajectory progressively becomes exponential. However, for f = 100 Hz, the last part of the voltage trajectory (for V > 0.85Vth ) still differs considerably from an exponential. Firing frequencies larger than 300 Hz are needed to obtain an exponential voltage profile over the whole interspike interval (not shown). When the absolute refractory period T is taken into account (dashdotted line in Figure 2A), the I − f curve still displays three firing ranges, but it is drastically changed at high frequency, as shown in Figure 2A. The frequency no longer goes to infinity with the injected current but tends to the limiting value 1/T. In addition, the curve displays a second inflection point, located in the secondary range. This entails that this range is now approximately linear. In contrast, the absolute refractory period has little effect on the primary-firing range because it is short compared to the interspike interval. 3.2 Origin of the Three Firing Ranges. Why the model displays three distinct firing ranges can be understood by computing the I − f curve in the limit where passive voltage relaxation becomes instantaneous compared to AHP relaxation. Because the behavior of the model depends on the two time constants, τm and τAHP , and not merely on their ratio , this limit can be taken in two different manners. We may set τm to 0, while keeping τAHP (and G in ) fixed. We may also let τAHP go to infinity while keeping τm fixed. We first neglect the absolute refractory period (T = 0) and compare the actual I − f curve and the limiting curve in Figure 4A . If τm goes to 0, θ[t] goes to infinity as 1/, and so does ψ[t]. Applying the Laplace approximation procedure (de Bruijn, 1981) to the integral ψ[t] gives e θ[t] /ψ[t] − 1 = K [ f τAHP ]e −t/τ AH P . Using this relation and equations 2.8 and 2.5, equation A.5 becomes I − Ith = G AHP max

αe −1/β f τAHP 1 − (1 − α)e −1/β f τAHP

β (Vth − VK ) ,

(3.1)

from which we derive the explicit expression f =

βτAHP ln 1 + α

1 Vth −VK G max AHP I −Ith

1/β

(3.2)

−1

for the steady-state I − f curve. This equation holds only for currents

How Membrane Properties Shape the Discharge of Motoneurons

A

B

1000

ε=0

250

200

ε = 0.2

ε=0

600

ε = 0.2

400

150

100

LF

HF

LF

f (Hz)

f (Hz)

800

2393

HF

200

50

Ic

Ic

0 50

100

I (nA)

150

50

100

0 150

I (nA)

Figure 4: Limiting behavior of the steady-state current-frequency curve. Only the secondary and high-frequency ranges are displayed; the primary range (below 50 Hz), in which the true I − f curve little departs from the limiting curve, is not shown. (A) No absolute refractory period (T = 0 ms). When = 0, the current-frequency curve (dashed line) displays two separate branches, below Ic ≈ 105 nA (LF) and above Ic (HF) (dash-dotted vertical line). In contrast, the steady-state I − f curve is continuous when = 0.2 (solid line), and the firing rate steadily increases with the injected current. This shows that the = 0 limit is singular. Above Ic , the I − f curve for = 0.2 is almost linear: it is barely distinguishable from the linear asymptote of the HF branch (thin line). (B) T = 5 ms. When = 0, the current-frequency curve (dashed line) displays two separate branches: below Ic ≈ 105 nA (LF) and above Ic (HF) (dash-dotted vertical line). However, the firing rate no longer goes to infinity: it is limited by 1/T on both branches. The steady-state I − f curve is continuous when = 0.2 (solid line) and crosses over from the LF to the HF branch near Ic as in A.

smaller than Ic = Ith + G max AHP (Vth − VK ). The firing rate goes to infinity as 1/G max (V − V )/ατ (Ic − I ) when the current approaches Ic . This th K AHP AHP branch of the I − f curve is denoted by LF in Figure 4A. because it well approximates the true I − f curve at low frequencies. In contrast, when τAHP becomes infinite, θ [t] = G max t/C, where G max = G in + G max that the membrane conductance may ever AHP is the maximal value max reach, and ψ[t] = (G in /G max ) e G t/C − 1 . One obtains I − Ic =

G max (Vth − Vr ) e G max /C f − 1

(3.3)

and f =

G max

. max th −Vr ) C ln 1 + G I(V −Ic

(3.4)

2394

C. Meunier and K. Borejsza

This is the I − f curve of a leaky integrate-and-fire model with leak conductance G max . Equation 3.3 is valid for I > Ic , that is, when the injected current exceeds the summed leak and AHP currents in the whole interspike interval. It defines the second branch of the limiting I − f curve, denoted by HF in Figure 4A. The AHP conductance is fully saturated all along the HF branch. In contrast, complete saturation of the AHP conductance occurs on the LF branch only when the current approaches Ic . Figure 4A shows that for = 0.2, the I − f curve (solid line) displays a single branch, It is initially close to the LF branch of the limit curve but progressively departs from it as the injected current increases. Eventually it crosses over to the HF branch. The I − f curve is linear in the high-frequency range above Ic but not in the secondary range below Ic (see the thin line). The three firing ranges of the model correspond to different durations of the interspike interval compared to the two timescales of the model, τm and τAHP . Firing frequencies achieved in the primary range are of the order of 1/τAHP and quite smaller than 1/τm . In contrast, the high-frequency range corresponds to firing rates of the order of 1/τm and substantially larger than 1/τAHP . The secondary range corresponds to the intermediate situation where the firing rate is neither substantially larger than 1/τAHP nor much smaller than 1/τm . A similar analysis can be performed when the absolute refractory period T is taken into account. The limiting I − f curve is obtained by applying the transformation f → f /(1 + f T) to the limiting curve computed for T = 0. It is shown in Figure 4B. The limiting curve still presents two branches, still denoted by LF and HF. On the LF branch, the firing rate no longer goes to infinity as the current approaches Ic . It stops at the finite frequency 1/T. Similarly, on the HF branch, the firing frequency tends to 1/T as the injected current goes to infinity. As above for T = 0, the true I − f curve for = 0.2 is well approximated by the limiting curve when I is far from Ic . Near Ic , it departs from the limiting curve and displays a secondary range, corresponding to the crossover between the LF and HF branches. 4 How Membrane Properties Shape the Steady-State I− f Curve 4.1 Slow AHP Current 4.1.1 AHP Decay Time. Early experimental and modeling studies by Kernell (1965, 1968) suggested that the firing frequency and the gain of motoneurons decreased with the decay rate of the AHP. Figure 5A shows how changing τAHP affects the steady-state I − f curve of the model. Because the absolute refractory period is not determined by the AHP conductance and because it has little impact in the physiological range of frequencies, results are presented for the limiting case where T = 0. Increasing τAHP reduces the gain in the primary and secondary firing ranges (thick dashed line).

How Membrane Properties Shape the Discharge of Motoneurons

B

250

250

200

200

150

150

100

100

50

0

τ

50

max

G

AHP

f (Hz)

f (Hz)

A

2395

AHP

0 0

20

40

60

I (nA)

80

100

120

0

50

100

150

200

I (nA)

Figure 5: How steady-state I − f curves depend on the AHP decay time constant and on its maximal conductance. Same parameters as in Figure 2 in control condition (τAHP = 40 ms and G max AHP = 3 µS, solid line). (A) The time constant τAHP is modified while G max AHP is kept constant. Doubling τAHP (thick dashed line) reduces the firing frequency. In the primary range, the firing rate is almost halved. To illustrate this point, we drew the thin dashed curve that was deduced from the control I − f curve (solid) by dividing the frequency by 2. In the primary range, the two dashed curves are almost superimposed. Conversely, halving τAHP (thick dotted line) increases the firing rate. In the primary range, the effect almost amounts to doubling the frequency. This is shown by the comparison with the thin dotted line, deduced from the control I − f curve by multiplying the frequency by 2. In both cases (τAHP = 20 ms and τAHP = 80 ms), the effect on the I − f curve diminishes as the frequency is increased beyond the primary range. (B) The maximal conductance G max AHP is modified while τAHP is kept constant. Doubling G max AHP (thick dashed line) considerably reduces the firing rate. The effect on the firing rate is almost the same as when the injected current I − Ith is doubled in control condition (τAHP = 40 ms and G max AHP = 3 µS). This is shown by the good match with the thin dashed curve, deduced from the control I − f curve by multiplying I − Ith by 2. Conversely, halving G max AHP (thick dotted line) augments the firing rate. The increase is almost the same as when the injected current I − Ith is cut by half in the control condition (as shown by the thin dotted line).

Conversely, decreasing τAHP increases the gain (thick dotted line). In the high-frequency range, the gain remains unaltered. The effect of τAHP on the I − f curve can be understood by considering the limiting situation where = 0 (see section 3.2). As shown by equations 3.2 and 3.4, modifying τAHP amounts to just a rescaling of the frequency on the LF branch of the limiting I − f curve ( = 0). For fixed current I , both frequency and gain vary as 1/τAHP . When τm is finite, a similar scaling behavior is observed in the primary firing range, as illustrated in Figure 5A, though it is only approximate. This approximate scaling of the firing rate is

2396

C. Meunier and K. Borejsza

not observed beyond the primary range. This is because augmenting τAHP reduces the ratio = τm /τAHP . This has little impact on the primary firing range, where the actual I − f curve closely follows the LF branch of the limiting I − f curve. In contrast, it considerably affects the I − f curve in the intermediate-frequency range where the I − f curve crosses over from the LH to the HF branch (see Figure 4). At high frequency, the scaling behavior is lost. Indeed, changing τAHP has no effect at all on the HF branch of the I − f curve for = 0 because on this branch, τAHP is assumed to be infinite and the AHP conductance is fully saturated. For finite , the I − f curve follows the HF branch of the limiting curve in the high-frequency range (see Figure 4) and displays the same behavior. 4.1.2 Maximal AHP Conductance. Increasing the maximal AHP conductance has essentially the same effect as increasing τAHP , as shown by Figure 5B. The gain is reduced in the primary and secondary firing ranges (thick dashed line) but is not changed at high frequency. Conversely, decreasing G max AHP increases the gain (thin dotted line). Moreover, the gain apmax proximately scales as 1/G max AHP . Indeed, the true I − f curve for 2G AHP and G max /2 (thick lines), and the curves deduced from the control I − f curve AHP by doubling or halving the current I − Ith (thin lines) remain close over a large frequency range. Again, this stems from an exact scaling behavior of the limiting I − f curve. As shown by equation 3.1, which describes the LF branch of this curve, the current I − Ith required to achieve a given firing frequency scales as G max AHP . For finite = 0.2, this scaling behavior is approximately preserved. 4.1.3 Activation Parameter. Increasing the activation parameter α augments the activation of the AHP conductance, as shown by equation 2.5. This parameter has a strong impact on steady-state I − f curves, as illustrated in Figure 6: they steepen as α is decreased (from bottom to top). In addition, the distinction between the different firing ranges becomes less and less marked and completely disappears as α goes to 0. In this limit, AHP vanishes, voltage evolution is governed only by the membrane leak, and the I − f curve displays only one linear range (upper dashed curve). This change of profile is due to the increase of the gain in the primary range. The gain is not very sensitive to α at high frequencies, because the AHP conductance tends to its maximal value G max AHP , regardless of the value of α. The activation parameter α controls the rate of AHP summation from one spike to the next. Changing this parameter does not amount to modifying max G max AHP . For instance, the I − f curves obtained for G AHP = 3 µS and α = 0.5 (thick line) and for G max = 1.5 µS and α = 1 (dotted line) coincide only near AHP the current threshold Ith = 15 nA, when firing rates are low and summation is small. AHP activation increases with the firing rate when α = 0.5. In contrast, it remains the same at all frequencies when α = 1: the first spike recruits the same AHP conductance as in the control condition (G max AHP α is

How Membrane Properties Shape the Discharge of Motoneurons

2397

250

0.1

200

f (Hz)

150

0.5

0

100

1

50

α 0

0

20

40

60

80

100

120

I (nA) Figure 6: Dependence of steady-state I − f curves on the activation parameter α. The thick line is the steady-state I − f curve already shown in Figure 2A (control condition, G max AHP = 3 µS, and α = 0.5). The thin lines are the I − f curves obtained by decreasing α to 0.1 (left) or increasing it to 0.9 (right). The limiting curves corresponding to α = 0 (no AHP) and α = 1 (AHP conductance fully activated) are displayed as dashed lines. The effect of α on I − f curves is all the more important as α is small. Note that complete saturation of the AHP conductance requires the firing rate to be much larger than (1 − α)/(ατAHP ), as can be shown by performing the Taylor expansion of equation 2.5 for large values of f τAHP . This explains why the gain in the high-frequency range varies from one curve to the other: the asymptotic behavior is observed for small values of α at frequencies higher than 250 Hz, above the frequency range displayed in the present figure. Increasing α is not equivalent to augmenting the maximal AHP conductance (see text). This is illustrated by the dotted line, which was obtained by setting G max AHP = 1.5 µS and α = 1. In the physiological range of frequencies (below 50 Hz typically), it is steeper than the I − f curve obtained for G max AHP = 3 µS and α = 0.5 (thick solid line).

the same in both cases), but the following action potentials do not further increase AHP activation. Therefore, the gain is about twice lower in the primary range when α = 0.5. The gain at high frequency is about the same in both cases, although the AHP conductance saturates at different values (3.0 µS for α = 1 compared to 1.5 µS for α = 0.5). This indicates that the gain becomes independent of the AHP conductance in the high-frequency limit. 4.1.4 Gain in the Primary Range. To understand how the various parameters of the AHP conductance affect the gain in the primary range, we compute the slope of the steady-state I − f curve for = 0 and T = 0. Computing the derivative of equation 3.2, which describes the LF branch of

2398

C. Meunier and K. Borejsza

the limiting I − f curve, one obtains G0 [ f ] =

α ( f τAHP )2 G max τ AHP AHP (Vth − VK ) 1/β f τ 1+β AHP − (1 − α) e × e −1/β f τAHP , α

(4.1)

where the subscript is a reminder that this formula is valid only in the limit considered. The gain G0 [ f ] depends on neither the reset voltage Vr nor the leak current. It is controlled by all the parameters of the AHP current: maximal conductance, reversal potential (relative to the spike threshold), and kinetic parameters α, β, and τAHP . As expected from the previous paragraphs, the 2 gain scales as 1/G max AHP τAHP for f τAHP fixed. It goes to infinity as f when the injected current approaches Ic . In contrast, the gain does not vary much over the primary range. It takes its minimal value there, at the inflection point of the I − f curve. This minimal gain value can be written as G0 = H/G max AHP τAHP (Vth − VK ),

(4.2)

where H is a dimensionless quantity. As shown in appendix C, H depends on only α and β. Its variations with α are plotted in Figure 7 for β = 1. 20

20

15

15

H

10

5

exact

5

∆H/H (%)

error 10

ε=0 0

0

0.2

0.4

α

0.6

0.8

1

0

Figure 7: Variations of H with α. Same parameters as in Figure 2A. Traces were obtained by parameterizing both H and α by y = 1/2 f τAHP (ranging between 1 and infinity) according to equations C.11 and C.12. Solid line: = 0.2. Dashed line: = 0. Thin line: error in the gain when corrections of order (right scale, in absolute value).

How Membrane Properties Shape the Discharge of Motoneurons

2399

H decreases with increasing α and tends to e 2 /4. When α goes to 0, H goes to infinity as 1/α. In this limit, AHP disappears, and the two timescales method can no longer be used. 4.2 Input Conductance 4.2.1 A Refined Estimate of the Gain. The results of the previous section suggest that the gain in the primary range is controlled by the slow AHP and does not depend on the input conductance. However, they were established under the assumption that passive voltage relaxation was instantaneous. The input conductance has no other influence on the LF branch of the I − f curve than setting the current threshold for repetitive firing Ith = G in Vth , as shown by equation 3.2. When realistic values of the passive membrane time constant (τm ≈ 5 ms) are considered, the gain slightly changes in the low-frequency range (see Figure 4), and it is not completely determined by the slow AHP conductance. This explains why the I − f curves do not exactly obey in the primary range the simple scaling relations valid when = 0 (see Figure 5). To determine how the input conductance affects the gain when = 0, one should a priori work with the exact expression of the I − f curve provided by equation A.5 (supplemented with equations 2.6 and 2.7). Unfortunately, this expression involves incomplete gamma functions, and one cannot derive easily from it an explicit formula relating the gain to the model parameters. To overcome this difficulty, we capitalize on the smallness of ( ≈ 0.2). We use a perturbation approach to compute the correction of order to the expression of the gain G0 established in the previous section. Because voltage evolution involves the time constants τm and τAHP , we chose to use the two timescales method (Holmes, 1995) presented in appendix B. Using this method, we obtain the approximation V0 [t] + V1 [t] of the true voltage trajectory, with V0 and V1 given by equations B.3 and B.4. The firing period T is determined by the threshold crossing condition V0 [T] + V1 [T] = Vth . Exponential terms can be neglected in equations B.3 and B.4 as long as T is longer than the instantaneous membrane time constant C/G[T]. The approximate I − f curve is then given by ˜ V[T] −

G in d V˜ τm [T] = Vth , G[T] dt

which we rewrite as I − Ith = G AHP [T] (Vth − VK ) 1 +

G in . G in + G AHP [T]

(4.3)

As shown in Figure 8, we obtain in this way an excellent approximation of the true I − f curve as long as the injected currents is not close to

2400

C. Meunier and K. Borejsza

A 300 exact

250

f (Hz)

200

ε=0

150 100

two scales

50 0

0

20

40

80

100 120 140

I (nA)

B 100 ∆ f/f (%)

60

ε=0

50 0 -50 -100

two scales 0

20

40

60

80

100 120 140

Figure 8: Approximating the steady-state I − f curve. Same parameters as in Figure 2A. (A) The true I − f curve ( = 0.2, solid line) is compared to the approximation (dashed) provided the two timescales method (exponentially small corrections are neglected; see text) and to the LF branch of the limiting curve ( = 0 dotted). (B) Relative errors in the firing rate for both approximations.

Ic = 105 nA. The approximate curve differs from the true I − f curve (solid line) by less than 3% for frequencies below 200 Hz. In contrast, the limiting I − f curve ( = 0, dotted line) departs from the true I − f curve by 20% at 100 Hz. We deduce from equation 4.3 the refined estimate G[ f ] = G0 [ f ] 1 −

G in G in + G AHP [1/ f ]

2 (4.4)

for the gain in the physiological range of frequencies. It differs from the limiting value G0 [ f ] computed in the limit = 0 by a corrective term that depends on two dimensionless parameters: the ratio of the two timescales and the ratio of the input conductance to the total membrane conductance at the end of the interspike interval. This shows that the gain actually depends on the input conductance and that the correction of order in equation 4.4 must be taken into account if an accurate estimate is wanted. Experiments recently performed in our laboratory (Manuel et al., 2003) in cat lumbar motoneurons show that the AHP conductance G max AHP α elicited by a single

How Membrane Properties Shape the Discharge of Motoneurons

2401

spike is comparable to the input conductance. This implies that in the physiological range of frequencies, the residual AHP conductance G AHP [T] at the end of the interspike interval is not negligible (compared to the input conductance). The correction to G0 is then substantially smaller than ≈ 20%. In the primary range, the gain (computed at the computed at the inflection point) can still be written under the form G = H/G max AHP τAHP (Vth − VK ) when = 0, as in the = 0 limit. However, the factor H is no longer determined by the two parameters α and β, as shown in appendix C. It also depends on and on the ratio G max AHP /G in , as could be expected from equation 4.4. This refined estimate of H is compared to the value computed for = 0 in Figure 7. The error made when setting to 0 decreases with increasing α 2 2 from /1 + ≈ 17% for α = 0 (no AHP) to /(1 + G max AHP /G in e ) ≈ 10% for α = 1 (full activation by the first spike). 4.2.2 Impact of Tonic Synaptic Activity. Numerous experimental studies showed that tonic synaptic activity shifts the I − f curve of motoneurons without significantly modifying the gain in the primary firing range (Granit et al., 1966a; Brizzi et al., 2004). As synaptic activity may increase the input conductance by up to 100% (Schwindt & Calvin 1973; Gosnach, Quevedo, Fedirbuck, & McCrea, 2000; Perreault, 2002), this suggests that increasing the input conductance has little effect on the gain of motoneurons. Our results are consistent with this idea and agree with the recent numerical study of Capaday (2002). As shown in the previous paragraph, the gain in the physiological range of frequencies is essentially controlled by the slow AHP current and depends little on the input conductance. Figure 9 shows that when the input conductance increases by 50% and 100%, the gain is almost unchanged below 100 Hz. It is slightly decreased at the higher frequencies displayed, but the modification does not exceed 6% at 200 Hz. When = 0, the LF branch of the I − f curve depends on the input conductance only through the current threshold Ith (see above). This entails that increasing the input conductance by the amount G in shifts the I − f curve by approximately G in Vth in the physiological range of frequencies, where it is close to the LF branch. The recent experimental study by Brizzi et al. (2004) confirmed this relationship between the shift of the I − f curve and the spike voltage threshold. In the high-frequency range (not shown here) where the I − f curve converges to the HF branch of the limiting curve, changing G in does not alter the gain. Indeed, the gain of the HF branch becomes independent of the input conductance when the injected current is high, as shown by the Taylor expansion of equation 3.4. In this range, where the model behaves like a leaky integrate-and-fire model, the shift of the I − f curve is equal to G syn (Vth + Vr )/2) (Holt & Koch, 1997). Therefore, modifying the input conductance does not produce a global shift of the whole I − f curve.

2402

C. Meunier and K. Borejsza 200

f (Hz)

150

1 µS

100

2 µS

50

G in 0

0

20

40

60

80

100

120

I (nA) Figure 9: How the passive membrane conductance affects the steady-state I − f curve. Same parameters as in Figure 2A in control condition (G in = 1 µS, solid line). The current threshold of this control I − f curve is equal to 15 nA. Increasing G in by 50% (dashed line) or 100% (dotted line) shifts the control I − f curve to the right. The shift is proportional to the increase of G in , which agrees with the experimental results of Brizzi et al. (2004). The curve obtained by shifting the control curve to the right by 15 nA (thin solid line) is almost superimposed to the I − f curve computed for G in = 2 µS. This shows that doubling G in has very little effect on the gain.

In the intermediate region where the I − f curve crosses over from the LF to the HF branch (see section 3.2), the gain of the model must increase with the input conductance. 4.3 Fast Voltage-Dependent Currents 4.3.1 Voltage Threshold and Reset. Sodium and potassium voltagedependent currents responsible for the generation of action potentials and for the fast membrane repolarization that follows them are not explicitly modeled in our study. However, they are taken into account via the boundary conditions of the subthreshold voltage trajectory: V[0] = Vr and V[T] = Vth . The firing period of the model is determined by the threshold crossing condition. This entails that the spike voltage threshold has a strong influence on the gain: as shown by equation 4.1, the gain in the primary firing range is inversely proportional to the driving force of the slow AHP current at threshold, Vth − VK . The spike voltage threshold of a motoneuron may increase with the amplitude of the injected current step. This phenomenon, akin to the

How Membrane Properties Shape the Discharge of Motoneurons

A

B

400

400 350

350

ρ

300

r

ρ = 0.15 th ρ

ρ =0 th ρ= 0

th

ρ= ρ

250

th

r

r

300 250

200

200

150

150

100

100

50

50

0

0

50

100

150

I (nA)

200

250

0

50

100

150

200

250

f (Hz)

f (Hz)

2403

0

I (nA)

Figure 10: Effect of voltage threshold accommodation on steady-state I − f curves. Same parameters as in Figure 2A except ρth and ρr . (A) Voltage threshold and reset voltage increase in unison (ρth = ρr ). I − f curves are shown for five different accommodation rates: 0 (fixed threshold, solid, as in A), 0.075 (longdashed), 0.15 (dashed), 0.25 (dash-dotted), and 0.30 mV/nA (dotted). (B) The I − f curve of the fixed threshold model (ρth = ρr = 0 mV/nA) is displayed as a solid line. The other curves were all obtained for the same threshold accommodation rate (ρth = 0.15 mV/nA), but they differ by the value of ρr : 0.15 (dashed, ρth = ρr ), 0.075 (dash-dotted), or 0 mV/nA (dotted).

dependence of the threshold on the slope of a current ramp, is known as voltage threshold accommodation. It is probably due in large part to sodium current inactivation. To account for this phenomenon, we let the voltage threshold and the voltage reset increase with the injected current. As shown in appendix D, the I − f curve can still be explicitly determined when that dependence is linear, that is, when Vth = Vth [Ith ] + ρth (I − Ith ) and Vr = Vr [Ith ] + ρr (I − Ith ). The two phenomenological parameters ρth and ρr quantify the rate of accommodation. Figure 10 shows how accommodation affects steady-state I − f curves. In Figure 10A, we let ρth and ρr change in unison (ρr = ρth ), so that Vth − Vr is always equal to Vth [Ith ] − Vr [Ith ]. Increasing ρth decreases the gain at all frequencies. The effect is particularly striking at high frequencies. For ρth < 0.25 mV/nA, the I − f curve displays a steep and linear high-frequency range (dashed lines), whereas the firing frequency saturates at a finite value f ∞ for higher accommodation rates (dotted line). This change of behavior as ρth increases is analyzed in appendix D. How voltage threshold accommodation affects the gain in the physiological range of frequencies can be computed using the two timescales method. Replacing the voltage threshold Vth by Vth [I ] = Vth [Ith ] + ρth (I − Ith ) in

2404

C. Meunier and K. Borejsza

equation 4.3, one gets the expression I − Ith =

G AHP [1/ f ] (Vth [Ith ] − VK ) 1 − ρth G AHP [1/ f ] G in × 1+ 1 − ρth G AHP [1/ f ] G in + G AHP [1/ f ]

(4.5)

for the steady-state I − f curve. For = 0, the gain reads G0 [ f ] = (1 − ρth G AHP [1/ f ])2

Vth [Ith ] − VK . dG AHP [1/ f ]/d f

(4.6)

This shows that as long as ρth G max AHP < 1, accommodation reduces the gain by the frequency-dependent factor (1 − ρth G AHP [1/ f ])2 , compared to the fixed threshold model. When terms of order are taken into account, we obtain G[ f ] = G0 [ f ] 1 −

G in G in + G AHP [1/ f ]

2

1 + ρth G AHP [1/ f ](1 + 2G AHP [1/ f ]/G in ) × . 1 − ρth G AHP [1/ f ]

(4.7)

This shows that the corrective term for finite is also modified by voltage threshold accommodation. The reset potential Vr has a negligible impact on the I − f curve in the physiological range of frequencies: it appears in neither equation 3.2 of the LF branch for = 0 nor equation 4.3 of the I − f curve obtained for = 0. Therefore, the gain is independent of the voltage reset, as shown by equations 4.1 and 4.4. This is because the slow voltage ramp occupies most of the interspike interval, and the initial value of the voltage is forgotten by the time the voltage reaches threshold. This remains true when voltage threshold accommodation is taken into account, as illustrated in Figure 10B. In this figure, we fixed the accommodation rate ρth at 0.15 mV/nA but did not impose that the voltage reset increase with the injected current at that same rate. Threshold accommodation considerably reduces the gain compared to the model with the fixed threshold (solid line). However, the gain in the physiological range of frequencies is the same for all values of ρr . In contrast, the voltage reset plays an important role at high frequency when the absolute refractory period T is not taken into account. The AHP conductance G AHP [t] is then close to its maximal value G max AHP , regardless of the values of α and β. AHP is fully saturated, and the steady-state I − f curve of the fixed threshold model (ρth = ρr = 0) converges to the HF branch

How Membrane Properties Shape the Discharge of Motoneurons

2405

of the limiting I − f curve (see Figure 4). The model behaves like a leaky integrate-and-fire model with leak conductance G max = G in + G max AHP , and the gain is given by G=

1 . C (Vth − Vr )

(4.8)

It depends on the membrane capacitance C and the amplitude Vth − Vr of the fast repolarization, but it is independent of the membrane conductance G max . When voltage threshold accommodation is incorporated in the model, the parameter ρr strongly affects high-frequency firing, as illustrated in Figure 10B (see also the analysis in appendix D). When ρr < ρth , the firing rate tends to a finite limit as the injected current goes to infinity (dotted line). This limiting frequency increases with ρr and goes to infinity as ρr approaches ρth . The firing rate linearly increases with the injected current only when ρr = ρth (dashed line). Thus, linearity of the high-frequency range requires that the width Vth − Vr of the voltage range traversed during high frequency remain the same, independently of the current injected. This again points to the importance of fast spike repolarization in regulating highfrequency firing. However, the gain in this linear range is no longer given by equation 4.8. Performing the Taylor expansion of equation 3.4, we obtain G≈

1 − ρth G max , C (Vth [Ith ] − Vr [Ith ])

(4.9)

which shows that threshold accommodation decreases the gain at high frequency by the factor 1 − ρth G max compared to the fixed threshold model. 4.3.2 Refractory Period. Taking into account absolute refractoriness suppresses the high-frequency range of the I − f curve that is observed for T = 0 (see Figure 2A). When voltage threshold accommodation is present and limits the frequency to some maximum value f ∞ (see previous paragraph), the absolute refractory period further reduces the maximal frequency to f ∞ /(1 + T f ∞ ). To compute the impact of the refractory period in the primary firing range, we capitalize on the fact that the ratio η = T/τAHP is small (typically 0.1). The firing rate decreases from f [I ] (when T = 0) to f T [I ] = f [I ]/(1 + ητAHP f [I ]). The firing period is of the order of τAHP in the primary range, so that the I − f curve is not much modified (in agreement with Figure 2A). Computing by perturbation how the gain is modified, we find that it is decreased by the factor 1 − 2T f compared to the value previously computed for T = 0. The minimal value of the gain, reached

2406

C. Meunier and K. Borejsza

in the primary range, is now G=

H 1 − 2T f prim , G AHP τAHP (Vth − VK )

(4.10)

max

where f prim is the firing rate at the inflection point of the I − f curve when T = 0. Equation 4.10 shows that the gain is typically decreased by 20% in the primary range. Importantly, the absolute period linearizes the I − f curve in the secondary range that corresponds to the crossover from the LF branch to the HF branch of the limiting I − f curve (see Figure 4). This happens because an inflection point appears in this range as soon as T is not zero, as shown by the following argument. Inflection points of the I − f curve are given by the condition d 2 f T [I ]/d I 2 = 0. This condition may be rewritten as (1 + T f [I ])d 2 f [I ]/d I 2 = T(d f [I ]/d I )2 , where f [I ] is the I − f curve for T = 0. The inflection point of the I − f curve located in the primary range is given by d 2 f /d I 2 = T(d f /d I )2 when T is small compared to τAHP ). It always exists, even when T = 0. A second inflection point is found at a higher frequency, scaling as 1/T. When T goes to 0, this inflection point is pushed to infinity, and the secondary range becomes nonlinear. For any positive value T, the I − f curve displays two inflection points. The secondary range is then approximately linear near the second inflection point. As T is increased, the second inflection point moves downward along the I − f curve, and the firing frequency and the gain decrease in the linear secondary range. This is illustrated in Figure 11. If T is too small, the A

B

200

15

∆T

∆T=0 ms

∆T=0 ms

150

10

100 5 50

0

G (Hz/nA)

f (Hz)

∆T

∆T=10 ms 0

20

40

60

I (nA)

80

100

0

20

40

60

80

100

0

I (nA)

Figure 11: Effect of the refractory period on the firing frequency and on the gain of the model. (A) Steady-state I − f curves. (B) Steady-state gain as a function of the injected current. Same parameters as in Figure 2A. Solid lines correspond to increasing values of the absolute refractory period (from top to bottom): T = 2 ms (duration of the spike in Powers (1993), fast repolarization excluded), 5 ms (as in Figure 1) and 10 ms. The dashed lines show the results obtained for T = 0.

How Membrane Properties Shape the Discharge of Motoneurons

2407

secondary range is too steep compared to experimental data on motoneurons. Moreover, firing rates in this range are higher than those observed in motoneurons (more than 100 Hz). For T = 2 ms, for instance, the gain is maximal at 175 Hz and is about 6.5 times larger than in the primary range. On the contrary, if T is too large, the firing rates achieved in the secondary range are reasonable, but the gain is almost the same in the primary and secondary ranges. For T = 10 ms, for instance, the secondary range corresponds to frequencies of about 60 Hz, but the gain is only 70% larger than in the primary range. Values of the firing rate and of the gain consistent with experimental data are obtained only for T ≈ 4 − 5 ms. Then an approximately linear secondary range is obtained below 100 Hz, and the gain in this range is 3 to 4 times higher than in the primary range. Such values of the absolute refractory period are reasonable, as illustrated by Figure 1. In this figure, the spike itself lasts more than 2 ms and the fast repolarization has a similar duration, so the absolute refractory period is close to 5 ms. 5 I − f Curve for the First Interspike Interval We have thus far focused on steady-state I − f curves. Our model also allows us to analytically study the I − f curve for the first interspike interval. Simpler expressions are obtained than for steady-state firing. Indeed, the β −t/τAHP AHP conductance is then given by G AHP [t] = G max during the AHP α e first interval and does not depend on the instantaneous firing rate. Proβ vided that K is substituted with G max AHP α /G in in equations 2.6 and 2.7, the I − f curve for the first interspike interval is given by equation A.5, as the steady-state curve. This entails that the input conductance and the fast voltage-dependent currents have the same impact on the I − f curve as in the steady-state condition. The I − f curve for the first interspike interval is displayed in Figure 12A (dashed line) for T = 0 (no absolute refractory period). It has the same shape as the steady-state curve (solid line), but the primary range is strongly reduced, and the gain in this range is increased by 30% to 40%. In contrast, the gain becomes the same at high frequency in both conditions. Indeed, the decay of the AHP conductance over the first interspike interval is negligible in the high-frequency limit, and we may approximate G AHP [t] by β G AHP [0] = G max AHP α . The model then behaves like a leaky integrate-and-fire β model with membrane conductance G in + G max AHP α , and the gain is given by 1/C (Vth − Vr ), as in the steady-state regime. To estimate the gain increase in the primary range, we again consider the limit where passive voltage relaxation is instantaneous ( = 0). In that limit, the I − f curve for the first interspike interval is given by f =

τAHP ln

1 β G max AH P α (Vth −VK ) I −Ith

.

(5.1)

2408

A

C. Meunier and K. Borejsza

B

250

250

200

200

150

β =3

100

50

0

β =2

β =1

100

50

steady-state 0

20

40

60

80

f (Hz)

f (Hz)

st

1 ISI

150

100

10

I (nA)

20

30

40

50

60

70

0

I (nA)

Figure 12: I − f curves for the first interspike interval. (A) Same parameters as in Figure 2A. The absolute refractory period T is not taken into account. Both the steady-state I − f curve (solid line) and the I − f curve for the first interspike interval (dashed) are displayed. Gain in the primary range is about 1.56 larger for the first interspike interval curve. (B) Increasing the activation exponent β from 1 (solid line) to 2 (dashed) and 3 (dotted) leads to a steepening of the I − f curve in the physiological range of frequencies. The gain is unchanged at high frequency.

This curve displays an infection point for f = 1/2τAHP whatever the values of α and β. Computing the derivative of the curve at this point, one obtains G0 =

e2 β 4G max AH P α (Vth

− VK )τ AH P

.

(5.2)

The gain takes the same form as in the steady-state regime. It depends in the same way on the maximal conductance, decay time constant, and reversal potential of the AHP current. Only the coefficient H is different in equation 4.2. We remark that it is inversely proportional to the integrated ∞ β AHP conductance following a single spike 0 G AHP [t]dt = G max AHP α τAHP . The first interval curve has the same gain as the steady-state curve only when α = 1 and β = 1. In other conditions, its gain is larger. For β = 1, the difference increases as α diminishes, and the gain of the first interspike interval curve eventually exceeds the steady-state value by e 2 /4 ≈ 1.847 when α ≈ 0, For fixed α, the gain of this I − f curve considerably increases with β, at variance with the steady-state gain. This is illustrated in Figure 12B, which shows the I − f curves obtained for = 0.2, α = 0.5, and different values of β. The explanation of this dependency on β is the following. Increasing the exponent β reduces the activation of the AHP conductance by the factor 1/α β . For the steady-state I − f curve, this reduction of the AHP activation

How Membrane Properties Shape the Discharge of Motoneurons

2409

is counterbalanced by an enhanced AHP summation because the activation decay time τz = βτAHP increases with β. This is why β has a limited effect on steady-state I − f curves, as illustrated in Figure 2B. This summation effect does occur in the first interspike interval. Therefore, the gain of the I − f curve for that interval increases with β. Finally, we note that a refined estimate of the gain of the I − f curve for the first interspike interval can be computed for = 0, as in the steady-state condition. One obtains in the primary firing range G = G0

β −2 G max AHP α 1− 1+ , G in e 2

(5.3)

where G0 is given by equation 5.2. The correction to the limiting value G0 depends on the ratio of time constants = τm /τAHP and the ratio of conductances G max AHP /G in . Assuming that ≈ 0.2, we find that the correction decreases the gain by about 15%, if the AHP conductance recruited by the first spike is comparable to the input conductance (Manuel et al., 2003). 6 Discussion 6.1 Summary of Results We showed that integrate-and-fire models with an exponentially decaying AHP conductance are amenable to a fully analytical study. This is true even when the summation and saturation of the AHP conductance, accommodation of the spike voltage threshold, and absolute refractoriness are taken into account. We established explicit equations that show how the membrane properties (leak conductance, slow AHP conductance, fast currents responsible for spiking) determine the subthreshold voltage trajectory, the steady-state I − f curve, the I − f curve for the first interspike interval, and the gain in the primary firing range. When passive voltage relaxation is assumed to be very fast compared to AHP relaxation (i.e., when = 0) and the absolute refractory period is neglected (T = 0), the steady-state I − f curve of the fixed threshold model displays two distinct branches: the LF branch below the critical current Ic and the HF branch above Ic . The LF branch is controlled by AHP: the firing rate is independent of the fast membrane repolarization following spikes, and the input conductance sets only the current threshold. The gain is proportional to 1/G max AHP τAHP because of the scaling behavior of the I − f curve. It also depends on the activation parameter α, which controls the frequencydependent summation of the AHP conductance, and on the reversal potential of the AHP current. On the HF branch, AHP is fully saturated. The model reduces to a leaky integrate-and-fire model, and the gain is determined by the amplitude Vth − Vr of the fast membrane repolarization following action potentials. When the absolute refractory period T is incorporated in the

2410

C. Meunier and K. Borejsza

model, the steady-state I − f curve still displays two separate branches, but the frequency stays below 1/T on both of them. When voltage relaxation is not instantaneous (i.e., when = 0), the steady-state I − f curve becomes continuous. It displays three firing ranges, corresponding to the linear zone around the inflection point of the LF branch (primary range), to the crossover from the LF to the HF branch (secondary range), and to the HF branch (high-frequency range). When T = 0, the secondary range is not linear. When T = 0, the I − f curve presents a second inflection point, located in the crossover region between the LF and HF branches, and the secondary range is approximately linear. The gain in the physiological range of frequencies (below 100 Hz) changes with the input conductance when > 0. The importance of this effect is controlled by the ratio of the two time constants of the model and by the ratio of the residual AHP conductance (i.e., the AHP conductance at the end of the interspike interval) to the input conductance. Doubling the input conductance decreases the gain by typically 10%. Taking into account the absolute refractory period reduces the gain by a comparable amount. If the spike voltage threshold increases with the injected current (threshold accommodation), the gain is also decreased. The gain reduction is determined by the product of the accommodation rate and the residual AHP conductance. High accommodation rates provoke a saturation of the I − f curve for large injected currents; it adds to the effect of the absolute refractory period. The I − f curve for the first interspike interval behaves like the steadystate I − f curve for large currents, but the physiological firing range is narrower and steeper. The gain is quite sensitive to the value of the activation exponent β, at variance with the steady-state I − f curve. 6.2 Comparison with Previous Modeling Studies Previous modeling studies of motoneurons (Baldissera et al., 1976; Powers, 1993) suggested that the gain was largely controlled by the AHP conductance, and that it was inversely proportional to the maximal conductance and decay time of AHP (Kernell, 1968). However, these studies provided no explicit expression of the gain. Explicit formulas were established by Ermentrout (1998) in a study on the linearization of I − f curves of neurons by an adaptation current and in the recent study of subthalamic nucleus neurons by Wilson, Weyrick, Terman, Hallworth, and Bevan, (2004). This article goes beyond these two analytical studies because it considers the summation and saturation of the AHP conductance, as in Baldissera et al. (1976), and takes into account the absolute refractory period. This makes a big difference in the behavior of the I − f curve. For = 0 and T = 0, the HF branch of the I − f curve disappears when AHP saturation is not taken into account, and the LF branch extends over the whole current range, eventually becoming linear for large currents. Absolute refractoriness limits the firing frequency and makes the secondary range approximately linear. Moreover, we did not restrict our study to the limit where = 0, as did previous analytical studies.

How Membrane Properties Shape the Discharge of Motoneurons

2411

This enabled us to understand how passive membrane properties affect the gain. In our model, the I − f curve for the first interspike interval does not exhibit the unrealistic jump found in Powers (1993). This shows that the problem encountered in that study is not inherent in the use of an integrateand-fire model. Most models of motoneurons (Kernell, 1968; Powers, 1993; Capaday, 2002) assume a linear activation of the AHP conductance (i.e., β = 1). In contrast, the AHP conductance is proportional to the square of the activation variable in Baldissera et al. (1976). We showed that changing β in our model had little impact on steady-state current-frequency curves. In contrast, the primary range of the first interspike interval I − f curve steepened, its slope increasing as 1/α β . For β = 1, the gain of the I − f curve for the first interspike interval is typically 50% larger than the gain of the steady-state curve. A better agreement with experimental data (Kernell, 1965) is obtained by setting β = 2, as in Baldissera et al. (1976). Effects of voltage threshold accommodation on our model are similar to those found in the numerical study by Powers (1993). However, our analytical approach enabled us to predict how the accommodation rate quantitatively affects the I − f curve and the gain. 6.3 Functional Implications Our study clarifies the nature of the secondary range. Baldissera et al. (1976) claimed that the secondary range of their model was linear and that this resulted from AHP saturation. Our study shows that the AHP conductance is only partly saturated in the secondary range and that the I − f curve is not linear in this range when T = 0. Linearity is achieved only when the AHP conductance is fully saturated. This happens at much larger frequencies than those typically observed in anesthetized motoneurons. Adding a refractory period T to the model of Baldissera et al. (1976) linearizes the secondary range. When T ≈ 4–5 ms, this approximately linear secondary range occurs in the physiological range of frequencies, as in the model of Powers (1993). However, we wish to emphasize that the linearity of the secondary range of motoneurons is questionable. In most experimental studies, I − f curves present few points beyond the primary range, and errors in the firing frequency are quite large. In such conditions, fitting the secondary range of the I − f curve with a straight line over a restricted domain of firing rates is easy, but the quality of the fit is poor. The transition from primary to secondary range firing is accompanied in our model, as in motoneurons, by a change in the shape of voltage trajectories. Our study shows that this is not because different membrane properties control firing in these two ranges. In both ranges, fast spike repolarization has a negligible effect on I − f curves, and the gain is determined by the AHP current, the voltage threshold, the passive membrane properties, and the absolute refractory period. This was expected in the primary range, where the initial value of the voltage (Vr ) is forgotten by the time the slow

2412

C. Meunier and K. Borejsza

voltage ramp reaches the spiking threshold. This is more surprising in the secondary range, where the voltage smoothly increases from Vr to Vth . In this range, the voltage profile strongly departs from an exponential at the end of the interspike interval (see Figure 3B), which implies that the firing frequency is not controlled at all by the voltage reset (see Figure 8). The voltage profile becomes exponential (with time constant C/G max ) only at a high firing rate, when the AHP conductance is saturated. The gain of our model in the physiological frequency range is essentially controlled by the AHP conductance. It depends on all parameters of the AHP current: the reversal potential, the maximal conductance, the decay time constant, the activation parameter (α), and, for the first interspike interval I − f curve, the activation exponent β. These parameters vary from one motoneuron to the other and were never reported to be strongly correlated, Therefore, we cannot expect substantial correlations between the gain and any single parameter of the AHP, such as the maximal conductance or the decay time, in samples of motoneurons. Passive membrane properties have a negligible impact on the gain only when the AHP conductance is much larger than the input conductance or when AHP decay is much slower than passive voltage relaxation. In cat lumbar motoneurons, AHP relaxation is typically five times slower than passive relaxation, and the AHP conductance activated by a single spike G max AHP α is comparable to the input conductance (Manuel et al., 2003). In these conditions, passive properties affect the gain, but their impact is quite limited. When is set to 0, one overestimates the gain by about 10% in the physiological range of frequencies. This might explain why no convincing experimental evidence of correlations between the gain and the input conductance was ever reported (Schwindt, 1973). The direct correlation between the input conductance and the gain is weak (about 10%). Indirect correlations between these two quantities via the various parameters of the AHP current are not likely to alter this figure much. Our model predicts that tonic synaptic activation should shift the I − f curve of motoneurons in the primary firing range without changing the gain much. This agrees with previous experimental (Granit et al., 1966a) and numerical (Capaday, 2002) studies. Moreover, the shift of the I − f curve should be determined by the spike voltage threshold. This has been verified in a recent experimental study (Brizzi et al., 2004). The effect of synaptic activation on secondary range firing is a matter of debate. Granit, Kernell, and Lamarre (1966b) suggested that the gain could be changed, whereas the numerical study of Capaday (2002) predicts that the secondary range is just shifted. Our results agree with this latter study: the gain is little modified in the physiological range of frequencies (below 100 Hz), and the I − f curve is shifted by the same amount in both primary and secondary ranges. The gain changes more when the injected current is close to Ic but the firing rate is then above 150 Hz (see Figure 4), that is, above the values observed in motoneurons.

How Membrane Properties Shape the Discharge of Motoneurons

2413

6.4 Concluding Remarks This study shows that the current frequency of motoneurons in the physiological range of frequencies is essentially controlled by the AHP current, and we quantitatively relate the gain to the parameters of this current. The simple model we used in our investigations incorporates no other slow current, and it satisfactorily reproduces the firing properties (voltage trajectory, I − f curve) of lumbar motoneurons in cats deeply anesthetized with barbiturates. In decerebrate cats and in vitro preparations of vertebrate spinal cords (rodents, turtle), monoaminergic neuromodulation triggers the expression of a persistent calcium current that induces plateau potentials and augments the firing rate. It may even lead, when neuromodulation is intense, to membrane bistability. Even in such conditions, AHP plays an important regulatory role. By limiting the firing rate and reducing the discharge variability, it guarantees that the firing rate of motoneurons is matched to the contractile properties of muscle fibers. Moreover, the excitability enhancement elicited by neuromodulation is largely due to a decreased AHP conductance. Understanding the impact of AHP on the firing rate and the gain then remains an important issue, even if the discharge is not entirely controlled by AHP. Although AHP is particularly important in spinal motoneurons, it is present in many other types of neurons (pyramidal cells and striatal cholinergic interneurons, for example; Bennett, Callaway, and Wilson, 2000). AHP is expected to play the same role in all these neurons as in spinal motoneurons, and modifying the AHP current may have a strong effect on their firing properties. For instance, Bennett et al. (2000) showed that pharmacological blockade of AHP may induce bursting in the tonically active neurons of the striatum. Although initially motivated by the specific case of lumbar motoneurons, our study provides a rigorous basis for apprehending how AHP regulates by itself the activity of neurons in general.

Appendix A: Formal Solution of the Integrate-and-Fire Model We show in this appendix how the voltage trajectory in an interspike interval and the current-frequency curve for that interval can be derived from the time course of the AHP conductance. We assume here that AHP relaxation between spikes is voltage independent. Integrating the linear differential equation 2.1 from t = 0, where V = Vr , to some given time t in the interspike interval, one obtains V[t] = e −θ [t] Vr +

I G in τm

0

t

e θ [t ] dt + VK

0

t

G AHP [t ] θ[t ] dt e G in τm

(A.1)

2414

C. Meunier and K. Borejsza

where τm = C/G in is the passive membrane time constant and

t

θ[t] = 0

G[t ] dt C

(A.2)

is time expressed in units of the instantaneous membrane time constant C/G[t], G[t] = G in + G AHP [t] being the membrane conductance at time t. Setting ψ[t] =

1 τm

t

e θ [t ] dt

(A.3)

0

t and noticing that 0 (G AHP [t ]/G in )e θ [t ] dt /τm = e θ[t] − 1 − ψ[t], we rewrite equation A.1 under the simpler form V[t] = VK + (Vr − VK ) e −θ [t] +

I − VK ψ[t]e −θ[t] . G in

(A.4)

Using equation A.4 and the threshold crossing condition V[1/ f ] = Vth , where f is the instantaneous firing frequency, we obtain the frequencycurrent relation VK − Vr e θ[1/ f ] I − Ith = G in (Vth − VK ) − 1 + G in ψ[1/ f ] ψ[1/ f ]

(A.5)

in which Ith = G in Vth is the current threshold for repetitive firing. Equation A.5 implicitly determines the I − f relation f [I ] for the interspike interval considered, if the absolute refractory period is neglected.

Appendix B: The Two Timescales Method Keeping the initial variable t to describe fast voltage relaxation, we introduce the second, “slow,” time variable t˜ = t. This auxiliary time variable will describe the slow voltage evolution governed by AHP relaxation. It enables us to put slow and fast voltage evolution on the same footing: a relaxation with time constant τAHP becomes a relaxation with time constant τm when t˜ is used instead of t. The voltage evolution equation 2.1 is then rewritten as C

dV ˜ t˜] − V , = G[t˜] V[ dt

where G[t˜] = G in + G AHP [t˜] is the membrane conductance at time t and ˜ t˜] = (I + G AHP [t˜]VK )/G[t˜]. Considering V as a function of both variables V[

How Membrane Properties Shape the Discharge of Motoneurons

2415

t and t˜ and replacing the time derivative in equation 2.1 by ∂/∂t + ∂/∂ t˜ according to the chain rule, we get C

∂V ∂V ˜ t˜] − V[t, t˜] . [t, t˜] = G[t˜] V[ [t, t˜] + ∂t ∂ t˜

(B.1)

Because AHP relaxation is slow compared to voltage relaxation ( ≈ 0.1 − 0.3), we expand the membrane voltage as a power series in the small parameter , V[t, t˜] = V0 [t, t˜] + V1 [t, t˜] + 2 V2 [t, t˜] + . . .

(B.2)

Inserting this expansion in equation B.1 and regrouping terms of same order in , one obtains the series of equations ∂ V0 ˜ t˜] − V0 [t, t˜] [t, t˜] = G[t˜] V[ ∂t ∂ V1 ∂ V0 [t, t˜] C [t, t˜] = −G[t˜]V1 [t, t˜] − C ∂t ∂ t˜ ...

C

Integrating the first two equations with respect to the “fast” time t for fixed t˜ and initial conditions V0 [0, t˜] = Vr and V1 [0, t˜] = 0 and substituting t˜ with t, one obtains ˜ V0 [t] = V[t] 1 − e −G[t]t/C + Vr e −G[t]t/C

(B.3)

G AHP [t] t −G[t]t/C ˜ − Vr e V1 [t] = V[t] G in 2τm2 2

−

d V˜ d V˜ G in t τAHP [t] 1 − e −G[t]t/C + τAHP [t] e −G[t]t/C , G[t] dt dt τm (B.4)

˜ with τAHP d V/dt[t] = G AHP [t] (I − G in VK ) /G[t]2 . An approximation of the I − f curve is then deduced from the threshold crossing condition V0 [1/ f ] + V1 [1/ f ] = Vth . The two timescales method provides an exact expression of the voltage trajectory V[t] only when = 0. The voltage trajectory in the interspike interval then consists in an instantaneous drop from the reset voltage Vr followed by a voltage ramp that ends when the threshold Vth is reached. ˜ The equation of this ramp is given by V[t]. For realistic values of ( ≈ 0.2), the method provides only approximations of the subthreshold membrane voltage, unless all terms of the

2416

C. Meunier and K. Borejsza

expansion B.2 are computed and summed. However, retaining only the first two terms of this expansion provides an excellent estimate of the true voltage trajectory. The first term, V0 [t], captures well the shape of the voltage trajectory, but departs from the true trajectory by about 10% over most of the interspike interval. The error is negligible only at the very beginning of the interval, where both true and approximate solutions satisfy the same boundary condition. In contrast, the approximation V0 [t] + V1 [t] fits the voltage trajectory well. It overestimates the true voltage by less than 0.2%. Appendix C: Computing the Gain in the Primary Range First, we assume that = 0, and we compute the gain G0 in the primary range as the slope of the limit I − f curve at its inflection point. We introduce x=

I − Ith G max AHP (Vth − VK )

1/β ,

(C.1)

1 . βτAHP ln [1 − α + α/x]

(C.2)

and we rewrite equation 3.2 as f =

From this last equation, we deduce x=

e

α − (1 − α)

1/β f τAHP

d x x + (1 − α)x 2 /α = df f 2 βτAHP 1 d2x d x = (2(1 − α)x/α + 1 − 2 fβτAHP ) . d f 2 d f f 2 βτAHP

(C.3) (C.4) (C.5)

We denote the values of x and f at the inflection point by xprim and f prim . We may write the gain under the form G0 = H/G max AHP τAHP (Vth − VK ) where H=

( f prim τAHP )2 β

xprim (1 + (1 − α)xprim /α)

(C.6)

is dimensionless. The inflection point of the I − f curve is determined by d 2 f /d I 2 = 0, that is, xprim d 2 x/d f 2 [ f prim ] + (β − 1)(d x/d f [ f prim )2 = 0, which

How Membrane Properties Shape the Discharge of Motoneurons

2417

implies that xprim = β

f prim τAHP − 1/2 . (1 − α)/α + (α/1 − α)((β − 1)/2)

(C.7)

From this equation and from equation C.3, we deduce the equation β

e

1/ f prim βτAHP

1−α

−1 =

1 + (β − 1)(α/1 − α)2 /2 . f prim τAHP − 1/2

(C.8)

The firing frequency f prim is obtained by numerically solving this equation, and xprim and H are then determined by equations C.7 and C.8. We remark that the three quantities f prim τAHP , xprim , and H depend on only the parameters α and β. When = 0, the gain becomes

G = G0 1 −

2

G in

,

G in + G AHP [1/ f prim ]

(C.9)

as shown by equation 4.4. If we neglect terms of order 2 in the expression of the gain, it is enough to determine the residual conductance G AHP [1/ f prim ] at lowest order, that is, for = 0. We deduce from equations 2.5, 2.4, and β C.2 that G AHP [1/ f prim ] = G max AHP xprim . It follows that  H=

( f prim τAHP )

2

β

xprim (1 + (1 − α)xprim /α)



 1 − 

G in β

G in + G max AHP xprim

2     . (C.10)

All computations become much simpler when β = 1. The inflection point 2 2 of the limit I − f curve is given by the condition d x/d f = 0 so that xprim = (α/1 − α) f prim τAHP − 1/2 . The equation that determines the firing 1/ f τ rate at the inflection point reads f prim τAHP − 1/2 = (1 − α)/(exp prim AHP − (1 − α)) and may be rewritten as 1−α =

1 − y 2y e 1+y

(C.11)

if we set y = 1/2 f prim τAHP . When α increases from 0 to 1, so does y, and the firing rate decreases from infinity to 1/2τAHP . Equation C.10 becomes G max 1−α α 1 − y[α] −2 AHP 1− 1+ . H= α(1 − y[α]2 ) G in 1 − α 2y[α]

(C.12)

2418

C. Meunier and K. Borejsza

√ When α goes to 0, y ≈ α/2 and H ≈ 1/α. α goes to 1, In contrast, when 2 −2 y ≈ 1 − 2(1 − α)/e 2 and H ≈ (e 2 /4)(1 − 1 + G max . /(G e )) in AHP Appendix D: Voltage Threshold Accommodation To account for voltage threshold accommodation, we let both the voltage threshold Vth and the voltage reset Vr depend on I . Equation A.5 no longer provides an explicit expression of the current as a function of the firing frequency, except when accommodation is a linear function of the injected current. Such a linear dependence of the threshold on the current was observed in motoneurons (Schwindt & Crill, 1982). Equation A.5 then becomes θ [1/ f ] G in e − 1 + ρr (I − G in Vth ) 1 − ρth G in ψ[1/ f ] ψ[1/ f ] θ [1/ f ] e VK − Vr = G in (Vth − VK ) − 1 + G in ψ[1/ f ] ψ[1/ f ] which differs from equation A.5 by the extra factor involving ρth and ρr . The influence of voltage threshold accommodation on the I − f curve can be understood by considering the limiting case where = 0. When accommodation is strong (ρth > 1/G max AHP ), the limiting I − f curve has only one branch, defined for all values of the injected current. Its equation reads f =

1

. G max AH P (Vth [Ith ]−VK ) βτAHP ln 1 − α 1 − ρth G max AHP + α I −Ith

(D.1)

This shows that the I − f curve is independent of Vr in the limit considered. As the current goes to infinity, the firing rate tends to f ∞ = 1/βτAHP ln[1 + α(ρth G max AHP − 1)]. In contrast, when accommodation is moderate (ρth < 1/G max AHP ), the LF branch is defined only for currents smaller than Ic = Ith + G max AHP (Vth − VK )/ (1 − ρth G max AHP ), and the firing rate goes to infinity as the injected current approaches Ic . However, a second branch (HF branch), defined by f =

G max C ln 1 +

1 G max 1−ρth G max AH P

Vth [I ]−Vr [I ] I −Ic

,

(D.2)

exists above Ic when ρr is not too large (ρr < ρth + (Vth [Ith ] − Vr [Ith ]) (1 − max ρth G max AHP )/G AHP (Vth [Ith ] − VK )). It depends on both Vth and Vr and is defined for all currents above Ic when ρr ≤ ρth . The firing rate increases from 0 to f ∞ = G max /C ln[1 + (ρth − ρr )G max /(1 − ρth G max AHP )] along this branch. When ρr > ρth , the HF branch extends only up to Ic = Ith + (Vth [Ith ] − Vr [Ith ])/ (ρr − ρth ), at which point Vr [I ] reaches Vth [I ]. In the special case

How Membrane Properties Shape the Discharge of Motoneurons

2419

where ρth = ρr , the equation of the HF branch becomes f =

G max C ln 1 +

1 Vth [Ith ]−Vr [Ith ] G max 1−ρth G max I −Ic AH P

.

(D.3)

When the injected current is large, the firing rate linearly increases with I . Acknowledgments We are indebted to M. Arnaud, D. Hansel, L. Jami, and D. Zytnicki for their careful reading of the manuscript and many helpful comments. We also thank C. van Vreeswijk for useful remarks and C. Capaday for stimulating discussions on motoneurons. Financial support provided by D´el´egation G´en´erale pour l’Armement (DGA Grant 0034029) and by Minist`ere de la Recherche (Action Concert´ee Incitative “Neurosciences Int´egratives et Computationnelles”) is gratefully acknowledged. References Abramowitz, M., & Stegun, I. A. (1970), Handbook of mathematical functions. New York: Dover. Baldissera, F., & Gustafsson, B. (1971). Regulation of repetitive firing in motoneurones by the afterhyperpolarization conductance. Brain Res., 30, 431–434. Baldissera, F., Gustafsson, B., & Parmiggiani, F. (1976). A model for refractoriness accumulation and secondary range firing in spinal motoneurones. Biol. Cybern., 24, 61–65. Baldissera, F., Gustafsson, B., & Parmiggiani, F. (1978). Saturating summation of the afterhyperpolarization conductance in spinal motoneurones: A mechanism for secondary range repetitive firing. Brain Res., 146, 69–82. Bennett, B. D., Callaway, J. C., & Wilson, C. J. (2000). Intrinsic membrane properties underlying spontaneous tonic firing in neostriatal cholinergic interneurons. J. Neurosci., 20, 8493–8503. Brizzi, L., Meunier, C., Zytnicki, D., Donnet, M., Hansel, D., Lamotte d’Incamps, B., & van Vreeswijk, C. (2004). How shunting inhibition affects the discharge of lumbar motoneurones: A dynamic clamp study in anaesthetized cats. J. Physiol. (London), 15, 671–683. de Bruijn, N. G. (1981). Asymptotic methods in analysis. New York: Dover. Capaday, C. (2002). A re-examination of the possibility of controlling the firing rate gain of neurons by balancing excitatory and inhibitory conductances. Exp. Brain Res., 409, 67–77. Ermentrout, B. (1998). Linearization of F − I curves by adaptation. Neural Comput., 10, 1721–1729. Gosnach, S., Quevedo, J., Fedirbuck, B., & McCrea, D. A. (2000). Depression of group Ia monosynaptic EPSPs in cat hindlimb motoneurones during fictive locomotion. J. Physiol. (London), 526, 639–652.

2420

C. Meunier and K. Borejsza

Granit, R., Kernell, D., & Lamarre, Y. (1966a). Algebraic summation in synaptic activation of motoneurones firing within the “primary range” to injected currents. J. Physiol. (London), 187, 379–399. Granit, R., Kernell, D., & Lamarre, Y. (1966b). Synaptic stimulation superimposed on motoneurones in the “secondary range” to injected current. J. Physiol. (London), 187, 401–415. Holmes, M. H. (1995). Introduction to perturbation methods. New York: Springer-Verlag. Holt, G., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Comput., 9, 1001–1013. Kernell, D. (1965). High-frequency, repetitive firing of cat lumbosacral motoneurons stimulated by long-lasting injected currents. Acta Physiol. Scand., 65, 74–86. Kernell, D. (1968). The repetitive impulse discharge of a simple neurone model compared to that of spinal motoneurones. Brain Res., 11, 685–687. Manuel, M., Brizzi, L., Donnet, M., Lamotte d’Incamps, B., Meunier, C., & Zytnicki, D. (2003). Demonstration of the effect of the conductance and kinetics of the AHP current on the gain of cat lumbar motoneurons. SFN Abstract 496.7. Perreault, M.-C. (2002). Motoneurons have different membrane resistance during fictive scratching and weight support. J. Neurosci., 22, 8259–8265. Powers, R. K. (1993). A variable-threshold motoneuron model that incorporates timeand voltage-dependent potassium and calcium conductances. J. Neurophysiol., 70, 246–262. Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. (1992). Numerical recipes in C: The art of scientific computing (2nd ed.). Cambridge: Cambridge University Press. Schwindt, P. C. (1973). Membrane-potential trajectories underlying motoneuron rhythmic firing at high rates. J. Neurophysiol., 36, 434–439. Schwindt, P. C., & Calvin, W. H. (1972). Membrane-potential trajectories between spikes underlying motoneuron firing rates. J. Neurophysiol., 35, 311–325. Schwindt, P. C., & Calvin, W. H. (1973). Equivalence of synaptic and injected current in determining the membrane potential trajectory during motoneuron rhythmic firing. Brain Res., 59, 389–394. Schwindt, P. C., & Crill, W. E. (1982). Factors influencing motoneuron rhythmic firing: Results from a voltage clamp study. J. Neurophysiol., 48, 875–890. Traub, R. D. (1977). Motoneurons of different geometry and the size principle. Biol. Cybern., 25, 163–176. Traub, R. D., & Llin´as, R. (1977). The spatial distribution of ionic conductances in normal and axotomized motoneurons. Neuroscience, 2, 829–849. Wilson, C. J., Weyrick, A., Terman, D., Hallworth, N. E., & Bevan, M. D. (2004). A model of reverse spike frequency adaptation and repetitive firing of subthalamic nucleus neurons. J. Neurophysiol., 91, 1963–1980. Zhang, L., & Krnjevi´c, K. (1987). Apamin depresses selectively the afterhyperpolarization of cat spinal motoneurons. Neurosci. Lett., 74, 58–62.

Received July 7, 2004; accepted March 22, 2005.

LETTER

Communicated by Elizabeth Buffalo

Stimulus Competition by Inhibitory Interference Paul H. E. Tiesinga [email protected] Department of Physics and Astronomy, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A.

When two stimuli are present in the receptive field of a V4 neuron, the firing rate response is between the weakest and strongest response elicited by each of the stimuli when presented alone (Reynolds, Chelazzi, & Desimone, 1999). When attention is directed toward the stimulus eliciting the strongest response (the preferred stimulus), the response to the pair is increased, whereas the response decreases when attention is directed to the other stimulus (the poor stimulus). When attention is directed to either of the two stimuli presented alone, the firing rate remains the same or increases slightly, but the coherence between the neuron’s spike train and the local field potential can increase (Fries, Reynolds, Rorie, & Desimone, 2001). These experimental results were reproduced in a model of a V4 neuron under the assumption that attention modulates the activity of local interneuron networks. The V4 model neuron received stimulusspecific excitation from V2 and synchronous inhibitory inputs from two local interneuron networks in V4. Each interneuron network was driven by stimulus-specific excitatory inputs from V2 and was modulated by the activity of the frontal eye fields. Stimulus competition was present because of a delay in arrival time of synchronous volleys from each interneuron network. For small delays, the firing rate was close to the rate elicited by the preferred stimulus alone, whereas for larger delays, it approached the firing rate of the poor stimulus. When either stimulus was presented alone, the neuron’s response was not altered by the change in delay, but could change due to modulation of the degree of synchrony of the corresponding interneuron network. The model suggests that topdown attention biases the competition between V2 columns for control of V4 neurons primarily by changing the relative timing of inhibition, whereas changes in the degree of synchrony of interneuron networks modulate the response to a single stimulus. The new mechanism proposed here for attentional modulation of firing rate, gain modulation by inhibitory interference, is likely to have more general applicability to cortical information processing.

Neural Computation 17, 2421–2453 (2005)

© 2005 Massachusetts Institute of Technology

2422

P. Tiesinga

1 Introduction The neural correlates of selective attention have been studied in monkeys using recordings from single neurons in cortical area V4 (Connor, Gallant, Preddie, & VanEssen, 1996; Fries, Reynolds, Rorie, Desimone, 2001; McAdams & Maunsell, 1999a, 1999b, 2000; Moore & Armstrong, 2003; Moore & Fallah, 2004; Reynolds & Chelazzi, 2004; Reynolds, Chelazzi, & Desimone, 1999; Reynolds & Desimone, 1999, 2003; Reynolds, Pasternak, & Desimone, 2000). A key finding is that attention modulates both the mean firing rate of a neuron in response to a stimulus and the coherence of its spikes with other neurons responsive to the same stimulus (Fries et al., 2001). The increase of coherence with attention is strongest in the gamma frequency range (30–80 Hz). Networks of inhibitory interneurons (Beierlein, Gibson, & Connors, 2000, 2003; Galarreta & Hestrin, 1999, 2001; Gibson, Beierlein, & Connors, 1999) have been implicated in the generation of synchronous gamma-frequencyrange oscillations in the hippocampus (Fisahn, Pike, Buhl, & Paulsen, 1998; Whittington, Traub, & Jefferys, 1995) and the cortex (Deans, Gibson, Sellitto, Connors, & Paul, 2001) and could entrain a large number of principal cells (Koos & Tepper, 1999; Tamas, Buhl, Lorincz, & Somogyi, 2000) as well as modulate their firing rates (Aradi, Santhakumar, Chen, & Soltesz, 2002). Hence, synchrony modulation of interneuron networks could possibly mediate the effects of attention observed in cortical neurons (Tiesinga & Sejnowski, 2004). The degree of synchrony of the inhibitory inputs to V4 neurons can be characterized by their temporal dispersion, referred to here as precision. We proposed that attention could act by increasing the precision of inhibitory inputs to V4 neurons (Tiesinga, Fellous, Salinas, Jose, & Sejnowski, 2004, in press). We found that the modulation of the model neuron’s firing rate and its coherence with the inhibitory oscillation was consistent with the observed effects of attention. These models considered only the case of one stimulus in the neuron’s receptive field. The response of a V4 neuron to two stimuli in its receptive field has also been studied (Chelazzi, Miller, Duncan, & Desimone, 1993; Gawne & Martin, 2002; Luck, Chelazzi, Hillyard, & Desimone, 1997; Moran & Desimone, 1985; Reynolds et al., 1999). One of the stimuli yielded a weak response when presented alone, whereas a more vigorous response was elicited when the other stimulus was presented alone. The latter is referred to as the preferred stimulus and the former as the poor stimulus. When both stimuli were presented at the same time, the neuron’s firing rate was less than the response to the preferred stimulus but larger than the response to the poor stimulus. This result is consistent with the framework of stimulus competition (Desimone, 1998) where the pair response was a weighted sum of the responses to the stimuli when presented alone. When attention was directed to the preferred stimulus, the response increased, whereas attending to the poor stimulus

Stimulus Competition by Inhibitory Interference

2423

decreased the response. Hence, attention biased the outcome of stimulus competition toward the stimulus that was attended. The neural circuit that underlies stimulus competition is not yet fully characterized. Our goal is to determine whether and how modulation of the activity of local interneuron networks can account for attentional modulation of stimulus competition. We find that modulating the relative phase of synchronized interneuron networks rather than the degree of synchrony can account for the competition between V2 columns for the control of V4 neurons. In recent experiments, microstimulation of the frontal eye fields (FEF) resulted in response changes similar to those induced by attention, of V4 neurons whose receptive fields overlapped with the FEF movement fields (Moore & Armstrong, 2003). We therefore propose that activity of the FEF may modulate the relative phase between interneuron networks providing inputs to the V4 cells. 2 Methods In a previously studied model of inhibitory interneurons connected by chemical synapses, the network produced oscillatory activity that consisted of a sequence of synchronized spike volleys (Diesmann, Gewaltig, & Aertsen, 1999; Tiesinga & Jose, 2000). First, we describe the statistics of the output of a local interneuron network. Each spike volley was characteri ized by the number of spikes in the volley a IV (with i the volley index, a the i activity, and IV indicating inhibitory volley), their mean spike time tIV , and i their spike time dispersion σIV . The spike time dispersion, σIV , is inversely related to the precision P. The mean number of spikes per volley, a IV , is determined by the fraction of network neurons that is active on a given cycle, the size of the network, and the presynaptic release probability (Koch, 1999). The two interneuron networks were not explicitly simulated; rather, the above statistics were used to model input spike trains representing the synchronous inhibitory input, as described below. These input spike trains will be referred to as network activity throughout. The method used to obtain synchronous volleys is as described in i Tiesinga, Fellous, Jose, & Sejnowski (2002). A set of volley times tIV (with a fixed intervolley interval equal to 25 ms) was generated for the first network. The volley times for the second network were obtained by adding a fixed delay D to each of the first network’s volley times. Next, a binned spike time probability (STP) was obtained by convolving all the volley times with a gaussian filter with standard deviation σIV and area a IV t (the bin width t = 0.01 ms was equal to the integration time step used in the simulations). Input spike times were generated as a Poisson process from the STP, as in Tiesinga et al. (2002). Each input spike produced an exponentially decaying conductance pulse in the postsynaptic cell yielding a current Isyn = ginh exp(−t/τinh )(V − E GABA ). In this expression, t is the

2424

P. Tiesinga

time since the last presynaptic spike, τinh = 5 ms is a decay time constant (Bartos et al., 2002), ginh = 0.05 mS/cm2 is the unitary synaptic conductance, V is the postsynaptic membrane potential, and E GABA = −75 mV is the reversal potential for GABAergic inhibitory synapses. The neuron was also driven by asynchronous excitatory synaptic inputs. The parameters were, with a notation analogous to that for inhibitory inputs, τexc = 2 ms and gexc = 0.02 mS/cm2 . The reversal potential for fast AMPA excitatory synapses was E AMPA = 0 mV (Shepherd, 1998). The resulting train of conductance pulses drove a single compartment neuron with Hodgkin-Huxley voltage-gated sodium and potassium channels, a passive leak current, and excitatory and inhibitory synaptic currents as described above (Wang & Buzsaki, 1996). Full model equations are given in Tiesinga and Jose (2000). They were integrated using a noise-adapted second-order Runge-Kutta method (Greenside & Helfand, 1981) implemented in the Fortran programming language, with time step dt = 0.01 ms. Simulations were run multiple times with different seeds for the ranj dom number generator, yielding different trials. Spike times ti (ith spike time during the jth trial) of the target neuron were calculated as the time that the membrane potential crossed 0 mV from below. The spike phase j j was expressed in ms and calculated as φi = mod(ti + 12.5, 25). The mean interspike interval, τ , was calculated as the mean of all intervals during a given trial and then averaged across all trials. In most cases, we took one 200 s long trial. The mean firing rate f was 1000/τ (τ is in ms, f in Hz). Histograms were calculated using Matlab’s (The Mathworks) hist function. The bin width was 1 ms for phase histograms, 2 ms for the spike train cross correlations, 2 Hz for firing rate histograms, and 0.1 for the histograms of firing rate ratios. The cross correlation between two neurons was calculated as follows. For a given pair of neurons, we determined the interspike intervals for all possible spike pairs, with one spike coming from the first neuron and the other from the second neuron. To model the response of 30 different neurons, we used 30 trials with a duration of 1500 ms, each obtained in response to an independent realization of the input spike train using the same spike time density. There were (30 * 29)/2 = 435 different neuron pairs, the interspike intervals corresponding to each neuron pair were collected into one set, and a histogram was calculated. The raw histogram was normalized by the total number of spike pairs in order to account for the difference in firing rate across different attentional conditions. The advantage of this method is that the spike trains do not have to binned first; only the interspike intervals are binned. This means that two spikes that would happen to fall in different bins, but have an interspike interval less than the bin width, are still counted as coincident with this method. The spectral content of the spike trains was calculated by binning them at a resolution of 1 ms. For this resolution, there was at most one spike per bin. A binned spike train of duration 16.384 s was passed to the Matlab routine

Stimulus Competition by Inhibitory Interference

2425

A

V4

INH

INH

EXC

EXC

Stimulus 1

Stimulus 2

FEF

B

V4

INH

INH

V4 V2

EXC

EXC

Stimulus 1

Stimulus 2

Figure 1: The model neuron and the synaptic inputs it receives. (A) The model proposed by Reynolds and coworkers (Reynolds et al., 1999). Stimulus 1 and 2 each activates a separate pool of inhibitory and excitatory neurons. (B) Temporal interference network. The model neuron receives feedforward inputs from two excitatory pools in area V2, and it receives inhibition from two local interneuron networks in V4. Each excitatory pool is associated with a specific stimulus and projects only to the corresponding interneuron network. The interneuron network is modulated by the activity of the frontal eye fields.

pmtm to obtain a multi-taper estimate of the spectrum (Mitra & Pesaran, 1999). The width parameter, commonly denoted by NW, was set to four, meaning that eight independent tapers were used in the estimate for the spectrum. 3 Results 3.1 A Model for Stimulus Competition. A number of models have been proposed to account for stimulus competition and its modulation with attention. Reynolds and coworkers proposed a simple phenomenological model (see Figure 1A). A pool of excitatory and inhibitory neurons is associated with the poor stimulus (pool 1), and another pool is associated with the

2426

P. Tiesinga

preferred stimulus (pool 2). When a particular stimulus is presented, the corresponding pool is activated. For a preferred stimulus, the pool generates relatively more excitation than inhibition, whereas for a poor stimulus, inhibition is stronger and excitation is weaker. In their model, the firing rate of the V4 neuron is proportional to the total amount of excitation it receives, divided by the sum of excitation and inhibition. Stimulus competition emerges automatically because the summed inhibition decreases the pair response more than the summed excitation increases it. Attention is assumed to increase the strength of the inputs from the pool associated with the attended stimulus. For attending the preferred stimulus, that means relatively more excitation in response to the pair, hence an increased firing rate, whereas for attending the poor stimulus, there would be relatively more inhibition, hence a decreased firing rate. The model accounts succinctly for the experimental observations, but its biophysical underpinnings are unclear. The firing rate cannot be written in the form that is assumed for the Reynolds model unless normalizing effects due to recurrent network activity are included (Carandini, Heeger, & Movshon, 1997), which could take the form of balanced excitatory and inhibitory modulatory inputs (Chance, Abbott, & Reyes, 2002). The pools providing inputs to the model neuron are likely to be located in area V2. However, the projections from one cortical area to the other are thought to be primarily excitatory (Shepherd, 1998). The question then is, Where does the inhibition come from? And how are the excitatory and inhibitory inputs correlated? A basic assumption of the Reynolds model is that the amount of excitation and inhibition provided by a pool depends on only whether the associated stimulus is present. Specifically, it is not influenced by the presence of other stimuli that are not associated with the given pool. This is different from previous models for stimulus competition (see Usher & Niebur, 1996, and section 4). Here we determine how stimulus competition in area V4 emerges from independently activated pools in V2. The following framework is used to study the behavior of the V4 model neuron (see Figure 1B). There are two excitatory pools. The first pool is exclusively activated by the poor stimulus (stimulus 1) and provides weak excitation to the neuron. The second pool is exclusively activated by the preferred stimulus (stimulus 2) and provides strong excitation to the neuron. For most cases, we took a 3:1 ratio for the rates of preferred and poor excitatory inputs, respectively. This means that when both stimuli are present, pool 1 provides 25% of the excitatory inputs and pool 2 provides 75%; 100% excitation corresponded to 2400 excitatory inputs per second. The excitation is asynchronous, meaning that it was generated as the sum of independent, constant-rate Poisson processes. The input rate is proportional to the number of neurons in the pool, their mean firing rate, and the probability of a successful spike transmission across the synapse. We assume that each stimulus activates the neurons in the associated pool in the same way. The neurons in the pool will therefore have comparable mean rates. The difference between the poor and preferred

Stimulus Competition by Inhibitory Interference

2427

stimuli has to do with the number of synapses, transmission probability, and/or synaptic strength. In the model, the synapses all have the same strength. Short-term plasticity and transmission failure are also not taken into account. Hence, within the framework of the model, poor and preferred stimuli are different only in the rate of excitatory inputs that they provide to the neuron. Each excitatory pool also activates a corresponding inhibitory pool. When the inhibitory pools are activated, they produce synchronized volleys with a precision P at a rate of 40 Hz. The precision is the inverse of the temporal dispersion of the spikes in a volley and it is expressed in 1/ms. A more synchronous inhibitory network has a higher value for the precision. We used P values between 0 and 1/ms. The level of activation of the interneuron networks is the same for poor and preferred stimuli. Each volley had on average a IV = 18.75 synaptic inputs, yielding 750 inputs per second for each inhibitory pool. The volleys produced by each network arrive at the neuron at different times. The delay in volley arrival times is expressed in ms and denoted by D. We hypothesize that a top-down projection modulates the precision and phase of the inhibitory volleys. The delay is the phase difference between the two networks, and it is also referred to as the relative phase. Phases are expressed as a time between 0 and 25 ms rather than the more conventional choice of values between 0 and 1. Using this model system, we determine how modulation of D and P can account for the observed effects of attention on stimulus competition. 3.2 Stimulus Competition by Inhibitory Interference. The firing rate response to the poor stimulus alone and to the preferred stimulus alone is denoted by f 1 and f 2 , respectively. When both stimuli are presented simultaneously, the response is f 3 . Stimulus competition thus implies the following inequality: f 1 < f 3 < f 2 . For the reader’s convenience, the reported effects of attention on the different stimulus configurations are summarized in Table 1. The volleys from the second inhibitory network (activated by stimulus 2)

Table 1: Summary of the Experimentally Observed Responses of V4 Neurons to Different Stimulus Configurations and Under Different Attentional Conditions. Stimulus Configuration Stimulus 1 Stimulus 2 Stimuli 1 and 2 Stimuli 1 and 2

Attended Object

Response

Response Change from Baseline

Synchrony Change

Stimulus 1 Stimulus 2 Stimulus 1 Stimulus 2

f1 f2 f3 f3

Same or slight increase Same or slight increase Decrease Increase

? Increases ? ?

Notes: The question marks denote the stimulus configurations for which no explicit experimental results are available. (See Fries et al., 2001; Reynolds & Chelazzi, 2004; Reynolds et al., 1999.)

2428

P. Tiesinga

were delayed by D = 7.5 ms compared with volleys from the first inhibitory network (activated by stimulus 1). The firing rate in response to stimulus 2 alone was higher than in response to stimulus 1 alone because the second excitatory pool provided a higher excitatory input rate (see Figure 2A, top and bottom). When both stimuli were presented simultaneously, the response was less than the response to stimulus 2 alone, because the two inhibitory networks were out of phase. We hypothesized that the decrease in firing rate was due to the shortening of the time interval on each cycle during which the inhibitory conductance was small enough to allow the neuron to spike. This time interval is referred to as the spiking interval. To investigate this further, the distribution of the spike phase was calculated (see Figure 3). This will tell us at what point during the oscillation cycle the neuron spiked. We did not introduce a quantitative measure for the spiking interval; instead, it was estimated by eye from the phase histogram. For a unimodal distribution, one could use the number of the histogram bins with the highest number of spikes necessary to account for 80% of all the spikes. However, this measure does not work for bimodal distributions. When only one interneuron network was active, the phase histogram had one peak, and the neuron was able to spike only during an interval that was about half the cycle length (see Figure 3A). When the excitatory drive to the neuron was doubled, the spiking interval increased to about two-thirds of a cycle length. The phase histogram had also become bimodal because on some cycles, the neuron spiked twice (see Figure 3B). When both interneuron networks were active with a relative phase of 7.5 ms, there was one peak in the phase histogram (see Figure 3C), which was sharper than before. The neuron spiked only during an interval of about one-third of a cycle length. For a higher level of excitation, the spiking interval had increased to half a cycle length (see Figure 3D). Note that there is no bimodality in the phase histogram because it is hard to fit two spikes in a single spiking interval. When the two networks were completely out of phase (D = 12.5 ms; see Figures 3E and 3F), the phase histograms were bimodal from the start and the spiking was confined to two brief periods on each cycle. The minimum interspike interval is determined by the absolute and relative refractory period. In order to produce multiple spikes on a given cycle, the spiking interval needs to be long enough to fit the minimum interspike interval for that neuron. Hence, the strength of the excitatory drive and the period during which the neuron can spike determine the spike rate. Stimulus competition occurs in the model because of modulation of the spiking interval. We investigated how the firing rate for the single stimulus and pair configuration depended on the delay and precision of the interneuron networks. In the absence of inhibition, the firing rate versus excitatory input rate curve had a steep onset (see Figure 4Aa). When inhibition from a single interneuron network was added, there was a regime where the firing rate did not increase as fast with the excitatory input rate. Nevertheless, the asymptotic gain of the firing rate versus input rate curve was approximately the same.

Stimulus Competition by Inhibitory Interference

2429

Stimulus 1

A 4 2 0

Stimulus 1 & 2 4 2

rate (kHz)

0 Stimulus 2 4 2 0 200

300 t (ms)

400

Stimulus 1

B 20 -30 -80

Stimulus 1 & 2 20 -30 -80 V (mV)

Stimulus 2 20 -30 -80 200

300 t (ms)

400

Figure 2: Stimulus competition emerges from the interference between two inhibitory networks. In each panel, from top to bottom, we show the response to (or input due to) the presentation of stimulus 1 alone, stimulus 1 and 2 together, and stimulus 2 alone. (A) The inhibitory (solid lines) and excitatory input rates (dashed lines) and (B) the neuron’s membrane potential are plotted versus time. The input rates were 600, 1800, 750, and 750 inputs per second for the first and second excitatory pool and the first and second inhibitory pool, respectively. The other model parameters are given in section 2.

P. Tiesinga

histogram

histogram

histogram

2430

A

B

C

D

E

F

0.2 0.1 0 0.2 0.1 0 0.2 0.1 0

0

10 20 phase (ms)

0

10 20 phase (ms)

Figure 3: The delay between inhibitory networks determines the firing rate via modulation of the spiking interval. Phase histograms for the spike times of a V4 neuron driven by (A, B) one active interneuron network and by two active interneuron networks with a delay of (C, D) 7.5 ms and (E, F) D = 12.5 ms. The phase histograms were calculated over 200 s of data. The responses for A to F were expressed as (excitatory input rate in Hz, output rate in Hz) (780, 32.2), (1580, 89.7), (1170, 3.3), (2370, 41.8), (1170, 0.2), (2370, 20.9), respectively.

For asynchronous excitatory and inhibitory synaptic inputs, the neuron can be in the so-called fluctuation-driven regime where the spikes are caused by the voltage fluctuations rather than a mean upward drift of the membrane potential (see section 4). In this state, the effect of the standard deviation of the synaptic currents dominates that of the mean. When the inhibition is synchronous, an analogous regime exists, which is also referred to as fluctuation driven (Tiesinga et al., in press). In that case, the neuron’s membrane potential oscillates in the gamma frequency range, but the maximal voltage during each cycle in the mean is below the spike threshold; only when the fluctuations are strong enough will an action potential be elicited on a given cycle. The fluctuation-driven regime extended over a larger range when both interneuron networks were active and the gain of the response curve was less than for the single interneuron network. Hence, inhibition not only shifts the response curves to the right, it also induces a fluctuationdriven regime and changes the gain of the response (Chance et al., 2002; Mitchell & Silver, 2003; Prescott & De Koninck, 2003). The gain of the response curves also decreased when the delay between the networks was increased (see Figure 4Ab) or the precision decreased (see Figure 4Ac). Let us denote the response of a neuron with one active interneuron network by F1 ( f e ) and the response with two active networks by F2 ( f e ), with f e

Stimulus Competition by Inhibitory Interference

A

a

b

2431

c

f (Hz)

80

40

0

0

1000 2000 0 1000 2000 0 1000 2000 Excitatory Input Rate (Hz)

B

f (Hz)

40

a

b

c

20

0

0

2000

0 2000 0 Excitatory Input Rate (Hz)

2000

Figure 4: The relative phase and precision of interneuron networks determine the gain of the firing rate response curves. (A) Firing rate of the V4 neuron as a function of the excitatory input rate for purely excitatory inputs (dotdashed line), for excitatory inputs together with one active interneuron network (dashed), and together with two active interneuron networks (solid line). The parameters for a were the same as in Figure 2. In b the delays, in ms, were, from left to right, D = 0, 7.5 and 12.5 (note that the response with one active interneuron network does not depend on the delay; therefore only one curve is shown). In c, the precisions in 1/ms were, for the solid and the dashed lines, from left to right, P = 1/3, 1/5 and 1/10. (B) Analysis of stimulus competition. Solid lines represent the neuron’s response for one active interneuron network together with 25% excitation (lower curves) and 75% excitation (upper curves). The dashed lines are the responses to 100% excitation with two active interneuron networks. The rate on the x-ordinate corresponds to 100% excitation. Stimulus competition is present when the dashed curve is between the two solid curves. Parameters (D in ms, P in 1/ms) were (a) (7.5, 1/3), (b) (12.5, 1/3), and (c) (7.5, 1/10).

being the excitatory input rate. Then f 1 = F1 ( 14 f e ), f 2 = F1 ( 34 f e ), and f 3 = F2 ( f e ), and stimulus competition implies that F1 ( 14 f e ) < F2 ( f e ) < F1 ( 34 f e ). These three curves are plotted together in Figure 4B. We use a qualitative scale for precision: low precision corresponds to 0 ≤ P ≤ 1/5, moderate precision corresponds to 1/5 < P ≤ 1/3, and high precision corresponds to P > 1/3. When the precision is moderate to high, stimulus competition is

2432

P. Tiesinga

obtained for all input rates (see Figures 4Ba & 4Bb). However, for low precision and low input rates, the pair response can dip below the response to stimulus 1 alone (see Figure 4Bc). This suggests that for these parameters, the inhibitory networks need to be synchronized in order to obtain stimulus competition. Note that, in principle, with a similar balance between excitation and inhibition as used here, stimulus competition could also be present without synchronous inhibitory inputs (see section 4.3). This precision requirement was further investigated for different combinations of poor and preferred stimuli. The excitatory input rate to the neuron in response to stimulus 1 and stimulus 2 alone is denoted by f e1 and f e2 , respectively. We determined f 3 = F2 ( f e1 + f e2 ) and the ratios f 3 / f 1 = F2 ( f e1 + f e2 )/F1 ( f e1 ) and f 3 / f 2 = F2 ( f e1 + f e2 )/F1 ( f e2 ) for all distinct pairs of ( f e1 , f e2 ) values. We used 60 values for f e1 and f e2 that were between 0 and 1475 inputs per second in increments of 25 spikes per second. Stimulus competition is obtained when the first ratio is larger than one and the second ratio is less than one. The histograms of these ratios are shown in Figure 5 together with the histogram of the values of f 3 for which there was stimulus competition. For a small delay, D = 2.5 ms, and moderate precision, P = 1/(3 ms), stimulus competition was obtained for 91% of the pairs (see Figure 5A). The pair responses were broadly distributed between 10 and 60 Hz. For the majority of the pairs, the response was closer to f 2 than to f 1 . When the delay was increased to its maximal value, D = 12.5 ms, less than 1% of the pairs yielded stimulus competition (see Figure 5B). This was because the inhibition was so effective that the pair response usually was below f 1 . For low precision, stimulus competition was obtained in about 24% of the pairs (see Figure 5C). The pair response was in the majority of the cases closest to f 1 . In summary, small delays and moderate precision are required to get robust stimulus competition.

3.3 Attentional Modulation by Changes in Synchrony. It was recently reported that altering the precision of the inhibitory inputs to a neuron could lead to a gain change of its firing rate response curve. Increasing input precision led to an increased firing rate, whereas decreasing precision led to a decreased firing rate (Tiesinga et al., 2004). We therefore studied the following scenario for attentional bias. When both stimuli are present and attention is directed toward the poor stimulus, the precision of the inhibitory inputs to the neuron is decreased, leading to a lower firing rate that is closer to f 1 . When attention is directed toward the preferred stimulus, the precision is increased, yielding a response closer to f 2 . Attention is thus hypothesized to have a different effect on the synchrony of the inhibitory networks depending on whether the focus of attention is on a poor or a preferred stimulus. This type of behavior may be hard to orchestrate in cortical networks, but the aim here is to see whether in principle, such a mechanism could work.

Stimulus Competition by Inhibitory Interference

histogram

A 0.4

a

b

c

a

b

c

a

b

c

0.2 0

histogram

0.4

histogram

B

0.4

C

2433

0.2 0

0.2 0

0

2 f3 /f 1

4

0

1 f3 /f 2

0

40 80 f3 (Hz)

Figure 5: Small delays and moderate precision are necessary for robust stimulus competition. In each panel, we show the histogram of (a) f 3 / f 1 and (b) f 3 / f 2 . (c) We plot the distribution of f 3 for the pairs that satisfied the requirement for stimulus competition. Parameter values were (D in ms, P in 1/ms) = (A) (2.5, 1/3), (B) (12.5, 1/3), and (C) (7.5, 1/10).

First, the effect of a global inhibitory network was considered (see Figure 6Aa). There was only one pool of inhibitory neurons whose precision was modulated. The response to stimulus 1 alone, stimulus 2 alone, and stimulus 1 and 2 together always increased with the precision of the global inhibitory network. However, the pair response, f 3 , was always larger than f 2 and f 1 . Hence, no stimulus competition was obtained. Next, we returned to the original model with two separate inhibitory pools (see Figure 1B), but their precisions were varied independently. The precision of the second inhibitory pool was increased to model directing attention to stimulus 2. The response to stimulus 2 alone and both stimuli at the same time increased, whereas the response to stimulus 1 remained the same (see Figure 6Ab). The precision of the first inhibitory pool was decreased to model attending stimulus 1. The response to stimulus 1 alone and both stimuli at the same time decreased, whereas the response to stimulus 2 remained the same (see Figure 6Ac). By contrast, the experimental result is that when a single stimulus in the receptive field is attended, the neuron’s response stays the same or increases moderately (McAdams & Maunsell, 1999a). This problem could be addressed by making the direction of precision change

2434

P. Tiesinga

A

b

a

c

f (Hz)

80

40

0 0.0

B

0.5

1.0 0.0 0.5 1.0 0.0 Precision (1/ms)

a

b

0.5

1.0

c

f (Hz)

80

40

0

0

5

10

0

5 10 Delay (ms)

0

5

10

Figure 6: Modulation of relative phase could account for attentional bias, whereas modulation of precision did not. In each panel, solid lines are the neuron’s responses with a single active interneuron network together with 25% excitation (lower curves) and 75% excitation (upper curves). The dashed lines are the responses to 100% excitation with two active interneuron networks. (A) Firing rate as a function of precision. We varied the precision of (a) the global interneuron network and the interneuron network associated with (b) stimulus 2 and (c) stimulus 1. (B) Firing rate is plotted as a function of delay with the precision (in 1/ms) equal to (a) 1, (b) 1/3, and (c) 1/8.

depend on whether a stimulus is presented by itself or with another stimulus. This would imply a direct interaction between the inhibitory networks, which we excluded from the model. 3.4 Attentional Modulation by Inhibitory Interference. We investigated whether we could model the effects of attention by varying the delay between the interneuron networks rather than by varying the precision (see Figure 7). For simplicity, we kept the precision of the inhibitory networks constant during the manipulation of the delay. However, the results described below are robust against simultaneous changes in delay and precision (see section 4). For attention directed to stimulus 1, the delay was increased and the firing rate decreased from the baseline condition (see Figures 7A and 7B, top panels). In contrast, for attention directed to

Stimulus Competition by Inhibitory Interference

A

2435

Stim 1 & 2, Attend1 4 2 0 Stim 1 & 2, Attend Away 4 2

rate (kHz)

0 Stim 1 & 2, Attend 2 4 2 0 200

B

300 t (ms)

400

Stim 1 & 2, Attend 1 20 -30 -80 Stim 1 & 2, Attend Away 20 -30 -80

V (mV)

Stim 1 & 2, Attend 2 20 -30 -80 200

300 t (ms)

400

Figure 7: Stimulus competition is biased by modulating the relative phase of the interneuron networks. We plot the (A) inhibitory input rates and (B) neuron’s voltage in response to stimulus 1 and 2 presented simultaneously. In each panel, from top to bottom, stimulus 1 is attended, attention is directed away from the neuron’s receptive field, and stimulus 2 is attended. The precision was P = 1/(3ms) and the delay in ms was, from top to bottom, D = 12.5, 7.5, 2.5.

stimulus 2, the delay was decreased, and the firing rate increased from the baseline value (see Figures 7A and 7B, bottom panels). The attentional bias should be able to raise the firing rate to values close to f 2 and decrease the firing rate to values close to f 1 . The requirements necessary to achieve this

2436

P. Tiesinga

dynamic range were determined (see Figure 6B). For high precision, the firing rate was strongly modulated by the value of the delay, but the pair response did not get close to f 1 (see Figure 6Ba). For low precision, there was only a weak modulation of the firing rate with delay, and the pair response remained close to f 1 (see Figure 6Bc). Only for moderate precision, P = (1/3 ms) was there both a strong modulation with delay and a pair response that went from values close to f 1 to values close to f 2 (see Figure 6Bb). Hence, there is only a limited range of precision values for which attentional modulation is possible for the full dynamic range. 3.5 Model Predictions for the Cross Correlation. We determined the average cross correlation between pairs of neurons with similar stimulus preferences as described in section 2. The spike trains of different neurons were obtained by running the model multiple times with the same exact parameters, but a different seed for the random number generator was used to generate the input spike trains. This models the situation in which each neuron receives input from a different subset of neurons out of the same ensemble. This is an approximation because there may be some overlap in the spike trains that drive the different neurons. We assume that this approximation affects the different attentional conditions to a similar extent. We use the same model parameter settings as for Figure 7. The peak of equal-time correlations is highest for the attend-away condition, followed by attend-topreferred, with attend-to-poor being the lowest (see Figure 8A). The cross correlation for the attend-to-poor condition had peaks at ±12.5 ms, which were absent for the other conditions. These peaks were due to the bimodality of the spiking phase (see also Figures 3E and 3F). These results are unexpected in two regards. First, based on the reduced spiking interval for the attend-to-poor condition, which limits spikes to a smaller part of the oscillation period, one would expect the resulting normalized cross correlation to be the highest. Second, based on the results reported in Fries et al. (2001), one would expect the attend-to-preferred condition to have higher cross correlations than attend-away (see section 4). For Figure 7, we did not vary the degree of precision of the inhibitory inputs; only the delay was varied to model the different attentional conditions. This was enough to account for the attentional modulation of firing rate, but apparently not enough to account for the attentional modulation of synchrony. We considered two scenarios for the covariation of inhibitory precision with delay: (1) the precision of the preferred-stimulus-related inhibitory inputs increased, whereas the precision of the poor-stimulus-related inhibitory inputs decreased, going from attend-to-poor, via attend-away, to attend-to-preferred (see Figure 8B); (2) the precision of both types of inhibitory inputs increased, going from attend-to-poor, via attend-away, to attend-to-preferred (see Figure 8C). For the first case, the attend-to-preferred condition had a higher cross correlation than attend-away, with a small peak for the equal-time cross correlation in the attend-to-poor condition. In the second case, the attend-to-preferred

Stimulus Competition by Inhibitory Interference

A a Xcorr (norm)

b

Xcorr (raw)

B

2437

b

Xcorr (raw)

Xcorr (norm)

a

C a Xcorr (raw)

Xcorr (norm)

b

-20

-10

0 10 ∆t (ms)

20

-20

-10

0 10 ∆t (ms)

20

Figure 8: Behavior of the cross correlation for different attentional conditions. The cross correlation is normalized by the total number of possible spike pairs to account for the difference in firing rates (see section 2). In each panel, we show (a) raw cross correlation and (b) normalized cross correlation. Notation: attend-away (D = 7.5 ms, dashed-dot lines), attend-to-preferred (D = 2.5 ms, dashed lines), and attend-to-poor (D = 12.5 ms, solid lines). (A) The parameters correspond to the model data presented in Figure 7; specifically, the precision is P = 1/3 (values for the precision are expressed in 1/ms in what follows) for both inhibitory inputs across all attentional conditions. (B), (P for the interneuron network responding to the preferred stimulus, P for the interneuron network responding to the poor stimulus) is (1/5, 1/5), (1/3, 1/5), and (1/5, 1/3), for attend-away, attend-to-preferred, and attend-to-poor, respectively. (C) With the same notation, (1/4, 1/4), (1/3, 1/3) and (1/5, 1/5). The raw cross correlation is expressed in pairs per bin, and the normalized cross correlation is expressed as the fraction of pairs per bin.

2438

P. Tiesinga

and attend-away correlations were virtually identical, whereas there was no peak at zero time in the cross correlation for the attend-to-poor condition. Since there are three parameters (two precisions, one delay) that could be varied independently, not to mention the number of inhibitory inputs, which were kept fixed here, there are many possible ways of accounting for attentional modulation of neural synchrony. However, among the few alternatives explored here, the first case seems to best match the expectations based on prior experimental results (see section 4). 3.6 The Power Spectrum of Model Spike Trains. In experiment, the power in the gamma frequency range of the local field potential was modulated by attention (Fries et al., 2001). To determine what the model would predict, we calculated the power spectrum of the spike trains using a multitaper method (see section 2). We again used the parameter settings of Figure 7 as the starting point, that is, the precisions of the inhibitory inputs were kept fixed. There were peaks (harmonics) in the power spectrum at multiples of the oscillation frequency—40 Hz, 80 Hz, and so on (see Figure 9).

a

Stim 1 & 2, Attend 2

0.015 P(0)

-2 10 -3 10 -4 10 -5 10

0.01

0

b Stim 1 & 2, Attend Away

c

0

Stim 1 & 2, Attend 1

P(80)/P(40)

Power spectrum

a

0.005

-2 10 -3 10 -4 10 -5 10

-2 10 -3 10 -4 10 -5 10

B

P(40)/P(0), P(80)/P(0)

A

50 f (Hz)

100

1

b

0.5 0 10

c

5 0 0

5 D (ms)

10

Figure 9: Spectral content of the spike trains produced by the model. The power spectrum was calculated using the multi-taper method (see section 2). (A) Parameters as in Figure 7 for the attentional conditions as labeled. (B) We plot versus the delay D, (a) the power at zero frequency (P(0)); (b) the power at the first harmonic (P(40 Hz), solid line) and second harmonic (P(80 Hz), dotdashed) both normalized by P(0); (c) the ratio of the power in the second and first harmonic.

Stimulus Competition by Inhibitory Interference

2439

We show only the first two harmonics. For the attend-to-preferred and the attend-away conditions, the first harmonic was dominant, with a stronger peak for the former compared to the latter. For the attend-to-poor condition, there only was the second harmonic. We also determined the power in the harmonics relative to the power at zero frequency (shown in Figure 9Ba). For delays smaller than 8 ms, there was a plateau in the relative strength of the first harmonic, whereas the relative strength of the second increased approximately linearly with delay (see Figure 9Bb). For larger delays, the relative strength of the first harmonic dropped precipitously, and the ratio between second and first harmonic increased rapidly (see Figure 9Bc). The prediction that comes out of these results is that the power at the second harmonic should increase for the attend-to-poor condition. One would also have expected, based on the results in Fries et al. (2001), that the power at the first harmonic would be higher for the attend-to-preferred condition compared with attend-away. This is true for the absolute power (see Figures 9Aa and Ab) but not for the relative power (see Figure 9Bb). Similar to the preceding results for the cross correlation, the relative power in the gamma frequency range can again be modulated by varying the precision of the inhibitory networks in the appropriate direction.

4 Discussion We have proposed a single cell mechanism for stimulus competition in V4 based on temporal interference of synchronous inhibition. The key idea is that a neuron’s firing rate can be modulated by changing the length of the spiking interval without needing to change the amount of excitation. The spiking interval is the length of that part of the cycle during which a neuron is free to fire; it is determined by the phase difference between different sources of synchronous inhibition and their precision. The model used here has a number of assumptions that were introduced in order to explain the results in the simplest possible way. However, as we outline below, response modulation by inhibitory interference is obtained under more general conditions. The requirements for attentional modulation of stimulus competition were as follows: (1) the feedforward excitation from V2 cannot be phase-locked to the inhibitory inputs, but it may consist of a sequence of precise spike volleys; (2) the local inhibitory networks need to be synchronized; and (3) attention-related bottom-up or top-down projections need to be able to modulate the phase and precision of synchronized interneuron networks. In this section, we discuss these requirements for stimulus competition, discuss how the results of gain modulation by inhibitory interference compare with those obtained previously for gain modulation by inhibitory synchrony (Tiesinga et al., in press), discuss other mechanisms for stimulus competition, and compare gain modulation by inhibitory interference to previously proposed mechanisms for firing rate modulation.

2440

P. Tiesinga

4.1 Requirements for Stimulus Competition by Inhibitory Interference. 4.1.1 Excitation. It was assumed in section 3 that excitation was asynchronous, that is, the sum of independent, constant-rate Poisson processes. We tested the robustness of the mechanism by making the density of the Poisson processes vary in time (results not shown). There were sharp peaks (spike volleys) in the density representing stimulus locking and a weak sinusoidal modulation at gamma frequencies representing possible gamma oscillations in upstream areas (for instance, V2; see Buffalo, Fries, & Desimone, 2004). For this case, stimulus competition was still obtained and could also be modulated by delay. The dynamic range over which firing rates could be modulated by delay was improved when 50% of the excitatory inputs came from constant-rate Poisson processes and the remainder from synchronous spike volleys. Not all synchronous excitatory drives are appropriate. Let us assume for a moment that excitation consisted of a train of spike volleys with the same period as those generated by inhibitory networks. In that case, a change in phase of one of the interneuron networks also changes the relative phase between the synchronous excitation and inhibition the neuron receives. The relative timing of inhibition and excitation strongly affects the neuron’s firing rate. Hence, changing the delay would affect not only the pair response, as intended, but also the response to a stimulus presented alone, possibly in the wrong direction. Thus, excitation can be synchronous as long as the volley times are uncorrelated with those generated by the inhibitory networks. 4.1.2 Generation and Rapid Modulation of Inhibitory Synchrony. Experiments (Fisahn et al., 1998; Whittington et al., 1995) and theoretical and computational work (Aradi & Soltesz, 2002; Bartos et al., 2002; Borgers & Kopell, 2003; Brunel, 2000; Brunel & Wang, 2003; Golomb & Hansel, 2000; Neltner, Hansel, Mato, & Meunier 2000; Tiesinga & Jose, 2000; Wang & Buzsaki, 1996; White, Chow, Ritt, Soto-Trevino, & Kopell, 1998) show that interneuron networks readily synchronize in the gamma frequency range. Precisions ranging from 0.1 to 1 (1/ms) were obtained in model simulations (Tiesinga & Jose, 2000). Although the synchronization dynamics of inhibitory networks has been studied extensively, the focus has almost exclusively been on the stationary state rather than dynamic changes in synchrony. Here we considered the case where interneuron networks were not active if there was no stimulus present. However, in a more realistic setting, they would be asynchronous and less active without a stimulus present and synchronized and strongly active with a stimulus present. In that case, interneuron networks would need to switch rapidly between different synchrony states. We found two types of networks whose synchrony can be changed by modulatory inputs. First, in a purely inhibitory network, synchrony can be modulated by increasing excitation to a part of the network. The activated

Stimulus Competition by Inhibitory Interference

2441

neurons increase their firing rate, synchronize, and reduce the activity of the other group of interneurons. Synchrony can be modulated using this mechanism on timescales as short as 100 ms (Tiesinga & Sejnowski, 2004). Second, in a mixed excitatory and inhibitory network, synchrony can be modulated by activating the interneuron network when the inhibitory and excitatory neurons are mode-locked to each other (Buia & Tiesinga, 2004). In that case, synchronized excitatory activity recruits inhibitory activity that temporarily shuts down the excitatory activity. When the inhibition decays, the excitatory neurons become active again, and the cycle starts over. Activation of interneuron networks by neuromodulators may increase their synchrony, in turn increasing excitatory synchrony, but without altering the mean firing rate of individual neurons (Buia & Tiesinga, 2004). When there are stimuli in the receptive field, our model assumes some level of gamma synchrony in the baseline (attend-away) condition. Experiments have primarily linked attentional effects with changes in the level of coherence in the gamma frequency range (Fries et al., 2001), but gamma oscillations have also been observed in anesthetized animals in response to visual stimulation (Gray, Konig, ¨ Engel, & Singer, 1989; Gray & Singer, 1989). Hence, gamma oscillations are not necessarily directly coupled to attention and may also subserve other generic cortical information processing purposes, such as binding (Singer & Gray, 1995). 4.1.3 Modulating Relative Phase. To determine whether the phase delay could be manipulated as precisely as necessary for gain modulation by inhibitory interference, we simulated the following circuit (Buia & Tiesinga, unpublished observations; model neurons as in Buia & Tiesinga, 2004). There were three pools of neurons; one consisted of 200 excitatory neurons, and the other two were interneuron networks with 100 neurons each. The interneurons in each network were connected via mutual inhibition among themselves. The excitatory neurons were synchronized and drove the inhibitory networks. The interneurons received a noise current, representing random synaptic background activity, and a constant current, representing the activity of top-down projections. We fixed the constant current to the first inhibitory network and systematically varied it for the second inhibitory network. In doing so, we obtained the full range of relative phases (results not shown). To determine whether the precision could be varied using this architecture while the relative phase remained constant, we also systematically varied the variance of the noise current while maintaining the same value of the constant current to each of the interneuron networks. In that case, a broad range of precisions was obtained. The architecture used in these exploratory simulations may not be the most realistic one, and further studies are necessary. 4.1.4 Balanced Inhibitory Synaptic Inputs. It was assumed that the inhibition due to interneuron networks activated by the preferred stimulus had

2442

P. Tiesinga

the same strength and precision as that due to the poor stimulus. We investigated whether stimulus competition was robust starting from the parameter settings in Figure 6Bb (results not shown). First, the precision of the interneuron network responding to the poor stimulus was made less precise; P went from 1/3 to 1/5 (1/ms). Stimulus competition, and the modulation of it with delay, was essentially unaltered. Second, the poor-stimulus-related inhibition was made smaller by reducing the number of spikes to 450 per second from 750 per second. Stimulus competition was again present, but for a range of small delays, the response to the stimulus pair was almost equal to the response to the preferred stimulus when presented alone. This suggests that the conditions for stimulus competition are optimal for equalstrength inhibition, but that stimulus competition is less robust against an imbalance between inhibitory inputs. Specifically, modulation toward the poor-stimulus response is less effective. 4.1.5 Which Interneurons Are Involved in Inhibitory Interference? There is an enormous diversity of interneurons in the cortex. It was recently reported that there are two dynamically distinct inhibitory networks in layer 4 of the somatosensory cortex (Beierlein et al., 2003). The fast-spiking (FS) cells receive inputs from the thalamus and provide feedforward inhibition (Galarreta & Hestrin, 1999). The FS cells are an important component of feedforward models dealing with the emergence of orientation selectivity in primary visual cortex (Miller, 2003). The low-threshold spiking (LTS) cells are coupled by electric gap junctions and can easily become synchronized (Gibson et al., 1999). The interneurons in our model have the same stimulus specificity as the excitatory inputs to the model neuron, but they should also synchronize easily. The model networks thus have properties in common with both the FS and LTS networks. The short-term synaptic dynamics of the two networks are drastically different and lead to distinct temporal response properties (Beierlein et al., 2003). Experiments are needed that probe the temporal dynamics of attentional modulation of stimulus competition in order to identify which of the two networks is modulated by attention. 4.1.6 Which Neurons Provide Modulatory Input to V4 Interneuron Networks? In a study by Schmolesky et al. (1998) of the latency of response onset to a flashed stimulus in anesthesized macaque, it was found that FEF and V2 responded at approximately the same time, whereas V4 responded after both FEF and V2. This implies that bottom-up signals from V2 and topdown signals from FEF have the proper time course to meet and interact in area V4. Moore and coworkers recently studied the neural interactions between the FEF and V4 in awake macaque monkeys (Moore & Armstrong, 2003; Moore, Armstrong, & Fallah, 2003). When neurons in the FEF were electrically stimulated—microstimulation—the animal made a saccade to a particular area in visual space. They then stimulated FEF with a smaller current that did not elicit a saccade and recorded at the same time from

Stimulus Competition by Inhibitory Interference

2443

V4 neurons with a receptive field at the location of the intended target of the saccade. Microstimulation did not affect the response of the V4 neuron when there was no stimulus in its receptive field. However, when there was a stimulus in the receptive field, the firing rate increased. The increase was larger for stimuli that elicited a large response without microstimulation. This resembles attentional gain modulation of orientation tuning curves in V4 as reported in McAdams and Maunsell (1999a). When a distractor stimulus was placed outside the receptive field, the strength of the modulation of V4 responses by the FEF increased. This is consistent with the effects of attention on the V4 response to two simultaneously presented stimuli (Desimone & Duncan, 1995; Luck et al., 1997; Reynolds et al., 1999). These experimental results provide support for the involvement of FEF activity in attentional processing, putatively via the modulation of the local interneuron network activity (see Figure 1B). Because there is a lack of strong anatomical evidence for direct connections between FEF and V4 (Felleman & Van Essen, 1991), the precise role of the FEF in attentional modulation of V4 requires further experimental study. 4.1.7 Model Circuits with Multiple V4 Neurons. We studied how a single neuron responded to two stimuli, with stimulus 1 (S1) being the poor stimulus and stimulus 2 (S2) the preferred one. It is of interest to determine how this model could be extended to include a neuron for which S1 was the preferred and S2 the poor stimulus. An important issue is whether it is possible to manipulate the phases of the interneuron networks in the way required by the proposed inhibitory interference mechanism. It is possible to solve the phase-setting problem in a simple way. Although it may be too simplistic to work in cortex, it might point the way to more realistic architectures. First, we assume that there is a clock with respect to which we can define a phase (for related ideas, see Buzsaki & Chrobak, 1995; Hopfield, 1995). S1 is the preferred stimulus for neuron 1 and S2 is the poor stimulus, whereas for neuron 2, it is the other way around. We assume that there are two interneuron networks, I1 and I2. I1 has phase φ (with respect to the clock, and as before, it is expressed as a time) and is closest to neuron 1, its inhibitory volleys arrive at the soma of neuron 1 with a delay T1 and neuron 2 with delay T2 (with T1 < T2). I2 has zero phase, and spikes arrive at neuron 1 with delay T2 and neuron 2 with delay T1. The relative phase between the inhibitory volleys at neuron 1 is (T1 + φ) − T2 and it is T1 − (T2 + φ) at neuron 2 (all phases are modulo the period, and the sign does not matter when the inhibitory volleys are identical). Hence, we have, (T1 − T2) + φ for neuron 1 and (T1 − T2) − φ for neuron 2. By changing the phase at only one location, the relative phase will increase for neuron 1 and decrease for neuron 2, or vice versa. This means that a top-down area would need to modulate the phase only at the location on the cortical surface corresponding to the to-be-attended location in visual space. This scheme could conceivably be extended to include feature or object selectivity. Nonetheless,

2444

P. Tiesinga

in this scheme, there are quite strong requirements on the value of the inhibitory conduction delays and the relative phase of inhibitory networks. The feasibility in the cortex of these requirements needs to be studied rigorously. 4.2 Model Predictions and the Relation to Gain Modulation by Inhibitory Synchrony. Previously we proposed gain modulation by inhibitory synchrony as a potential mechanism for attentional modulation (Tiesinga et al., in press). By varying the precision of one activated interneuron network, this mechanism accounted for multiplicative gain modulation of orientation tuning curves (McAdams & Maunsell, 1999a) and the modulation of synchrony with small changes in firing rate as measured in Fries et al. (2001). This previous model also predicted large relative changes with attention in firing rate for a low firing rate (for instance, for a poor stimulus) and smaller relative changes for a high firing rate (for a preferred stimulus) in correspondence with the contrast gain model reviewed in Reynolds and Chelazzi (2004). Here we extended this model for the case with two stimuli in the receptive field. We assumed that there are now two active interneuron networks. Attention could, in principle, modulate their precisions as well as the relative delay. The most straightforward assumption, based on the result in Fries et al. (2001), was that attending to either of the stimuli increased the precision of the interneuron network corresponding to the attended stimulus. However, this was inconsistent with the results in Reynolds et al. (1999), because for two stimuli, attending the poor stimulus increased the model response. We therefore studied how the firing rate varied with delay. For simplicity, we did not vary the precision at the same time as the delay. However, for each of the interneuron networks, the precision could be either increased or decreased independently, without changing the main results on firing rate modulation (results not shown). However, this raises two issues: How does the coherence between neural spike trains change with attentional condition? And how does the frequency content of the spike train change? The behavioral task, the stimulus choice, and duration were different between the two experiments (Fries et al., 2001; Reynolds et al., 1999). In Reynolds et al. (1999), the animal had to detect the onset of a target stimulus at one of two locations—one inside the receptive field and the other outside the receptive field. The animal was cued where to expect this stimulus. The response to briefly presented (50–250 ms) probe stimuli at these locations was then measured. The probe stimuli could be preferred or poor. The probe stimuli appeared at the attended location, but they were not behaviorally-relevant targets for the subject. By contrast, in Fries et al. (2001), the animal had to detect the change in color of a stimulus in the receptive field or outside the receptive field. The stimulus was chosen to maximally drive the cell. The experiment therefore did not determine how the synchrony changed when a poor stimulus was attended. The stimulus was also presented for a longer duration, 500 to 5000 ms, and it was the

Stimulus Competition by Inhibitory Interference

2445

actual target for the behavioral task. The experiment thus predicted that attending a preferred stimulus, presented by itself, increases the power of the local field potential in the gamma frequency range, and it increases the coherence between neurons responding to the same stimulus. Because of the differences in the experimental paradigm, it is unclear how to generalize these findings to the case of multiple stimuli in a neuron’s receptive field. With this cautionary note, one could infer that attending the preferred stimulus of a pair would increase coherence, but it is unclear whether attending the poor stimulus of a pair would increase or decrease synchrony. The model predicts that the cross correlation between spike trains for the attend-to-preferred condition is higher than attend-to-poor (see Figure 8A). Only when the precision of preferred-stimulus-related interneuron network increased when the preferred stimulus was attended and the precision of the poor-stimulus-related network did not increase was the cross correlation for the attend-to-preferred condition higher than attend-away (see Figure 8B). Hence, in order to account for the synchrony modulations inferred from experiment, the model predicts that modulation of inhibitory synchrony as well as delay is necessary. Based on current experimental data, it cannot predict what happens when the poor stimulus is presented alone. For the attend-away and attend-to-preferred conditions in the model, there is a peak in the spike train power spectrum at gamma frequencies and a smaller peak at the second harmonic. For the attend-to-poor condition, the second harmonic was dominant. In Fries et al. (2001), for one stimulus in the neuron’s receptive field, the strength of the first harmonic changed, but the first harmonic itself did not shift in its entirety to 80 Hz. The case of two stimuli in the neuron’s receptive field was not considered in Fries et al. (2001). Therefore, new experiments are necessary to test the model prediction for the spike train spectra. 4.3 Other Approaches to Stimulus Competition. Stimulus competition has been achieved in network models (Usher & Niebur, 1996). The model consisted of feature-selective principal cells, such as cells sensitive to stimulus orientation. Each principal cell provided excitatory input to a global inhibitory network and in turn received inhibitory inputs from it. When multiple stimuli were presented, the response of the principal cell to a preferred stimulus was suppressed by the inhibition recruited by other principal cells responding to poor stimuli. The competition could be biased in favor of a particular orientation by top-down excitation to the corresponding principal cell. When the stimulus corresponding to the attended orientation was presented, the sum of the feedforward and top-down inputs allowed the responsive neuron to recruit the strongest inhibition, which suppressed the neurons responding to the nonattended orientations. An extension of this model (Deco, Pollatos, & Zihl, 2002) was used to account for the results in Reynolds et al. (1999). The difference between these proposals and ours is that the role of inhibitory synchrony was not studied in their models.

2446

P. Tiesinga

There are also single cell mechanisms for stimulus competition. Archie and Mel (2000) proposed that stimulus competition arose from the spatial segregation of afferent synapses onto different regions of the excitable dendritic tree of V4 neurons. This raises the issue of whether asynchronous excitation and inhibition would be enough to obtain stimulus competition in single-compartment model neurons. Neurons have two operating regimes (Kuhn, Aertsen, & Rotter, 2004; Tiesinga, Jose, & Sejnowski, 2000). A neuron can be in the fluctuation-driven regime where inhibition dominates excitation. The spikes are then due to stochastic fluctuations in the membrane potential that cross the action potential threshold. Or a neuron can be in the current-driven regime, where there is a net upward drift in the membrane potential that causes the action potentials. In that case, excitation dominates inhibition, and the firing rate is proportional to the difference between excitation and inhibition. Consider a neuron that is operating in the current-driven regime. A poor stimulus as well as a preferred stimulus elicit spikes and lead to a firing rate that exceeds the spontaneous rate (this is zero in the model but typically a few spikes per second in experiment). This means that excitation is larger than inhibition in both cases. When the input due to the poor stimulus is added to that due to the preferred stimulus, the firing rate has to exceed that elicited in response to the preferred stimulus alone, because the net difference between excitation and inhibition has increased. Therefore, stimulus competition does not occur in the currentdriven regime. Can stimulus competition occur in the fluctuation-driven regime? In the fluctuation-driven regime, the variance of the membrane potential fluctuations can decrease when the inputs due to poor and preferred stimulus are added (Burkitt, Meffin, & Grayden 2003; Chance et al., 2002; Tiesinga et al., 2000). In that case, the firing rate would also decrease and fall below the response to the preferred stimulus. This is indeed what happened for low-precision inhibition in Figure 4Ac. For the model considered here, the resulting firing rate was low—the pair response was closer to the response to the poor stimulus. However, for a different model, the low firing rate problem can probably be resolved. Hence, stimulus competition is possible in neurons driven by asynchronous excitatory and inhibitory spike trains, as long as they operate in the fluctuation-driven regime. The real problem is the attentional modulation. How can attention modulate the poor-stimulus-related inputs to the V4 neuron in such a way that the response increases or remains the same when the poor stimulus is presented alone, but decreases when both stimuli are presented simultaneously? A decrease in the amount of inhibition or an increase in the amount of excitation to the V4 neuron would increase its response to stimulus 1 alone, but would also increase the pair response. The same holds for increasing the precision of the inhibitory inputs. The solution suggested here is to alter the relative phase of volleys produced by the interneuron networks. When only one interneuron network is active and the excitatory inputs are asynchronous, the absolute phase of the volleys does not affect the firing rate (it does affect

Stimulus Competition by Inhibitory Interference

2447

the timing of the spikes). However, when two interneuron networks are active, the relative phase does matter. The pair response can then be made to decrease with attention by increasing the delay. As mentioned before, it is also possible to increase the response to stimulus 1 alone by increasing the precision of the inputs with attention. Even for that case, the pair response will decrease with increasing delay (data not shown).

4.4 Mechanisms for Gain Modulation. Firing rate modulation by interference of inhibitory inputs may have more general applicability as a mechanism for gain modulation. Previously, three other mechanisms have been proposed for how gain changes can be achieved (Burkitt, 2001; Burkitt et al., 2003; Chance et al., 2002; Destexhe, Rudolph, Fellous, & Sejnowski, 2001; Doiron, Longtin, Berman, & Maler, 2001; Fellous, Rudolph, Destexhe, & Sejnowski, 2003; Holt & Koch, 1997; Kuhn et al., 2004; Larkum, Senn, & Luscher, 2004; Mitchell & Silver, 2003; Murphy & Miller, 2003; Prescott & De Koninck, 2003; Rauch, La Camera, Luscher, Senn, & Fusi, 2003; Salinas & Sejnowski, 2000; Tiesinga et al., 2004, 2000; Ulrich, 2003). We briefly summarize them and discuss how they relate to gain modulation by inhibitory interference. 4.4.1 Gain Modulation by Balanced Synaptic Inputs. Under in vivo conditions, neurons receive a constant barrage of excitatory and inhibitory inputs (Shadlen & Newsome, 1998). The synaptic inputs are called balanced when the effective reversal potential of the sum of excitatory and inhibitory inputs is equal to the neuron’s resting membrane potential (leak reversal potential). By proportionally scaling the rates of excitatory and inhibitory inputs, the amplitude of the voltage fluctuations can be modulated while maintaining a constant mean membrane potential. In the balanced mode, the neuron is driven by fluctuations: the larger the fluctuations, the higher the firing rate (Chance et al., 2002). Interestingly, an increase in balanced activity decreased the gain of the firing rate versus current curve (Burkitt, 2001; Burkitt et al., 2003; Kuhn et al., 2004; Tiesinga et al., 2000). The saturation of dendritic nonlinearities can further enhance the change in gain obtained with balanced inputs (Prescott & De Koninck, 2003). 4.4.2 Gain Modulation by Tonic Inhibition and Excitation. Tonic inhibition by itself did not lead to multiplicative gain modulation (Doiron et al., 2001; Holt & Koch, 1997). However, when tonic inhibition was applied in combination with either excitatory or inhibitory Poisson spike train inputs, changes in gain as well as shifts in sensitivity were observed (Mitchell & Silver, 2003; Ulrich, 2003). Murphy and Miller (2003) showed that changes in tonic excitation and inhibition can lead to approximate multiplicative gain modulation of cortical responses when the nonlinearity of the thalamic contrast response is taken into account.

2448

P. Tiesinga

4.4.3 Gain Modulation by Correlations. When a neuron is in a fluctuationdriven state and receives inputs from different neurons, it is sensitive to correlations between these neurons. Stronger correlations lead to an increase in the amplitude of voltage fluctuations, hence to an increase in firing rate (Salinas & Sejnowski, 2000). Gain modulation by inhibitory synchrony is a specific example of this mechanism. Changing inhibitory input synchrony resulted in a gain change for neurons receiving on the order of 10 inhibitory inputs on each oscillation cycle (Tiesinga et al., 2004). 4.4.4 Gain Modulation by Inhibitory Interference. This mechanism assumes that there are multiple sources of synchronous inhibitory inputs. When all of the input sources are in phase—no interference—the gain is high, whereas when they are not in phase—interference—the gain is low. The network proposed here is similar to circuitry that makes neurons in the inferior colliculus sensitive to interaural timing differences (Brand et al., 2002). This circuitry was studied in vitro using dynamic clamp (Grande, Kinney, Miracle, & Spain, 2004). Two synchronized and periodic excitatory input trains were injected into a neuron. The excitatory volleys in each train arrived at a rate of f d . The volleys of the second train were delayed with respect to the first train. For a small delay, the neuron’s firing rate was f d ; the neuron fired one spike for every two input volleys. However, for the maximal possible delay, when the two input trains were in antiphase, the firing rate of the neuron had doubled to 2 f d . By contrast, we find that for small delays, the firing rate is higher than for longer delays. It seems hard to modulate the activity of cortical networks in such a way that the synaptic inputs they provide to other neurons remain balanced. To the best of our knowledge, no network architecture has been proposed that achieves this. By contrast, tonic excitation and inhibition can be easily modulated using neurotransmitters or modulators. The same manipulations can also alter the correlations in networks (see, e.g., Tiesinga & Sejnowski, 2004). The attractive feature of gain modulation by inhibitory interference is that the gain depends on the conjunction of two inputs. This bears a resemblance to the idea of binding: spikes that occur at the same phase are part of the same percept and are transmitted and processed together (for review, see Singer & Gray, 1995). Here, the same idea is used to gate information transmission via gain changes. 4.5 Open Problems and Future Work. Our model was constructed to account for the results obtained by Reynolds and coworkers (Reynolds & Chelazzi, 2004; Reynolds et al., 1999). The response of V4 cells to two simultaneously presented stimuli was also studied by Gawne and Martin (2002; see also the review by Rousselet, Thorpe, & Fabre-Thorpe, 2003). They attempted to make the distance between the two stimuli as large as possible. Under those conditions, the response of the neuron was closer to the maximum of the response to either of the stimuli alone rather than a weighted

Stimulus Competition by Inhibitory Interference

2449

average of these responses, as was the case in Reynolds et al. (1999). The model presented here can also account for these results: when the phase delay is small, the firing rate is close to the maximal response to either of the stimuli alone. These results underscore the need for a more detailed investigation of stimulus competition as a function of the distance between the stimuli as well as the difference in feature values, such as color and orientation. The discussion also indicates that it is not yet fully understood in what way attending to a poor stimulus, presented either alone in the neuron’s receptive field or as part of a pair, modulates the correlation between similarly responding neurons. In addition, characterizing the temporal dynamics as well as the anatomical substrate of the FEF modulation of V4 responses remains a challenging problem. The general picture that emerges from the proposed model is a cortex with patches of active and synchronized interneuron networks that all fire at a particular phase and with a certain precision. Bottom-up and top-down projections dynamically modulate the precision and phase in order to preferentially process the behaviorally relevant stimuli. From a modeling standpoint, this raises a number of questions. What is the typical size of a patch with neurons that fire at approximately the same phase? Does the firing phase change continuously across the cortical surface, or are there discontinuous transitions? How are the phases dynamically modulated, and with what timescale? These and other questions will be addressed in future work using large-scale network models. References Aradi, I., Santhakumar, V., Chen, K., & Soltesz, I. (2002). Postsynaptic effects of GABAergic synaptic diversity: Regulation of neuronal excitability by changes in IPSC variance. Neuropharmacology, 43, 511–522. Aradi, I., & Soltesz, I. (2002). Modulation of network behaviour by changes in variance in interneuronal properties. J. Physiol., 538, 227–251. Archie, K. A., & Mel, B. W. (2000). Dendritic compartmentalization could underlie competition and attentional biasing of simultaneous visual stimuli. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Bartos, M., Vida, I., Frotscher, M., Meyer, A., Monyer, H., Geiger, J. R., & Jonas, P. (2002). Fast synaptic inhibition promotes synchronized gamma oscillations in hippocampal interneuron networks. Proc. Natl. Acad. Sci. U.S.A., 99, 13222–13227. Beierlein, M., Gibson, J. R., & Connors, B. W. (2000). A network of electrically coupled interneurons drives synchronized inhibition in neocortex. Nat. Neurosci., 3, 904– 910. Beierlein, M., Gibson, J. R., & Connors, B. W. (2003). Two dynamically distinct inhibitory networks in layer 4 of the neocortex. J. Neurophysiol., 90, 2987–3000. Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Comput., 15, 509–538.

2450

P. Tiesinga

Brand, A., Behrend, O., Marquardt, T., McAlpine, D., & Grothe, B. (2002). Precise inhibition is essential for microsecond interaural time difference coding. Nature, 417, 543–547. Brunel, N. (2000). Dynamics of networks of randomly connected excitatory and inhibitory spiking neurons. J. Physiol. Paris., 94, 445–463. Brunel, N., & Wang, X. J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitationinhibition balance. Journal of Neurophysiology, 90, 415–430. Buffalo, E. A., Fries, P., & Desimone, R. (2004). Layer-specific attentional modulation in early visual areas. Abstracts Society for Neuroscience, 717.6. Buia, C., & Tiesinga, P. (2004). Gain modulation by synchrony in a model cortical column. Society for Neuroscience Abstract, 648.1. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials. Biol. Cybern., 85, 247–255. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Buzsaki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuronal networks. Curr. Opin. Neurobiol., 5, 504– 510. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neurosci., 17, 8621– 8644. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Chelazzi, L., Miller, E. K., Duncan, J., & Desimone, R. (1993). A neural basis for visual search in inferior temporal cortex. Nature, 363, 345–347. Connor, C. E., Gallant, J. L., Preddie, D. C., & VanEssen, D. C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. Journal of Neurophysiology, 75, 1306–1308. Deans, M. R., Gibson, J. R., Sellitto, C., Connors, B. W., & Paul, D. L. (2001). Synchronous activity of inhibitory networks in neocortex requires electrical synapses containing connexin36. Neuron, 31, 477–485. Deco, G., Pollatos, O., & Zihl, J. (2002). The time course of selective visual attention: Theory and experiments. Vision Res., 42, 2925–2945. Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate visual cortex. Philos. Trans. R. Soc. Lond. B Biol. Sci., 353, 1245–1255. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annu. Rev. Neurosci., 18, 193–222. Destexhe, A., Rudolph, M., Fellous, J.-M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in-vivo-like activity in neocortical neurons. Neuroscience, 107, 13–24. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Doiron, B., Longtin, A. A. E., Berman, N., & Maler, L. (2001). Subtractive and divisive inhibition: Effect of voltage-dependent inhibitory conductances and noise. Neural Computation, 13, 227–248.

Stimulus Competition by Inhibitory Interference

2451

Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex., 1, 1–47. Fellous, J. M., Rudolph, M., Destexhe, A., & Sejnowski, T. J. (2003). Synaptic background noise controls the input/output characteristics of single cells in an in vitro model of in vivo activity. Neuroscience, 122, 811–829. Fisahn, A., Pike, F., Buhl, E., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Galarreta, M., & Hestrin, S. (1999). A network of fast-spiking cells in the neocortex connected by electrical synapses. Nature, 402, 72–75. Galarreta, M., & Hestrin, S. (2001). Spike transmission and synchrony detection in networks of GABAergic interneurons. Science, 292, 2295–2299. Gawne, T. J., & Martin, J. M. (2002). Responses of primate visual cortical V4 neurons to simultaneously presented stimuli. J. Neurophysiol., 88, 1128–1135. Gibson, J. R., Beierlein, M., & Connors, B. W. (1999). Two networks of electrically coupled inhibitory neurons in neocortex. Nature, 402, 75–79. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large, sparse neuronal networks. Neural. Comput., 12, 1095–1139. Grande, L. A., Kinney, G. A., Miracle, G. L., & Spain, W. J. (2004). Dynamic influences on coincidence detection in neocortical pyramidal neurons. Journal of Neuroscience, 24, 1839–1851. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A., 86, 1698–1702. Greenside, H. S., & Helfand, E. (1981). Numerical integration of stochastic differential equations. Bell. Syst. Tech. J., 60, 1927. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Comput., 9, 1001–1013. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Koch, C. (1999). Biophysics of computation. Oxford: Oxford University Press. Koos, T., & Tepper, J. (1999). Inhibitory control of neostriatal projection neurons by GABAergic interneurons. Nat. Neurosci., 2, 467–472. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. Larkum, M. E., Senn, W., & Luscher, H. R. (2004). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cereb. Cortex, 14, 1059–1070. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77, 24–42. McAdams, C. J., & Maunsell, J. H. (1999a). Effects of attention on orientation-tuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19, 431–441. McAdams, C. J., & Maunsell, J. H. (1999b). Effects of attention on the reliability of individual neurons in monkey visual cortex. Neuron, 23, 765–773.

2452

P. Tiesinga

McAdams, C. J., & Maunsell, J. H. (2000). Attention to both space and feature modulates neuronal responses in macaque area V4. J. Neurophysiol., 83, 1751– 1755. Miller, K. D. (2003). Understanding layer 4 of the cortical circuit: A model based on cat V1. Cerebral Cortex, 13, 73–82. Mitchell, S. J., & Silver, R. A. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron, 38, 433–445. Mitra, P. P., & Pesaran, B. (1999). Analysis of dynamic brain imaging data. Biophys. J., 76, 691–708. Moore, T., & Armstrong, K. M. (2003). Selective gating of visual signals by microstimulation of frontal cortex. Nature, 421, 370–373. Moore, T., Armstrong, K. M., & Fallah, M. (2003). Visuomotor origins of covert spatial attention. Neuron, 40, 671–683. Moore, T., & Fallah, M. (2004). Microstimulation of the frontal eye field and its effects on covert spatial attention. J. Neurophysiol., 91, 152–162. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Murphy, B. K., & Miller, K. D. (2003). Multiplicative gain changes are induced by excitation or inhibition alone. J. Neurosci., 23, 10040–10051. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural. Comput., 12, 1607–1641. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci. U.S.A., 100, 2076–2081. Rauch, A., La Camera, G., Luscher, H. R., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate-and-fire neurons to in vivo-like input currents. J. Neurophysiol., 90, 1598–1612. Reynolds, J. H., & Chelazzi, L. (2004). Attentional modulation of visual processing. Annual Review of Neuroscience, 27, 611–647. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19, 1736–1753. Reynolds, J. H., & Desimone, R. (1999). The role of neural mechanisms of attention in solving the binding problem. Neuron, 24, 19–29, 111–125. Reynolds, J. H., & Desimone, R. (2003). Interacting roles of attention and visual salience in V4. Neuron, 37, 853–863. Reynolds, J. H., Pasternak, T., & Desimone, R. (2000). Attention increases sensitivity of V4 neurons. Neuron, 26, 703–714. Rousselet, G. A., Thorpe, S. J., & Fabre-Thorpe, M. (2003). Taking the MAX from neuronal responses. Trends Cogn. Sci., 7, 99–102. Salinas, E., & Sejnowski, T. (2000). Impact of correlated synaptic input on output variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Schmolesky, M. T., Wang, Y., Hanes, D. P., Thompson, K. G., Leutgeb, S., Schall, J. D., & Leventhal, A. G. (1998). Signal timing across the macaque visual system. J. Neurophysiol., 79, 3272–3278. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896.

Stimulus Competition by Inhibitory Interference

2453

Shepherd, G. (1998). Synaptic organization of the brain (4th ed). New York: Oxford University Press. Singer, W., & Gray, C. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Tamas, G., Buhl, E., Lorincz, A., & Somogyi, P. (2000). Proximally targeted GABAergic synapses and gap junctions synchronize cortical interneurons. Nat. Neurosci., 3, 366–371. Tiesinga, P. H. E., Fellous, J. M., Jose, J. V., & Sejnowski, T. J. (2002). Information transfer in entrained cortical neurons. Network, 13, 41–66. Tiesinga, P., Fellous, J.-M., Salinas, E., Jose, J. V., & Sejnowski, T. (2004). Synchronization as a mechanism for attentional gain modulation. Neurocomputing, 58–60, 641–646. Tiesinga, P., Fellous, J.-M., Salinas, E., Jose, J., & Sejnowski, T. (in press). Inhibitory synchrony as a mechanism for attentional gain modulation. J. Physiol. (Paris). Tiesinga, P. H. E., & Jose, J. V. (2000). Robust gamma oscillations in networks of inhibitory hippocampal interneurons. Network, 11, 1–23. Tiesinga, P. H. E., Jose, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels. Physical Review E, 62, 8413–8419. Tiesinga, P. H. E., & Sejnowski, T. J. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Comput., 16, 251–275. Ulrich, D. (2003). Differential arithmetic of shunting inhibition for voltage and spike rate in neocortical pyramidal cells. Eur. J. Neurosci., 18, 2159–2165. Usher, M., & Niebur, E. (1996). Modeling the temporal dynamics of IT neurons in visual search: A mechanism for top-down selective attention. Journal of Cognitive Neuroscience, 8, 311–327. Wang, X. J., & Buzsaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Whittington, M. A., Traub, R. D., & Jefferys, J. G. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.

Received October 20, 2004; accepted March 28, 2005.

LETTER

Communicated by Eric Mjolsness

Optimization via Intermittency with a Self-Organizing Neural Network Terence Kwok [email protected]

Kate A. Smith [email protected] School of Business Systems, Faculty of Information Technology, Monash University, Clayton, Victoria 3168, Australia

One of the major obstacles in using neural networks to solve combinatorial optimization problems is the convergence toward one of the many local minima instead of the global minima. In this letter, we propose a technique that enables a self-organizing neural network to escape from local minima by virtue of the intermittency phenomenon. It gives rise to novel search dynamics that allow the system to visit multiple global minima as meta-stable states. Numerical experiments performed suggest that the phenomenon is a combined effect of Kohonen-type competitive learning and the iterated softmax function operating near bifurcation. The resultant intermittent search exhibits fractal characteristics when the optimization performance is at its peak in the form of 1/f signals in the time evolution of the cost, as well as power law distributions in the meta-stable solution states. The N-Queens problem is used as an example to illustrate the meta-stable convergence process that sequentially generates, in a single run, 92 solutions to the 8-Queens problem and 4024 solutions to the 17-Queens problem. 1 Introduction Solving difficult combinatorial optimization problems (COPs) has been for many years an important application area of neural networks. Apart from the popular Hopfield-type approach, there is a class of deformable template neural networks that were originally proposed to solve low-dimensional problems with a geometric interpretation, like the traveling salesman problem (TSP). Examples of these neural networks include the elastic net (Durbin & Willshaw, 1987) and an adaptation on Kohonen’s self-organizing feature map (SOFM) (Favata & Walker, 1991; Kohonen, 1982). More recently, a selforganizing neural network (SONN) has been proposed to solve a broader class of 0-1 problems not confined to the Euclidean plane (Smith, 1995), with applications like assembly line sequencing, frequency assignment in mobile communications, cell formation in cellular manufacturing, and the Neural Computation 17, 2454–2481 (2005)

© 2005 Massachusetts Institute of Technology

Optimization via Intermittency with a Self-Organizing Neural Network

2455

N-Queens (Smith, Palaniswami, & Krishnamoorthy, 1996, 1998; Smith & Palaniswami, 1997; Smith, Potvin, & Kwok, 2002). One major feature of the SONN is its adaptation of the SOFM principle to the cost space of the optimization problem. Instead of selecting the winning neighborhood according to Euclidean distances relative to the winning neuron, the winning neighborhood in the SONN is selected according to the partial cost associated with each neuron. Thus, the optimization process has been transformed into an ongoing competition of neurons in the cost space, and the weights associated with the winning neighborhood are updated according to Kohonen’s learning rule. As the partial costs are functions of the weights, the competition-updating process thus forms a feedback loop. The system is said to be converged or in equilibrium when the weights and the winning neurons remain unchanged. For faster convergence in solving general 0-1 optimization problems with the SONN, Guerrero, Lozano, Smith, Canca, and Kwok (2002) incorporated a normalization process into the SONN algorithm. It employs the softmax function (Bridle, 1990; Elfadel & Wyatt, 1994) to serve the dual role of satisfying the summation-to-one row constraint in the weight matrix and ensuring that weight values are continuous. Moreover, the softmax function contains a parameter T so that it can be tuned to output near 0-1 values (T small) or a mean value (T large). With the normalization process, the SONN has been implemented to solve the N-Queens problem in order to illustrate its optimization capabilities. A surprising outcome has been observed: the optimization performance of the SONN, in terms of both solution quality and efficiency, is at its peak when the parameter T is near a critical value Tc (Kwok & Smith, 2002, 2003a, 2003b, 2004). Numerical models were constructed in the same studies to capture the interaction between the weight updating and the normalization, revealing a range of dynamical phenomena from bifurcation to chaos, with T being the bifurcation parameter. For a better understanding of the iterated softmax function, an even more simplified model, involving only the softmax function without any interaction with the SONN, has been investigated (Kwok & Smith, 2003a). It shows that if the softmax function is treated as an iterated map, it contains a symmetry-breaking bifurcation at a value T* separating the 0-1 and the mean-valued region. From these results came an important finding: Tc is always near T*, which suggests that the symmetry-breaking bifurcation of the softmax function can enhance the SONN’s optimization performance (Kwok & Smith, 2003a). While a detailed mechanism of the enhancement is not yet known, further experiments have been carried out to study its benefits. Notably, a kind of intermittent switching dynamics has been observed when Tc is near T*. Instead of converging to a stable state like all other deformable template neural networks, the weights of the SONN, which represent the solution matrix of a COP, are now found to transit from one metastable state to another in a sequential, intermittent manner (Kwok & Smith, 2003b).

2456

T. Kwok and K. Smith

As the convergence analysis of the SOFM is well known to be difficult mathematically (Cottrell, Fort, & Pag`es, 1998), the same can be expected of the SONN due to their similarities. With the incorporation of the normalization process, the analysis is expected to be even more complicated. However, certain global dynamics of these systems can be studied using techniques from disciplines like artificial life and complex systems. This kind of approach enables us to gain insights into some global properties of the SONN as a complex system comprising many simple components with complicated interactions. For example, the theory of self-organized criticality (SOC) has been used to analyze the learning dynamics of the self-organizing map (Flanagan, 2001; Zhao & Chen, 2002). It has also been suggested that a promising regime within complex systems for computations is poised at the border between order and chaos, or simply edge-ofchaos (Langton, 1990; Bak, 1996; Wolfram, 2002). A similar concept can also be found in dynamical systems theories, where intermittency is a dynamical phenomenon lying between orderly and chaotic regimes (Pomeau & Manneville, 1980; Grebogi, Ott, Romeiras, & Yorke, 1987; Platt, Spiegel, & Tresser, 1993). Moreover, fractal structures like 1/f noise and power law distributions are often found in systems showing intermittency and edgeof-chaos behaviors. It is our aim in this letter to present the intermittent search (IS) phenomenon of the SONN and to uncover its potential for optimization. The meta-stable “convergence” process consisting of intermittent switching of solution states will be described. The N-Queens problem is used as an optimization example to carry out numerical studies and illustrate the relationship between optimization performance and fractal characteristics of IS. In section 2, the architecture and algorithm of the SONN with normalization is outlined. In section 3, we provide a summary on the dynamical properties of the iterated softmax function. In section 4, the IS phenomenon is applied to solve the N-Queens problem, and its fractal characteristics are measured relative to optimization performance. Discussions are presented in section 5, followed by conclusions in section 6. 2 Architecture and Algorithm of the SONN with Weight Normalization We outline the SONN in this section in the context of solving the N-Queens problem. It should be noted that the SONN is a general approach and has been implemented for various COPs (see section 1). The N-Queens problem is an example of NP-hard constraint satisfaction problems. The aim is to place N queens on an N × N chessboard without attacking each other. This is enforced by having (1) one queen in each row, (2) one queen in each column, (3) one queen on each diagonal, and (4) exactly N queens on the chessboard. Mathematically, the state of the chessboard can be represented by an N × N matrix X with elements of 1 wherever a queen is placed, and 0

Optimization via Intermittency with a Self-Organizing Neural Network

V1 j*

V2 j*

1

VN-1, j*

V3 j* 3

2

Output layer

... N-1

2457

VN j*

N

WN-1 N WN N

W3 N

W11 W21 2

1

Input layer

W1 N

N

...

Figure 1: Architecture of the SONN.

otherwise. This forms a 0-1 optimization problem minimizing the following cost (Q), which measures violation of the constraints described above: 2 1 1 1 Q= xij xil + xij xk j + xij − N 2 i, j l= j 2 i, j k=i 2 i, j   +

 1  xij xi− p, j− p + xij xi− p, j+ p  .  2 i, j p=0 p=0 i, j 1≤i− p≤N 1≤ j− p≤N

(2.1)

1≤i− p≤N 1≤ j+ p≤N

To solve the N-Queens problem of minimizing constraint violation by equation 2.1, we use a SONN based on that used in Kwok and Smith (2003a). The architecture of the network for this problem is shown in Figure 1. The architecture of the SONN is a feedforward neural network with an input layer of N nodes and an output layer of N nodes, corresponding to the N columns and N rows of the chessboard, respectively. The weight of the connection between an input node j and an output node i is given by Wij and represents the continuous relaxation of the decision variable xij in equation 2.1. The training set consists of one input pattern for each column (each queen in this example). The input pattern corresponding to a column j ∗ is a vector of N components where the j ∗ th component is 1 and all the remaining components are 0. The task is to use competition to determine the best row to place each queen in. When the input pattern corresponding to a column j ∗ is presented, the net input to each node i of the output layer is the potential, Vij∗ =

N k,l=1

(1 − δik )(δi+ j ∗ ,k+l + δi− j ∗ ,k−l )Wkl + B

N k=1

(1 − δik )Wkj∗ , (2.2)

2458

T. Kwok and K. Smith

which is the diagonal and column contribution to the objective function Q of placing a queen on column j ∗ of row i, and δik is the Kronecker delta. The cost potential Vij∗ represents the partial cost of assigning a queen to position (i, j ∗ ) and is related to the partial derivative of the objective function, equation 2.1. B is a penalty parameter that is arbitrarily chosen to be 1 in this article. The row constraints are enforced by the normalization process described below after the weight updates. The competition between the output nodes gives the winning node i 0 corresponding to the row with minimum potential, i 0 = arg min Vij∗ . i

(2.3)

For self-organization to develop, we define a neighborhood of the winning nodes as those with the least potential, N(i 0 , j ∗ ) = {i 0 , i 1 , i 2 , . . . , i η }

(2.4)

where Vi0 , j ∗ ≤ Vi1 , j ∗ ≤ Vi2 , j ∗ ≤ . . . ≤ Viη , j ∗ ≤ . . . ≤ ViN−1 , j ∗ ,

(2.5)

and η ≥ 0 is the neighborhood size. Thus, the neighborhood is defined according to the partial costs rather than spatially, as in other geometric problems (Favata & Walker, 1991). Once both the winning node and its neighborhood have been determined, the weights are updated according to a Kohonen-type learning rule. In order to explicitly enforce the constraint of one queen in each row, the weight normalization for each output node as described in Guerrero et al. (2002) is used: exp(−(1 − Wij )/T) Wij ← N j=1 exp(−(1 − Wij )/T)

(2.6)

where Tcan be seen as a bifurcation parameter when equation 2.6 is treated as an iterated map (Kwok & Smith, 2004). The normalization operation in the SONN guarantees that the Wij elements in each row sum up to 1. As a general treatment of the learning process, the neighborhood size, the magnitude of the weight adaptations, and T can be gradually decreased for a stable convergence. However, they are fixed for this study to keep the system in a meta-stable mode. The complete algorithm follows: 1. Randomly initialize the weights around 0.5.

Optimization via Intermittency with a Self-Organizing Neural Network

2459

2. Randomly choose a column j ∗ and present its corresponding input pattern. 3. Compute the potential Vij∗ for each output node i according to equation 2.2. 4. Determine the winning node, i 0 , as well as its neighboring nodes according to equations 2.3 to 2.5. 5. Update weights, Wij∗ , connecting input node j ∗ with every output node i: ∀i ∈ N(i 0 , j ∗ )

Wij∗ (t) = α(i, t)(1 − Wij∗ ) Wij∗ (t) = 0 where

∗

∀i ∈ / N(i 0 , j )

Vi

α(i, t) = β(t) exp − Vi

0, j

0, j

∗

(2.7) (2.8)

− Vi, j ∗ . − Vi , j ∗

∗

(2.9)

N−1

The updated weights are: Wij∗ ← Wij∗ + Wij∗ .

(2.10)

6. Normalize all rows of W using equation 2.6. 7. Repeat from step 2 until all columns have been selected as input patterns. This is one training epoch. Each epoch consists of N presentations of input patterns regarded as N iterations. After each epoch, a discrete 0-1 matrix Wd is obtained: j = arg max Wik 1 k if ∀i. (2.11) Wd ij = 0 other wise 8. Repeat steps 2 to 7 for epochs. Since the above algorithm is customized to sustain intermittent dynamics in the SONN, interested readers should refer to (Kwok & Smith, 2003a) for a stable convergence version that incorporates (β, T, η) parameter annealing. In the next section, we present some experimental results with various (β, T, η) parameter values. To aid our numerical investigation, two cost functions are constructed from equation 2.1: E = Q(W) and

E d = Q(Wd )

(continuous) (discrete).

(2.12)

It should be noted that the matrix W of continuous values is used for all state computations in the above algorithm, while the matrix Wd , having discrete values of 0 and 1, is used for decoding W for

2460

T. Kwok and K. Smith

representation purposes only. Thus, it is clear that the SONN evolves in a continuous state space. 3 Dynamics of the Iterated Softmax Function In this section we present some dynamical properties of the softmax function under iterations. For the normalization purpose in the SONN, the softmax function is repeatedly applied to each row of W as time goes on. Thus, equation 2.6 takes on the role of a nonlinear iterated map, which contributes to the T-dependent aspects of the IS dynamics. From our previous research, it has been shown that the optimization performance of the SONN is strongly influenced by the parameter T, implying that the softmax function plays a crucial role in enhancing solution quality (Kwok & Smith, 2002, 2003a, 2003b). In order to understand the overall IS dynamics, our approach is to study first the dynamics of the iterated softmax function (ISF) on its own, without interacting with the neural network. The general form of the softmax function in equation 2.6 can be written as exp (wk /T) Fk = N , j=1 exp w j T

(3.1)

where k = 1, 2, . . . , N, w j ∈ (0, 1) and T > 0. The function can be written in vectorial form as F (w ), where Fk (w) (or Fk in short) is the kth component. The parameter T is sometimes called “temperature” because the softmax function can be derived from statistical mechanics (Geiger & Yuille, 1991; Elfadel & Wyatt, 1994). The softmax function is regarded as a generalized sigmoidal function, and its outputs can be interpreted as continuous probabilities of having states between 0 or 1 (Bridle, 1990). This feature has been useful in the treatment of competition-based neural networks (Elfadel, 1993; Waugh & Westervelt, 1993; Guerrero et al., 2002). Some properties of the softmax function can be summarized as follows (Elfadel & Wyatt, 1994; Haykin, 1999): i. Fk → 1 as T → 0; and Fk → 1/N as T → ∞.

N ii. k=1 Fk = 1. iii. The rank order of the input values is preserved, that is, F p > Fr iff w p > wr . iv. The Jacobian of F (w) is a symmetric N × N matrix: J = diag(Fk (w)) − F (w)F (w)T ,

(3.2)

Optimization via Intermittency with a Self-Organizing Neural Network

2461

which is always singular with the vector e = (1, 1, . . . , 1)T being the only eigenvector corresponding to the zero eigenvalue. All eigenvalues are upper-bounded by max Fk (w) < 1. k

v. There exists a potential function P: N → such that F = ∇P, with P being convex. A similar expression to equation 3.1 was obtained from the mean-field theory (MFT) approach to neural networks by Peterson and Soderberg ¨ (1989). They studied the convergence dynamics of the neural network as a whole and found that in the annealing process of gradually decreasing T, there exists a critical temperature Tcrit at which a phase transition of the system takes place. A more detailed study of the phase transition as a series of bifurcations had been given by Sato and Ishii (1996). In a similar meanfield approach (Van den Bout & Miller, 1989), effective optimization was achieved at the fixed temperature Tcrit without any annealing. Although the softmax function contributes to the T-dependent dynamics of these approaches, the convergence analyses were performed on the overall system level rather than on the iterated softmax function (ISF) alone. Here we study the ISF as an iterated map in the form (n+1)

wk

←

(n) T exp − 1 − wk , N

(n) T exp − 1 − w j

(3.3)

j=1

where (n) denotes the nth iteration. The right-hand side of equation 3.3 is theoretically equivalent to equation 3.1 but has better numerical properties, especially when T is small. It is equation 3.3 that includes the iteration of the softmax function that is not present in equation 3.1. We plot in Figure 2 the asymptotic convergence of w versus T for N = 8 queens after n = 1000 iterations. It can be seen that a symmetry-breaking bifurcation occurs at around T* ≈ 0.22. For T > T*, all N weights asymptotically converge to 1/N; for T < T*, an initially larger weight is attracted toward a high orbit (near 1) and all others toward the low orbit (near 0). This bifurcation diagram (see Figure 2) is very useful in showing the long-term effect of normalization due to the ISF. For T << T*, the weights converge to 0 or 1 in just a few iterations; for T >> T*, the weights converge rapidly to 1/N. But for T near T*, the convergence toward the corresponding asymptotic value is considerably slower. This transient effect is illustrated in Figure 3, where the first 10 iterations of equation 3.3 are shown with three different T values. From Figures 3A to 3C, T decreases from 0.193 to 0.170. The convergence of w5 changes from the mean value to 1, but w3 remains convergent toward the mean value rapidly. Figure 3B shows the interesting case of w5 initially dipping toward the mean value, but later transits into the high orbit near 1.

2462

T. Kwok and K. Smith

Figure 2: Asymptotic states of the iterated softmax function for N = 8 after 1000 iterations. A T-dependent convergence is observed, with a bifurcation at around T ∗ = 0.22. For each T value, the initial w is {w3 = 0.5; all other wk ’s ≈ 0.07}. The initially large w3 breaks apart from other wk ’s for T < T ∗ .

This kind of behavior occurs whenever two wk ’s are close to each other and with T near T*, thus revealing that the ISF is sensitive to the distribution of its inputs when it is near the bifurcation point. Note that T* varies slightly with the initial wk settings, for example, the initial w of {w5 = 0.957, w3 = 0.51, other wk ’s ≈ 0.007} corresponds to T ∗ = 0.235 and the w(0) in Figure 3 corresponds to T* ≈ 0.2. Table 1 shows further dependence of T* on N. The purpose of the experiment shown in Figure 3 is to simulate the aftereffect of a weight update to w5 using β = 0.95 from a previously converged state of {w3 = 0.95, all other wk ’s ≈ 0.007}, as would typically occur when the softmax function encounters a weight-update operation in the SONN algorithm. In other words, we are testing how the weights respond to the repeated normalization process after a previous weight update. Note that the actual weight update as in equation 2.10 is not used in this experiment, but as an initial condition only. In the context of the ISF serving the normalization role in the SONN, the above result has the following implication: within any row i, when updating two different Wij elements in two successive iterations, the largest Wij will be briefly suppressed (decrease in value) before it returns to the high value after a few iterations. The brief suppression (see Figure 3B) allows other weights to be promoted to the largest in the row during a “dispute of power” between two weights (w3 and w5 in Figure 3, for example). This provides a mechanism for a decision made (a weight updated) to be overridden when ambiguity arises, and such a plasticity-like property is “in-built” in the transient convergence phase of

Optimization via Intermittency with a Self-Organizing Neural Network A

1

2463

w1 w2 w3 w4 w5 w6 w7 w8

0.8 0.6

wk

0.4 0.2

w5

w3

0 0

4

2

6

10

8

iterations B

1

w1 w2 w3 w4 w5 w6 w7 w8

w5

0.8 0.6

wk

0.4 w3

0.2 0

2

0

4

6

10

8

iterations C

1

w1 w2 w3 w4 w5 w6 w7 w8

w5 0.8 0.6

wk

0.4

w3

0.2 0 0

2

4

6

8

10

iterations Figure 3: Transient convergence phase of w for the iterated softmax function 3.3 at fixed T values for N = 8. The wk values are plotted against the number of iterations, with (A) T = 0.193, (B) T = 0.192, and (C) T = 0.170. The first 10 iterations are plotted with T near T*. The initial w is {w5 = 0.957, w3 = 0.95, other wk ’s ≈ 0.007}, where w5 and w3 are initially close to each other.

2464

T. Kwok and K. Smith

Table 1: Numerically Determined T ∗ Dependence on N. N

T* (approx.)

8 10 17

0.22 0.20 0.17

Note: The softmax mapping in equation 3.3 is iterated for 1000 times using a w(0) of {w5 = 0.5, all other wk ’s = 1/(2N)}.

the ISF. This may be one of the possibilities that allows the SONN to escape from local minima during the solution process. The study of the ISF described here is not intended to be comprehensive. Rather, it highlights the near bifurcation dynamics of the ISF as a contributing factor to the IS phenomenon discussed later in this letter. 4 The Intermittent Search Phenomenon The novel dynamics of the IS phenomenon in optimization is demonstrated in this section. The SONN algorithm with the softmax function as described in section 2 is implemented to solve the N-Queens problem with N = 8 and 17 for illustration purposes. There are three main parameters (β, T, η) to be adjusted for IS to occur. The parameter T should be chosen near the bifurcation point T* according to the problem size N (see Table 1). It is essential that the softmax function is operating near its bifurcation point to utilize the characteristic dynamics of the region (see section 3). Given T is chosen near T*, a relatively large β should be used. This is to accentuate the weight update in the ISF’s transient convergence phase, as described in section 3. A value of β = 0.8 is used for our experiments, and the choice of β is not sensitive to N in general. The neighborhood size η should be nonzero in order to update the weights of nearby neurons (in the cost space). We will show that η has a strong influence on IS dynamics. 4.1 The Solution Process of Intermittent Search. To have sufficient data demonstrating the IS phenomenon, we first run the algorithm for = 100,000 epochs for the 8-Queens problem. Note that the SONN does not converge to a stable fixed point under IS, but ceaselessly visits meta-stable states. Table 2 shows the result of three different T values, each averaged over 10 trials of different initial W (except the column of “minimum number of epochs required for the first solution”). It can be seen that with T = 0.2215, all 92 distinct solutions of the 8-Queens problem are obtained, including isomorphic solutions (Cull & Pandey, 1994). In fact, all 92 solutions are obtained in each one of the 10 trials. Notice that this value of T is very close to the bifurcation temperature T* for N = 8 (see Table 1). Significantly fewer

Optimization via Intermittency with a Self-Organizing Neural Network

2465

Table 2: Summary of Results for N = 8 with η = 2.

β

T

Number of Distinct Solutions Obtained (m)a

0.8 0.8 0.8

0.2000 0.2215 0.2400

2 92 21

Minimum Number of Epochs Required for the First Solution

Epochs Required to Obtain m Distinct Solutionsa

11 5 549

21,657 78,380 86,700

a Values

are rounded to the nearest integer after averaging over 10 trials with different initial W’s.

8 7 6 5

Ed

4 3 2 1 0 12400

12600

12800

Epoch Figure 4: A time window in the evolution of E d as epochs elapsed, showing the intermittency in arriving at the global minima (E d = 0) with T = 0.2215 (see Table 2). Large fluctuations can be observed occasionally, but the IS capability is not compromised.

solutions are obtained with higher or lower T values. It will be shown later in this section that the T = 0.2215 configuration corresponds to IS. With this configuration, it takes 78,380 epochs to obtain all 92 solutions, which gives it the lowest number of epochs per distinct solutions count of 852, compared to 10,829 and 4129 for the other two cases. Moreover, only 5 epochs are needed in the best trial for the IS to obtain the first solution. In order to observe the intermittent dynamics from a global perspective, we plot in Figure 4 a typical evolution of E d for T = 0.2215 in Table 2. It shows that E d fluctuates between 0 and 8 in the epoch interval of 12,400 and 12,900. This kind of fluctuation continues throughout the whole run. It can be seen that the global minima represented by E d = 0 are attained intermittently, and the dynamics are clearly unstable but not entirely random. In fact, the system is switching among various meta-stable states. It should be noted that although E d = 0 is repeatedly attained, the global minimum solutions as given by Wd can be different. This is illustrated in Figure 5, where three different solutions to the 8-Queens problem are obtained from the

2466

T. Kwok and K. Smith

A

B

C

Epoch 12,401

Epoch 12,477

Epoch 12,662

Figure 5: Solutions to the 8-Queens problem obtained at various epochs in Figure 4.

respective epochs. The number below each chessboard indicates the epoch at which the particular solution is encountered for the first time. The IS dynamics shown here allows previously encountered solutions to reappear. For example, the solution shown in Figure 5A may reappear more than once at any subsequent epoch. The particular run depicted in Figure 4 takes 78,606 epochs for each of the 92 distinct solutions of the 8-Queens problem to appear at least once. As a general measure of how often global minimum solutions are encountered in the case of IS, we plot in Figure 6A the histogram of E d after = 100,000 epochs for T = 0.2215. An exponential-like distribution can be observed in Figure 6A, with E d = 0 having the highest frequency of over 30,000, which accounts for around 33% of the total number of epochs being run. Local minima with increasingly high E d values occur with decreasing frequencies. Experiments show that the general shape of the distribution does not vary with initial conditions of W. This characteristic distribution of E d demonstrates the exceptional ability of the IS dynamics in accessing the global minima of the cost landscape. In order to verify that the exponentiallike distribution is not a numerical artifact, Figure 6B shows the case when the competitive ordering in the potentials (Vij ’s) as described by equation 2.5 is randomly shuffled. The distribution then changes into an approximately normal distribution with E d = 6 and 7 having the highest frequencies, and the optimal solution of E d = 0 is achieved only three times. The result of this controlled experiment shows that the exponential-like distribution is a consequence of the self-organizing competitions of the SONN algorithm, suggesting that the self-organization approach is highly effective in yielding optimal solutions. It has been seen from Table 2 that the ability of IS in accessing global minima is more effective when T is close to the bifurcation temperature T*. Now given T is near T*, the controlled experiments described in Figure 6 suggest that the neighborhood size η, which controls the spatial extent of the Kohonen updating as described in equations 2.4 and 2.7, would also

Optimization via Intermittency with a Self-Organizing Neural Network

A

2467

35000

Frequency

30000 25000 20000 15000 10000 5000 0

0

1

2

3

4

5

6

7

8

9

12

14

16

18

10

Ed B

20000 18000 16000

Frequency

14000 12000 10000 8000 6000 4000 2000 0 0

2

4

6

8

10

20

Ed Figure 6: Histogram of E d obtained in 100,000 epochs with T = 0.2215 and η = 2 (see Table 2) using (A) the competitive ordering in potentials (V) and (B) a randomized ordering. This comparison highlights the necessity of the Kohonentype neighborhood ranking in the cost space for the occurrence of intermittent search.

affect the IS dynamics. In Table 3 we present the results obtained using T = 0.2215 but with different η values. It can be seen that the best result is obtained with η = 2, with fewer distinct solutions as η deviates from that value. This suggests that there is an optimal choice for η that favors IS. Notice

2468

T. Kwok and K. Smith

Table 3: Summary of Results for N = 8 with T = 0.2215.

β

η

Number of Distinct Solutions Obtained (m)a

0.8 0.8 0.8 0.8 0.8

0 1 2 3 4

1 3 92 83 9

Minimum Number of Epochs Required for the First Solution

Epochs Required to Obtain m Distinct Solutionsa

1 1 5 37 679

3 16,091 78,380 93,809 81,014

Note: Bold type indicates the outcome with the largest number of distinct solutions obtained. a Values are rounded to the nearest integer after averaging 10 trials with different initial W’s.

Table 4: Summary of Results for N = 17 with η = 4.

β

T

Number of Distinct Solutions Obtained (m)

0.8 0.8 0.8 0.8

0.160 0.170 0.172 0.180

3 264 399 0

Minimum Number of Epochs Required for the First Solution

Epochs Required to Obtain m Distinct Solutions

1896 488 194 —

19,527 97,317 99,664 —

Note: Bold type indicates the outcome with the largest number of distinct solutions obtained.

that when η = 0, only the weight of the winning node (Wi0, j ∗ ) is updated for each column j ∗ , and Table 3 shows that only one solution is obtained. In fact, the system is trapped in the first stable fixed point it encounters and is unable to escape from it. It is only by using an η > 0 that the system can exit one minimum state and enter another, provided that T is near T*. These results show that intermittent search is most efficient in obtaining multiple solutions when the neighborhood size is neither too large nor too small. Experimental results for the 17-Queens problem are also obtained to demonstrate the generality of our observations. The SONN algorithm is again run for = 100,000 epochs but with N = 17. Table 4 shows the result of using different T values around T* ≈ 0.17 (see Table 1) with η = 4. The best performance is achieved at T = 0.172, where 399 distinct solutions are obtained. It also takes the least number of epochs to arrive at the first solution. Performance deteriorates for larger or smaller T’s. As observed in Table 4, the performance trend with respect to T* agrees very well with our 8-Queens experimental results. To test the effect of neighborhood size on IS for N = 17, an experiment on varying η has also been performed, with the result shown in Table 5. The best value of η is 4, which yields 399 distinct solutions in the least

Optimization via Intermittency with a Self-Organizing Neural Network

2469

Table 5: Summary of Results for N = 17 with T = 0.172.

β

η

Number of Distinct Solutions Obtained (m)

0.8 0.8 0.8 0.8 0.8

0 3 4 5 8

1 250 399 52 0

Minimum Number of Epochs Required for the First Solution

Epochs Required to Obtain m Distinct Solutions

187 542 194 5,934 —

187 99,763 99,664 99,895 —

Note: Bold type indicates the outcome with the largest number of distinct solutions obtained.

number of epochs, when compared to η = 3 and 5. The use of a slightly larger neighborhood size of 5 leads to a dramatic decrease in the number of solutions obtained—down to 52—but requires more epochs to do so. For the case of zero neighborhood (η = 0), again the system is trapped at the first stable fixed point it encounters, and IS does not occur. No solution is obtained when the neighborhood size is too big (η = 8) relative to the critical value of η = 4. Thus, we have confirmed that the dependence of IS on critical values of T and η observed in the 8-Queens experiments is also found in a larger-size problem, suggesting that this is a general characteristic of IS. Since obtaining multiple solutions without restarting is a novel feature of IS, we let the SONN algorithm run for a large number of epochs ( = 1,000,000) to see how many distinct 17-Queens solutions it can obtain. Our experimental result shows that 4024 distinct solutions are obtained in a single trial with β = 0.8, T = 0.172, and η = 4. For comparison, previous empirical results indicated that there are 95,815,104 distinct solutions (including isomorphic solutions) for the 17-Queens problem (Erbas, Sarkeshik, & Tanik, 1992). Thus, IS can obtain 0.004% of all previously found 17-Queens solutions in 1 million epochs. Although it appears that intermittent search is not particularly efficient in obtaining all the N-Queens solutions, its ability in obtaining multiple solutions in a single run is a novel feature among neural approaches to optimization. Moreover, the SONN is a general algorithm for solving a variety of COPs and is not purpose-built for solving the N-Queens problem. 4.2 Fractal Properties of Intermittent Search. It has been shown that the IS phenomenon is most effective in escaping local minima and thereby obtaining multiple solutions when both T and η are at some critical values. In this section, we illustrate the fractal properties of IS and their dependence on those values. Evidence of a connection between fractal behaviors and effective IS will be presented. First, we investigate the temporal correlations in the overall dynamics of IS. This can be measured by the power spectrum S( f ) of the continuous cost function E = Q(W) given in equation 2.12, where f is the frequency.

2470

T. Kwok and K. Smith 10

A

9 8

E

7 6 5 12400

12500

12600

12700

12800

Epoch 2

B

best_fit E

0

Log10 Sk

-2 -4 -6 -8 -10 0

1

2

3

4

5

Log10 k Figure 7: Plots showing the fractal property of the cost function E = Q(W). (A) A time window into the evolution of E during intermittent search (IS) for the 8-Queens problem, with T = 0.2215 and η = 2. (B) Power spectrum of E in log10 -log10 scales with a best fit line of slope −0.992, revealing the signature of 1/f noise in IS.

In general, systems that exhibit long-range temporal correlations would lead to S ∝ 1/ f γ , with the exponent γ close to 1. Also called fractals in time, signals or noises with a 1/f spectrum are scale invariant in time. As such their time series contain fluctuations on all timescales. Found in both natural and artificial systems, 1/f noise has diverse origins (Gardner, 1978; Milotti, 2002), and is regarded as a general manifestation of complex systems (Mandelbrot, 1999; Bak, 1996). A typical time evolution of E with IS occurring is shown in Figure 7A, which is generated from the same run as in Figure 4, with N = 8, β = 0.8,

Optimization via Intermittency with a Self-Organizing Neural Network

2471

Table 6: Summary of 8-Queens Results Relating Solution Counts and 1/f γ Noise in E.

Case

T

η

γ from Power Spectruma

Number of Distinct Solutions Obtained (m)a

A B C D E F G

0.2000 0.2215 0.2215 0.2215 0.2215 0.2215 0.2400

2 0 1 2 3 4 2

0.481 0.386 0.579 0.993 0.707 0.373 0.423

2 1 3 92 83 9 21

Note: Bold type indicated the outcome with the largest number of distinct solutions obtained. a Values are averaged over 10 trials with different initial W’s.

T = 0.2215, and η = 2. Fluctuations in different sizes and timescales can be observed. A discrete Fourier transform is then applied to the E time series with 100,000 data points (corresponding to ), thus obtaining the power spectrum

2 −1 1 Sk = E j exp(−i2π jk/ ) , j=0

(4.1)

√ where E 0 , E 1 , . . . , E −1 are the data points, i = −1, and k = 0, 1, . . . , ( − 1). A plot of Sk versus k in log10 -log10 scales is shown in Figure 7B, with the best-fit line represented by the dotted line. By measuring the slope of the best-fit line, the exponent γ = 0.992 is obtained, which is close to 1. Therefore, the time series of the cost function E as depicted in Figure 7A is a 1/f signal. This implies that for the given parameter configuration, the macroscopic dynamics of IS have fractal characteristics in time, which contain long-range temporal correlations. This important result suggests that the IS phenomenon in SONN has a memory of previous search states and that its path visiting the metastable states is not a random process but rather a systematic search. In order to determine whether the 1/f behavior has any significance in optimization, we compute power spectra of E with various T and η values after solving the 8- and 17-Queens problems. The result corresponding to 8-Queens is given in Table 6, where seven different combinations of T and η values are used to test if the solution counts, m, correlates with the exponent γ . Case D is the configuration that favors IS in obtaining the most number of solutions (all solutions of 8-Queens), and it can be seen that the corresponding γ value is closest to 1 among all other cases. The value of γ decreases as the neighborhood size deviates from the critical value of 2 as shown from

2472

T. Kwok and K. Smith

Table 7: Summary of 17-Queens Results Relating Solution Counts and 1/f γ Noise in E.

Case

T

η

γ from Power Spectrum

Number of Distinct Solutions Obtained (m)

H I J K L M N P

0.160 0.170 0.172 0.172 0.172 0.172 0.172 0.180

4 4 0 3 4 5 8 4

0.772 1.135 1.574 1.103 1.121 0.802 0.211 0.242

3 264 1 250 399 52 0 0

Note: Bold type indicates the outcome with the largest number of distinct solutions obtained.

case B to F. Surprisingly, m also decreases as γ deviates from 1. A similar situation can be observed in varying T, but keeping η constant, as shown in cases A, D, and G. A decrease in γ again corresponds to a decrease in m. Cases for N = 17 as shown in Table 7 largely confirm the same trends, albeit with slight irregularity in cases I and K. Thus, it is clear that any deviation from the critical values of T and η would lead to a deviation from 1/f signal in E, which simultaneously corresponds to a lesser performance in obtaining the number of solutions. In other words, IS exhibits fractal characteristics in the form of 1/f signal when its optimization performance is at its peak. The temporal fractal described above occurs on a macroscopic level in the scalar quantity E, whose fluctuations are a consequence of the complex interactions between the weights and the local updating rules. To provide a picture of the local fluctuations of a typical weight, Figure 8 plots the time evolution of W2,6 in epochs. The parameter configuration is the same as case D in Table 6, with which IS performs best. It should be reminded that Wij is a continuous variable in the SONN algorithm, and the updating is also continuous according to equations 2.7 and 2.10. From Figure 8 it can be seen that fluctuations of diverse amplitudes occur throughout the epochs, but there are two distinct levels of mean activities—one near 0 (LO) and the other near 0.8 (HI). Amid the fluctuations, the states switch between these two levels in an intermittent manner, which correspond to switches between 0 and 1 in the discrete version Wd 2,6 . This kind of switching behavior happens to all elements of W, and collectively they result in the switching patterns of N-Queens solutions as observed in Figures 4 and 5. The same switching phenomenon is also found during the optimization of the 17-Queens problem. Figure 9 shows the time evolution of the discrete solution state Wd represented as black (1) and white (0) dots. With β = 0.8, T = 0.172, and η = 4, intermittent search is known to occur here (see section 4.1). The ever-changing patterns in Figure 9A represent the

Optimization via Intermittency with a Self-Organizing Neural Network

2473

1 0.8

W2,6

0.6 0.4 0.2 0 13000

14000

15000

16000

Epoch Figure 8: Typical local fluctuations of a weight element (W2,6 ) in time during optimization of the 8-Queens problem. Parameters of β = 0.8, T = 0.2215, and η = 2 are used for better intermittent search performance. Fluctuations of diverse amplitudes occur throughout, but two levels of mean activities, one near 0 and the other near 0.8, can be observed to switch into one another intermittently. The symbol indicates where switching occurs.

intermittent switching phenomenon in action. Any element that remains unchanged would appear as a horizontal straight line in the plot. It can be seen that there are windows of almost quiescent periods where the metastable states reside. These windows exist in a multitude of timescales, and they appear and disappear intermittently. The fact that these global patterns can arise from the interactions among hundreds of elements, and doing so only with local updating rules, suggests that the SONN can be regarded as a complex system that solves optimization problems. Furthermore, the computational process has a fractal quality to it, as evident from the selfsimilar structures shown in Figure 9B, where meta-stable patterns exist in many timescales. The dynamical nature of the intermittent windows of meta-stable states in Figure 9 can be described statistically by measuring the residence time of a state before switching occurs. Figure 10 shows a histogram of the residence times distribution of the discrete state Wd 2,6 = 0 in log scales, which is generated from solving the 17-Queens problem with the same IS configuration as in Figure 9 for = 100,000 epochs. Part of the plot is annotated by an adjacent dotted straight line, which illustrates that a power law distribution holds for residence times up to 100 epochs. Similar power law distributions are also observed for IS in solving the 8-Queens problem. These results mathematically confirm the self-similar nature of the windows of metastable states shown in Figure 9. The power law distributions in the state switching dynamics also suggest that IS does not compute in a particular

2474

T. Kwok and K. Smith

A

272

Solution states

238 204 170 136 102 68 34 0 0

25000

50000

Epoch B

272

Solution states

238 204 170 136 102 68 34 0 0

2500

5000

Epoch Figure 9: Time evolution of solution states during intermittent search for the 17-Queens problem. At each epoch, the elements of Wd are plotted in a vertical array of dots indexed as Wd [r ], where r = 17(i − 1) + j, for i, j ∈ {1, 2, . . . , 17}, with black as 1 and white as 0. (A) Intermittent patterns of meta-stable states persist in a multitude of timescales. (B) A 10 times magnification shows similar patterns in finer scales. These plots reflect the self-similar nature of the solution process.

timescale, but is scale free in time to a certain extent. This feature allows IS the flexibility to escape local minima in the short term while maintaining its optimization objective in the long term. Finally, from all the above experimental results, it is interesting to note that a random presentation of inputs to the SONN, with the Kohonen-type competitions of costs, followed by local updating and normalization via a bifurcating softmax function, can all work coherently together to form orderly spatial patterns (N-Queens solutions) that are linked by fractal temporal dynamics.

Optimization via Intermittency with a Self-Organizing Neural Network

2475

Number of occurrences

10000

1000

100

10

1 1

10

100

1000

Residence time (epochs) of Wd 2,6 = 0 Figure 10: Histogram showing the distribution of residence times of the discrete state Wd 2,6 = 0 in log scales. The section with power law distribution can be interpolated with a straight line, as referenced by the dotted line shown. The data are generated by solving the 17-Queens problem with intermittent search, using the configuration of β = 0.8, T = 0.172, and η = 4. Note that Wd 2,6 corresponds to Wd [r = 23] in Figure 9.

5 Discussion We have presented intermittent search as a new kind of optimization dynamics that occur in the SONN. It endows the neural network with an intermittent switching process that visits both local and global minima as meta-stable states. As a result, the network can escape from one local minimum and enter another sequentially. For some critical parameter values, we have seen that IS achieves its best optimization performance in obtaining multiple optimal solutions. Surprisingly, as IS approaches its peak performance, its temporal evolution increasingly takes on fractal characteristics. Furthermore, we have demonstrated that IS can solve the N-Queens problem of different sizes and obtain a number of optimal solutions in each run. Meta-stable convergence via intermittent switching is a new dynamical regime of the SONN in optimization. Such dynamics have never been observed in any deformable template neural network as far as we are aware. Our results have shown that in optimization, this new phenomenon is capable of obtaining multiple optimal solutions in a single run, and if only one solution is required, the first optimal solution state is often quickly found. It is important to note that IS can arrive at global minimum states with a rate of 33% of the time (see Figure 6), which is evidence that the search maintains its

2476

T. Kwok and K. Smith

goal of optimization as it exits a local minimum in the search path. The role of the neighborhood function in guiding the search toward more optimal states is best illustrated by the controlled experiment as shown in Figure 6B, where destroying the ranking of cost potentials leads to a drastic decline of the number of optimal states obtained. We have also found that the size of the neighborhood function can strongly affect the optimization ability of IS, where a critical size exists for the maximal retrieval of optimal solutions (see Tables 3 and 5). It is possible that when the neighborhood size is too small, there are not enough “hints” for IS to take the next step in the search path to escape local minima; when the neighborhood size is too large, there are so many alternative directions in the search path that it becomes noisy. As it is observed that the critical neighborhood size grows with the problem size N, it appears that the “right number of hints” required for effective IS increases with the search space. It should be emphasized that using an appropriately chosen neighborhood function in the SONN alone would not give rise to IS, and much research concerning the neighborhood function in the SOFM has been done in the past without encountering IS. We believe this is where the softmax function comes into the picture. As described in section 3, the transient dynamics of the ISF near its symmetry-breaking bifurcation has the capacity to briefly suppress a dominant weight when there is another weight of slightly smaller value. This peculiar effect is due to the ISF only, but its integration into the SONN algorithm sees it as a mechanism that could alter the ranking of the cost potentials Vij in subsequent iterations, which changes the course of updating as the neuron competition is now affected, especially when two neurons are having similar cost potentials. Although a precise account on the overall dynamics is not yet available, nevertheless it is clear that IS always occurs near the bifurcation point of the ISF, where transient dynamics can provide the agility required for the system to escape local minima and generate complex patterns. It would also be interesting to see how the near-bifurcation properties of the ISF can be exploited in other classes of neural networks for enhancing their agility and responsiveness, as analogous principles on utilizing near-instability have been successfully applied to memory models and control applications (Pantic, Torres, Kappen, & Gielen, 2002; Nakahara & Doya, 1998; Cabrera & Milton, 2002; Perkins, 1949). The increased agility of the SONN due to the ISF has made it possible to perform computations across many timescales, as evident from the fractal properties of IS in the form of 1/f signals during optimization. More important, we have demonstrated that the fractal quality is strongly associated with optimization performance: the more effective IS is in attaining optimal states, the closer its temporal signal approaches 1/f fractal. This surprising outcome is neither coincidental nor trivial. It may suggest that the intermittent search follows a path that is particularly efficient in linking the minima as meta-stable states, and such a path turns out to have a fractal character.

Optimization via Intermittency with a Self-Organizing Neural Network

2477

From our results, this self-similar property of search states is also reflected in the power law distribution of 0-1 solution states (see Figures 9 and 10). Having fractal properties on both the global level (in E) and the local level (in Wd elements) means that the IS phenomenon manifests itself in both the analog domain (the self-organizing computations) and the digital domain (the discrete solution decoding). This feature may provide further insights into the continuous search mechanics of IS in the discrete solution space of COPs. The evidence of 1/f signals in IS also suggests that the search carries both short- and long-term temporal correlations, thus rendering it sensitive to small fluctuations and at the same time preserving a memory of past states. This strategy is clearly useful in an efficient search of global minima, as it has to escape local minima in the short term while keeping a memory of searched regions in the long term, much like the strategy with tabu search (Glover & Kochenberger, 2003). From this perspective, 1/f signals may be viewed as the signature of efficiency in IS. In fact, Usher and Stemmler (1995) have previously suggested a possible exploitation of 1/f dynamics in neural computations for the above benefits. Although 1/f dynamics has never been used in neural networks for optimization applications (to our knowledge), there have been investigations into the role of 1/f noises in biological neural dynamics (Aks & Sprott, 2003; Aks, Zelinsky, & Sprott, 2002; Billock, de Guzman, & Kelso, 2001; Anderson & Mandell, 1996; Gilden, Thornton, & Mallon, 1995). These studies could serve as further pointers in designing neural networks with useful 1/f dynamics. In the literature of complex systems, the theory of self-organized criticality (SOC) is used to describe dissipative systems that organize themselves into a critical state with no characteristic time or length scales, with 1/f noise and fractal spatial structures (Bak, 1996). Our results also show similar features of self-organization that intermittently accesses meta-stable states with 1/f noise, with fractal evolution of the spatial structures. One major difference between the SONN and SOC systems would be the need to tune the T parameter in the ISF, which makes the SONN more akin to systems in condensed matter physics undergoing a phase transition at certain critical temperature. However, the dependency of IS on the neighborhood function has proved the importance of the Kohonen-type competition in generating 1/f dynamics in the SONN, which sets it apart from treatments of condense matter at phase transition. It should be pointed out that SOC has been used to model the self-organizing map, which shares the same Kohonen learning principle as the SONN (Flanagan, 2001), and it was suggested that the accumulated updating effect due to the neighborhood function is analogous to the addition and tumbling of sand grains in the sand pile model of SOC. Thus, we can say that the SONN is related to SOC systems in the sense that they have some major features in common, in much the same way as the classic example in cellular automata called “Game of Life” is related to SOC (Bak, 1996; Ninagawa, Yoneda, & Hirose, 1998).

2478

T. Kwok and K. Smith

It has been shown from the intermittent switching phenomenon that IS is neither entirely orderly nor entirely chaotic (see Figures 4, 7, 8, and 9). Instead, it belongs to a dynamical regime between order and chaos, or sometimes called edge-of-chaos (Langton, 1990; Bak, 1996; Wolfram, 2002). The fact that IS achieves its best optimization performance within this regime supports the conjecture that edge-of-chaos is conducive for complex systems to carry out novel computations. It is also interesting to note that in this regime, the SONN has the ability to function coherently to achieve an objective (optimization), while retaining certain flexibility (escaping from local minima). Interestingly, this important characteristic has previously been highlighted in an investigation on the role of intermittency in brain dynamics (Kelso et al., 1995). A synergetic outcome may result from a comparison of such a model with the SONN. Future research should involve applying IS to other COPs to test its generality, as well as providing a more detailed account on the underlying mechanism generating IS. 6 Conclusions A new kind of optimization dynamics using the SONN has been proposed. The so-called intermittent search is based on the intermittency phenomenon that arises as a result of combining Kohonen’s self-organizing principle with the near-bifurcation iterated softmax function. With the corresponding parameters at their critical values, intermittent search achieves its best optimization performance via an intermittent switching process that sequentially visits optimal solutions of the N-Queens problem as meta-stable states, thereby escaping local minima in its search path. Fractal characteristics are found on different system levels during intermittent search and can be regarded as signatures of search efficiency. Our findings have also shed light on the SONN being a complex system at the edge-of-chaos. References Aks, D. J., & Sprott, J. C. (2003). The role of depth and 1/f dynamics in perceiving reversible figures. Nonlinear Dynamics, Psychology, and Life Sciences, 7(2), 161–180. Aks, D. J., Zelinsky, G. J., & Sprott, J. C. (2002). Memory across eye-movements: 1/f dynamic in visual search. Nonlinear Dynamics, Psychology, and Life Sciences, 6(1), 1–25. Anderson, C. M., & Mandell, A. J. (1996). Fractal time and the foundations of consciousness: Verical convergence of 1/f phenomena from ion channels to behavioral states. In E. MacCormac & M. I. Stamenov (Eds.), Fractals of brain, fractals of mind: In search of a symmetry bond (pp. 75–126). Philadelphia: John Benjamins. Bak, P. (1996). How nature works: The science of self-organized criticality. New York: Springer-Verlag.

Optimization via Intermittency with a Self-Organizing Neural Network

2479

Billock, V. A., de Guzman, G. C., & Kelso, J. A. S. (2001). Fractal time and 1/f spectra in dynamic images and human vision. Physica D, 148, 136–146. Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. Fogleman-Souli´e & J. H´erault (Eds.), Neuro-computing: Algorithms, architectures and applications (pp. 227–236). Berlin: Springer. Cabrera, J. L., & Milton, J. G. (2002). On-off intermittency in a human balancing task. Physical Review Letters, 89(15), 158702. Cottrell, M., Fort, J. C., & Pag`es, G. (1998). Theoretical aspects of the SOM algorithm. Neurocomputing, 21, 119–138. Cull, P., & Pandey, R. (1994). Isomorphism and the N-queens problem. SIGCSE Bulletin, 26, 29–36. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Elfadel, I. M. (1993). Global dynamics of winner-take-all networks. Proceedings of the SPIE-The International Society for Optical Engineering, 2032, 127–137. Elfadel, I. M., & Wyatt, J. L. Jr. (1994). The “softmax” nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In J. Cowan, G. Tesauro, & J. Alspetor (Eds.), Advances in neural information processing systems, 6 (pp. 882–887). San Mateo, CA: Morgan Kaufmann. Erbas, C., Sarkeshik, S., & Tanik, M. M. (1992). Different perspectives of the N-queens problem. In Communications Proceedings of the ACM Computer Science Conference (pp. 99–108). New York: ACM. Favata, F., & Walker, R. (1991). A study of the application of Kohonen-type neural networks to the travelling salesman problem. Biological Cybernetics, 64, 463–468. Flanagan, J. A. (2001). Self-organized criticality and the self-organizing map. Physical Review E, 63, 036130. Gardner, M. (1978). Mathematical games: White and brown music, fractal curves and one-over-f fluctuations. Scientific American, 238(4), 16–32. Geiger, D., & Yuille, A. (1991). A common framework for image segmentation. Int. J. Computer Vision, 6, 227–253. Gilden, D. L., Thornton, T., & Mallon, M. W. (1995), 1/f noise in human cognition. Science, 267(5205), 1837–1839. Glover, F., & Kochenberger, G. (2003). Handbook of metaheuristics. Norwood, MA: Kluwer. Grebogi, C., Ott, E., Romeiras, F., & Yorke, J. A. (1987). Critical exponents for crisisinduced intermittency. Physical Review A, 36, 5365–5380. Guerrero, F., Lozano, S., Smith, K. A., Canca, D., & Kwok, T. (2002). Manufacturing cell formation using a new self-organizing neural network. Computers and Industrial Engineering, 42, 377–382. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Kelso, J. A. S., Case, P., Holroyd, T., Horvath, E., & Raczaszek, J., Tuller, B., & Ding, M. (1995). In P. Kruse & M. Stadler (Eds.), Ambiguity in mind and nature (pp. 159–184). Berlin: Springer-Verlag. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69.

2480

T. Kwok and K. Smith

Kwok, T., & Smith, K. A. (2002). Characteristic updating-normalisation dynamics of a self-organising neural network for enhanced combinatorial optimisation. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP’2002), 3 (pp. 1146–1152). Singapore: Nanyang Technological University. Kwok, T., & Smith, K. A. (2003a). Performance-enhancing bifurcations in a selforganising neural network. In Lecture Notes in Computer Science, 2686 (pp. 390–397). Berlin: Springer-Verlag. Kwok, T., & Smith, K. A. (2003b). A self-organising neural network with intermittent switching dynamics for combinatorial optimisation. In M. Abraham, M. Koppen, & K. Franke (Eds.), Design and application of hybrid intelligent systems (pp. 13–21). Amsterdam: IOS Press. Kwok, T., & Smith, K. A. (2004). A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks, 15, 84–98. Langton, C. G. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D, 42, 12–37. Mandelbrot, B. B. (1999). Multifractals and 1/f noise: Wild self-affinity in physics (1963– 1976). Berlin: Springer-Verlag. Milotti, E. (2002). 1/f noise: A pedagogical review. arxiv preprint, physics/0204033. Nakahara, H., & Doya, K. (1998). Near-saddle-node bifurcation behavior as dynamics in working memory for goal-directed behavior. Neural Computation, 10, 113– 132. Ninagawa, S., Yoneda, M., & Hirose, S. (1998). 1/f fluctuation in the “Game of Life.” Physica D, 118, 49–52. Pantic, L., Torres, J. J., Kappen, H. J., & Gielen, S. C.A.M. (2002). Associative memory with dynamic synapses. Neural Computation, 14, 2903–2923. Perkins, C. D. (1949). Airplane performance stability and control. New York: Wiley. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1(1), 3–22. Platt, N., Spiegel, E. A., & Tresser, C. (1993). On-off intermittency: A mechanism for bursting. Physical Review Letters, 70(3), 279–282. Pomeau, Y., & Manneville, P. (1980). Intermittent transition to turbulence in dissipative dynamical systems. Commun. Math. Phys., 74, 189–197. Sato, M., & Ishii, S. (1996). Bifurcations in mean-field-theory annealing. Physical Review E, 53(5), 5153–5168. Smith, K. A. (1995). Solving the generalized quadratic assignment problem using a self-organizing process. In Proceedings of the IEEE International Conference on Neural Networks, 4, (pp. 1876–1879). New York: IEEE. Smith, K. A., & Palaniswami, M. (1997). Static and dynamic channel assignment using neural networks. IEEE Journal on Selected Areas in Communications, 15, 238– 249. Smith, K. A., Palaniswami, M., & Krishnamoorthy, M. (1996). A hybrid neural approach to combinatorial optimization. Computers and Operations Research, 23, 597– 610. Smith, K. A., Palaniswami, M., & Krishnamoorthy, M. (1998). Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks, 9, 1301–1318.

Optimization via Intermittency with a Self-Organizing Neural Network

2481

Smith, K. A., Potvin, J.-Y., & Kwok, T. (2002). Neural network models for combinatorial optimization: A survey of deterministic, stochastic and chaotic approaches. Control and Cybernetics, 31, 183–216. Usher, M., & Stemmler, M. (1995). Dynamic pattern formation leads to 1/f noise in neural populations. Physical Review Letters, 74(2), 326–329. Van den Bout, D. E., & Miller, T. K. (1989). Improving the performance of the HopfieldTank neural network through normalization and annealing. Biological Cybernetics, 62, 129–139. Waugh, F. R., & Westervelt, R. M. (1993). Analog neural networks with local competition. I. Dynamics and stability. Physical Review E, 47(6), 4524–4536. Wolfram, S. (2002). A new kind of science. Champaign, IL: Wolfram Media. Zhao, X., & Chen, T. (2002). Type of self-organized criticality model based on neural networks. Physical Review E, 65, 026114.

Received May 5, 2004; accepted March 3, 2005.

LETTER

Communicated by Steven Nowlan

Mixture Modeling with Pairwise, Instance-Level Class Constraints Qi Zhao [email protected]

David J. Miller [email protected] Department of Electrical Engineering, Penn State University, University Park, PA 16802, U.S.A.

The goal of semisupervised clustering/mixture modeling is to learn the underlying groups comprising a given data set when there is also some form of instance-level supervision available, usually in the form of labels or pairwise sample constraints. Most prior work with constraints assumes the number of classes is known, with each learned cluster assumed to be a class and, hence, subject to the given class constraints. When the number of classes is unknown or when the one-cluster-per-class assumption is not valid, the use of constraints may actually be deleterious to learning the ground-truth data groups. We address this by (1) allowing allocation of multiple mixture components to individual classes and (2) estimating both the number of components and the number of classes. We also address new class discovery, with components void of constraints treated as putative unknown classes. For both real-world and synthetic data, our method is shown to accurately estimate the number of classes and to give favorable comparison with the recent approach of Shental, Bar-Hillel, Hertz, and Weinshall (2003). 1 Introduction The goal of unsupervised clustering is to extract the hidden structure in data in order to produce a compact, salient representation. Among various methods, mixture modeling is a very important one, often performing better than the hard-partitioning clustering algorithms such as K-means, as well as hierarchical clustering methods. However, there are some challenges associated with mixture modeling. First, there are local maxima of the likelihood objective, which can trap the local optimization techniques such as the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977), typically used for learning. Choosing the best EM run based on numerous initializations improves results, but at great computational expense. Second, there is the choice of the parametric form of the mixture component densities. This choice is comparable to the choice of a (soft-partitioning) Neural Computation 17, 2482–2507 (2005)

© 2005 Massachusetts Institute of Technology

Mixture Modeling with Class Constraints

2483

clustering distortion measure. A “straight” squared Euclidean distance measure (comparable to component densities that are gaussian and isotropic) is often used in practice. However, this measure is inadequate when there are significant feature correlations and some individual features that are more informative about the ground-truth data groups than others. On the other hand, more complex “tunable” distortion measures, comparable, for example, to use of gaussian components with general covariance matrix structure, require specifying many parameters. These more complex models can only be learned accurately if the data set is large or there is some prior knowledge on the group structure that can be used to guide parameter learning. Some recent works, which we discuss, have proposed methods for learning the clustering distortion measure (here referred to as metric learning). Third, there is the choice of the number of components in the solution. There are many model selection approaches, among them, MDL (Rissanen, 1978), BIC (Schwarz, 1978), AIC (Akaike, 1973), and cross validation, but there is no consensus on the proper choice when given limited data. In recent years, the utility of side information in clustering—to help avoid poor local optima and improve metric learning—has been investigated. Early work in this area, known broadly as semisupervised learning (Shashahani & Landgrebe, 1994; Miller & Uyar, 1997; Nigam, McCallum, Thrun, ¨ & Mitchell, 2000), assumed the existence of class (or cluster (Basu, Banerjee, & Mooney, 2002)) labels for some samples. A less restrictive form of side information, applicable in situations where label information is inaccessible or inappropriate, is pairwise sample constraints—an indication that a given pair of samples does belong to the same group (a must-link constraint) or does not belong to the same group (a cannot-link constraint). Note that this information does not explicitly specify the label of origin for any data samples or even indicate how many classes are involved in the problem. Moreover, whereas a labeled set of samples entails class constraints between all pairs of samples in the set, a set of samples possessing some constraints does not, in general, determine labels for any samples. In real-world data, various types of information, such as structural or spatial information, grouping rules, or some prior knowledge, can be expressed as pairwise constraints and used to help mine the hidden data structure. Nonexhaustively, we can identify two general scenarios for obtaining constraints:

r

Domain knowledge, such as spatial information in images and temporal continuity in video. In Yu and Shi (2004), pixels near the image border are likely to represent image background. This spatial information was expressed as must-links for pairs of pixels near the border and cannotlinks for (border, center-of-field) pairs. Temporal continuity in video has been exploited in Bar-Hillel, Hertz, Shental, and Weinshall (2003b), where two objects extracted from successive frames in roughly the same location were assumed to be the same. Also, for land use images,

2484

r

Q. Zhao and D. Miller

map-based locations of roads, lakes, or other objects can be used to glean constraints on groups of pixels. Interactive, online databases. Consider an online database where site users are solicited to provide supervision on individual data records. An example is the Open Mind Initiative (Stork, 2001). Even supposing that such users provide category labels (as opposed to constraints) for a subset of the records, individual users may not conform to a common convention for class names or labels or even agree on the number of classes in the problem. In this situation, the labeled examples from each user cannot be pooled in a simple way to form an aggregate semisupervised data set; however, from each user’s labels, one can infer mustlink and cannot-link examples that can be pooled. Moreover, instead of inferring constraints from user-supplied labels, constraints may be directly elicited from users, for example, in coordination with an active learning system.

In this letter, we develop a method for learning mixtures consistent with given pairwise sample constraints. Unlike previous work (e.g., Wagstaff, Cardie, Rogers, & Schroedl, 2001; Basu, Bilenko, & Mooney, 2003), where one cluster or component corresponds to one class, our approach, which allows multiple components per class, models classes with greater flexibility. Our work also addresses some problems that have not been much addressed in constraint-based clustering, including estimation of the number of clusters and the number of classes, as well as the discovery of new or unknown classes in semisupervised data with constraints. Before discussing these aspects, we first review relevant prior work. 1.1 Prior Work. There has been much recent interest in semisupervised learning (e.g., Seeger, 2000), due to the label scarcity in many domains. There are several categorizations of these methods. One is whether the supervision information consists of labels or constraints. A second is generative versus discriminative learning. The generative approach learns a statistical model for the data, to maximize a “data representation” objective. Generative semisupervised learning extends traditional unsupervised learning by incorporating the supervision information within the statistical model and the learning objective (Shashahani & Landgrebe, 1994; Miller & Uyar, 1997; Nigam et al., 2000). Discriminative semisupervised methods solely aim to learn a decision function well separating the different classes. Toward this end, these methods use both unsupervised and supervised samples (Demiriz & Bennett, 2001; Blum & Mitchell, 1998; Joachims, 1999; Jaakkola, Meila, & Jebara, 1999). Generative models can be used to form pseudo-Bayes decision rules. Thus, generative semisupervised learning is applicable to building statistical classifiers. However, the discriminative approach is generally

Mixture Modeling with Class Constraints

2485

superior, as all the model resources are devoted to accurately forming class boundaries. The generative approach is most appropriate when the objective is to discover the class or cluster structure or to estimate the density function. In this work, we address the generative problem with the goal of discerning class and cluster structure. In the mentioned prior works, supervision is in the form of class labels. Recent work has also considered constraints. Both must-links and cannot-links are considered in Wagstaff et al.’s modified K-means algorithm (Wagstaff et al., 2001; Wagstaff, 2002). This method minimizes a hard clustering distortion while ensuring no constraints are violated. In this approach, the data are first partitioned into chunklets, obtained by applying transitive closure to the must-links; chunklets consist of disjoint data subsets constrained to belong to the same class, as entailed by the specified must-link constraints. Then a variant of K-means is applied, one satisfying the must-link and cannot-link constraints, with each cluster treated as a distinct class. In Yu and Shi (2004), pairwise constraints are introduced within a graph-based clustering framework, applied to image segmentation. Klein, Kamvar and Manning (2002) proposed hierarchical agglomerative clustering with integrated metric learning. Here, the distance measure is modified based on must-links, with cannot-links enforced during the cluster merging. Basu et al. (2003) developed another modified K-means method together with metric learning. Their learning objective is obtained by adding a constraint violation penalty to the clustering distortion. Instead of squared Euclidean distance, Mahalanobis distance was adopted, with the parameters specifying this distance learned from the given constraints. Metric learning with pairwise constraints was also addressed in Bar-Hillel, Hertz, Shental, and Weinshall (2003a) and Xing, Ng, Jordan, and Russell (2003). In Bar-Hillel et al. (2003a), a global linear transformation, which assigns large weights to “relevant dimensions” and low weights to “irrelevant dimensions,” is determined by computing the whitening transformation associated with the covariance matrix of centered chunklets. This approach aims to minimize intraclass distances, but only must-links are considered. In Xing et al. (2003), a convex optimization with respect to the Mahalanobis matrix is formulated, minimizing distances between points with must-links and maximizing distances between points with cannot-links. Our approach effectively performs a type of “local” metric learning, an aspect we will discuss here. Constraints have also been integrated within the learning of “soft” clustering solutions, that is, gaussian mixtures (Shental, Bar-Hillel, Hertz, & Weinshall, 2003). This work is closely related to ours. However, unlike our work, Shental et al. (2003) assume one mixture component per class. Moreover, they assume the number of classes (individual components, in Shental et al., 2003) is known— that is, there is neither model selection nor estimation of the number of classes. The approach in Shental et al. has been extended in Law, Topchy,

2486

Q. Zhao and D. Miller

and Jain (2004) to consider soft constraints. Since our method constructs a learning objective by adding a constraint penalty to the log likelihood (with adjustable weight given to the constraints), our constraints are also “soft.” 1.2 Explanation of Our Approach. Our method makes clear distinction between clusters or components and classes. We learn a mixture, with individual components capturing homogeneous groups of points. However, individual components are not necessarily treated as classes, subject to instance-level class constraints. Rather, we allow classes to be composed of one or more components, with the component allocation learned to best satisfy the given (class) constraints. There are several novel capabilities of our approach. 1.2.1 Class Number and Cluster Number. Unlike our model, previous works assumed one cluster per class. Most works assume the class (i.e., cluster) number is known; an exception is Wagstaff (2002), which used model selection to choose the number of clusters. In our model, we do not assume that either the cluster number or the class number is known. To estimate the cluster number, we also use model selection. Estimation of the number of classes is a natural by-product of our model learning. Several factors affect our ability to accurately estimate the number of classes. First, we assume a mixture model (in particular, a gaussian mixture), with each class represented by either one or possibly more than one component; inaccuracy in estimating classes (and their number) may stem from incorrect model assumptions. Second, accuracy of the learned components is affected by local optima that can trap the learning and by the limited size of the data sample. Let us (unrealistically) ignore inaccuracy in both model assumptions and model learning, supposing that the learning does capture the ground-truth mixture that was used for stochastic generation of the given data set. In this case, identifying the classes (and their number) boils down to identifying to which class each mixture component belongs. Whether this can be uniquely determined depends on both the consistency and sufficiency of the supplied constraints. We will suppose the given constraints are logically consistent. Four different scenarios that pertain to constraint sufficiency are depicted in Figure 1 and will be discussed, following some definitions and a lemma. Definitions. We define a chunklet as a must-link group of data points, obtained by applying transitive closure to the given must-link constraints; each chunklet can be represented by a connected graph with edges defined by must-link constraints between pairs of points and with no connections between any two such graphs. Note that a chunklet may consist of a single point. When there is a cannot-link constraint between a point in chunklet A and a point in chunklet

Mixture Modeling with Class Constraints

2487

Cluster 1

m

m

Cluster 1 c

Cluster 4

c

Cluster 2 m

Cluster 4

c

Cluster 2 m

c

c

Cluster 3

Cluster 3

A. Case 1

B. Case 2

Cluster 1 m

Cluster 1

c Cluster 2 m

Cluster 4

Cluster 4

c

Cluster 2 m c

c

Cluster 3

C. Case 3

Cluster 3

D. Case 4

Figure 1: Constraint illustration. A must-link is denoted by m and a cannot-link by c. Clusters are indicated by solid closed curves, and uniquely identifiable classes are indicated by dashed closed curves.

B, we say that chunklets A and B are cannot-linked. If cluster 1 owns chunklet A and cluster 2 owns chunklet B, with A and B cannot-linked, we say that clusters 1 and 2 are cannot-linked. Likewise, if clusters 1 and 2 each own points belonging to the same chunklet, then we say clusters 1 and 2 are must-linked. We can then further define a cluster-chunklet by applying transitive closure to all must-linked clusters. We say that two cluster-chunklets, X and Y, are cannot-linked if X and Y own clusters x and y, respectively, that are cannot-linked. Given these definitions, we can state the following: Lemma. The constraint information is sufficient to uniquely specify the classes in the problem if each cluster-chunklet represents a ground-truth class (contains the

2488

Q. Zhao and D. Miller

true clusters and components that belong to the class) and if all cluster-chunklets are mutually cannot-linked. Figure 1 depicts an example with three classes, one of which consists of a pair of clusters, the others being singleton clusters. In Figure 1A, the must-link information identifies cluster-chunklets that correspond to the individual classes and with cannot-link constraints that distinguish all classes; the classes (indicated by dashed closed curves) are uniquely determined in this case. In Figure 1B, each cluster-chunklet does in fact represent a class, but without a cannot-link between clusters 3 and 4, it cannot be determined whether each singleton cluster represents a class or whether it comes from the same class. In Figure 1C, the must-link that binds clusters 1 and 2 is missing. In this case, the constraint information is insufficient to fully specify any of the classes; for example, one cannot rule out that clusters 1 and 3 (or 2 and 4) might belong to the same class. This statement also applies to Figure 1D. Figure 1D illustrates a case of some interest where there are some clusters without any associated constraint information. This situation is further discussed. 1.2.2 Discovery of “Unknown” Classes. An assumption in most prior work on generative semisupervised learning is that the number of classes present in the data is known, with supervision provided for examples from each of these classes. However, if one considers, say, a very large database that starts out purely unlabeled, with only a small fraction of the database then surveyed and labeled, it is quite plausible that there is unknown content, that is, unknown classes present, characterized by a complete lack of supervision information. Unknown classes may also be relevant to scientific domains such as molecular biology, where they may result from inaccuracy in the current theory about the domain. The discovery of unknown classes in semisupervised data with some supervising labels was recently motivated, proposed, and studied in Miller and Browning (2003). A special mixture model was proposed—one that hypothesizes a realistic labeling mechanism and treats label presence or absence as observed data, to be explained by the model. This method was demonstrated to be effective at discerning purely unlabeled (putative unknown class) components in mixed labeled and unlabeled data. Here, we extend this approach in order to discern clusters in the data without any constraints. Similar to the reasoning given in Miller and Browning (2003), such clusters may be indicative of a priori unknown content in the data, which should be specially identified. 1.2.3 Mixture-Based Density Estimation. Suppose the ultimate goal is density estimation over the feature space. Even if the number of classes is known, use of constraint information in mixture modeling may be unhelpful or even harmful if the parametric forms assumed for the classconditional feature densities are inaccurate. As a stark example, consider

Mixture Modeling with Class Constraints 4

4

3

3 2 22 2 11 1 1 212 12 2 2222 2 2 2 1 12 2 2 2 22 1 1 1 1 2 22 2 2 1 11 2 22 1 11111 1111 1 2 2 2 22 22 2 2 22 2 1 1 1111 11 2 222 2 2 1111 1111 11 222 22 1 11 1 2 2 2 11 11 1 111311 11211 3121 2 2 22222 2 2 2 2 22 1 111 11 12333 2 2 2 2 1 3 3 3 2 13 333 3333333333332333 3 3 3 3 33 333 3333 33 3 3 33 33 3 3 33 33 33 3 33 33 333 333 3 33 33 3 33 3 33 3

1 1 2 1 1 0 −1 −2 −2

2489

−1

0

A. 2 classes

1

2

2 22 2 11 1 212 12 2 2222 2 2 2 1 12 2 2 2 22 1 1 1 2 22 2 2 1 11 2 22 1 11111 1111 1 2 2 2 22 22 2 2 22 2 1 1 1111 11 2 222 2 2 1111 1111 11 222 22 1 11 1 2 2 2 11 11 1 111311 11211 3121 2 2 22222 2 2 2 2 22 1 111 11 12333 2 2 2 2 1 3 3 3 2 13 333 3333333333332333 3 3 3 3 33 333 3333 33 3 3 33 33 3 3 33 33 33 3 33 33 333 333 3 33 33 3 33 3 33 3

1 1

1

2 1 1 0 −1 −2 −2

1

1

1

−1

0

1

2

B. 3 classes

Figure 2: Two-dimensional data from three components but only two classes, with given must-link and cannot-link constraints. Data are indicated by component labels. Mixture model solutions for Shental et al.’s (2003) method assuming two classes (A) and three classes (B).

Figure 2, which consists of three ground-truth components but only two classes; that is, components 1 and 2 belong to the same class. If a single cluster per class is assumed and the number of classes is known to be two, a representative solution obtained by the method in Shental et al. (2003) is shown in Figure 2A, with one learned component suboptimally representing two ground-truth components from the same class. On the other hand, if three classes are assumed by the model, the solution obtained by Shental et al. gives a poor representation for the data comprising groundtruth clusters 1 and 2 (which have must-link constraints between them); in this case, a single model cluster (class) must try to capture the mustlinked pairs coming from both ground-truth clusters. A mixture modeling approach that ignores the constraint information and uses model order selection techniques (Mclachlan & Peel, 2000) to estimate the number of components will likely give a more accurate mixture density estimate for this example. On the other hand, use of multiple components per class provides more flexibility in meeting the given must-link and cannot-link constraints and may capitalize on this information to improve solutions in cases where the single-component model should best ignore the constraints. 1.2.4 Local Metric Learning. Metric learning (Xing et al., 2003; Basu et al., 2003; Bar-Hillel et al., 2003a) can help to improve clustering solutions by increasing the boundaries between classes. Conventional K-means, as well as COP-K-means (Wagstaff et al., 2001), use a standard global distortion measure—squared Euclidean distance. This leads to restricted piecewise linear decision boundaries. Xing et al. (2003) and Bar-Hillel et al. (2003a) learn a global Mahalanobis distance, consistent with given constraint information.

2490

Q. Zhao and D. Miller

Subsequent clustering yields more flexible piecewise linear decision boundaries. Shental et al. (2003) learn a mixture model with each component a general multivariate gaussian density. A maximum a posteriori rule applied to the component (class) probabilities yields quadratic decision boundaries. Our approach, with multiple components per class, forms even more complex boundaries. Effectively, we replace a global clustering distance metric by region-based local metrics, with each component defining a region. Simpler models, with greater model bias but smaller model variance, may be preferred for very small data sets, which explains why in some experiments the method from Shental et al. (2003) achieved best results when using a single (tied) covariance for all components. Our model has the smallest bias among these methods but the greatest variance, which suggests, as verified in our results, that performance should improve with increasing data size. Note, though, that our use of a model selection criterion to select the number of components will approximately match solution complexity to the data length and limit complexity when the data set is small. Finally, we note that learning a global metric may be deleterious in some cases. For example, in a new class discovery setting (considered later), there are classes that are entirely void of constraints. Global metric learning based on given constraints may effect a shape for model clusters that is well matched to the ground-truth clusters that possess constraints but one that poorly fits ground-truth clusters without constraints. Our approach, with flexible modeling of each component (constraint bearing or not), may allow better modeling of unknown classes. 1.3 Organization. The rest of the letter is organized as follows. In section 2, we develop our approach for learning mixture models consistent with instance-level class constraints. In section 3, our model and learning are extended to address semisupervised new class discovery. Experiments and results are described in section 4. 2 Mixture Model for Class Constraints 2.1 Standard EM Algorithm and Gibbs Distribution. First, consider the EM algorithm for a standard mixture with K components. For a set of samples X = {x1 , x2 , · · · , xN }, assumed to be generated independently, we define the potential as the negative complete data log likelihood, that is, U(M, ) = −

K N

Mik log[αk f (xi |θk )],

(2.1)

i=1 k=1

where αk is a component’s mass, f (·) is the component density specified by parameter set θk , = {{θk }, {αk }}, and M = [Mik ] is the assignment matrix with Mik = 1 if sample xi is assigned to component k; else Mik = 0. Then the

Mixture Modeling with Class Constraints

2491

incomplete data log likelihood is N

log

K

i=1

αk f (xi |θk ) =

k=1

K N

P(k|xi ) log[αk f (xi |θk )]

i=1 k=1

−

K N

P(k|xi ) log P(k|xi )

i=1 k=1

= − (E P(M|X ) {U(M, )} − H(P(M|X ))), (2.2) where P(k|xi ) = αk f (xi |θk )/ kK =1 αk f (xi |θk ). Note that the final form is the Helmholtz free energy, with H the entropy of the cluster assignments. Thus, the maximum likelihood problem can be cast as minimization of the Helmholtz free energy over parameters and component assignments (Yuille, Stolorz, & Utans, 1994; Neal & Hinton, 1998; Rose, 1998): min

{αk },{θk },P(M|X )

F ≡ U(M, ) − H(P(M|X )),

(2.3)

with U(M, ) ≡ E P(M|X ) {U(M, )}. The E-step solves min P(M|X ) F , and the M-step solves min{αk },{θk } F (Hastie, Tibshirani, & Friedman, 2001). The E-step gives the Gibbs distribution: P(M|X ) =

e −U(M,) . −U(M ,) M e

(2.4)

Since all samples are assumed independent, P(M|X ) specializes to a product N K of posterior probabilities: i=1 ( k=1 Mik P(k|xi )). 2.2 Single Component per Class Mixture with Constraints. We first develop our learning objective assuming one component per class, as in Shental et al. (2003). We then extend it for multiple components per class. We incorporate constraints via a penalty added to the potential in equation 2.1. First, we consider just cannot-links. The potential is now written: U(M, ) = −

K N i=1 k=1

Mik log[αk f (xi |θk )] +

K N N 1 Ci j Mik Mjk D, 2 i=1 j=1 k=1

(2.5) where Ci j =

1: 0:

cannot-link between samples i and j otherwise

(2.6)

2492

Q. Zhao and D. Miller

and D is a positive number to penalize constraint violations. Unlike Shental et al. (2003), cannot-links are treated as soft constraints in our approach, with the degree of enforcement based on the value of D. Next, consider must-links. There are two ways to insert them into the potential in equation 2.5. One approach is via a simple extension of the penalty matrix C = {Ci j }:   1: cannot-link between samples i and j Ci j = −1: must-link between samples i and j  0: otherwise.

(2.7)

Another approach is applied in Shental et al. (2003), as next described. First, chunklets are obtained by applying transitive closure to the set of mustlinks. The penalty matrix associated with pairs of chunklets is given as Ci j =

1: cannot-link between chunklets i and j 0: otherwise.

(2.8)

There are two variants of the complete data log likelihood, based on different assumptions about how chunklets are generated: 1. Data points are generated independently and identically distributed (i.i.d.), with chunklets selected from the data. 2. Chunklets themselves are sampled i.i.d. Corresponding to these two assumptions, the potential becomes, respectively,

U(M, ) = −

Nc K

Mck

c=1 k=1

+

1 2

Qc

{c} log αk f xi θk

i=1

Nc Nc

Ci j

K

i=1 j=1

Mik Mjk D

(2.9)

k=1

and U(M, ) = −

Nc K c=1 k=1

+

Mck log αk +

Qc

log f

{c} xi θk

i=1

Nc Nc K 1 Ci j Mik Mjk D, 2 i=1 j=1 k=1

(2.10)

Mixture Modeling with Class Constraints

2493

where Nc is the number of chunklets, Qc is the number of data points in {c} chunklet c, xi , i = 1, · · · , Qc are the data points in chunklet c, and Mck is the assignment variable for chunklet c in component k. As an aside, we note that equation 2.10 has the form of a Markov random field potential (Kindermann & Snell, 1980). The approach embodied in equations 2.9 and 2.10 strictly enforces mustlink constraints. While this certainly may be desirable in some cases, the approach taken here, based on equations 2.5 and 2.7, has some advantages. First, it is a natural way to handle soft constraints. As will be seen by our results, hard constraint enforcement may be undesirable if the parametric model assumed for each class is insufficient for satisfying all the constraints. Second, constraints with different degrees of reliability can be considered by modifying individual entries in the matrix C. Third, this method is naturally extended for the multiple components per class case, as explained next. 2.3 Multiple Components per Class Mixture with Constraints. Assume the K components belong to a maximum of L max classes, where K ≥ L max . Extra parameters are needed to describe these relations. Let βl|k be the probability that component k is assigned to class l. Here we have L max l=1 βl|k = 1, k = 1, · · · , K . Note that if βl|k = 0, ∀k, then class l is not used and the number of classes, as estimated by {βl|k }, is smaller than L max . The joint data likelihood of sample xi and class label l, given generation by component k, is αk βl|k f (xi |θk ). Based on this, the complete data log likelihood can now be written as

U(M, V, ) = −

K L N

Mik Vkl log[αk βl|k f (xi |θk )],

(2.11)

i=1 k=1 l=1

where V is the cluster assignment matrix with Vkl = 1 if component k is assigned to class l; else Vkl = 0. If there are no constraints, one can verify that the parameters {βl|k } are canceled in F = U(M, V, ) − H(P(M, V|X )), which is then equal to the negative log likelihood of the data. With constraints, we obtain the potential

U(M, V, ) = −

K L N

Mik Vkl log[αk βl|k f (xi |θk )]

i=1 k=1 l=1

L N K K N 1 + Ci j Mik Vkl Mjk Vk l D, 2 i=1 j=1 l=1 k=1 k =1 (2.12)

2494

Q. Zhao and D. Miller

where, still,   1: cannot-link between samples i and j Ci j = −1: must-link between samples i and j  0: otherwise.

(2.13)

2.4 Mean-Field Approximation. In the following, we focus on a learning derivation for the case of multiple components per class. The derivation for a single component per class is a special case. In the following, explicit dependence on in U(M, V, ) is omitted. The ultimate learning objective is F (U, P) ≡ U(M, V) − H(P(M, V|X )),

(2.14)

where U(M, V) ≡ {M,V} P(M, V|X )U(M, V). This is to be minimized over {αk }, {βl|k }, {θk }, P(M, V|X ). In the unconstrained case (see section 2.1), the optimal joint pmf P ∗ (M|X ) satisfies statistical independence of individual data assignments. Unfortunately, with constraints, the optimal joint pmf P ∗ (M, V|X ), minimizing equation 2.14, does not simplify to a tractable form unless some approximation is applied. We invoke a mean-field approximation as applied, for example, in Hofmann and Buhmann (1997). This strategy, next developed, is to approximate the joint pmf P ∗ (M, V|X ) by a pmf that does have a tractable, factorized form, denoted P 0 (M, V|X ). The form of P 0 (M, V|X ) is chosen to minimize the free energy, F (U 0 , P) = U 0 (M, V) − H(P(M, V|X )),

(2.15)

N where U 0 (M, V) ≡ − i=1 k l Mik Vkl Eikl , with {Eikl } the mean-field parameters. The form of the solution is a tractable Gibbs distribution due to the definition of U 0 (M, V): e −U (M,V) −U 0 (M ,V ) M ,V e N = Mik Vkl · P(Mik , Vkl |xi ) , 0

P 0 (M, V|X ) =

i=1

(2.16)

k,l −Eikl

where P(Mik , Vkl |xi ) ≡ e e −Eik l . The parameters {Eikl } approximate the k ,l interaction of Mik Vkl with other assignment variables expressed in equation 2.12. We choose the values E = {Eikl , ∀i, k, l} so as to make the tractable P 0 (M, V|X ) as close as possible to the optimal, intractable P ∗ (M, V|X ). For concision of expression, we subsequently drop the conditioning on X from both pmfs. Choosing relative entropy as the criterion

Mixture Modeling with Class Constraints

2495

with P ∗ (M, V) treated as the prior, we pose min D(P 0 (M, V)||P ∗ (M, V)) E

∗

= min −F (U, P ) + F (U , P ) + 0

0

E

= min E

P (M, V)[U(M, V) − U (M, V)] 0

0

M,V

P (M, V) · U(M, V) − H(P (M, V)) . 0

0

(2.17)

M,V

Note that the final expression is the free energy in equation 2.14 but with the tractable P 0 (M, V) replacing the unconstrained distribution P(M, V). After taking partials with respect to {Eikl }, setting to zero, and after some manipulation, we obtain the mean-field equations: Eikl = −log[αk βl|k f (xi |θk )] + D

N

Ci j Mjk +

Mjk Vk l .

k =k

j=1, j=i

(2.18) Then the marginal posterior probabilities (factors) of P 0 (M, V|X ) satisfy Mik Vkl ≡ Prob(Mik = 1, Vkl = 1|xi ) ∝

e −Eikl = αk βl|k f (xi |θk )

N

e −D·Ci j [Mjk +

k =k Mjk Vk l ]

.

(2.19)

j=1, j=i

Since D(P 0 (M, V)||P ∗ (M, V)) = −F (U, P ∗ ) + F (U 0 , P 0 ) + [U(M, V) − U 0 (M, V)] ≥ 0, we have F (U, P ∗ ) ≤ F (U 0 , P 0 ) + =

M,V

P 0 (M, V)

P 0 (M, V)[U(M, V) − U 0 (M, V)]

M,V

P 0 (M, V) · U(M, V) − H(P 0 (M, V)).

(2.20)

M,V

From equations 2.20 and 2.17, we see that through this mean-field approximation, P 0 (M, V|X ) is chosen to minimize a free energy that is an upper bound on the original free energy F (U, P ∗ ). In addition to minimizing this free energy with respect to E, we must also minimize with respect to {αk }, {βl|k }, and {θk }. Taking the derivative of the

2496

Q. Zhao and D. Miller

right-hand side of equation 2.20 with respect to the parameters and setting to zero, we obtain the necessary optimality conditions: Mik i l Mik Vkl αk = = i (2.21) N N i Mik Vkl i Mik Vkl βl|k = = (2.22) M V ik kl l i i Mik Mik xi µk = i i Mik k =

i Mik (xi

− µk )(xi − µk )T . i Mik

(2.23)

(2.24)

Here we assume gaussian components, that is, the parameters {θk } are {µk , k }. Equations 2.19 and 2.21 to 2.24 form the basis for an iterative algorithm minimizing the free energy, equation 2.20. Given fixed associations L max {Mik Vkl } (and Mik = l=1 Mik Vkl ), equations 2.21 to 2.24 directly specify M-step parameter updates. Given fixed parameters and the current {Mik Vkl } values, we use equation 2.19 as a fixed-point equation to update {Mik Vkl }. We do not iterate equation 2.19 to convergence, instead performing just one fixed-point update operation on {Mik Vkl } before returning to the M-step. One point of caution concerns how the fixed-point update is carried out. Updating these associations sequentially, one association at a time, is guaranteed to descend in the free energy given in equation 2.20 (Hofmann & Buhman, 1997). Updating Mik Vkl ∀i, k, l in parallel according to equation 2.19 is not guaranteed to descend. In practice, oscillations may occur if D is made too large. 3 New Class Discovery Now consider the scenario where some ground-truth components are not involved in any constraints; that is, we do not know any class information about these components. Such components may represent either heretofore unknown classes or unknown subgroups of existing classes. In either case, it is of interest to identify such components. In Miller and Browning (2003), such a method was developed for the case where the supervision takes the form of labels. There were two key aspects of this approach. First was the proposal of two mixture component types, predefined and nonpredefined, with predefined components generating both labeled and unlabeled data from known classes, and with nonpredefined components deterministically generating unlabeled data. The type of each component was learned, along

Mixture Modeling with Class Constraints

2497

with all other model parameters. Second was the treatment of label presence or absence as observed data, to be explained by the model. Miller and Browning demonstrated that label presence or absence may be informative, in some cases substantially helping to improve mixture modeling solutions. Here we parallel the approach in Miller and Browning, again introducing two component types and treating constraint presence or absence as observed data. Let vk = 1 if component k is predefined and vk = 0, otherwise. Let γ p be the probability that a data point is involved in any constraints, given that it is generated by a predefined component (this probability is zero for nonpredefined components). The complete data log likelihood for a constrained data point from class l, given that it is generated by (predefined) component k, is log(αk βl|k γ p f (xi |θk )), with the likelihood of an unconstrained point from class l, generated by component k given by vk log[αk βl|k f (xi |θk )(1 − γ p )] + (1 − vk ) log[αk f (xi |θk )]. The potential (broken into contributions from constrained and unconstrained data subsets) thus becomes U(M, V) = − vk Mik Vkl log[αk βl|k f (xi |θk )γ p ] i:con k

− −

l

i:un

k

l

i:un

k

l

vk Mik Vkl log[αk βl|k f (xi |θk )(1 − γ p )] (1 − vk )Mik Vkl log[αk f (xi |θk )]

N N 1 + Ci j vk Mik Vkl vk Mik Vk l D. 2 i=1 j=1 l k k

(3.1)

Carrying through the mean-field approximation as before, the marginal posterior probabilities take form as follows for constrained points, Mik Vkl ∝ vk [αk f (xi |θk )βl|k γ p ]

vk

N

−D·Ci j vk (

e

k =k

vk Mik Vk l + Mjk Vkl ) l

,

(3.2)

j=1

and for unconstrained points, Mik Vkl

∝

αk f (xi |θk )[βl|k (1 − γ p )]vk .

(3.3)

The parameter updates now take the form βl|k

i vk Mik Vkl = i vk Mik

(3.4)

2498

Q. Zhao and D. Miller

i:con k vk Mik γp = v i k k Mik

(3.5)

i:con vk Mik + i:un Mik

αk = k i:con vk Mik + i:un Mik

(3.6)

Mik xi i:con vk Mik xi + i:un µk = v M + ik i:con k i:un Mik

(3.7)

k =

i:con

vk Mik (xi − µk )(xi − µk )T + i:un Mik (xi − µk )(xi − µk )T . i:con vk Mik + i:un Mik (3.8)

In addition to EM-like optimization of most parameters as indicated above, we also optimize over the component natures {vk }. This is accomplished using the iterated conditional modes algorithm (Besag, 1986), that is, with a sequential, cyclical optimization of each individual component nature, given all other natures (and all other parameters) held fixed. This cyclical optimization continues until there are no further changes to the natures.

4 Experimental Results In this section, we evaluate the performance of our methods. Since Shental et al.’s (2003) method was found to outperform other approaches in Shental et al. (2003), we mainly compare this method with ours. In the following, we denote Shental et al.’s method by SGMM and our methods by MCGMM (multiple-component gaussian mixture model) and SCGMM (single-component gaussian mixture model). As in Shental et al. (2003), the performance of each method was evaluated via a combined measure of purity P and accuracy A scores, defined as follows, ρ=

2P A , P+A

(4.1)

where purity (P) measures the homogeneity of estimated groups—how many of the estimated group points belong to a single true group—and accuracy (A) measures how many of the true group points reside in a single estimated group (rather than being spread over several estimated groups). Some of the methods that are evaluated estimate the underlying class

Mixture Modeling with Class Constraints

2499

structure. For these methods, we evaluate purity-accuracy measured with respect to classes, denoted ρc . Some methods solely estimate the underlying cluster or component structure. For these methods, purity-accuracy is evaluated with respect to clusters, denoted ρ . 4.1 Illustrative Examples. Consider the synthetic data in Figure 3A. There are 200 data points from two classes, with one of the classes composed of two clusters. Constraints were generated by randomly selecting pairs of points. Here, the number of constraints is chosen as 15% of the data length; approximately 30% of the data points are associated with constraints. Based on the true labels, must-links or cannot-links were designated for the selected pairs. The result of SGMM with the class number fixed at two is shown in Figure 3B. With the class number fixed this way, this method cannot well capture the cluster structure. Thus, we evaluated only ρc for this method. Since the class with two components is not well modeled by one gaussian cluster, some data points near the boundary are misclassified. In Figure 3C, three classes are assumed for SGMM. Here one learned class (cluster) tries to capture the data points from two ground-truth clusters that share must-link constraints. This has a negative effect on the clustering result (and hence on ρ ). For our SCGMM, we evaluated the version where both must-links and cannot-links are treated as soft constraints. The meanfield learning was based on parallel updates. With two classes assumed, this method gives the same result as SGMM (not shown). When three classes are assumed, different results are obtained, depending on how strongly the constraints are enforced. With a strong penalty (D = 4), the result in Figure 3D is similar to SGMM’s, while a weak penalty (D = 1) gives the result in Figure 3E, which finds the natural cluster structure in spite of the constraints. This is indicative that when constraints are either inconsistent or, in this case, unsatisfiable based on the model assumptions, good clustering can still be achieved if the constraints are not strongly enforced. For our MCGMM method, the cluster number and class number need not be known. For this same example, we tried MCGMM with the cluster number K varied from 2 to 6, and with the maximum number of classes L max reasonably chosen as L max = K . Using the BIC model criterion (Schwarz, 1978), the cluster number was determined to be three. For this solution, the estimated number of classes is the true number, two, since in the solution, we find β3|k = 0, ∀k. No data points are misclassified by this solution (see Figure 3F), while SGMM and SCGMM cannot achieve this result at any model order. As noted earlier, if constraints are inconsistent or if the class models are biased, use of constraints may degrade clustering accuracy. In Figure 4, we show a data set with six clusters and four classes, with one class consisting of three clusters (clusters 2, 3, and 4 in the upper right of Figure 4B). The number of constraints is 15% of the data length. For these data, use of constraints in SGMM and SCGMM degrades the accuracy of the mixture

2500

Q. Zhao and D. Miller

3

3

2

2

1

1

0

0

−1

−1

−2

−2

ρc = 0.963

−3 −1

−0.5

0

0.5

1

1.5

2

2.5

−3 −1

−0.5

A. Data set with true labels

0

0.5

1

2

2.5

B. SGMM: 2 classes

3

3

ρ’ = 0.870

ρ’ = 0.886 2

2

1

1

0

0

−1

−1

−2

−2

−3 −1

1.5

−0.5

0

0.5

1

1.5

2

2.5

−3 −1

C. SGMM: 3 classes

−0.5

0

0.5

1

1.5

2

2.5

D. SCGMM: 3 classes, D=4

3

3

ρ’ = 0.940

ρ’ = 0.990

2

2

1

1

0

0

−1

−1

−2

−2

ρ =1 c

−3 −1

−0.5

0

0.5

1

1.5

2

E. SCGMM: 3 classes, D=1

2.5

−3 −1

−0.5

0

0.5

1

1.5

2

2.5

F. MCGMM

Figure 3: An illustrative example.

solution. However, MCGMM can still capitalize on constraints to improve the clustering result. For example, for the standard mixture modeling solution, which does not use the constraints, shown in Figure 4A, one estimated cluster incorrectly captures two ground-truth clusters (clusters 1 and 6) that do not belong to the same class, while for the same parameter initialization,

Mixture Modeling with Class Constraints 5

3

ρ’ = 0.836 0 5

−5

−10 −6

2501

3 3 3 3 3333 3 33 3 3 5 3 5 555 5553 2 2 3333 33 3 555 355352 33 3333 33 5 5 22 2 24 55 555 5353333 2 3333 2 2 22 5 3 3 3333 24 5 5555 55 55555555 555 5 5 4 5 3 2 4 5 555555 55 3 3 3 3 22 4 2 2 2 4 555555555 4 5 4 3 444 44 4 4444 4 22 55 5 35 3 3 3 5 4 44 4 3 2 4 4 4 3 44 4 244 4 4 4 4 5 5 5 5 3233 4 4 2 33 332 4 44 4 2 4 5 55 3 4 2 3 3 33 2 6 6 6 6 6 66 66 6 66 6 6 6 6 66 666 6 6 1 1 66 1 6 6 66 6 1 1 11 11 1 1 11 1 1 6 6 1 1 1 1 1 1 1 1 1 11111 1 1 1

−4

−2

0

2

4

6

4

8

10

A. Standard mixture modeling 5

3

ρ’ = 0.910 0 5

−5

−10 −6

3 3 3 3 3333 3 33 3 3 5 5 555 5553 333 33 3 2 2 33 555 355352 33 3333 33 5 5 22 2 24 55 3 555 5353333 2 333 2 2 22 5 3 3 3333 24 5 5555 55 55555555 555 5 5 4 5 3 2 5 5 3 2 5 555 55 5 3 3 2 3 444 4 4 444 4 2 2 2 4 555555555 5 3 2 3 5 2 5 33 4 444 44444 4 5 3 5 5 4 3 4 2 4 4 2 5 3 3 4 5 4 5 5 4 4 44 4 3233 3 4 44 2 3 3 2 2 4 4 4444 5 55 333 2 3 3 2 6 6 6 6 6 66 66 6 6 6 6666 666 6 6 66 6 1 1 6 1 6 6 66 6 1 1 11 11 1 1 11 1 1 6 6 1 1 1 1 1 1 1 1 1 11111 1 1 1

−4

−2

0

2

4

6

8

4

10

B. MCGMM

Figure 4: Comparison between standard mixture modeling and MCGMM for the same parameter initialization. True cluster labels are indicated in both figures.

the MCGMM solution (see Figure 4B) correctly separates these two clusters. For this example, standard mixture modeling (which does not use the constraints) achieved ρ = 0.836 (see Figure 4A) while MCGMM (see Figure 4B) achieved ρ = 0.91. Averaged over the same 10 initializations, the standard mixture achieved ρ = 0.851 with MCGMM scoring ρ = 0.886. 4.2 Synthetic Data Sets. Next, we compared our methods with SGMM in a more systematic way on two-dimensional synthetic data sets. The data sets consisted of six gaussian components, belonging to four classes with one owning three nearest components. The component means were generated according to a standard normal distribution. The covariances take the form AT A, where the elements of matrix A are generated based on an

2502

Q. Zhao and D. Miller

Table 1: Average ρc Results for Synthetic Data Sets.

Constraints Data sets (200 points) Data sets (400 points) Data sets (600 points)

SGMM

SCGMM

MCGMM

15% 0.830 0.840(TC) 0.850 0.814(TC) 0.836 0.826(TC)

15% 0.835 0.841(TC) 0.847 0.813(TC) 0.839 0.828(TC)

15% 0.891 0.860(MS:4.3) 0.933 0.898(MS:4.9) 0.953 0.921(MS:5.2)

Notes: SGMM = Shental et al.’s (2003) method. SCGMM = our single-component mixture model. MCGMM = our multiple-component mixture model. TC = tied covariance is used for all components. MS:x = model selection is used to choose the cluster number and x is the average cluster number.

exponential distribution. We still chose the number of constraints as 15% of the data length. Initially for all methods, the number of classes was assumed known, with the number of clusters also assumed known for MCGMM. This experiment essentially evaluates how well our learning can capitalize on this information if available. The ρc performance results averaged over 20 randomly generated data sets are shown in Table 1. Here, we used the potential given in equation 2.9 for SCGMM. Some comments are in order regarding the choice of D. As noted earlier, parallel updates do not guarantee convergent learning. We have observed nonmonotonicity of learning iterations, just as in Hofmann and Buhmann (1997), when D is made too large. Two possible solutions are to use sequential updates or parallel updates but adopt a relatively small value of D, one for which nonmonotonicity problems do not occur. We have not observed very obvious differences in performance for these two approaches. For these experiments, we chose the latter approach, using parallel updates and D = 2. In Table 1, our single-component mixture model and Shental et al. (2003) give similar results. This is not surprising as both methods use a gaussian mixture model and assume one cluster per class. The only difference between these methods is the way cannot-link constraints are handled. However, our multiple-component mixture model, with its greater flexibility in class modeling, gives better results. The results in the table also illustrate to some extent the relation between bias and variance of estimates. Since our multiple-component mixture has small bias but large variance and the variance decreases with increasing data length, this model performs better when the data length increases. By comparison, tied-covariance (TC) versions of the SGMM and SCGMM models have larger bias but smaller variance. When the data length is short, the TC versions are slightly better than their more complex counterparts. However, as the data length increases, both of these models achieve better results by allowing flexibility in the choice of covariances. For MCGMM, we also evaluated the performance when the number

Mixture Modeling with Class Constraints

2503

Table 2: Average ρc Results for Synthetic Data Sets.

Constraints Data sets (200 points) Data sets (400 points) Data sets (600 points)

SGMM

SCGMM

MCGMM

20 0.830 0.835 0.829

20 0.853 0.843 0.825

20 0.867 0.860 0.886

Notes: SGMM = Shental et al.’s (2003) method. SCGMM = our single-component mixture model. MCGMM = our multiple-component mixture model.

of clusters and the number of classes were both estimated (as described in section 4.1). These results are indicated by MS:x under the MCGMM column, with x the average number of clusters, selected according to BIC. As expected, the class accuracy for this approach is lower than for the MCGMM method with cluster and class number fixed as true. However, the accuracy is still higher than that of SGMM and SCGMM (which both assume knowledge of the number of classes). We also note from Table 1 that BIC tends to underestimate the model order. In principle, for these six-component data sets, we need no more than eight constraints, with each cluster having one representative point constrained. Due to initialization and local minimum problems, however, more constraints are always needed in practice. We tried 20 constraints, with the ρc (class purity-accuracy) results shown in Table 2. For SGMM and SCGMM, we still use the true number of classes. Since the model structure is not correct for both of these methods, the effect of using a different number of constraints is not significant. For MCGMM, we find that, in general, 20 constraints are sufficient to estimate the true number of classes. However, more constraints are helpful for improving the classification accuracy. 4.3 Real Data Sets. We next compared the methods on several real data sets from the UC Irvine repository. The number of constraints was again 15% of the data length. BIC was used in our MCGMM method to determine the cluster number. As preprocessing, standard principal component analysis was used to reduce the dimension for the data sets Ecoli, Indian Diabetes, Ionosphere, and Breast Cancer, down to the dimension shown in Table 3. For the Ecoli data set, the original eight classes were merged into three classes. Specifically, the six classes with different membrane locations were merged into one because some of them own too few data points to support the cluster structure. The ρc average results for 20 initializations are shown in Table 3. The most important statement to make about these results is that our MCGMM method correctly estimated the number of classes present for each of these real-world data sets. For example, for Ecoli, BIC-based model selection yielded K = 5 and, thus, L max = 5. However, in the mixture solution with K = 5, as expressed by the {βl|k }, only three classes were used.

2504

Q. Zhao and D. Miller

Table 3: Average ρc Results for Real Data Sets.

Ecoli (N = 336, d = 5, L = 3) Liver disorders (N = 345, d = 6, L = 2) Indian diabetes (N = 768, d = 6, L = 2) Breast cancer (N = 683, d = 4, L = 2) Ionosphere (N = 351, d = 15, L = 2)

SGMM

SCGMM

MCGMM

0.793 0.862(TC) 0.635 0.653(TC) 0.639 0.626(TC) 0.931 0.972(TC) 0.905 0.796(TC)

0.791 0.863(TC) 0.655 0.705(TC) 0.671 0.743(TC) 0.917 0.967(TC) 0.888 0.778(TC)

0.853,K =5 0.874(TC) 0.695,K =2 0.722(TC) 0.710,K =3 0.741(TC) 0.945,K =5 0.964(TC) 0.888,K =3 0.804(TC)

Notes: TC = tied-covariance is used for all components. N = number of data points. d = the dimension. L = number of classes. K = cluster number determined via the BIC model cost.

The purity-accuracy performance of our MCGMM method was the best in most cases. One interesting observation is that one-covariance versions of all three methods often improved the results, indicating that the data sets were too small to support use of full covariances. 4.4 Class Discovery Results. Finally, we evaluated our class discovery method. Here, we still generate synthetic data sets as described in section 4.2. The number of constraints is still 15% of the data length. However, for each data set, data points in one randomly chosen cluster are not associated with any constraints. Averaged over 20 data sets, with one initialization per data set, the results comparing our class discovery method with our MCGMM method are shown in Table 4. Both purity-accuracy measures are evaluated on these data sets. For our class discovery method, the data points belonging to new class candidates (nonpredefined clusters) are not used to

Table 4: Comparison Between MCGMM and Our Class Discovery Method. MCGMM ρ Data sets (400 points)

0.837(ρc ) 0.825(ρ ) Data sets (600 points) 0.847(ρc ) 0.858(ρ )

Class Discovery Method

Estimated K

ρ

Estimated K

Fraction R

4.70

0.892(ρc ) 0.834(ρ ) 0.922(ρc ) 0.865(ρ )

5.10

14/20

5.35

17/20

5.25

Notes: The number of clusters K is estimated using BIC. The last column gives the fraction R of data sets for which the class discovery method successfully identified a cluster without any constraints.

Mixture Modeling with Class Constraints

2505

calculate the score based on classes, because these points have no class labels. However, the MCGMM method still classifies points with no constraint information into known classes, which causes classification errors. This is the reason that our class discovery method provides much better scores based on classes. Note that the class discovery method not only improved ρc but also modestly improved ρ . This is consistent with the results in Miller and Browning (2003), which suggested that treating supervision presence or absence as observed data can aid cluster estimation.

5 Conclusions In this work, we have extended semisupervised learning with constraints in several respects: (1) to improve the representation of the classes by allowing multiple components per class, with component allocation automatically determined by the learning. (2) Unlike previous approaches, our method automatically estimates the number of classes in the data. We also identified the scenario under which the constraints are sufficient to unambiguously identify the classes, given the ground-truth clusters. (3) Building on Miller and Browning (2003), we defined putative unknown classes as being composed of clusters that are void of any constraints. We extended our model and learning to address discovery of these unknown classes. Our method does make use of all the supplied constraint information. However, as indicated by equation 2.19, cannot-link constraints only explicitly affect the association probabilities for data samples bearing these constraints, with no direct space-level implications coming from these constraints (Klein et al., 2002). In particular, a data sample C in the close vicinity of sample B, which in turn is cannot-linked to sample A, is not necessarily precluded from belonging to the same class or cluster as sample A. On the other hand, use of parameterized (i.e., structurally constrained) associations with controlled degrees of freedom (constrained region shapes) might better enforce spacelevel implications from sample constraints, similar to what (Klein et al., 2002) achieves for the case of hierarchical clustering. We also note the tendency, in our results, of BIC to underestimate the true number of clusters in the data. Thus, other model selection criteria could also be considered.

References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (pp. 267–281). Budapest: Akademia Kiado. Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003a). Learning distance functions using equivalence relations. In Proc. of 20th International Conference on Machine Learning. San Francisco: Morgan Kaufmann.

2506

Q. Zhao and D. Miller

Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003b). Learning via equivalence constraints, with applications to the enhancement of image and video retrieval. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition. Silver Spring, MD: IEEE Computer Society Press. Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In Intl. Conf. on Machine Learning (pp. 19–26). San Francisco: Morgan Kaufmann. Basu, S., Bilenko, M., & Mooney, R. (2003). Comparing and unifying searchbased and similarity-based approaches to semi-supervised clustering. In Proc. of ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems (pp. 42–49). San Francisco: Morgan Kaufmann. Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society B, 48(3), 259–302. Blum, A., & Mitchell, T. (1998). Combined labeled and unlabeled data with cotraining. In Proceedings of Computational Learning Theory (COLT 98). New York: Association for Computing Machinery. Demiriz, A., & Bennett, K. (2001). Optimization approaches to semi-supervised learning. In M. Fernis, O. Mangasarian’s Pang (Eds.), complementarity: Applications, algorithms and extensions. Norwood, MA: Kluwer. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Stat. Society, 39(1), 1–38. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer. Hofmann, T., & Buhmann, J. (1997). Pairwise data clustering by deterministic annealing. IEEE Trans. Pattern Anal. Machine Intell., 19(1), 1–14. Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proc. of the Fourteenth Conference on Uncertainty in AI (pp. 200–209) New York: North-Holland. Kindermann, R., & Snell, J. (1980). Markov random fields and their applications. Providence, RI: American Mathematical Society. Klein, D., Kamvar, S., & Manning, C. (2002). From instance-level constraints to spacelevel constraints: Making the most of prior knowledge in data clustering. In Proc. of 19th International Conference on Machine Learning. San Francisco: Morgan Kaufmann. Law, M., Topchy, A., & Jain, A. (2004). Clustering with soft and group constraints. In A. Fred, T. Caelli, & R. P. W. Duin (Eds.), Joint IAPR International Workshop on Syntactical and Structural Pattern Recognition and Statistical Pattern Recognition. Berlin: Springer-Verlag. Mclachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Miller, D., & Browning, J. (2003). A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Trans. Pattern Anal. Machine Intell., 25(11), 1468–1483. Miller, D., & Uyar, H. (1997). A mixture of experts classifier with learning based on both labelled and unlabelled data. In M. Mozer, J. Jordan, & T. Petsche (Eds.),

Mixture Modeling with Class Constraints

2507

Advances in neural information processing systems, 9 (pp. 571–577). Cambridge, MA: MIT press. Neal, R., & Hinton, G. (1998). A new view of the EM algorithm that justifies incremental, sparse and other variants. In M. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwood, MA: Kluwer. Nigam, K., McCallum, A., Thrun, ¨ S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 1–34. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471. Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. In Proc. of the IEEE (Vol. 86, pp. 2210–2239). New York: IEEE Press. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Stats., 6(2), 461– 464. Seeger, M. (2000). Learning with labeled and unlabeled data (Tech. Rep.). Berkeley: University of California at Berkeley. Shashahani, B., & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. on Geoscience and Remote Sensing, 32, 1087–1095. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (2003). Computing gaussian mixture models with EM using equivalence constraints. In S. Becker, S. Thrun, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Stork, D. G. (2001). Toward a computational theory of data acquisition and truthing. In Proceedings of Computational Learning Theory (COLT 01). New York: Association for Computing Machinery. Wagstaff, K. (2002). Intelligent clustering with instance-level constraints. Unbublished doctoral Dissertation, Cornell University. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. of 18th International Conference on Machine Learning (pp. 577–584). San Francisco: Morgan Kaufmann. Xing, E., Ng, A., Jordan, M., & Russell, S. (2003). Distance metric learning with application to clustering with side-information. In S. Becker, S. Thrun, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Yu, S., & Shi, J. (2004). Segmentation given partial grouping constraints. IEEE Trans. Pattern Anal. Machine Intell., 26(2), 173–183. Yuille, A., Stolorz, P., & Utans, J. (1994). Statistical physics, mixtures of distributions, and the EM algorithm. Neural Computation, 6(2), 334–340.

Received October 20, 2004; accepted March 22, 2005.

LETTER

Communicated by Kristin Bennett

Geometrical Properties of Nu Support Vector Machines with Different Norms Kazushi Ikeda [email protected] Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606-8501 Japan

Noboru Murata [email protected] School of Science and Engineering, Waseda University, Shinjuku, Tokyo 169-8555 Japan

By employing the L1 or L∞ norms in maximizing margins, support vector machines (SVMs) result in a linear programming problem that requires a lower computational load compared to SVMs with the L2 norm. However, how the change of norm affects the generalization ability of SVMs has not been clarified so far except for numerical experiments. In this letter, the geometrical meaning of SVMs with the L p norm is investigated, and the SVM solutions are shown to have rather little dependency on p. 1 Introduction A decade has passed since support vector machines (SVMs) were proposed as a new classification technique with good generalization ability, and they still attract a great deal of attention (Vapnik, 1995, 1998; Scholkopf, ¨ Burges, & Smola, 1998; Cristianini & Shawe-Taylor, 2000; Smola, Bartlett, Scholkopf, ¨ & Schuurmans, 2000). The key idea of the SVMs lies in mapping the input space to a high-dimensional feature space and discriminating inputs with a separating hyperplane in the feature space optimized in terms of margins, that is, distances between given examples and a hyperplane. In the original SVMs, the distances of given examples from a separating hyperplane were evaluated in the L 2 norm, and this leads the problem of maximizing the margin to a quadratic programming problem. If we employ the L p norm for an arbitrary 1 ≤ p ≤ ∞ instead, the problem results in a pth-order programming problem (Mangasarian, 1999; Pedroso & Murata, 2001), that is, particularly when p = 1,1 the problem can be solved by linear programming, and the computational load can be reduced. However, it is

1 Even when p = ∞, the problem is reduced to a linear programming problem (Bennett & Bredensteiner, 2000; Pedroso & Murata, 2001).

Neural Computation 17, 2508–2529 (2005)

© 2005 Massachusetts Institute of Technology

Properties of Nu Support Vector Machines

2509

not clarified how the change of norm affects the generalization ability of SVMs except an experimental result in Pedroso and Murata (2001) in which it is reported that the generalization error barely depends on p. Hence, we need a theoretical background to justify such a modification. We discuss here the properties of the solutions of the so-called ν-SVM proposed by Scholkopf, ¨ Smola, Williamson, and Bartlett (2000) with the linear kernel and homogeneous separating hyperplanes based on the L p norm, which we call ν-SVM( p) . When homogeneous hyperplanes are employed, maximizing the margin is equivalent to finding the point nearest the origin in the reduced convex hull of examples, while maximizing the margin is equivalent to finding the nearest points in the two reduced convex hulls in the original SVMs (Bennett & Bredensteiner, 2000). Hence, we investigate the problem of finding the nearest point of the reduced convex hull with the L p norm and show some properties on the nearest point and the ν-SVM( p) solution. One of our results is that the ν-SVM( p) solution has rather little dependency on p. More concretely, if the nearest points with the L p and L 2 norms are located on the same hyperplane, the solutions of ν-SVM and ν-SVM( p) coincide, and although they are not on the same hyperplane, the difference of the solutions is bounded from above by a value independent of examples. The other result is derived under the assumption that the input vectors are normalized, which holds for the gaussian kernel, for example. In this case, the ν-SVM( p) solution is expressed in the input space as a nonnegatively weighted sum of the centers of some circumscribed hyperspheres of examples. Equivalently, it is expressed in the parameter space as a nonnegatively weighted sum of the centers of some inscribed hyperspheres of examples, which is an extension of the result in Herbrich (2002) that the solution of the SVM is the center of the largest inscribed hypersphere of the version space. This letter is organized as follows. Section 2 formulates the ν-SVM with the L p norm, and the relationship between the nearest point in the L p norm and the ν-SVM( p) solution is given in section 3. Section 4 discusses the geometrical meaning of the solution under the assumption of the normalized inputs, and conclusions and discussions are given in section 5. 2 ν-SVM with L p Norm The ν-SVM is a variation of SVM proposed in Scholkopf ¨ et al. (2000), where the ν-SVM employs inhomogeneous separating hyperplanes, that is, they do not necessarily include the origin. The ν-SVM with homogeneous hyperplanes formulated here chooses the separating hyperplane w x = 0, maximizing the margin, which is defined as the minimum distance between examples and the hyperplane. (A problem with inhomogeneous hyperplanes is easily transformed to one with homogeneous hyperplanes using the so-called lifting up (Figure 1), that is, transformation w ˜ := (w , b)

2510

K. Ikeda and N. Murata

+1

w O ~ w

−1

Figure 1: Geometrical view of lifting up where the origin is denoted by O. Since the distances of examples from w ˜ (thick solid line) are proportional to those from w (black circle), lifting up does not change the problem of separating examples at all in terms of margin maximization. Neither does transforming a negative example (×) to a positive one (o).

and x˜ := (x , 1) where := means definition. Note that they differ a little since the latter also penalizes the bias b.) Since the distance between an example w f (n) (x (n) , y(n) ) and a hyperplane w x = 0 is expressed as w , where f (n) := y(n) x (n) , w and cw have the same distance for c > 0 due to linearity. To absorb this ambiguity, the ν-SVM sets the minimum distance to β, that is, w f (n) ≥ β. Maximizing the margin, then, results in minimizing w and maximizing β, simultaneously. Hence the ν-SVM minimizes the cost function 1 minimize w2 − β w,β 2

s.t.w f (n) ≥ β.

(2.1)

Although maximizing the margin is easy to understand intuitively, it is not applicable to a linearly inseparable set of examples, since there exists a trivial solution w = (0, . . . , 0) and β = 0 in such a case. To ensure convergence in linearly inseparable cases and avoid overfitting to noisy data or outliers in examples, soft margins were introduced in SVMs (Cortes & Vapnik, 1995). They make the separating hyperplane less sensitive to given examples by using slack variables ξn , ξ = 1, . . . , N, for margin-constraint violation, and the problem is formulated as N 1 ξn − β minimize w2 + C w,β,ξ 2 n=1

s.t. w f (n) ≥ β − ξn ,

ξn ≥ 0,

(2.2)

where C is a given constant and ξ := (ξ1 , . . . , ξ N ) . The geometrical meaning of the slack variables ξ is shown in Figure 2. Note that this problem reduces to the original SVM with homogeneous hyperplanes if we fix β to unity.

Properties of Nu Support Vector Machines

2511

ξ1 ξ2 Figure 2: Soft margins allow the violation of constraints, although their use is penalized in the cost function as in equation 2.2.

By using the Lagrangian multipliers, this problem can be rewritten as minimize maximize L(w, β, ξ, α, γ),

(2.3)

α,γ

w,β,ξ

N 1 L(w, β, ξ, α, γ) := w2 + C ξn − β 2 n=1

−

N

αn (w f (n) − β + ξn ) −

n=1

N

γn ξn

(2.4)

n=1

where αn ≥ 0, ξn ≥ 0, α := (α1 , . . . , α N ) , and γ := (γ1 , . . . , γ N ) . Since w is a saddle point of L(w, β, ξ, α, γ), by differentiating it with respect to w, β, and ξ, the following condition is derived: N ∂L =w− αn f (n) = 0, ∂w n=1

(2.5)

N ∂L = −1 + αn = 0, ∂β n=1

(2.6)

∂L = C − αn − γn = 0. ∂ξn

(2.7)

Hence, problem 2.2 is equivalent to 1 minimize w2 α 2 s.t. w =

N n=1

αn f (n) ,

0 ≤ αn ≤ C,

N

αn = 1,

(2.8)

n=1

which is a quadratic programming problem with linear constraints. This is called the dual problem of equation 2.2. As seen above, x (n) and y(n) do not appear alone but necessarily in the form y(n) x (n) (= f (n) ). This means that the example (x (n) , y(n) ) is equivalent to

2512

K. Ikeda and N. Murata

Figure 3: Reduced convex hull of examples (o) in the case of C = 1/2. This coincides with the convex hull of other points (x), each of which is the midpoint of two examples since C = 1/2.

( f (n) , 1), and introducing f (n) can be regarded as making all the examples positive (see Figure 1). Hence, we also call f (n) an example whose true class is always positive. When C ≥ 1, the solution of equation 2.8 corresponds to the point nearest the origin in the convex hull of examples, f| f =

N

tn f

(n)

,

n=1

N

tn = 1, 0 ≤ tn ≤ 1 ,

(2.9)

n=1

since 0 ≤ αn ≤ C effectively equals 0 ≤ αn . Hence, its geometrical meaning in the parameter space becomes clear as the original SVM with hard margins can be understood intuitively in the input space. When C < 1, we can consider the reduced convex hull of examples, f| f =

N

tn f

(n)

n=1

,

N

tn = 1, 0 ≤ tn ≤ C ,

(2.10)

n=1

instead of the convex hull. (See Bennett & Bredensteiner, 2000, and Ikeda & Aoishi, 2004, 2005) for the meaning and characteristics of reduced convex hulls.) Note that the reduced convex hull of examples is the convex hull of another example set as seen in Figure 3. Therefore, it is straightforward to extend the discussion for hard margins to that for soft margins. The formulation of the ν-SVM given above employs the L 2 norm for measuring the distance between an example and a hyperplane. We discuss the ν-SVM that employs the L p norm instead, which is defined as

w p :=

 M   

1/ p

|wi | i=1    max |wi | 1≤i≤M

p

1≤ p<∞ p=∞

,

(2.11)

Properties of Nu Support Vector Machines

2513

p=∞ 1 p=8

p=2 0.5

0

0

p=1

0.5

1

Figure 4: Contour curves of the L p norms. Solid, dashed, dash-dot, and dotted curves show the points a unit distance from the origin in the L p norms with p = 1, 2, 8, and ∞, respectively.

where wi denotes the ith element of w. Here, we use the term ν-SVM( p) to refer to the ν-SVM with the L p norm. Pedroso and Murata (2001) applied this idea to the original SVM described for p = 1 and p = ∞ and showed that in those cases, maximizing the margin resulted in a linear programming problem. (See also Mangasarian, 1999, for the relationship between the norm and the computational complexity.) Figure 4 shows the curves a unit distance from the origin in the L p norms for p = 1, 2, 8, and ∞ by the solid, dashed, dash-dot, and dotted curves, respectively. These forms imply that the point nearest the origin tends to approach an axis for small p and the vector ρ(1, 1, . . . , 1) for large p where ρ is a constant. For p ∈ (1, ∞), the L q norm satisfying 1 1 + =1 p q

(2.12)

is called the dual norm of the L p norm. In the case of p = 1, the L ∞ norm is the dual norm of the L 1 norm, and vice versa. They satisfy Holder’s ¨ inequality, w v ≤ w p vq .

(2.13)

The distance of an example x (n) from a hyperplane w x = 0 in the L p norm is defined as

2514

K. Ikeda and N. Murata

s.t. w x = 0.

minx (n) − x p x

(2.14)

Since the minimizer is a saddle point of the Lagrangian function, L(x, λ) :=

1 (n) x − x pp + λw x, p

(2.15)

where λ is the Lagrangian multiplier, it satisfies ∂L (n) (n) = sgn(xi − xi )|xi − xi | p−1 + λwi = 0. ∂ xi

(2.16)

Solving the above, we get x − x (n) pp = −λw (x − x (n) ) = λw x (n)

(2.17)

|xi − xi | = |λwi |1/( p−1) = |λwi |q −1

(2.18)

(n)

x −

x (n) pp

= |λ|

q

wqq

(2.19)

for 1 < p < ∞, q satisfying equation 2.12, and x minimizing the distance. Therefore, the distance is written as minx − x (n) p = x

|w x (n) | . wq

(2.20)

It is easily confirmed that the above also holds true for p = 1, q = ∞, and p = ∞, q = 1. Hence, the ν-SVM( p) is formulated as N 1 minimize wqq + C ξn − β w,β,ξ q n=1

s.t. w f (n) ≥ β − ξn ,

ξn ≥ 0 (2.21)

for 1 < p < ∞. Using the Lagrangian function, N 1 ξn − β L(w, β, ξ, α, γ) := wqq + C q n=1

−

N n=1

αn (w f (n) − β + ξn ) −

N n=1

γn ξn ,

(2.22)

Properties of Nu Support Vector Machines

2515

where αn ≥ 0 and γn ≥ 0 are the Lagrangian multipliers, the dual problem is described as 1 minimize v pp α p N αn = 1, 0 ≤ αn ≤ C,

s.t. v =

N

αn f (n) ,

n=1

(2.23)

vi = sgn(wi )|wi |q −1 ,

n=1

which means that the ν-SVM( p) is a problem of finding the point in a reduced convex hull nearest the origin in the L p norm. We denote the nearest point by v p and the solution of equation 2.21 by w ˆ p , which is derived from v p as wˆ pi = sgn(v p i )|v p i | p−1 .

(2.24)

We also introduce the normalized vector of w ˆ p denoted by wp , that is, wp :=

w ˆp , w ˆ p 2

(2.25)

to absorb the ambiguity of the magnitude of the normal vector. Using this notation, the solution of the SVM with the L 2 norm is written as w ˆ 2 x = 0 or w2 x = 0. Note that the above formulation is not applicable for the cases of p = 1 and p = ∞, since equations 2.21 and 2.23 are, respectively, meaningless for q = ∞ and p = ∞, although the L 1 norm and L ∞ norm primal and dual formulations are possible without the Wolfe dual (Bennett & Bredensteiner, 2000). 3 Nearest Point and the ν-SVM( p) Solution As shown in equation 2.24, the normalized ν-SVM( p) solution wp is determined through the point v p nearest the origin in the L p norm. Here we discuss the geometrical relationship between wp and v p , more specifically, its dependency on the norm p. For brevity, we set C = 1, that is, we consider only the ν-SVM( p) with hard margins, since the case of the ν-SVM( p) with soft margins is straightforward by considering the reduced convex hull, which is the convex hull of another example set. 3.1 The ν-SVM( p) Solution for a Hyperplane. Each face of the convex hull of examples is a part of a hyperplane; therefore, it is important to consider the nearest point in such a hyperplane at first. The nearest point v p differs depending on the L p norm, but, surprisingly, the corresponding vector wp does not depend on p, as follows.

2516

K. Ikeda and N. Murata

Theorem 1. Suppose that v p is the point in a hyperplane e v = d(> 0) nearest the origin in the L p norm, where e = 1. Then wp induced from v p through equations 2.24 and 2.25 coincides to e, irrespective of p. Proof. Since the problem is to find the nearest point, v p is the solution of 1 minimize v pp v p

s.t. e v ≥ d.

(3.1)

The Lagrangian function of equation 3.1 is written as L(v, λ) :=

1 v pp − λ e v − d , p

(3.2)

where λ ≥ 0 is the Lagrangian multiplier, and the optimal v denoted by v p satisfies ∂L = sgn(v p i )|v p i | p−1 − λe i = 0. ∂vi

(3.3)

Hence, from equation 2.24, λ=

sgn(v p i )|v p i | p−1 wˆ pi = , ei ei

(3.4)

and w ˆ p is parallel to e regardless of p as well as wp . From wp = e = 1, they are the same. We show an example of M = 2 to clarify the meaning of theorem 1. Although the nearest point v p in the L p norm differs from the nearest point v2 in the L 2 norm, the normalized vectors wp and w2 are the same (see Figure 5). This fact shows the relationship between the ν-SVM( p) and ν-SVM solutions. If v2 and v p lie on the same hyperplane which consists of the convex hull of examples, then w2 = wp holds, as in Figure 6a, and the ν-SVM( p) solution does not depend on p. Otherwise, wp is different from w2 , as in Figure 6b. 3.2 Dependency of Solutions on the Norm. The discussion has presumed that the nearest point lies on a hyperplane, although the nearest point may exist at an edge or a vertex of the convex hull of examples. Here we discuss such a case.

Properties of Nu Support Vector Machines

2517

w 2= wp vp

v2

e'v =d

o

Figure 5: Geometrical view of v p and wp . Although v p varies according to p, wp does not change.

Convex Hull

Convex Hull

w 2=wp

vp

vp

w2 wp

O

O

(a)

(b)

Figure 6: Geometrical view of v p and w. ˆ (a) When v p and v2 are located on the same segment, wp coincides with w2 . (b) Otherwise, they are different.

Theorem 2. Suppose that the point v p nearest the origin in the L p norm in a convex hull lies in an affine space written as the intersection of K hyperplanes,

e(k) v = d (k) (> 0),

k = 1, . . . , K (≤ M),

(3.5)

each of which consists of the convex hull. That is, v p is the solution of 1 minimize v pp v p

s.t. e(k) v ≥ d (k) .

(3.6)

Then, wp induced from v p through equations 2.24 and 2.25 is a nonnegatively weighted sum of e(k) . Proof. Using the Lagrangian function L(v, λ1 , . . . , λ K ) :=

K 1 v pp − λk (e(k) v − d (k) ), p k=1

(3.7)

2518

K. Ikeda and N. Murata

1

0.5

0

0

0.5

1

Figure 7: The convex hull of examples and the contour touching it at the left segment for p ∈ [1, 4/3].

where λk ≥ 0 are the Lagrangian multipliers, w ˆ p is written from equation 2.24 as w ˆp =

K

λk e(k) .

(3.8)

k=1

Hence w ˆ p as well as wp is expressed as a nonnegatively weighted sum of e(k) . In this case, wp depends on p, which is different from the previous case. We show an example of M = 2 for clarity. In Figure 7, the asterisk and the solid line denote an example and the convex hull of examples, respectively. The dash-dot line and the dashed curve are the contours touching the convex hull in the L 1 and L 4/3 norms, respectively. In the same way, the dashed and the dotted curves are the contours touching the convex hull in the L 4/3 and L 3 norms, respectively in Figure 8 and the dotted curve and the dashed line are the contours touching the convex hull in the L 3 and L ∞ norms, respectively, in Figure 9. Since the contour in the L p norm touches the left segment of the convex hull for p ∈ [1, 4/3] as shown in Figure 7, v p lies on the segment for such p, and hence wp is parallel to the normal vector of the segment. In the same way, the contour in the L p norm touches the right segment of the convex hull for p ∈ [3, ∞], as shown in Figure 9, and wp is parallel to the normal vector of the segment regardless of p. When p ∈ [4/3, 3], the contour in the L p norm touches the convex hull at the example shown by the asterisk in

Properties of Nu Support Vector Machines

2519

1

0.5

0

0

0.5

1

Figure 8: The convex hull of examples and the contour touching it at the example ∗ for p ∈ [4/3, 3].

1

0.5

0

0

0.5

1

Figure 9: The convex hull of examples and the contour touching it at the right segment for p ∈ [3, ∞].

Figure 8. In such a case, wp is expressed as the normal vector of the tangent plane of the contour at the asterisk. Hence it moves from the normal vector of the left segment to that of the right as p changes from 4/3 to 3. 3.3 Angles of Nearest Points and Solutions. As seen in the previous subsection, v p and wp depend on p in general. We here show that the angle θ between v2 and v p has an upper bound regardless of examples, and so does

2520

K. Ikeda and N. Murata

the angle η between w2 and wp . Intuitively, this means that the ν-SVM( p) solution is not so distant even when it is different from the ν-SVM solution. The following theorem stands regardless of given examples. Theorem 3. For an arbitrary p ∈ (1, ∞), cos θ has the lower bound p −2 cos θ ≥ M− 2 p ,

(3.9)

where M is the dimension of the input space and cos θ :=

v2 v p . v2 2 v p 2

(3.10)

Proof. The case of p = 2 is trivial. We first show the case of p < 2. Since v2 and v p are, respectively, the nearest points in the convex hull in the L 2 and L p norms, v p 2 ≥ v2 2 ,

(3.11)

v2 p ≥ v p p

(3.12)

hold by definition, and (v p − v2 ) v2 ≥ 0

(3.13)

holds from the convexity of the convex hull to which v p and v2 belong. Therefore, the numerator of equation 3.10 satisfies v2 v p ≥ v2 22

(3.14)

from equation 3.13, and the denominator of equation 3.10 satisfies v2 2 v p 2 ≤ v2 2 v p p ≤ v2 2 v2 p

(3.15)

from · 2 ≤ · p when p < 2 and equation 3.12. Hence, cos θ is bounded as cos θ ≥ p

p−2 v2 2 ≥ M 2p v2 p

since v2 p has the maximum M using the Lagrangian function,

(3.16) 2− p 2

subject to v2 2 = 1. This is easily shown

L(v2 , λ) := v2 pp − λ v2 22 − 1 .

(3.17)

Properties of Nu Support Vector Machines

2521

When p > 2, in a similar way, the numerator of equation 3.10 satisfies v2 v p ≥ v2 22

(3.18)

from equation 3.13, and the denominator of equation 3.10 satisfies p−2

p−2

v2 2 v p 2 ≤ v2 2 v2 p M 2 p ≤ v2 22 M 2 p

(3.19)

p−2

since v p 2 has the maximum v2 p M 2 p subject to equation 3.12 and from · p ≤ · 2 when p > 2. This is easily shown using the Lagrangian function,

L(v p , λ) := v p 22 − λ v p pp − v2 pp .

(3.20)

Hence, cos θ is bounded as 2− p

cos θ ≥ M 2 p .

(3.21)

This theorem implies that the angle between the nearest points in the L 2 and L p norms has a lower bound √1M regardless of N and p. Similarly, we can show that the cosine of the angle η between w2 and wp has the same lower bound. Theorem 4. For an arbitrary p ∈ (1, ∞), cos η has the lower bound, cos η :=

w2 wp p−2 ≥ M− 2 p , w2 2 wp 2

(3.22)

where M is the dimension of the input space. Proof. The case of p = 2 is trivial. We first show the case of p > 2. Without loss of generality, we can assume that all the elements of v p are nonnegative. From equation 2.24, we consider the angle η between v2 and p−1 v p , which are, respectively, parallel to w2 and wp , where

p−1 p−1 v p−1 := (v p )1 , (v p )2 , . . . , (v p ) p−1 . p p

(3.23)

In this proof, we use the inequality (v2 − v p ) ∇ L(v p ) ≥ 0

(3.24)

2522

K. Ikeda and N. Murata

from the convexity of the convex hull to which v p and v2 belong, instead of equation 3.13, where ∇ L(v) is the gradient of L(v) :=

M

p

vi

(3.25)

i=1

at a point v. From

∂ L(v) ∂vi

p−1

= pvi

, equation 3.24 can be rewritten as

≥ vp v p−1 = v p pp , v2 v p−1 p p

(3.26)

and from · 2 p−2 ≤ · p when p > 2, p−1

p−1 v p−1 p 2 = v p 2 p−2 ≤ v p p .

(3.27)

Hence, the cosine of the angle satisfies the inequality cos η =

w2 wp w2 2 wp 2 v2 v p

(3.28)

≥

p−1

v2 2 v p 2

vp v p

p−1

p−1

=

≥

p−1

v p 2 v p 2

v p p , v p 2

(3.29) (3.30)

from equations 3.11, 3.26, and 3.27, where equation 3.30 has a lower bound 2− p

M 2p . In the above, equation 3.29 still holds when p < 2. The right-hand side of equation 3.29 is rewritten as vp v p

p−1 p−1

v p 2 v p 2

=

(v˜ q −1 ) v˜ v˜ q −1 2 v ˜ 2

(3.31)

where 1 1 + = 1, p q

(3.32)

v˜ = v p−1 p ,

(3.33)

and q > 2 from 1 < p < 2. Hence, equation 3.31 has a lower bound 2−q

p−2

M 2q = M 2 p .

(3.34)

Properties of Nu Support Vector Machines

2523

Note that the right-hand side of equation 3.29 coincides with the cosine of the angle between the nearest point v p in the L p norm and the corresponding solution wp . Hence, it also has the same lower bound. 4 Geometry of ν-SVM( p) for Normalized Inputs In this section, we assume that the magnitude of any input is normalized to unity, that is, f (n) 2 = 1. Although this assumption is restrictive, analyses under the assumption give good insight into the geometrical meaning of SVMs. In Herbrich (2002), for example, it has been shown that the solution of the original SVM with homogeneous hyperplanes appears in the parameter space as the center of the largest inscribed hypersphere of the version space under the assumption. Moreover, the gaussian kernel, which is popular in practice, satisfies this assumption. Hence, we discuss how the ν-SVM( p) solution appears in the input space or in the parameter space in this case. As a result, we show that the ν-SVM( p) solution is expressed in the input space as a nonnegatively weighted sum of the centers of some circumscribed hyperspheres of examples. Equivalently, it is expressed in the parameter space as a nonnegatively weighted sum of the centers of some inscribed hyperspheres of examples, which is an extension of the result in Herbrich (2002). 4.1 Examples in Input and Parameter Spaces. Under the assumption of normalized inputs, examples f (n) , n = 1, 2, . . ., distribute on a semihypersphere X = S+M−1 where S+M−1 := { f |w∗ f ≥ 0, f ∈ S M−1 },

(4.1)

where S M−1 is an (M − 1)–dimensional unit hypersphere and w∗ is the true parameter with which examples are made, that is, f (n) = y(n) x (n) and y(n) = sgn(w∗ x (n) ). Similarly, we consider the normalized ν-SVM( p) solution wp , which belongs to W := S M−1 . When an example f (n) is given, the set of parameters in which any one separates f (n) correctly is a semihypersphere in W. So the parameters consistent with all given examples form a set of parameters called the version space, which is often termed the admissible region (Ikeda & Amari, 1996). The examples that consist of the version space are called effective. Note that the version space in W, shown as the hatched area in Figure 10a, is expressed as the convex hull of examples in X where the effective examples are vertices of the convex hull as shown in Figure 10b. Since the original SVM finds the hyperplane farthest from the convex hull in X, the solution appears in W as the center of the largest inscribed hypersphere of the version space (Herbrich, 2002). The touched examples become support vectors, that is, have positive αn in equation 2.8. (See Ikeda and Amari, 1996, for details of the

2524

K. Ikeda and N. Murata

f4

f 2 w * f1 f3

f1

f3 f4 f2

(a) Parameter space W.

(b) Input space X.

Figure 10: Convex hull of examples in X and version space in W.

duality of the input space X and the parameter space W, the duality of the convex hull and the version space, and the relationship between the version space and the generalization ability.) 4.2 ν-SVM( p) Solution in Input Space. Let us consider the geometrical meaning of the ν-SVM( p) solution in the input space. Suppose that the contour in the L p norm centered at the origin touches the convex hull in the input space X at an interior point v p of a simplex consisting of M examples f (n) , n = 1, . . . , M. Figure 11 shows an example where such a simplex is drawn as a shaded triangle. Since the simplex is a subset of a hyperplane, we can express the hyperplane as e w = e22 ,

(4.2)

that is, e is normal to and exists on the hyperplane. From equation 3.4, e is parallel to w ˆ p and hence to wp . In this case, e is equidistant from the examples f (n) , n = 1, . . . , M; in other words, this point is the center of the

f3 f1 e

f2

Figure 11: Geometry of ν-SVM( p) for normalized inputs in the input space X where e is the center of the circumscribed circle of f (1) , f (2) , and f (3) .

Properties of Nu Support Vector Machines

2525

circumscribed hypersphere of the simplex in R M since e (e − f (n) ) = e22 − e f (n) = 0,

(4.3)

e − f (n) 22 = f (n) 22 − e22 = const.

(4.4)

Since the input space is half of a unit hypersphere, distances within it are expressed as the angle of vectors. More specifically, the distance between wp and f (n) is expressed as the angle ι(n) where ι(n) := arccos

e f (n) = arccos e2 . e2

(4.5)

From equation 4.3, ι(n) is constant; therefore, wp is the center of the circumscribed hypersphere of the simplex in S M−1 . Suppose that the contour touches a simplex of a lower dimension (M − K ) consisting of (M − K + 1) examples where 1 < K ≤ M, as discussed in section 3.2. Then the simplex is expressed as the intersection of K hyperplanes, each of which is a boundary of the convex hull. When such hyperplanes are expressed as

e(k) w = e(k) 22

k = 1, . . . , K ,

(4.6)

wp is a nonnegatively weighted sum of e(k) from theorem 2. Hence, wp is a vector in the convex cone of e(k) , k = 1, . . . , K , regardless of p. Figure 12 shows an example where x’s and o’s express examples and the centers of circumscribed circles, respectively. wp necessarily exists in the convex hull

Figure 12: Geometry of ν-SVM( p) in the input space X for normalized inputs. wp necessarily exists in the hatched area.

2526

K. Ikeda and N. Murata

of the centers shown as the shaded triangle irrespective of p if v p is on the intersection of equation 4.6. 4.3 ν-SVM( p) Solution in Parameter Space. Distances in the parameter input space are also expressed as the angle of vectors under the assumption of normalized inputs. Hence, the distance of wp from a hyperplane w f = 0 determined by an input f is expressed as the angle ζ where ζ := arcsin wp f .

(4.7)

When the contour in the L p norm centered at the origin touches the convex hull in the input space X at an interior point v p of a simplex consisting of M examples f (n) , n = 1, . . . , M, and expressed as equation 4.2, the distance ζ (n) of wp from such an example f (n) is written as ζ (n) := arcsin

e f (n) = arcsin e2 , e2

(4.8)

since e is parallel to wp . This means that wp is equidistant from the hyperplanes in the parameter space and, hence, the center of the inscribed hypersphere of the version space made with the examples consisting of the simplex as seen in Figure 13. When the contour touches a simplex of a lower-dimension (M − K ) consisting of (M − K + 1) examples and expressed as the intersection of equation 4.6, wp is a nonnegatively weighted sum of e(k) from theorem 2 and wp is a vector in the convex hull of e(k) , k = 1, . . . , K in W = S M−1 , irrespective of p. This means that for any p, wp is included in the version space and hence can separate all examples correctly (see Figure 14).

f2 w* e

f1 f3

Figure 13: Geometry of ν-SVM( p) for normalized inputs in the parameter space W where e is the center of the inscribed circle of f (1) , f (2) , and f (3) .

Properties of Nu Support Vector Machines

2527

Figure 14: Geometry of ν-SVM( p) in the parameter space W for normalized inputs. wp necessarily exists in the hatched area.

5 Conclusions and Discussions In order to reduce the heavy computational load of SVMs, the idea of employing the L 1 or L ∞ norms instead of the L 2 norm has been proposed in the literature. However, it is not clear how the change of norm affects the generalization ability of SVMs. We gave a theoretical background for a generalized method adopting the L p norm for measuring the distance between examples and a separating hyperplane, by considering the geometrical meaning in both the input space and the parameter space. This becomes possible by analyzing the ν-SVMs instead of the original SVMs. The results showed that the ν-SVM( p) solution wp is closely related to the ν-SVM solution w2 and has rather little dependency on p. More concretely, wp coincides to w2 if the nearest points v p and v2 are located on the same hyperplane. Even otherwise, the angular distance of v p and v2 has an upper bound regardless of given examples, and so does that of wp and w2 . The other result is that wp is necessarily included in the version space regardless of p, which is derived under the assumption that the input vectors are normalized. This means that the generalization ability does not get worse for any p qualitatively. Note that these results may be applicable to the problem of support vector regression, since it has a similar geometrical structure (Bi & Bennett, 2003). Intuitively, the convex hull of given examples approaches the origin as the number N of examples increases (see Figure 15) and hence the dependency of the generalization error on p would decrease from an asymptotic statistical viewpoint (Ikeda, 2004a, 2004b; Ikeda & Aoishi, 2004, 2005). A more rigorous and quantitative analysis will be shown in the future.

2528

K. Ikeda and N. Murata

f3 f1 f4

f2

Figure 15: The convex hull of given examples approaches the origin as the number N of examples increases. Compare the cases where f (4) is added (solid lines) and not (dashed lines).

Acknowledgments We thank Takaharu Onishi for his assistance with computer simulations. This study is supported in part by a Grant-in-Aid for Scientific Research (14084210, 15700130) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References Bennett, K. P., & Bredensteiner, E. J. (2000). Duality and geometry in SVM classifiers. In Proc. Int’l Conf. Machine Learning (pp. 57–64). San Francisco: Morgan Kaufmann. Bi, J., & Bennett, K. P. (2003). A geometric approach to support vector regression. Neurocomputing, 55, 79–108. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273– 297. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Herbrich, R. (2002). Learning kernel classifiers: Theory and algorithms. Cambridge, MA: MIT Press. Ikeda, K. (2004a). An asymptotic statistical theory of polynomial kernel methods. Neural Computation, 16, 1705–1719. Ikeda, K. (2004b). Geometry and learning curves of kernel methods with polynomial kernels. Systems and Computers in Japan, 35(7), 41–48. Ikeda, K., & Amari, S. (1996). Geometry of admissible parameter region in neural learning. IEICE Trans. Fundamentals, E79-A, 938–943. Ikeda, K., & Aoishi, T. (2004). Effects of soft margins on learning curves of support vector machines. In Proc. Brain Inspired Cognitive Systems, Stirling, UK. Millet, Canada: ICSC. NC 2–3.

Properties of Nu Support Vector Machines

2529

Ikeda, K., & Aoishi, T. (2005). An asymptotic statistical analysis of support vector machines with soft margins. Neural Networks, 18(3), 251–259. Mangasarian, O. L. (1999). Arbitrary-norm separating plane. Operations Research Letters, 24, 15–23. Pedroso, J. P., & Murata, N. (2001). Support vector machines with different norms: Motivation, formulations, and results. Pattern Recognition Letters, 22, 1263–1272. paragraph Scholkopf, ¨ B., Burges, C., & Smola, A. J. (1998). Advances in kernel methods: Support vector learning. Cambridge: Cambridge University Press. Scholkopf, ¨ B., Smola, A., Williamson, R., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245. Smola, A. J., Bartlett, P. L., Scholkopf, ¨ B., & Schuurmans, D. (Eds.). (2000). Advances in large margin classifiers. Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.

Received August 4, 2004; accepted April 12, 2005.

LETTER

Communicated by Emilio Salinas

The Effect of NMDA Receptors on Gain Modulation Michiel Berends [email protected]

Reinoud Maex [email protected]

Erik De Schutter [email protected] Laboratory of Theoretical Neurobiology, Born-Bunge Foundation, University of Antwerp, Antwerp B-2610, Belgium

The ability of individual neurons to modulate the gain of their inputoutput function is important for information processing in the brain. In a recent study (Mitchell & Silver, 2003), shunting inhibition was found to modulate the gain of cerebellar granule cells subjected to simulated currents through AMPA receptor synapses. Here we investigate the effect on gain modulation resulting from adding the currents mediated by NMDA receptors to a compartmental model of the granule cell. With only AMPA receptors, the changes in gain induced by shunting inhibition decreased gradually with the average firing rate of the afferent mossy fibers. With NMDA receptors present, this decrease was more rapid, therefore narrowing the bandwidth of mossy fiber firing rates available for gain modulation. The deterioration of gain modulation was accompanied by a reduced variability of the input current and saturation of NMDA receptors. However, when the output of the granule cell was plotted as a function of the average input current instead of the input firing frequency, both models showed very similar response curves and comparable gain modulation. We conclude that NMDA receptors do not directly impair gain control by shunting inhibition, but the effective bandwidth decreases as a consequence of the increased total charge transfer. 1 Introduction Gain modulation (GM) is a powerful computational principle that has been proposed to be operative in several brain functions, like sensory processing, attention modulation, and the computation of coordinate transformations (Salinas & Thier, 2000). The central mechanism of computation by GM is the transformation of the gain, that is, the slope of the input-output (IO) curve of neurons. This is a multiplicative operation on the IO curve and expresses an altered sensitivity of the neuron to changes in its input. In contrast, a translation of the IO curve is an additive operation and can be Neural Computation 17, 2531–2547 (2005)

© 2005 Massachusetts Institute of Technology

2532

M. Berends, R. Maex, and E. De Schutter

computationally useful to subtract a certain baseline activity. It has been shown both theoretically and experimentally using the dynamic-clamp technique that GM can be controlled at the level of a single neuron through the influence of a parameter such as the level of background firing (Burkitt, Meffin, & Grayden, 2003; Chance, Abbott, & Reyes, 2002; Fellous, Rudolph, Destexhe, & Sejnowski, 2003; Kuhn, Aertsen, & Rotter, 2004) or shunting inhibition (Doiron, Longtin, Berman, & Maler, 2000; Mitchell & Silver, 2003; Prescott & De Koninck, 2003). What effect does a change in the level of shunting inhibition have on the IO curve of a neuron when it is excited by injecting constant currents of various magnitudes? The increased membrane conductance reduces the slope of the relationship between the magnitude of the subthreshold input current and the membrane voltage, in accordance with Ohm’s law (I = gV). This would suggest also that the suprathreshold relationship between the input strength and the firing rate of the neuron might scale multiplicatively. However, it has been shown that with stimulation by constant input currents, only a translation of the IO curve results from changing the level of shunting inhibition (Holt & Koch, 1997). GM can be obtained when noise, which, for instance, arises from background synaptic inputs, is added to the injected current (Chance et al., 2002). Neurons like pyramidal cells receive background firing from a large number of small-amplitude excitatory and inhibitory synapses. By increasing the overall levels of excitation and inhibition, both the variance of the background input and the average membrane conductance are increased, and the slope of the IO curve is reduced. If excitation and inhibition are kept in approximate balance, it is possible to obtain an almost perfect multiplicative transformation, that is, without an additive component. In a similar manner, shunting inhibition can also generate GM if the cell is driven by varying synaptic input (Mitchell & Silver, 2003). In this case, both the noise and the excitatory drive to the cell arise from the same source: the synaptic inputs. The transformation of the IO curve by shunting inhibition also has in this case, apart from a multiplicative component, an additive component, which results from the altered excitation-inhibition ratio. To attain a significant modulation of the gain, both a large change of the amount of shunting inhibition and a high level of the input noise are required. In cerebellar granule cells (GCs), the amount of shunting inhibition can reach a considerable size. GCs receive phasic inhibition from a small number of Golgi cells (2–10; see Jakab & Hamori, 1988). The inhibition evoked by Golgi cell spikes consists of a fast, direct component and a much slower spillover component (Rossi, Hamann, & Attwell, 2003). In addition, GCs receive tonic inhibition that is independent of Golgi cell firing (for a review, see De Schutter, 2002). The strength of this tonic inhibition can be modulated by acetylcholine (Rossi et al., 2003) and reach several times its baseline value, implying that the gain of GCs can be regulated independent of the firing rate of Golgi cells.

The Effect of NMDA Receptors on Gain Modulation

2533

Gain modulation by shunting inhibition has been demonstrated in GCs receiving simulated synaptic inputs in a dynamic-clamp approach (Mitchell & Silver, 2003). A high level of input noise was ensured here by the fact that GCs have only a small number of input channels—four mossy fibers (MFs)— and that the simulated excitatory postsynaptic currents (EPSCs) had a large peak amplitude in addition to their fast rise and decay time constants, corresponding to the AMPA receptor component of synaptic transmission. However, the dynamic-clamp protocol did not simulate the currents mediated by NMDA receptors (D’Angelo, Rossi, & Taglietti, 1993; Silver, Traynelis, & Cull-Candy, 1992), which have an important role in long-term potentiation of the MF–GC synapse (D’Angelo, Rossi, Armano, & Taglietti, 1999). In this study, we investigated the effect of NMDA receptors on gain modulation using a conductance-based model of the GC. NMDA receptors have much slower kinetics than AMPA receptors, which might affect the level of input noise. The long activation times of NMDA receptors make their share in the total charge transfer substantial, despite their small contribution to the peak EPSP amplitude (D’Angelo, De Filippi, Rossi, & Taglietti, 1995). 2 Methods 2.1 The Model Granule Cell. The present conductance-based model of a rat cerebellar GC at 37◦ C is an elaboration of previous models (D’Angelo et al., 2001; Gabbiani, Midtgaard, & Knopfel, 1994; Maex & De Schutter, 1998). Its single compartment has a diameter of 10 µm and a specific membrane capacitance of 1 µF/cm2 . The membrane conductance consisted of a GABA-independent component of 200 pS with a reversal potential of −60 mV (D’Angelo et al., 2001), and a GABA-dependent component with a reversal potential of −65 mV. The baseline level of the GABA-dependent conductance was set at 200 pS, in accordance with the estimated value of the tonic extrasynaptic GABAA conductance in the absence of acetylcholine (Hamann, Rossi, & Attwell, 2002; Rossi et al., 2003). In the presence of acetylcholine, the extrasynaptic GABAA conductance can increase to several times its baseline level (Rossi et al., 2003). The total amount of tonic inhibition to a GC is further increased by the spillover inhibition mentioned above and can reach values over 1 nS. In our simulations, we investigated the effect of a GABA-dependent tonic inhibition of 1200 pS (baseline level +1 nS). The model contained eight types of active membrane channels: a fast Na+ channel, a delayed rectifier K+ channel, an A-type K+ channel, a highvoltage-activated Ca2+ channel, and a Ca2+ -activated K+ channel, all as implemented before (Maex & De Schutter, 1998) but with their voltage dependency shifted by +5 mV and with g¯ NAF = 0.04 S/cm2 and g¯ KA = 0.004 S/cm2 . We added a persistent Na+ channel and a slow K+ channel, taken from D’Angelo et al. (2001), and an inwardly rectifying noninactivating K+ channel, constructed after experimental data from D’Angelo et al. (1995) and Rossi, De Filippi, Armano, Taglietti, & D’Angelo (1998).

2534

M. Berends, R. Maex, and E. De Schutter

This inward rectifier had a peak conductance of 9.5 10−4 S/cm2 and was activated with a forward rate α = 0.4 e −0.08(Vm+86.3) (ms−1 ) and a backward rate β = 0.4 e 0.04(Vm+86.3) (ms−1 ). The inward rectifier, having an equilibrium potential of −84 mV, dominated the membrane conductance at rest, leading to a resting membrane potential of −76.9 mV (Cathala, Brickley, Cull-Candy, & Farrant, 2003; D’Angelo et al., 2001). The rheobase current Irheo (the current at which firing starts) was 8 pA in the absence of GABA-mediated inhibition. At this level of input current, the spike threshold measured approximately −39 mV (Cathala et al., 2003; D’Angelo et al., 1995). 2.2 The Synapses. GCs receive excitatory input from on average four MFs (Ito, 1984) through synapses containing both AMPA receptors (AMPARs) and NMDA receptors (NMDARs) (Silver et al., 1992). MFs can fire at very high rates; for example, tonic firing levels up to 200 Hz have been measured in monkeys during eye and limb movements (Kase, Miller, & Noda, 1980; van Kan, Gibson, & Houk, 1993) and during passive vestibular stimulation (Goldberg & Fernandez, 1971). Therefore, we used in this study a large range of MF input rates, from 0 to 200 Hz. We modeled the MF input by four independent Poisson processes. As instantaneous firing rates of MFs can exceed 800 Hz (Eccles, Faber, Murphy, Sabah, & Taborikova, 1971; Garwicz, Jorntell, & Ekerot, 1998), we did not include a dead time. The kinetics of the AMPARs were described by a rise time constant of 0.08 ms and three decay time constants of 0.37 ms, 2.2 ms, and 15 ms, contributing 80%, 16%, and 4% to the peak conductance, respectively (DiGregorio, Nusser, & Silver, 2002). Because of their fast kinetics, saturation of the AMPARs is negligible and was not taken into account. GCs have both synaptic and extrasynaptic NMDA receptors. The kinetics of the NMDAR in this study were based on NMDAR currents measured after single release events (D’Angelo, Rossi, & Taglietti, 1994; Rossi et al., 2002), and were described by a 10% to 90% rise time of about 5 ms and two decay time constants of 17 ms (contributing 60%) and 69 ms (40%), after correction for temperature using a Q10 factor of 2. The actual behavior during high-frequency stimulation is less well known. The relative synaptic and extrasynaptic contributions to the current (77% versus 23% after a single stimulus event; Rossi et al., 2002) might change, resulting in different overall kinetics. Subsequent synaptic events, following shortly after the first, will elicit less increase in the conductance than the first if a significant fraction of the postsynaptic receptors is still occupied due to the first event. Therefore, saturation of NMDARs can occur, as the kinetics of NMDARs are slow. There is evidence that synaptic NMDARs are not saturated by a single release event, and in an experimentally investigated case (Wang, 2000) the saturation level after a single event was estimated to be 31%. In our NMDAR conductance description, we set the saturation after a single event accordingly; we used a ratio of 0.31 between the unitary peak

The Effect of NMDA Receptors on Gain Modulation

2535

conductance, g¯ NMDA , and the maximum conductance when all NMDARs are saturated, gsat (g¯ NMDA and gsat denote the conductances in the absence of the Mg2+ -block). The activation of the fast and the slow component of the saturating NMDAR conductance was each modeled using the equations (Drew & Abbott, 2003) dn(t)/dt = T(t)(1 − n(t)) − n(t)/τdecay dT(t)/dt = −T(t)/τ, where T(t) was increased by Tpeak after each incoming event. The used constants were, for the fast component, τ = 1.5 ms, τdecay = 17 ms, and Tpeak = 0.33, and, for the slow component, τ = 3.5 ms, τdecay = 67 ms, and Tpeak = 0.12. The total NMDAR conductance was then given by gNMDA (t) = gsat (0.6 nfast (t) + 0.4 nslow (t))B(V(t)). The voltage-dependent Mg2+ -block is described by B(V(t)) = 1/(1 + η[Mg2+ ]e −γ V(t) ), with η = 0.28 mM−1 , γ = 0.062 mV−1 , and [Mg2+ ] = 1.2 mM (Jahr & Stevens, 1990). The AMPAR peak conductance was set to 732 pS (DiGregorio et al., 2002). The effect of NMDARs on the IO relationship of GCs was explored using NMDAR-to-AMPAR peak conductance ratios of 0.0, 0.64 (Rossi et al., 2002), and 1.28 (Gabbiani et al., 1994). For both excitatory synaptic currents, the reversal potential was 0 mV. Figures 1A and 1B show the excitatory postsynaptic potentials (EPSPs) and currents (EPSCs) in these different configurations (N0, N64, and N128). The contribution of the NMDAR current to the peak of the EPSP and the ratio of the peak of the NMDAR current versus the peak of the AMPAR current most closely resembled experimental values in configuration N128 (D’Angelo et al., 1995; Silver et al., 1992). All simulations were done in Neuron 5.5 (Hines & Carnevale, 1997) and extended over a mimimum period of 30 s. Numerical integration was done with the Crank-Nicolson method using a time step of 0.02 ms.

3 Results We investigated the effect of 1 nS shunting inhibition on the average firing rate f GC of the model GC during stimulation of the MFs at various average firing rates f MF . We compared the influence of NMDARs on the gain (defined as Gain( f MF ) = d f GC /d f MF ), for three levels of the NMDAR peak conductance (see section 2). In the presence of NMDAR currents the overall gain of GCs, quite expectedly, considerably increased (see Figures 2A to 2C; note the different scales

2536

B

20

N64

N0

0 0

25 Time (ms)

50 AMPA

EPSC (pA)

N128

EPSP (mV)

A

M. Berends, R. Maex, and E. De Schutter

50

NMDA

0 0

10 Time (ms)

20

Figure 1: Unitary EPSPs and EPSCs in the granule cell model from a holding potential of −70 mV. (A) EPSPs for configurations N0, N64, and N128. Peak of the EPSP in N0: 11.85 mV; in N64: 13.7 mV with the NMDAR contribution to the peak 13.5%; in N128: 16.3 mV with the NMDAR contribution to the peak 27.3%. (B) EPSCs with separated AMPAR and NMDAR components. The AMPARmediated current is approximately equal in all three configurations and has a peak of 50 pA. In configuration N64, the NMDAR-mediated current (lower line) has a peak of 2.1 pA (4.2% of the peak of the AMPAR component). In configuration N128, the NMDAR current (upper line) has a peak of 4.5 pA (9.1% of the AMPAR peak current).

of the ordinate axes). The rise in overall gain was accompanied by a comparable rise of the average input current I to the cell (see Figure 2E). The gain in configuration N0 without shunting inhibition increased slightly with the average input frequency but was close to linear (see Figure 2A). With the addition of NMDAR currents, the relation became increasingly convex (see Figures 2B and 2C), except at low input frequencies, where it stayed almost linear. This convex behavior was not a consequence of the refractory period of the GC, which is known to reduce in integrate-and-fire neurons the slope of the input-output curve at high firing rates (Bugmann, 1991), since the IO relationship plotted as a function of the average input current I did not show a similar slope reduction within the same range of GC firing rates (see Figure 3A). Therefore, the reduced slope at high input frequencies must be attributed to saturation of the NMDAR-mediated current (see below). Upon application of 1 nS shunting inhibition, both additive and multiplicative modification of the IO curve could be discerned (see Figures 2A to 2C). In the model lacking NMDARs, the IO curve had a strongly concave shape in the low-frequency range and became gradually more linear at high input frequencies (see Figure 2A). The presence of NMDARs reduced the extent of the concave part of the curve (see Figures 2B and 2C). To quantify the effect of shunting inhibition on the gain, we defined a gain modulation index: IGM = (Gaincontrol − Gainshunting )/Gaincontrol . Figure 2D shows the behavior of IGM as a function of the input frequency for the three synaptic

The Effect of NMDA Receptors on Gain Modulation

150

D

N0

f GC (Hz)

A

1.0

GM

N0 N64

I +1 nS

E

N64

N128

0 100

I (pA)

f GC (Hz)

B

0 300

2537

N128 N64

+1 nS

0

F

N128

2.0

CV

f GC (Hz)

C

0 450

N0

N0

+1 nS N128

0 0

50 100 f MF (Hz)

150

0

0

50 100 ff MF (Hz)

150

Figure 2: The effect of 1 nS shunting inhibition on the gain of the granule cell model, expressed as a function of the average MF firing rate f MF . (A–C) Average granule cell firing rate f GC in configuration N0 (A, open circles), N64 (B, closed circles), and N128 (C, closed squares) for control (large symbols) and 1 nS shunting inhibition (small symbols). (D) Gain modulation index IGM (see text) in four different synaptic configurations: N0, N64, N128, and the configuration with g¯ NMDA /g¯ AMPA = 1.28 and g¯ NMDA /gsat = 0.62 (dashed lines). (E) Average synaptic input current I for the same configurations as in D. Control curves and curves in the presence of 1 nS shunting inhibition (not shown) almost completely overlapped. (F) Coefficient of variation of the total synaptic conductance (CVg ) in configuration N0 (thin solid line) and configuration N128 (thick solid line) and of the corresponding excitatory synaptic input current (CVI , N0: open circles, N128: closed squares). In configuration N0, CVg could be well fitted with a function of the form a e b , yielding a = 12.12 and b = −0.50 (R2 > 0.99).

2538

C 2.0 I

CV

0 -60

V(mV)

-30

+1 nS

0 -40

D

+1 nS ISI

+1 nS

0 1.0

CV

V (mV)

B

15 σ V (mV)

300

f GC (Hz)

A

M. Berends, R. Maex, and E. De Schutter

N0 N128

-80 0

25 I (pA)

50

75

0

0

25

50

75

I (pA)

Figure 3: The effect of 1 nS shunting inhibition on the gain of the granule cell model, expressed as a function of the average input current I . (Curve styles as in Figures 2A and 2C. For clarity, the curves corresponding to the N64 configuration were omitted.) (A) Average granule cell firing rate f GC for configurations N0 and N128 and for excitation by constant input currents in the control case (thick solid line) and with 1 nS shunting inhibition (thin solid line). (B) Average membrane potential V for the same configurations as in A. (C) CV of the input current for configurations N0 and N128 in the control case. Control curves and curves in the presence of 1 nS shunting inhibition (not shown) almost completely overlapped. Inset: Standard deviation of the membrane potential fluctuations versus average membrane potential in the granule cell with all eight active membrane conductances blocked for configurations N0 and N128 in the control case and with 1 nS shunting inhibition. (D) CV of the interspike intervals of the granule cell for configurations N0 and N128 in the control case and with 1 nS shunting inhibition.

configurations of Figures 2A to 2C. With only AMPARs, 1 nS shunting inhibition resulted in an IGM greater than 0.45 over the whole frequency range, while in the presence of NMDARs, IGM dropped below 0.45 at high MF firing rates (at f MF = 78 and 43 Hz in the case of N64 and N128, respectively). Hence, GM was reduced in the presence of NMDARs, and particularly at high input frequencies, this reduction was profound. Simulations with shunting inhibition levels of 500 pS and 1500 pS gave qualitatively similar differences between configurations N0, N64, and N128. Therefore,

The Effect of NMDA Receptors on Gain Modulation

2539

we conclude that NMDARs reduce the bandwidth of input frequencies at which GM occurs. To investigate the influence on gain modulation of the ratio between the NMDAR peak conductance and the conductance at full saturation, we constructed a synaptic configuration with the same conductance at full saturation of the NMDARs as in N64, but with the ratio g¯ NMDA /gsat = 0.62, that is, with the same unitary peak conductance as in N128. This resulted at low input frequencies in a faster rise with input frequency of the NMDAR conductance than in configuration N64, as the steeper slope of the I - f MF curve indicates (see Figure 2E). The gain modulation index IG M ( f ) was reduced in this new configuration compared with the N64 case (see Figure 2D). The influence of the input noise on GM becomes clearer when the firing rate of the GC is plotted as a function of the average input current I instead of the input frequency f MF .1 Figure 3A compares for each synaptic configuration this IO curve with the curve obtained by injection of constant input currents I . The rheobase current was 14.5 pA in the control case and 39.5 mV with 1 nS shunting inhibition. As in integrate-and-fire neurons, shunting inhibition resulted in a rightward shift of the f -I curve, but did not lead to modulation of the gain in the absence of fluctuations (Holt & Koch, 1997; Mitchell & Silver, 2003). Indeed, the average subthreshold membrane potential V scaled multiplicatively on shunting inhibition for I < Irheo and had a sharp peak at I = Irheo (see Figure 3B). Note that in our case, V does not monotonically decrease for I > Irheo , unlike V in models with a fixed spike threshold and/or strong repolarization level (Doiron et al., 2000). The spike threshold and the repolarization in our model shifted several millivolts upward during high-stimulus intensities (approximately up to 6 mV and 3 mV, respectively). In contrast, with fluctuating input currents resulting from stochastic synaptic activation, the GC fired also at I < Irheo (see Figure 3A), resulting in a smoothing of the threshold nonlinearity and a clear GM by shunting inhibition. A complete analytical description of the influence of input noise on the firing rate of a neuron and its role in GM is complex (Burkitt et al., 2003; Ricciardi, 1977; Stein, 1965; Tuckwell, 1988). However, the effect of input noise on the firing rate can be qualitatively understood by considering the relation between the membrane potential fluctuations and the variability of the synaptic input currents. When a synapse, with instantaneous rise and a single decay time constant τs and a fixed peak conductance g¯ , is activated according to a Poisson process with rate f , the average conductance will be g = g¯ fτs , and the variance

The average input current I and voltage V represent the averages taken when the membrane potential was lower than −35 mV; hence, most of the action potential is filtered out. 1

2540

M. Berends, R. Maex, and E. De Schutter

is given by σg2 = g¯ 2 fτs , using Campbell’s theorem (Rice, 1954; van Kampen, 1981). The coefficient of variation (CV) of√the AMPAR conductance of our model GC decreased proportional to 1/ f MF , in good agreement with Campbell’s theorem (see Figure 2F). The variability of the input currents σ I2 closely follows that of the synaptic conductances σg2 , when the variations in the driving potential (V-Er ) of the input current are small (see Figure 2F for a comparison of CVg and CVI ). The membrane potential is a low-pass filtered 2 version of the input current, and its variance is given by σV2 = σ I2 τs /gtot (τm + τs ), where gtot is the total conductance—the membrane conductance and the average synaptic conductance combined (see Chance et al., 2002). The probability that a membrane potential fluctuation triggers a spike depends on both the amplitude of the fluctuation and the value of the membrane potential, which determines the distance to threshold. In the regime of I ≤ Irheo , the average voltage grows with I . Because the standard deviation of the voltage fluctuation also grows with the input levels, the largest effect of fluctuations on the firing rate occurs for amplitudes of I close to the rheobase current. For I > Irheo , the influence of fluctuations on the firing rate diminishes again. First, there is no further increase in the percentage of time the membrane potential spends close to threshold, as the average membrane potential V becomes almost independent of I (see Figure 3B; see also Holt & Koch, 1997). Second, spiking no longer depends solely on fluctuations, as the average input current itself, without fluctuations, causes spiking. Hence, the contribution of fluctuations on the firing rate relative to I decreases with input frequency, as σI grows less than I . Thus, an increase of the input frequency resulted in a transition from a fluctuationdriven regime to a current-driven regime (Doiron et al., 2000; Tiesinga, Jose, & Sejnowski, 2000). With shunting inhibition, input noise enhanced the firing rate over a larger range of average input currents (see Figure 3A). This can be explained by the slower rise of V as a function of I for I ≤ Irheo (see Figure 3B), resulting in a more gradual transition from a fluctuation-driven to a current-driven regime. Moreover, the reduction of the membrane time constant ensures that the cell is faster driven to the asymptotic subthreshold depolarization level, bringing the neuron earlier in a state (Troyer & Miller, 1997), where fluctuations can trigger a spike. This argument holds provided that shunting inhibition does not severely reduce the variability of the membrane potential. It can be shown that this is the case if the effective membrane time constant τm is much larger than the synaptic decay time constant τs . If the application of shunting inhibition leads to an increase of the intrinsic membrane conductance by a factor m, then the input rate must also increase by m to obtain the same average depolarization V. Therefore, the variance of the input current increases in the presence of shunting inhibition as σ I2 → mσ I2 . Because the total conductance increases as gtot → m gtot , while the effective membrane time constant decreases as τm → τm /m, the variance of the membrane potential fluctuations changes

The Effect of NMDA Receptors on Gain Modulation

2541

as σV2 → σV2 (τm + τs )/(τm + mτs ) (Chance et al., 2002). We investigated the effect of shunting inhibition on the membrane potential fluctuations in the absence of spiking and blocked all the channels in our GC model. As expected, the application of 1 nS shunting inhibition led to only a moderate reduction of σ V compared to the control value at the same average membrane potential (see the inset in Figure 3C). The average and variance of the conductance resulting from activation of NMDARs by a Poisson spike train cannot be evaluated using Campbell’s theorem, because NMDAR conductances saturate and are dependent on the membrane potential through the Mg2+ block. But simulations showed that the CV of the total synaptic conductance decreased faster with input frequency in the presence of NMDARs (see Figure 2F). Thus, as above, the inclusion of NMDARs reduced the input variability as a function of the input frequency. Surprisingly, but consistent with the almost identical f GC -I curves for the different synaptic configurations in Figure 3A, the influence of NMDARs on GM as a function of the average input current I was minimal. Nevertheless, the statistical characteristics of the input currents were not identical in the different synaptic configurations. The inclusion of NMDARs led to a slight reduction of CVI when average input currents were compared (see Figure 3C), suggesting less influence of fluctuations on the firing rate. However, the temporal correlation of the input current was enlarged, resulting in larger fluctuations of the membrane potential in the model without Hodkin-Huxley channels (see the inset in Figure 3C). As has been shown to be the case for integrate-and-fire neurons (Salinas & Sejnowski, 2002), an increase of the temporal correlation led to a higher variability of the interspike intervals in the fluctuation-driven regime (see the CVISI -I curve in Figure 3D). Taking the data from Figures 2 and 3 together, the reduced bandwidth for GM in the configurations with NMDARs as compared to those without must result almost completely from differences in the input frequency– input current relationships. Because the presence of NMDARs enhances the total input current, the region where GM occurs is compressed to a narrower range of input frequencies. Furthermore, GM is also reduced by the saturation of NMDARs. This causes a decrease of gain at high input frequencies and can lead, as in configuration N128, to even a negative value of IGM (see Figure 2D). 4 Discussion We demonstrated in a cerebellar GC model receiving Poisson-distributed excitatory synaptic events that the input-output frequency relationship changes both in an additive and a multiplicative manner upon altering the level of shunting inhibition. This confirmed the general result that gain

2542

M. Berends, R. Maex, and E. De Schutter

changes can be induced by inhibition only (Murphy & Miller, 2003) and is in agreement with a recent study on the GC in which AMPAR excitation by MFs was simulated using a dynamic-clamp approach (Mitchell & Silver, 2003). The latter study showed clearly the possibility of GM due to shunting inhibition, where the strength of GM, that is, IGM , was quite independent of the input frequency. In agreement herewith, we found only a relatively weak dependence of IGM on the input frequency using only AMPARs. However, the inclusion of NMDARs led to a much stronger decrease of IGM with input frequency, thereby reducing the bandwidth for GM. Moreover, the enhanced frequency dependence of IGM and the enlarged variability of the ISIs (see Figure 3D) will make it more difficult for downstream neurons to process the signal (Salinas & Thier, 2000). GM requires that the input to a neuron is highly variable, and it vanishes for a constant input current (Holt & Koch, 1997). The fact that the input of GCs consists of a small number of large-amplitude inputs is responsible for the high level of input noise. GM becomes less prominent for lower amplitudes of the EPSPs as in, for example, the model studied by Gabbiani et al. (1994), in which only an additive transformation on the application of shunting inhibition was found. In addition, their model NMDARs did not saturate, which resulted in a relatively larger role of the NMDAR-mediated currents. In this study, we took into account both the direct and spillover component of synaptic activation, but their relative contributions were based on experimental data describing the situation after a single stimulus, as was also the case for the description of the relative contribution of synaptic and extrasynaptic NMDARs to the charge transfer (DiGregorio et al., 2002). However, it is not clear whether these single-stimulation data can be extrapolated to the case of high-frequency stimulation, because it is unlikely that glutamate concentrations within the glomerulus add linearly on subsequent stimuli. MF-GC synapses are densely packed within glomeruli, which are enwrapped by a glial sheath. Due to this structure, neurotransmitter concentrations can stay elevated for relatively long times. Spillover of glutamate by diffusion out of the synaptic cleft activates the lowaffinity AMPARs of neighboring synapses in the glomerulus (DiGregorio et al., 2002). The spillover component of AMPAR activation contributes even more to the total charge transfer than the direct component. When activation of AMPARs or extrasynaptic NMDARs by spillover becomes more important during high-frequency stimulation, GM might be further reduced. In our model, Mg2+ blocked and unblocked NMDARs instantaneously upon a voltage change. However, it has been shown recently (Kampa, Clements, Jonas, & Stuart, 2004) that only the Mg2+ -block occurs nearly instantaneously and that the kinetics of the Mg2+ -unblock contain both very fast (∼100 µs) and slower components. With the instantaneous description of the Mg2+ -unblock, the contribution of NMDAR-mediated currents might

The Effect of NMDA Receptors on Gain Modulation

2543

therefore be overestimated. The relative amplitudes of the slow unblock components depend, however, on the timing of glutamate release. Shortly after a glutamate pulse as well as during sustained glutamate release, the kinetics of the Mg2+ -unblock is dominated by the fast component (Kampa et al., 2004). Therefore, it is expected that a more realistic description of the Mg2+ -unblock would have only a minor influence on our results. It has recently been suggested that shunting inhibition can enhance gain modulation in spatially extended neurons, especially when shunting inhibition targets the dendrite (Brizzi et al., 2004; Prescott & De Koninck, 2003). However, the soma and the dendrites of the GC are small, and the GC is under normal circumstances (without a large amount of shunting inhibition) electrotonically compact (D’Angelo et al., 1995; Silver et al., 1992). We verified that this was still the case in the presence of a high level of shunting inhibition by adding separate dendrites to the GC model. Although shunting inhibition increases the electrotonic length of the dendrites (Koch, 1999), the entire GC was almost isopotential, thereby validating a one-compartment description of the GC. Cerebellar granule cells receive synapses from only four MFs on average, but because a unitary EPSP remains subthreshold (D’Angelo et al., 1995; Silver et al., 1992), correlated input is needed to evoke a spike. In accordance with this, in vivo granule cells have been reported to be almost silent (Chadderton, Margrie, & Hausser, 2004) while MFs spontaneously fire at high rates (Goldberg & Fernandez, 1971; Soja, Fragoso, Cairns, & Jia, 1996). The present model granule cell, having both AMPARs and NMDARs (our N64 case), fired in the presence of 1nS shunting inhibition at an average rate of 1.2 spikes per second during stimulation by four MFs firing each at a physiological resting rate of 25 spikes per second (Soja et al., 1996). From the cross-correlogram (not shown), the granule cell was assessed to fire on average on the arrival of four MF spikes above the background level within a narrow window of a few milliseconds. Shunting inhibition and an increase in the granule cell firing rate independently tended to sharpen the correlation peak. Although NMDARs would broaden the peak, this effect was small, and we never obtained broad correlation peaks like those observed in neurons that integrate many EPSPs from hundreds of weak synapses (Maex, Vos, & De Schutter, 2000). Finally, reducing the degree of short-term correlation within the MF spike trains, by using a Poisson model with a 2 ms dead time (Eccles et al., 1971; Garwicz et al., 1998), did not affect our results. When individual MFs were modeled as a higher-order gamma process with a CV of the interspike interval of 0.68 (range 0.68–1.28 in Soja et al., 1996), the firing rate of the granule cell slightly decreased in the fluctuation-driven regime. In contrast, correlations between the MF input spike trains increase the fluctuations of the membrane potential (Fellous et al., 2003; Salinas & Sejnowski, 2002), resulting in an enhancement of the firing rate of the granule cell and its ISI variability.

2544

M. Berends, R. Maex, and E. De Schutter

From this, the cerebellar granule cell appears to behave primarily as a coincidence detector, driven by the synchronous activation of a few, strong synapses. The very high conduction velocity of the afferent pathways, like the dorsal spinocerebellar tract (80 m/s, Soja et al., 1996), supports this function. Shunting inhibition is able to control the coincidence threshold, and hence the granule cell firing rate, albeit on a slower timescale than the gain control function assigned originally by Marr (1969) to inhibition by Golgi cells. Acknowledgments This work was supported by the University of Antwerp, and grants IUAP 5/04 and FWO G.0097.04 (Belgium) and QLG3-CT-2001-02256 (EU). References Brizzi, L., Meunier, C., Zytnicki, D., Donnet, M., Hansel, D., D’Incamps, B. L., & Van Vreeswijk, C. (2004). How shunting inhibition affects the discharge of lumbar motoneurones: A dynamic clamp study in anaesthetized cats. J. Physiol., 558(Pt. 2), 671–683. Bugmann, G. (1991). Summation and multiplication: Two distinct operation domains of leaky integrate-and-fire neurons. Network, 2, 489–509. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89(2), 119–125. Cathala, L., Brickley, S., Cull-Candy, S., & Farrant, M. (2003). Maturation of EPSCs and intrinsic membrane properties enhances precision at a cerebellar synapse. J. Neurosci., 23(14), 6074–6085. Chadderton, P., Margrie, T. W., & Hausser, M. (2004). Integration of quanta in cerebellar granule cells during sensory processing. Nature, 428(6985), 856–860. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35(4), 773–782. D’Angelo, E., De Filippi, G., Rossi, P., & Taglietti, V. (1995). Synaptic excitation of individual rat cerebellar granule cells in situ: Evidence for the role of NMDA receptors. J. Physiol., 484 ( Pt. 2), 397–413. D’Angelo, E., Nieus, T., Maffei, A., Armano, S., Rossi, P., Taglietti, V., Fontana, A., & Naldi, G. (2001). Theta-frequency bursting and resonance in cerebellar granule cells: Experimental evidence and modeling of a slow K+ -dependent mechanism. J. Neurosci., 21(3), 759–770. D’Angelo, E., Rossi, P., Armano, S., & Taglietti, V. (1999). Evidence for NMDA and mGlu receptor-dependent long-term potentiation of mossy fiber-granule cell transmission in rat cerebellum. J. Neurophysiol., 81(1), 277–287. D’Angelo, E., Rossi, P., & Taglietti, V. (1993). Different proportions of N-methyl-Daspartate and non-N-methyl-D-aspartate receptor currents at the mossy fibregranule cell synapse of developing rat cerebellum. Neuroscience, 53(1), 121–130.

The Effect of NMDA Receptors on Gain Modulation

2545

D’Angelo, E., Rossi, P., & Taglietti, V. (1994). Voltage-dependent kinetics of N-methylD-aspartate synaptic currents in rat cerebellar granule cells. Eur. J. Neurosci., 6(4), 640–645. De Schutter, E. (2002). Cerebellar cortex: Computation by extrasynaptic inhibition. Current Biology, 12, R363–R365. DiGregorio, D. A., Nusser, Z., & Silver, R. A. (2002). Spillover of glutamate onto synaptic AMPA receptors enhances fast transmission at a cerebellar synapse. Neuron, 35(3), 521–533. Doiron, B., Longtin, A., Berman, N., & Maler, L. (2000). Subtractive and divisive inhibition: Effect of voltage-dependent inhibitory conductances and noise. Neural Comput., 13, 227–248. Drew, P. J., & Abbott, L. F. (2003). Model of song selectivity and sequence generation in area HVc of the songbird. J. Neurophysiol., 89(5), 2697–2706. Eccles, J. C., Faber, D. S., Murphy, J. T., Sabah, N. H., & Taborikova, H. (1971). Afferent volleys in limb nerves influencing impulse discharges in cerebellar cortex. I. In mossy fibers and granule cells. Exp. Brain Res., 13(1), 15–35. Fellous, J. M., Rudolph, M., Destexhe, A., & Sejnowski, T. J. (2003). Synaptic background noise controls the input/output characteristics of single cells in an in vitro model of in vivo activity. Neuroscience, 122(3), 811–829. Gabbiani, F., Midtgaard, J., & Knopfel, T. (1994). Synaptic integration in a model of cerebellar granule cells. J. Neurophysiol., 72(2), 999–1009. Garwicz, M., Jorntell, H., & Ekerot, C. F. (1998). Cutaneous receptive fields and topography of mossy fibres and climbing fibres projecting to cat cerebellar C3 zone. J. Physiol., 512 (Pt. 1), 277–293. Goldberg, J. M., & Fernandez, C. (1971). Physiology of peripheral neurons innervating semicircular canals of the squirrel monkey. I. Resting discharge and response to constant angular accelerations. J. Neurophysiol., 34(4), 635–660. Hamann, M., Rossi, D. J., & Attwell, D. (2002). Tonic and spillover inhibition of granule cells control information flow through cerebellar cortex. Neuron, 33(4), 625–633. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9(6), 1179–1209. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Comput., 9(5), 1001–1013. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Jahr, C. E., & Stevens, C. F. (1990). Voltage dependence of NMDA-activated macroscopic conductances predicted by single-channel kinetics. J. Neurosci., 10(9), 3178– 3182. Jakab, R. L., & Hamori, J. (1988). Quantitative morphology and synaptology of cerebellar glomeruli in the rat. Anat. Embryol. (Berl.), 179(1), 81–88. Kampa, B. M., Clements, J., Jonas, P., & Stuart, G. J. (2004). Kinetics of Mg2+ unblock of NMDA receptors: Implications for spike-timing dependent synaptic plasticity. J. Physiol., 556(Pt. 2), 337–345. Kase, M., Miller, D. C., & Noda, H. (1980). Discharges of Purkinje cells and mossy fibres in the cerebellar vermis of the monkey during saccadic eye movements and fixation. J. Physiol., 300, 539–555. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press.

2546

M. Berends, R. Maex, and E. De Schutter

Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24(10), 2345–2356. Maex, R., & De Schutter, E. (1998). Synchronization of golgi and granule cell firing in a detailed network model of the cerebellar granule cell layer. J. Neurophysiol., 80(5), 2521–2537. Maex, R., Vos, B. P., & De Schutter, E. (2000). Weak common parallel fibre synapses explain the loose synchrony observed between rat cerebellar Golgi cells. J. Physiol., 523(Pt. 1), 175–192. Marr, D. (1969). A theory of cerebellar cortex. J. Physiol., 202(2), 437–470. Mitchell, S. J., & Silver, R. A. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron, 38(3), 433–445. Murphy, B. K., & Miller, K. D. (2003). Multiplicative gain changes are induced by excitation or inhibition alone. J. Neurosci., 23(31), 10040–10051. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci. USA, 100(4), 2076–2081. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: SpringerVerlag. Rice, S. O. (1954). Mathematical analysis of random noise. In N. Wax (Ed.), Selected papers on noise and stochastic processes (pp. 133–161). New York: Dover. Rossi, D. J., Hamann, M., & Attwell, D. (2003). Multiple modes of GABAergic inhibition of rat cerebellar granule cells. J. Physiol., 548(Pt. 1), 97–110. Rossi, P., De Filippi, G., Armano, S., Taglietti, V., & D’Angelo, E. (1998). The weaver mutation causes a loss of inward rectifier current regulation in premigratory granule cells of the mouse cerebellum. J. Neurosci., 18(10), 3537–3547. Rossi, P., Sola, E., Taglietti, V., Borchardt, T., Steigerwald, F., Utvik, J. K., Ottersen, O. P., Kohr, G., & D’Angelo, E. (2002). NMDA receptor 2 (NR2) C-terminal control of NR open probability regulates synaptic transmission and plasticity at a cerebellar synapse. J. Neurosci., 22(22), 9687–9697. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Comput., 14(9), 2111–2155. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27(1), 15–21. Silver, R. A., Traynelis, S. F., & Cull-Candy, S. G. (1992). Rapid-time-course miniature and evoked excitatory currents at cerebellar synapses in situ. Nature, 355(6356), 163–166. Soja, P. J., Fragoso, M. C., Cairns, B. E., & Jia, W. G. (1996). Dorsal spinocerebellar tract neurons in the chronic intact cat during wakefulness and sleep: Analysis of spontaneous spike activity. J. Neurosci., 16(3), 1260–1272. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 91, 173– 194. Tiesinga, P. H., Jose, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics, 62(6 Pt. B), 8413–8419. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comput., 9(5), 971–983.

The Effect of NMDA Receptors on Gain Modulation

2547

Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: North-Holland. van Kan, P. L., Gibson, A. R., & Houk, J. C. (1993). Movement-related inputs to intermediate cerebellum of the monkey. J. Neurophysiol., 69(1), 74–94. Wang, L. Y. (2000). The dynamic range for gain control of NMDA receptor-mediated synaptic transmission at a single synapse. J. Neurosci., 20(24), RC115.

Received September 24, 2004; accepted April 15, 2005.

LETTER

Communicated by Bard Ermentrout

Oscillatory Synchronization Requires Precise and Balanced Feedback Inhibition in a Model of the Insect Antennal Lobe Dominique Martinez [email protected] LORIA-CNRS Vandoeuvre-Les-Nancy 54506, France

In the insect olfactory system, odor-evoked transient synchronization of antennal lobe (AL) projection neurons (PNs) is phase-locked to the oscillations of the local field potential. Sensory information is contained in the spatiotemporal synchronization pattern formed by the identities of the phase-locked PNs. This article investigates the role of feedback inhibition from the local neurons (LNs) in this coding. First, experimental biological results are reproduced with a reduced computational spiking neural network model of the AL. Second, the low complexity of the model leads to a mathematical analysis from which a lower bound on the phaselocking probability is derived. Parameters involved in the bound indicate that PN phase locking depends not only on the number of LN-evoked inhibitory postsynaptic potentials (IPSPs) previously received, but also on their temporal jitter. If the inhibition received by a PN at the current oscillatory cycle is both perfectly balanced (i.e., equal to the mean inhibitory drive) and precise (without any jitter), then the PN will be phase-locked at the next oscillatory cycle with probability one. 1 Introduction Oscillations are often observed in recorded electrical signals such as electroencephalograms (EEGs) or local field potentials (LFPs) (Buzsaki & Draguhn, 2004; Wang, 2003; Kopell, 2003). These oscillations seem to play an important role in the coding of sensory information by providing a “clock” or temporal frame of reference for the encoding neurons. In the hippocampus of rats, for example, place cells exhibit phase-dependent firing activity relative to the EEG theta oscillation (O’Keefe 1993; Huxter, Burgess, & O’Keefe, 2003). In the olfactory system of insects, antennal lobe projection neurons are phase-locked to the LFP oscillation (Laurent, Wehr, & Davidowitz, 1996). Several studies have shown that inhibitory feedback shapes oscillatory synchronization (e.g., Wang & Buzsaki, 1996; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000; Borgers ¨ & Kopell, 2003). These studies, however, have focused on macroscopic network properties, such as the emergence of oscillations and global synchronization, and did not consider the fact that some neurons exhibit phase-locked activity Neural Computation 17, 2548–2570 (2005)

© 2005 Massachusetts Institute of Technology

Precise and Balanced Inhibition in the Insect AL

A

2549

B T LNs

2T

3T

LFP

(I–cells)

PN#1 Iext PNs (E–cells)

Output PN#2

T– ε

T+ε

Time

Figure 1: (A) The insect AL as an excitatory-inhibitory network. Local neurons (LNs) are inhibitory cells (I-cells) and projection neurons are excitatory cells (E-cells). Although connections between PNs may be present in the locust, they do not seem to play a significant role on PN phase locking (Bazhenov et al., 2001b) and for simplicity are not considered in our AL model. (B) PN phase locking. The LFP is recorded in the mushroom body, a higher structure where the PNs project. The oscillations of the LFP define a “clock” of a 50 ms period corresponding to 20 Hz oscillations for the locust AL. Ticks depict individual PN spikes. Phase-locked PNs fire in the ascending phase of the mushroom body LFP with a consistent phase lag of −70◦ (Laurent & Davidowitz, 1994 ) corresponding to a mean firing time T of the PN population occurring approximately 10 ms before the peak of the LFP. This phase lag is probably due to the conduction time between the AL and the mushroom body (see chapter 4 in Perez-Orive, 2004). We consider that phase-locked PNs are those that fire within a temporal window of ± ms around the PN mean firing time T, typically = 5 ms (Laurent, 1999; Laurent et al., 2001). If one assigns bit 1 or 0 to phase-locked or non-phase-locked PNs (see, e.g., Laurent, 1996), then the population vectors formed by PN#1 and 2 and the three oscillatory cycles are (10, 01, 11).

while others do not. How does the received inhibition affect the probability of individual neurons to be phase-locked to the simultaneously recorded field potential? To address this question, we use a simplified computational model of the insect AL that allows analytic calculations. The first structure of the insect olfactory system, the AL, is formed by a small network of excitatory PNs and inhibitory LNs (see Figure 1A). Only PNs project to the mushroom body, a higher brain structure where the LFP is recorded. In the presence of an olfactory stimulus, subsets of PNs fire in synchrony, giving rise to LFP oscillations in the mushroom body. Neither global properties of the LFP oscillatory activity (e.g., the mean frequency) nor the phase of the PN or LN activity relative to the LFP have been found to convey information about odor identity (Laurent, 1996; Friedrich, Habermann, & Laurent, 2004) and intensity (Stopfer, Jayaraman, & Laurent, 2003). In contrast, the identities of the phase-locked PNs evolve in time in an odor-specific manner (Laurent & Davidowitz, 1994; Laurent et al., 1996;

2550

D. Martinez

Wehr & Laurent, 1996; Laurent, 1999). Figure 1B clarifies the significance of phase locking in this context. Both experimental and modeling studies support the functional relevance of fast LN-PN inhibition in this coding. First, when the GABAergic synapses are pharmacologically blocked, synchronization is disrupted, and insects are no longer able to discriminate between similar odors (Stopfer, Bhagavan, Smith, & Laurent, 1997; Hosler, Buxton, & Smith, 2000). Second, a recent biologically detailed model of the locust AL has shown that LN-mediated inhibition can effectively control the transient synchronization of the PNs (Bazhenov et al., 2001b). This model, however, is difficult to analyze mathematically because of its high complexity. We propose a simplified computational spiking neural network AL model that leads to analytic calculations concerning the role of feedback LN-PN inhibition on PN phase locking. In section 2, we show that our model reproduces the transient synchronization patterns observed in modeling and experimental studies (Wehr & Laurent, 1996; Bazhenov et al., 2001b). In section 3, we derive a lower bound on the PN phase-locking probability. In section 4, we demonstrate the validity of this bound with simulation results. 2 A Simplified Computational Spiking Neural Network Model of the Insect AL 2.1 Description of the Model. Our AL model is built on earlier work from Bazhenov et al. (2001b). It is a sparsely and randomly connected network of theta neurons (Ermentrout, 1996; Hoppenstead & Izhikevich 2002) coupled via simple exponential synapses. The state variable θ of a theta neuron obeys the following equation, dθ = (1 − cos θ ) + (1 + cos θ)α J , dt

(2.1)

where J is the total input current and α > 0 is a constant characterizing the neuron current-frequency response curve. Such a theta neuron can be considered as a point (cos θ , sin θ) moving on the unit circle (Ermentrout, 1996, Hoppenstead & Izhikevich, 2002) for which a spike occurs whenever θ crosses π.√When J > 0 and constant, the neuron fires regularly at a frequency given by α J /π . By using the transformation v = tan(θ/2), the theta model becomes the quadratic integrate-and-fire model. The schematic architecture of our AL network is depicted in Figure 1A. We did not consider interconnections between PNs because they have a negligible influence in both our model and Bazhenov et al. (2001b). Thus, the only synaptic current of a PN comes from the LNs and is given by syn

I P N (t) = −g

i

f f H t − ti e −(t−ti )/τ ,

(2.2)

Precise and Balanced Inhibition in the Insect AL

2551

where H is the Heaviside function, g ≥ 0 and τ are the strength and the time f constant of the inhibitory synapses, respectively, and ti , i = 1, 2, · · ·, is the set of firing times of all the LNs that project to this particular PN. For clarity, details about the model are omitted here, and we refer to section A.1 in the appendix for a detailed justification of the parameters used in the model regarding the neurons, the network, and the stimulus. 2.2 Simulation of the Model. In the presence of a stimulus, the network shows the following repeated characteristic behavior: a quasi-synchronized PN spike volley followed immediately by a similar LN spike volley and a silent period (see Figure 2A). In subplot j, each PN spike is represented by a dot whose size indicates the amount of inhibition the PN has received at the previous oscillatory cycle. In the simulations performed, the PNs receiving zero or one IPSP are not phase-locked. Moreover, dividing the strength of the inhibitory LN-PN synapses by a factor of 10 resulted in a loss of PN synchronization and LFP oscillation (see Figure 2B). Therefore a strong inhibitory feedback from the LNs is necessary for oscillatory synchronization to take place. In order to study the role of feedback inhibition, we adapted the analysis reported by Laurent et al. (1996) and Bazhenov et al. (2001b), by assigning to each PN spike a phase (−π, +π) relative to its closest LFP peak (zero phase = positive peak of the LFP). Note that the LFP is artificially generated in our simulations as the average of the PNs’ θ variables so that there is no phase lag between the peak of the LFP and the mean firing time T of the PN population. Figure 3 shows the results of this analysis for 20 different runs of the AL network and a particular PN (see the Figure 3 caption for details). Transient synchronization of the PN relative to the LFP oscillation can be seen in Figure 3 at the top. Clearly, the PN fires precisely and is phase-locked with the second, third, fourth, and tenth oscillatory cycles, whereas it is not with the other cycles. As in Wehr and Laurent (1996), the output of this PN can be represented as the binary vector (0, 1, 1, 1, 0, 0, 0, 0, 0, 1), in which the kth bit corresponds to synchronization or desynchronization at the kth oscillatory cycle. Depending on their particular connectivity with the set of active LNs, different PNs show different patterns of synchronization. Time evolution in these patterns is induced by a change over time in the PNs’ inhibitory drive (see Figure 3, second row). As in Bazhenov et al. (2001b), lateral inhibition and firing-rate adaptation leads to a complex competition between LNs in our network so that the LNs that are active at a given oscillatory cycle are not necessarily the same as those activated at other cycles (see Figure 2A). To illustrate this effect, we reran simulations without any LN firing-rate adaptation. In this condition, the subset of active LNs no longer changes (see Figure 2C). As a consequence of the fixed inhibitory drive, the subset of phase-locked PNs becomes time invariant, a PN being either synchronized or desynchronized at all oscillatory cycles (see Figure 4).

2552

D. Martinez

A LFP

0

(i) –1 0

100

200

300

400

500

600

100

200

300

400

500

600

PN

80 60 40 20 0

0

(j)

LN

30 20 10 0

0

(k) 100

200

300

400

500

600

Time (ms)

C

LN

LN

PN

PN

LFP

LFP

B

Time (ms)

Time (ms)

Figure 2: Simulations of our AL model (30 LNs and 90 PNs). (A) Corresponds to the simulation of an intact network. Subplot i represents the time evolution of the LFP (the LFP is taken as the average of the PNs’ variables θ). Subplot j is the raster plot of the PN spikes (the size of a dot is proportional to the number of LNevoked IPSPs received by the PN at the previous oscillatory cycle). Subplot k is the raster plot of the LN spikes. (B) Corresponds to the simulation of a network with weak inhibitory LN-PN synapses. (C) Corresponds to the simulation of a network without LN firing-rate adaptation.

2553

20 10

inhibitory drive

trial number

Precise and Balanced Inhibition in the Insect AL

0 4

−π + π

2

Jitter (ms)

0 8 4 0

0

1

1

1

0

0

0

0

0

1

Figure 3: Temporal analysis of PN#1 over 20 simulations of the 90-30 network with different initial conditions θ (t = 0) and different input currents due to the noise. For repeated trials, the stimulus was applied to the same 33% “random set” of neurons. Each column corresponds to a given oscillatory cycle (from 1 to 10). The first row at the top shows the phase of each PN spike relative to its closest peak of the LFP (the phase of a given spike relative to the ith peak of the LFP is plotted as a dot in the ith box). The second row shows the inhibitory drive: the mean number of LN-evoked IPSPs received by PN#1 at the previous oscillatory cycle. The third row shows the standard deviation of the temporal jitter (in ms) of these LN-evoked IPSPs. Inhibitory drive and temporal jitter are estimated as averages over the 20 runs. The fourth row, at the bottom, represents the binary code obtained by assigning +1 at the ith bit if at least 80% of the spikes belonging to the ith box are within a ±5 ms windows around the ensemble mean of the PN population.

This stable state is reached very quickly, after just one oscillatory cycle. The number of phase-locked PNs increases with the probability p of connection and the number of activated neurons, which reflects the odor concentration (model assumption in section A.1). As an example, about 24% of the PNs are phase-locked when the odor activates 33% of the neurons and p = 0.3. As shown in Figure 4, a phase-locked PN produces a single spike per oscillatory cycle, and the phase-locking pattern is reproducible across repeated trials, in agreement with previous work (Laurent, 1996; Wehr & Laurent, 1996). These observations confirm that our model can reproduce the transient synchronization patterns seen in both experimental data (Wehr & Laurent,

trial number

2554

D. Martinez

10

trial number

inhibitory drive

0 4 2 0

10 0

inhibitory drive

−π + π

4

−π + π

2 0

Figure 4: Network without LN firing-rate adaptation. When LN frequency adaptation is blocked, the subset of phase-locked PNs becomes time invariant, a PN being either synchronized or desynchronized at all oscillatory cycles. The first two rows at the top are for a phase-locked PN. The last two rows at the bottom are for a desynchronized PN. Each box corresponds to an oscillatory cycle (from 1 to 10). The phase of each spike is represented as a dot across repeated trials. The inhibitory drive is the mean number of IPSPs received by the neuron at the previous oscillatory cycle estimated as the average over 20 trials.

1996) and the original model of Bazhenov et al. (2001b). Furthermore, as seen in Figure 3 (second and third rows), the synchronization of a PN to the current cycle seems to depend not only on the number of inhibitory spikes received at the previous cycle, as suggested by Bazhenov et al. (2001b), but also on the temporal jitter in their arrival times. This point is developed in the next section. 3 Mathematical Analysis 3.1 Firing Times of Inhibited PNs. The previous simulations suggest that the firing time of a PN is constrained by the inhibition it has previously received. We first consider a PN receiving a single inhibitory postsynaptic potential of strength g at a time t f . From equation 2.2 and section A.1, the total input current, for t ≥ t f , is given by J (t) = I − ge −(t−t

f

)/τ

,

Precise and Balanced Inhibition in the Insect AL

2555

where τ is the synaptic decay time and I = I ext − I th (for PNs, I adapt = 0). The PN is not allowed to fire until the inhibition has worn away sufficiently so that the total input current becomes positive. Borgers ¨ and Kopell (2003) have shown that the firing time T1 of a theta neuron receiving a single strong inhibitory spike is relatively independent of the initial condition θ (t = 0). Provided g is large enough, trajectories in the phase plane (θ, J ) are all attracted toward a given trajectory so that they all reach approximately the same point (π, J ) at firing time (see Figure 5C in Borgers ¨ & Kopell, 2003 for an illustration). The result is an almost complete loss of the initial condition θ(t = 0). Whatever the initial condition might be, the total input current is approximately equal to J at the firing time T1 , J ≈ J (T1 ) = I − ge −(T1 −t

f

)/τ

,

and thus T1 ≈ τ ln g − τ ν(I ) + t f , where ν(I ) = ln(I − J ). In order to obtain expressions for the firing time of a PN, we have generalized Borgers ¨ and Kopell’s (2003) study to the case of a theta neuron receiving k strong inhibitory spikes of strength g at times f ti , i = 1 · · · k, as in equation 2.2. Without loss of generality, we consider f that the neuron fires after tk . At the firing time Tk , the total input current is approximately equal to J ,

J ≈ J (Tk ) = I −

k

ge −(Tk −ti

f

)/τ

,

i=1

and thus the firing time of a PN that received k inhibitions is

Tk ≈ τ ln(kg) − τ ν(I ) + τ ln

k f 1 e ti /τ . k i=1

(3.1)

Note that by using Jensen’s inequality, one obtains Tk ≥ τ ln(kg) − τ ν(I )+ , which means that Tk is larger than the synchronization time of the population receiving a single spike of strength kg at the mean time = f 1 k i=1 ti . k

2556

D. Martinez f

Let us consider that the firing times ti of the LNs are drawn randomly from an unknown probability density function with mean µ L N and standard deviation σ L N . The only random variable in equation 3.1 is thus X = τ ln

k f 1 e ti /τ , k i=1

where √ the mean and standard deviation are given by X ≈ µ L N and σ X ≈ σ L N / k. These expressions for X and σ X were found by considering the fact that the variance of the sum of k independently and identically distributed (i.i.d.) random variables, each with variance σ 2 , is kσ 2 , and that the mean and standard deviation of a function y = g(x) of a random variable x approximately depend on only the mean ηx and standard deviation σx of x : η y ≈ g(ηx ) and σ y ≈ |dg/d x|x=η σx (Papoulis, 1984, eq. 5-53 and 5-56). Therefore, the firing times of the PNs that received k inhibitions are randomly distributed with variance E[(Tk − T k )2 ] = σ X2 ≈

σ L2 N k

(3.2)

and mean T k ≈ τ ln(kg) − τ ν(I ) + X = T 1 + τ ln k,

(3.3)

where T 1 is the mean firing time of a PN that received a single LN inhibitory spike, T 1 ≈ τ ln(g) − τ ν(I ) + µ L N .

(3.4)

3.2 Phase-Locking Probability of Inhibited PNs. As we have seen in section 2, PNs that do not receive any inhibition are not phase-locked. Therefore, we consider here only the PNs that receive some amount of inhibition. The mean firing time of these inhibited PNs is T = E[T], where E is the expected value taken over the probability density function f T (T). Because a PN receives at each oscillatory cycle a discrete number of LN-evoked IPSPs, the probability density function of its firing time f T (T) can be expressed as a mixture of conditional densities, f T (T) =

pk f T (T|k),

k≥1

where pk is the probability that an inhibited PN receives exactly k inhibitions

Precise and Balanced Inhibition in the Insect AL

2557

and f T (T|k) is the conditional density of firing time T given k LN-caused IPSPs. (Expression for pk is given in section A.2.) From the above mixture model, T is given by T=

pk T k .

(3.5)

k≥1

Replacing equation 3.3 in 3.5 yields T ≈ T1 + τ

pk ln k

k≥1

≈ T 1 + τ ln ,

(3.6)

where is the mean inhibitory drive. For our model, we have = a a p NLN with p the probability of connection and NLN the number of active LNs. Approximation in equation 3.6 is obtained by considering E[ln k] ≈ ln E[k] for k concentrated near its mean. This is a valid assumption here bea cause the variance of k corresponds to p(1 − p)NLN , which is small because a NLN is not too large. We further consider that a PN receiving k LN-evoked IPSPs at the current oscillatory cycle will be phase-locked at the next cycle if its firing time Tk is within a temporal window of ± ms around the ensemble mean T (see also Figure 1B). The Chebychev inequality provides a lower bound on the probability that an inhibited PN is phase-locked, P(|Tk − T| < ) ≥ 1 −

E[(Tk − T)2 ] . 2

(3.7)

Using equations 3.2, 3.3, and 3.6, it can be shown that E[(Tk − T)2 ] = E[(Tk − T k )2 ] + (T k − T)2 2 2 σLN k 2 ≈ . + τ ln k

(3.8)

Equations 3.7 and 3.8 provide a lower bound on the probability that a PN receiving exactly k inhibitions is phase-locked. The first and second terms in the right-hand side of equation 3.8 represent the contribution of the temporal jitter σLN and the number k of LN-evoked IPSPs at the previous oscillatory cycle, respectively. The received inhibition is said to be precise when σLN is small, such that the first term in equation 3.8 can be neglected. The inhibition is said to be balanced when k is near the mean inhibitory drive , such that the second term in equation 3.8 can be neglected. When the inhibition is both balanced and precise, then the phase-locking probability is close to one.

2558

D. Martinez

The temporal jitter of the PNs spikes is given with equation 3.8 and Na

σ P2 N

=

LN

pk E[(Tk − T)2 ].

(3.9)

k=1

An alternate expression for σ P2 N is found by considering the number k of received IPSPs as a random variable with mean and variance σk2 . In equation 3.1, we now have a sum of a random number k of independent random variables (recall that the times of single IPSPs are independent ran2 dom variables, with variance σLN ). By applying the formulas of the variance of a random sum and of a function, we found σ P2 N ≈

2 σLN σ2 + τ2 k 2 .

(3.10)

We see that σ P2 N vanishes when the received inhibition is both perfectly balanced (k =, σk2 = 0) and precise (σ L2 N = 0), indicating perfect synchronization of the PN population. When the inhibition is balanced for all the PNs, then pk = 1 for k = and pk = 0 otherwise, so that equations 3.9 and 3.8 together become equivalent to equation 3.10 with σk2 = 0. When the 2 inhibition is precise, equation 3.10 with σLN = 0 is similar to equation 4.20 in Borgers ¨ and Kopell (2003). 4 Numerical Results The bound derived in equations 3.7 and 3.8 allows one to predict which PNs will be phased-locked at the next oscillatory cycle and which will not. This prediction requires only knowledge of the number and the jitter of the IPSPs received by individual PNs at the current LFP oscillation. It depends, however, on a specific choice for . Experimental observations suggest that phase-locked PN spikes occur within a ±5 ms window (Laurent, 1999; Laurent et al., 2001). PN spikes are received by Kenyon cells (KCs) in the mushroom body. As shown in section A.3, the estimated firing probability for a KC receiving PN spikes within a ±5 ms window is close to the one experimentally found in (Perez-Orive et al., 2002). Therefore, in the following, we choose = ±5 ms to determine whether a PN is phase-locked. As shown in Figure 5, the phase-locking probability given by equations 3.7 and 3.8 when σLN = 0 is an asymmetric bell-shaped function centered on the mean inhibitory drive . We consider that a given PN is phaselocked when this probability is higher than a given threshold Pth . It is √ easy to check that this will happen for k ∈ ( exp(− 1 − P /τ ), th √ exp( 1 − Pth /τ )). For the values used in our AL model—τ = 10 ms and = 5 ms—and for Pth = 0.3, a threshold value was taken from Wehr and Laurent (1996), we obtain k ∈ (0.66 , 1.52 ). Clearly, the range of

Precise and Balanced Inhibition in the Insect AL

2559

0.3

Phase – locking probability

1

0

1.52

0.66 1

2

k / Figure 5: Phase-locking probability versus deviation from mean inhibition. In case of precise inhibition (σ L N = 0), the phase-locking probability, given by equations 3.7 and 3.8, is an asymmetric bell-shaped function centered on the mean inhibitory drive received by the PNs at the previous oscillatory cycle. We consider that a PN is phase-locked if this probability is higher than 0.3 (Wehr & Laurent, 1996), which gives a synchronization range of k ∈ (0.66 , 1.52 ).

inhibitory drive k that allows PN phase locking grows with the mean inhibition . Therefore, PNs are more likely to be synchronized when a large number of LNs are active in the network. Figure 6A compares the lower bound (see equations 3.7 and 3.8) to the phase-locking probability estimated by the simulation of our AL model. The lower bound was computed using equations 3.7 and 3.8 in which values for σLN and at each oscillatory cycle are given by the simulation of the AL model. The bound has the same behavior as the estimated probability. Moreover, it is predictive: the binary code obtained by assigning +1 for phase-locked PNs each time the bound exceeds 0.3 (Wehr & Laurent, 1996) and 0 for nonphase-locked PNs is the same as the one obtained in the simulation of our AL model (see Figure 3). Figure 6B compares the standard deviation σ P N of the firing times for the PNs computed by using equations 3.8 and 3.9 to the one estimated by a simulation. The number of active LNs (NLN ) and the temporal jitter (σLN )

2560

D. Martinez

are similar to the values encountered in the simulations of our AL model in section 2. From Figure 6B we see that σ P N > σLN in almost all cases, indicating that PN phase locking requires precise inhibition. Moreover, the close match between theoretical and experimental σ P N values shows that the approximations made in section 3 for deriving equation 3.8 are valid. In the locust AL, the standard deviation of the phase, relative to the LFP oscillations, was found to be 52 degrees for the PNs and 26 degrees for the LNs (Laurent & Davidowitz, 1994). For 20 Hz LFP oscillations, this corresponds to σ P N ≈ 7 ms and σLN ≈ 3.8 ms, indicated by a star in Figure 6B. The apparent discrepancy between the σ P N value derived from Laurent and Davidowitz (1994) and our results is probably explained by the fact that the mathematical analysis in section 3 did not take into account the PNs that do not receive any inhibition. Therefore, the σ P N values computed with equations 3.8 and 3.9 are likely to be underestimated. The period of the LFP oscillations is given in equation 3.6, where a the mean inhibitory drive is = p NLN , with p the probability of a connection and NLN the number of active LNs. Because the number of activated glomeruli increases with odor concentration, the percentage of neurons receiving the stimulus reflects the odor concentration (model assumption in section A.1). Therefore, as odor concentration increases, more LNs are active and is higher. From equation 3.6, T grows as ln , and thus the frequency of the LFP oscillations should be quite robust to changes in odor concentration. This is indeed verified with simulation results from the AL model presented in Figure 7A. Moreover, as seen in both equation 3.6 and Figure 7A, the period of the LFP oscillation depends linearly on the decay time of the inhibitory synapses. This is in agreement with previous results (e.g., Chow, White, Ritt, & Kopell, 1998). As seen above, the range of inhibitory drive that allows PN phase locking grows with , and more PNs are synchronized when increases. Thus, the oscillations of the LFP increase at high concentrations. This is verified in Figure 7B, where the LFP becomes a pure oscillatory signal as the percentage of neurons activated by the stimulus increases.

Figure 6: (A) Lower bound versus estimated phase-locking probability. The solid curve represents the phase-locking probability estimated by simulation over 20 runs (refer to PN#1 in Figure 3). The dotted curve represents the lower bound computed by using equations 3.7 and 3.8. (B) Temporal jitter. Standard deviation σ P N of the firing times for the PN population versus standard deviation σLN of the firing times for the LN population. Curves represent the σ P N computed using equations 3.9 and 3.8 and section A.2. The mean inhibitory drive is given a a by = p NLN , with p being the probability of connection and NLN being the

Precise and Balanced Inhibition in the Insect AL

A

2561

3

Probability and lower bound

2

1

0

–1

–2

–3

2

3

4

5

6

7

8

9

10

LFP oscillatory cycle

σPN (ms)

B

8

= 2

7

3

6

4 5

5 4 3 2 1 0

0

1

2

3

4

5

6

7

8

σLN (ms) number of active LNs. Points represent experimental values for σ P N estimated as average over 20 runs with standard deviation. Circles, squares, diamonds, and cross marks are for = 2, 3, 4, and 5, respectively. The value indicated by a star was derived from Laurent & Davidowitz (1994) (see text). The identity line is represented in the graph.

2562

LFP frequency (in Hz)

A

B

D. Martinez

30

20

15

10 30

40

50

60

70

80

90

100

Number of activated neurons (in %) 1 0.8

SNR

τ=10ms τ=6ms τ=4ms

25

0.6

τ = 4 ms τ = 6 ms τ = 10 ms

0.4 0.2 0 10

20

30

40

50

60

70

80

90

100

Number of activated neurons (in %) Figure 7: Effects of odor concentration on the LFP. The percentage of neurons activated by the stimulus reflects the odor concentration (model assumption in section A.1). (A) LFP frequency versus odor concentration for different decay rates of the inhibitory LN-PN synapse. The LFP frequency is given by the frequency of the maximum Fourier component in the power spectrum computed on the unfiltered signal. (B) Signal-to-noise ratio (SNR) versus odor concentration. The SNR is defined as the ratio of the sum of the powers of the fundamental and all harmonic frequencies to the total power. The fundamental frequency of the signal is given by the frequency of the maximum Fourier component in the LFP power spectrum. When the SNR is one, the signal is considered a pure oscillatory signal.

5 Discussion In this article, we outline a mathematical method for predicting PN phase locking from feedback inhibitory input. To the best of our knowledge, this has never been characterized analytically, although both experimental and modeling studies have revealed the relevance of GABAergic inhibition in the synchronization of PNs (MacLeod & Laurent, 1996; Stopfer et al., 1997; Hosler et al., 2000; Bazhenov et al., 2001b). The phase locking probability given by equations 3.7 and 3.8 depends on both the number k and the temporal jitter σLN of LN-evoked IPSPs. A PN will be phase-locked at the

Precise and Balanced Inhibition in the Insect AL

2563

next oscillatory cycle with probability one if the inhibition received at the current cycle is both perfectly balanced (zero deviation from mean inhibition k =) and precise (zero jitter σLN = 0). Precise inhibition is consistent with experimental data from the locust AL for which σLN < 4 ms (Laurent & Davidowitz, 1994). Other studies have reported that phase-locked and precisely timed inhibition can be necessary for sensory coding, as in the auditory system (Brand, Behrend, Marquardt, McAlpine, & Grothe, 2002). In the mouse olfactory bulb, however, inhibitory feedback from granule cells (GCs) is not precise. The standard deviation of the normalized phase of the IPSPs received by a mitral cell (MC), relative to the respiratory cyle, was found to be 0.09 (see Figure 6B2 in Margrie & Schaefer, 2003). This corresponds to σGC ≈ 22 ms (respiration cycle of 250 ms). Despite the jitter in their received IPSPs, MCs exhibit an oscillatory synchronized activity (Kashiwadani, Sasaki, Uchida, & Mori, 1999), and this synchronization is attributed to inhibitory feedback from the GCs (Lagier, Carleton, & Lledo, 2004). From equations 3.7 and 3.8, we see that the temporal jitter σGC contributes negatively to the MC 2 phase-locking probability through the ratio σGC /k (a small ratio implies higher probability). Because of the large number of GCs in the bulb (in the order of 106 ), the number k of IPSPs received by a particular MC at each oscillatory cycle is expected to be large. Thus, there is no requirement to have precise inhibition (σGC small) in order for the MCs to be phase-locked. For oscillatory synchronization, the mouse olfactory bulb probably needs to maintain a large number of GCs. As shown in Gheusi et al. (2000), a dramatic reduction in the number of GCs by about 40% leads to impaired odor discrimination. When the number of inhibitory cells is small, as is the case in the locust AL (300 LNs for 830 PNs), oscillatory synchronization requires precise inhibition. In our model, precision of LN firing comes from the phasic nature of excitation resulting from synchronized PNs. Synchronous LN activity can be further reinforced by the use of lateral LN-LN inhibition (see Figure 8D) and by the dense connectivity from PNs to LNs, found to be higher than 80% in Jortner, Mazor, and Laurent (2003). The phase-locking probability is an asymmetric bell-shaped function centered on the inhibition received on average by the PNs. The inhibition k received by a particular PN is said to be balanced when it does not deviate too much from . If a PN receives either a fairly large or a fairly small amount k of inhibition relative to the mean inhibitory drive , then it is likely that it will fire very far away from the other PNs. In such a condition, the inhibition is not balanced, and the PN is not phase-locked. The range of k allowing phase locking grows with so that more PNs are expected to be synchronized when more LNs are activated by the stimulus. This finding is in agreement with numerical observations indicating a correlation between the amount of received inhibition and the synchronization of PNs (Bazhenov et al., 2001b). It is also consistent with experimental data showing

2564

D. Martinez

A

B

Input current PN#1

1 Ith LN

16

PN

LN

(without firing adaptation)

12

LN

F(Hz)

Ith PN

(with firing adaptation)

4 Ith PN 0

0

C

Time (ms)

600

Ith LN

0 0.5

0.6

Input current

0.9

1

D

PN

LN

PN

LN

LN

LN 2π 100 ms

0

100

Time (ms)

Figure 8: (A) External input. Time evolution of I ext applied to PN #1. I ext = 0.75 with added gaussian noise (0.1 standard deviation) of 600 ms duration. Stimulus th onset is randomly taken between 0 and 30 ms. The minimal currents I PthN and ILN necessary for repetitive firing of PNs or LNs are indicated by the dashed lines. (B) Firing rate versus applied current for a PN (left) and a LN (right). Circles are for the simulations of the conductance-based models from Bazhenov et al. (2001b), and plain curves are for the simulations of the corresponding fitted theta models. As expected, theta neurons are a good approximation of type I conductance-based models around the threshold (Ermentrout, 1996; Hoppenstead & Izhikevich, 2002) and firing-rate adaptation linearizes the LN frequency response (Ermentrout, 1998). (C) Responses of PN and LN to current pulses of different amplitude. From top to bottom, I ext = 0.55, 0.53, 0.85, 0.82, and 0.82. Bottom trace was obtained by introducing an additional parameter β in equation 2.1. The result is a more realistic shape for the LN Ca 2+ spikes, similar to the one in Bazhenov et al. (2001b). (D) Lateral LN-LN inhibition leads to synchronized LN activities even in presence of desynchronized PNs. Raster plots of LN spikes corresponding to the simulation of Figure 2B with LN-LN inhibition (bottom) and without LN-LN inhibition (top).

Precise and Balanced Inhibition in the Insect AL

2565

that both LN and PN oscillatory power increases with odor concentration (Stopfer et al., 2003). Our mathematical study is based on two basic assumptions. First, we consider only one spike per neuron at each oscillation cycle. This is justified by experimental observations showing that when a PN is phased-locked, it usually produces a single spike per oscillatory cycle (Laurent, 1996; Wehr & Laurent, 1996). Second, we consider PN phase locking as a Markov process so that only information about the inhibition received at the current oscillatory cycle is sufficient for predicting whether a given PN will be phaselocked at the next cycle. This assumption is common and has been already made in more abstract AL models like the DNF (Quenet & Horn, 2003) or the DAL (Holub, Laurent, & Perona, 2002). Both of these AL models consist of networks of binary units with one-step discrete temporal dynamics. A discrete temporal dynamics is consistent with experimental data showing that the firing probability of a PN during a given cycle is coupled to its firing probability in a different cycle of the same trial (Wehr & Laurent, 1996). The Markov assumption is valid for the fast LN-PN inhibition considered in our model because the time constant of the inhibitory synapse (10 ms) is much lower than the period of the LFP oscillation (50 ms). Both experimental and modeling results have shown that PN synchronization is generated by fast GABAergic inhibition, while slow PN temporal patterning comes from another distinct inhibitory mechanism (MacLeod & Laurent, 1996; Bazhenov et al., 2001a). The Markov assumption, however, would not be appropriate for studying slow LN-PN inhibition for which the time constant is much higher than the period of the oscillation.

Appendix: Parameters Used for the AL Model and Probabilities Derived from the Connectivity A.1 Parameters Used for the AL Model. A.1.1 Stimulus. The dose response curve of an olfactory glomerulus in insects and mammals has a sigmoid shape (Meister & Bonhoeffer, 2001; Wachowiak & Cohen, 2001; Wang, Wong, Flores, Vosshall, & Axel, 2003; Sachse and Galizia, 2003). For simplicity, we consider a binary glomerular response (active or inactive) as in (Ng et al., 2002). In the insect AL, both PNs and LNs receive inputs from the glomeruli. Because the number of activated glomeruli increases with odor concentration (Ng et al., 2002; Sachse & Galizia, 2003), the percentage of neurons in our network receiving the external input mimics odor concentration. Unless otherwise specified, the input is applied to 33% of the neurons chosen at random. For each of these activated neurons, the input consists of a constant current I ext = 0.75 with added gaussian noise (0.1 standard deviation) of 600 ms duration (stimulus onset randomly chosen between 0 and 30 ms) (see Figure 8A).

2566

D. Martinez

4.1.2 Neurons. In the locust, PNs fire classical sodium action potentials, but LNs do not. When stimulated by a constant current, LNs produce calcium-like spikelets whose frequency decreases during the duration of the stimulation (Laurent, Seymour-Laurent, & Johnson, 1993). We simulated the conductance-based models of PNs and LNs from Bazhenov et al. (2001b) and found that their firing frequency in response to a constant input current can be arbitrarily low (type I excitability). In order to reduce complexity, we choose to use the quadratic integrate-and-fire model or the equivalent nondiverging theta model, both of which are known to be very good approximations of any type I neuron around the threshold (Ermentrout, 1996; Hoppenstead & Izhikevich, 2002). The equation for the theta neuron is given by equation 2.1, where the total input current is J = I − I adapt + I syn . I = I ext − I th where I ext is the external current and I th denotes the threshold, the minimal current required for repetitive firing. I adapt and I syn are the adaptation and synaptic current, respectively. Parameters α and I th of the theta neurons have been fitted so as to obtain the same frequency-current response as their equivalent conductance-based neuron without firing-rate adaptation (see Figure 8B). PNs and LNs have a threshold of 0.53 and 0.79, respectively, which means that the LNs are less excitable than the PNs. The constant α was 0.05 for a PN and 0.1 for a LN. In Bazhenov et al. (2001b), the LN has an adaptation current, but the PN does not. We therefore introduced in our LN theta model a slow adaptation current whose form was given by Izhikevich (2000). For PNs, I adapt = 0. For LNs, I adapt increases of a fixed step equal to 0.05 whenever the cell fires and then relaxes exponentially toward zero with a decay rate of 200 ms. The frequency-current response of the adapting LN is shown in Figure 8B, and responses of PN and LN to current pulses of different amplitude are shown in Figure 8C. A more realistic shape for the LN Ca 2+ spikes, similar to the one in Bazhenov et al. (2001b), can be obtained by introducing an additional parameter β such that equation 2.1 becomes dθ /dt = (1 − cos θ)β + (1√+ cos θ )α J . When J > 0 and is constant, the firing frequency is given by αβ J /π, while the spike width can be adjusted by the ratio α/β (see the bottom trace in Figure 8C and Borgers ¨ & Kopell, 2003, for a discussion about the spike width in the theta neuron). However, because simulation results were not dependent on a particular shape of the LN spikes, the simpler LN model given by equation 2.1 was used. 4.1.3 Network. The AL network is a sparsely and randomly connected network with the same probability p = 0.5 of connection from LNs to PNs, PNs to LNs, and between LNs. We did not consider interconnections

Precise and Balanced Inhibition in the Insect AL

2567

between PNs because they seem to have a negligible influence in the original model of Bazhenov et al. (2001b). Neurons are coupled via simple exponential synapses. When two neurons are connected, the connection strength is 0.05 between a PN and a LN, −0.5 between a LN and a PN, and −0.1 between two LNs. The time constant is 5 ms and 10 ms for the excitatory and inhibitory synapse, respectively. A.2 Probability That an Inhibited PN Receives Exactly k Inhibitions. In the presence of a stimulus, the number of active LNs changes over time a (see e.g., Figure 2A). Let NLN be the number of LNs that are active at the current LFP oscillation and p the probability of connection from LNs to PNs ( p = 0.5 for the AL model). Then the probability that a PN receives k inhibitions from the LNs with k ≥ 0 is simply given by the binomial distribution a with parameters NLN and p, PP N (X = k) =

a NLN k

p k (1 − p)

Na

LN

−k

.

The probability that an inhibited PN receives exactly k inhibitions with k ≥ 1 is actually a conditional probability that can be found using the Bayes’ rule,

pk = PP N (X = k|k ≥ 1) =

PP N (X = k) . 1 − PP N (X = 0)

A.3 Firing Probability for a KC Receiving PN Spikes Within a ±5 Ms Window. Experimental observations suggest that PN spikes occur within a 5 ms window when they are phase-locked (Laurent, 1999). In order to check if this value could be used for estimating the lower bound, we have simulated a conductance-based model of a KC given in Ikeno and Usui (1999) with k input spikes generated with a uniform distribution in a = ±5 ms window. The synaptic currents have been modeled as double exponentials with parameter values (peak current, decay, and rise time constants) given in Su and O’dowd (2003). The conditional probability for the KC to fire given k input spikes phase-locked at = ±5 ms has been estimated as the average over 200 runs, which gives P f (k) = 0 for k < 3, P f (3) = 0.79, and P f (k) = 1 for k > 3. Assuming n K C random connections from the PNs, the probability for a KC of receiving such phase-locked k PN spikes is given by the binomial distribution

nK C P(k) = k

PPk N (1 − PP N )nK C −k ,

2568

D. Martinez

where PP N is the probability of a PN to be phase-locked. Finally, the probability of a KC to fire is simply

PK C =

nK C

P(k)P f (k).

k=0

We consider n K C = 15 (it is known that 10 ≤ n K C ≤ 20 for the locust), PP N ≈ 0.1 (Wehr & Laurent, 1996) and P f (k) estimated as above for = ±5 ms. We find PK C = 0.16, which is in agreement with the KC response probability of 0.1 found experimentally in Perez-Orive et al. (2002). Therefore, = ±5 ms appears to be a correct value for determining if a PN is phase-locked. Acknowledgments This work was supported by a Cooperative Research Initiative from INRIA and by the European Network of Excellence GOSPEL (General Olfaction and Sensing Projects on a European Level). References Bazhenov, M., and Stopfer, M., Rabinovich, M., Abarbanel, H., Sejnowski, T., & Laurent, G. (2001a). Model of cellular and network mechanisms for odor-evoked temporal patterning in the locust antennal lobe. Neuron, 30, 569–581. Bazhenov, M., Stopfer, M., Rabinovich, M., Huerta, R., Abarbanel, H., and Sejnowski, T., & Laurent, G. (2001b). Model of transient synchronization in the locust antennal lobe. Neuron, 30, 553–567. Borgers, ¨ C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Computation, 15, 509– 538. Brand, A., Behrend, O., Marquardt, T., McAlpine, D., & Grothe, B. (2002). Precise inhibition is essential for microsecond interaural time difference coding. Nature, 417, 543–547. Buzsaki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304, 1926–1929. Chow, C., White, J., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. Journal of Computational Neuroscience, 5, 407–420. Ermentrout, B. (1996). Type 1 membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, B. (1998). Linearization of F-I curves by adaptation, Neural Computation, 10, 1721–1729. Friedrich, R., Habermann, C., & Laurent, G. (2004). Multiplexing using synchrony in the zebrafish olfactory bulb. Nature Neuroscience, 7, 862–871. Gheusi, G., Cremer, H., McLean, H., Chazal, G., Vincent, J.-D., & Lledo, P.-M. (2000). Importance of newly generated neurons in the adult olfactory bulb for odor discrimination. PNAS, 97, 1823–1828.

Precise and Balanced Inhibition in the Insect AL

2569

Holub, A., Laurent, G., & Perona, P. (2002). A digital antennal lobe for pattern equalization: Analysis and design. In S. Becker, S. Thrun, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Hoppenstead, F., & Izhikevich, E. (2002). Canonical neural models in brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Hosler, J., Buxton, K., & Smith, B. (2000). Impairment of olfactory discrimination by blockade of GABA and nitric oxide activity in the honey bee antennal lobes. Behavioral Neuroscience, 114, 514–525. Huxter, J., Burgess, N., & O’Keefe, J. (2003). Independent rate and temporal coding in the hippocampal pyramidal cells. Nature, 425, 828–832. Ikeno, H., & Usui, S. (1999). Mathematical description of ionic currents of the Kenyon cell in the mushroom body of honeybee. Neurocomputing, 26–27, 177–184. Izhikevich, E. (2000). Neural excitability, spiking and bursting. International Journal of Bifurcation and Chaos, 10, 1171–1267. Jortner, R., Mazor, O., & Laurent, G. (2003). Local connectivity in the locust antennal lobe. In Proceedings of the Computational Neuroscience Meeting, CNS 2003 (p. 100). Alicante, Spain. Kashiwadani, H., Sasaki, Y., Uchida, N., & Mori, K. (1999). Synchronized oscillatory discharges of mitral/tufted cells with different molecular receptive ranges in the rabbit olfactory bulb. Journal of Neurophysiology, 82, 1786–1792. Kopell, N. (2003). We got rhythm: Dynamical systems of the nervous system. Notices of the AMS, 47, 6–16. Lagier, S., Carleton, A., & Lledo, P.-M. (2004). Interplay between local GABAergic interneurons and relay neurons generate γ oscillations in the rat olfactory bulb. Journal of Neuroscience, 24, 4382–4392. Laurent, G. (1996). Dynamical neural assemblies, TINS, 19, 489–496. Laurent, G. (1999). A systems perspective on early olfactory coding. Science, 286, 723–728. Laurent, G., & Davidowitz, H. (1994). Encoding of olfactory information with oscillating neural assemblies. Science, 265, 1872–1875. Laurent, G., Seymour-Laurent, K., & Johnson, K. (1993). Dendritic excitability and a voltage-gated calcium current in locust nonspiking local interneurons. Journal of Neurophysiology, 69, 1484–1498. Laurent, G., Stopfer, M., Friedrich, R., Rabinovich, M., Volkovskii, A., & Abarbanel, H. (2001). Odor coding as an active, dynamical process: Experiments, computation and theory. Annu. Rev. Neurosci., 24, 263–297. Laurent, G., Wehr, M., & Davidowitz, D. (1996). Temporal representations of odors in an olfactory network. Journal of Neuroscience, 16, 3837–3847. MacLeod, K., & Laurent, G. (1996). Distinct mechanisms for synchronization and temporal patterning of odor-encoding neural assemblies. Science, 274, 976–979. Margrie, T., & Schaefer, A. (2003). Theta oscillation coupled spike latencies yield computational vigour in a mammalian sensory system. Journal of Physiol., 546, 363–374. Meister, M., & Bonhoeffer, T. (2001). Tuning and topography in an odor map on the rat olfactory bulb. Journal of Neuroscience, 21, 1351–11360. Ng, M., Roorda, R., Lima, S., Zemelman, B., Morcillo, P., & Miesenbock, G. (2002). Transmission of olfactory information between three populations of neurons in the antennal lobe of the fly. Neuron, 36, 463–474.

2570

D. Martinez

O’Keefe, J. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Papoulis, A. (1984). Probability, random variables, and stochastic processes (2nd ed.). New York: McGraw-Hill. Perez-Orive, J. (2004). Neural oscillations and the decoding of sensory information. Unpublished doctoral dissertation, California Institute of Technology. Perez-Orive, J., Mazor, O., Turner, G., Cassenaer, S., Wilson, R., & Laurent, G. (2002). Oscillations and sparsening of odor representations in the mushroom bodies. Science, 297, 359–365. Quenet, B., & Horn, D. (2003). The dynamic neural filter: A binary model of spatiotemporal coding. Neural Computation, 15, 309–329. Sachse, S., & Galizia, C. (2003). The coding of odour-intensity in the honeybee antennal lobe: Local computation optimizes odour representation. European Journal of Neuroscience, 18, 2119–2132. Stopfer, M., Bhagavan, S., Smith, B., & Laurent, G. (1997). Impaired odor discrimination on desynchronisation of odor-encoding neural assemblies. Nature, 390, 70–74. Stopfer, M., Jayaraman, V., & Laurent, G. (2003). Intensity versus identity coding in the olfactory system. Neuron, 39, 991–1004. Su, H., & O’Dowd, D. (2003). Fast synaptic currents in drosophila mushroom body Kenyon cells are mediated by α-bungarotoxin-sensitive nicotinic acetylcholine receptors and picrotoxin-sensitive GABA receptors. Journal of Neuroscience, 23, 9246–9253. Wachowiak, M., & Cohen, L. (2001). Representation of odorants by receptor neuron input in the mouse olfactory bulb. Neuron, 32, 723–735. Wang, J., Wong, A., Flores, J., Vosshall, L., & Axel, R. (2003) Two-photon calcium imaging reveals an odor-evoked map of activity in the fly brain. Cell, 112, 271– 282. Wang, X.-J. (2003). Neural oscillations. In L. Nadel (Ed.), Encyclopedia of cognitive science (pp. 272–280). New York: MacMillan. Wang, X.-J., & Buzsaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. Journal of Neuroscience, 16, 6402–6413. Wehr, M., & Laurent, G. (1996). Odour encoding by temporal sequences of firing in oscillating neural assemblies. Nature, 384, 162–166. Whittington, M., Traub, R., Kopell, N., Ermentrout, B., & Buhl, E. (2000). Inhibitionbased rhythms: Experimental and mathematical observations on network dynamics. Int. J. of Psychophysiol., 38, 315–336.

Received November 17, 2004; accepted April 26, 2005.

LETTER

Communicated by Anthony Burkitt

Response Properties of an Integrate-and-Fire Model That Receives Subthreshold Inputs Xuedong Zhang [email protected] Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, U.S.A.

Laurel H. Carney [email protected] Institute for Sensory Research, Syracuse University, Syracuse, NY 13244-5290, U.S.A.

A computational technique is described for calculation of the interspike interval and poststimulus time histograms for the responses of an integrate-and-fire model to arbitrary inputs. The effects of the model parameters on the response statistics were studied systematically. Specifically, the probability distribution of the membrane potential was calculated as a function of time, and the mean interspike interval and PST histogram were calculated for arbitrary inputs. For stationary inputs, the regularity of the output was studied in detail for various model parameters. For nonstationary inputs, the effects of the model parameters on the output synchronization index were explored. The results show that enhanced synchronization in response to low-frequency stimuli required a large number (n > 25) of weak inputs. Irregular responses and a linear input-output rate relationship required strong (but subthreshold) inputs with a small time constant. A model cell with mixed-amplitude synaptic inputs can respond to stationary inputs irregularly and have enhanced synchronization to nonstationary inputs that are phase-locked to lowfrequency inputs. Both of these response properties have been reported for some cells in the ventral cochlear nucleus in the auditory brainstem. 1 Introduction One fundamental question in the study of nervous system function is how a single neuron responds to and processes the information received from other neurons. In the auditory system, approximately 30,000 primary auditory nerve (AN) fibers with different characteristic frequencies (CF) connect the auditory sensory organ, the cochlea or inner ear, to the cochlear nucleus (CN) in the brainstem. The CN receives all of the information in the acoustic signal represented by the temporally structured spike discharges in the Neural Computation 17, 2571–2601 (2005)

© 2005 Massachusetts Institute of Technology

2572

X. Zhang and L. Carney

population of AN fibers and is the first stage of information processing in the central auditory system. The CN contains a variety of cells that differ in their responses to a relatively homogeneous input and therefore presents a unique opportunity for quantitatively studying input-output transformations by neurons and the relationships between a neuron’s function and its underlying mechanisms. In this study, computational techniques for the study of neural models were developed. These methods are generally applicable to problems at many levels of the nervous system involving responses of neurons to convergent inputs. Here we applied the methods to an ongoing set of problems associated with understanding the nature of the convergence of AN fibers onto bushy cells in the CN. While the importance of spike timing in the millisecond range in cortical areas remains a topic of intense debate (Konig, ¨ Engel, & Singer, 1996), the significance of temporal coding for auditory perception, and especially for sound localization, has been widely accepted (Joris, Smith, & Yin, 1998). It is thus of great interest to understand temporal coding and processing along the auditory pathway and its underlying mechanisms. Considerable progress has been made over the decades. Temporal information is first encoded in the discharge pattern of AN fibers, which are phase-locked to the temporal features of the acoustic waveform up to 4 to 5 kHz (Kiang, 1965; Johnson, 1980). In the ventral division of the cochlear nucleus (VCN), the bushy cells appear to be specialized to preserve and even enhance the temporal information encoded in AN fibers (Joris, Carney, Smith, & Yin, 1994; Joris, Smith, & Yin, 1994). Temporal information is further transmitted to the superior olivary complex, where cells are sensitive to interaural timing differences from their binaural inputs (see the review by Yin, 2002). The capability of bushy cells to preserve or even enhance timing information relies on their synaptic configuration and membrane properties. Bushy cells receive large somatic AN terminals, called the end bulbs of Held (see review by Cant, 1992), that differ in number and size: spherical bushy cells (SBC) have fewer and larger end bulbs, while globular bushy cells (GBC) have more and smaller end bulbs. The somatic inputs bypass the dendritic low-pass filtering and thus have a very short time constant in their synaptic current. Bushy cells also have short membrane time constants (Oertel 1983, 1985) caused by the activation of a low-threshold potassium conductance at the resting potential. Recent experimental (Manis & Marx, 1991; Rothman & Manis, 2003c) and modeling (Rothman, Young, & Manis, 1993; Rothman & Manis 2003a, 2003b), studies have satisfactorily explained how the membrane properties of the bushy cell contribute to its precise preservation of temporal information. Analysis of neural responses is useful for estimating the parameters of synaptic inputs (the number and size of the inputs) to bushy cells because these parameters are believed to be crucial to bushy cell input-output functions. SBCs usually have prepotential waveforms (Bourk, 1976, reviewed by

Response Properties of an Integrate-and-Fire Model

2573

Rhode and Greenberg, 1992) and have discharge patterns similar to those of AN fibers. They are thus referred to as primary-like (PL), suggesting that SBCs may receive suprathreshold inputs. Modeling studies by Rothman et al. (1993) suggest that bushy cells with primary-like-with-notch (PLn) responses to high-frequency tones at CF must also receive suprathreshold inputs to maintain their irregularity. PLn responses are associated with GBSs. Evidence that bushy cells receive subthreshold inputs also is available. Some bushy cells demonstrate onset with low sustained rate (On-L) discharge patterns that can be successfully modeled using many weak subthreshold inputs (Rothman et al., 1993; Kipke & Levy, 1997; Kalluri & Delgutte, 2003a, 2003b). Consistent with models receiving subthreshold inputs, enhanced phase locking has been observed (Joris, Carney et al., 1994; Joris, Smith et al., 1994) in low-CF bushy cells in the anteroventral cochlear analysis responding to CF tones and in high-CF PLn response types responding to low-frequency tones. The reports that high-CF bushy cells respond irregularly to CF tones and have enhanced synchronization in response to low-frequency tones (Joris, Carney et al., 1994) suggest that the same synaptic configuration (number and size of the AN inputs) must be capable of both input-output relationships. In the studies from Rothman and his colleagues (Rothman et al., 1993; Rothman & Young, 1996; Rothman & Manis, 2003a), a compartmental model was used to explore the model responses with different synaptic configurations. Their results support the hypothesis that subthreshold inputs are capable of producing enhanced-sync responses to low-frequency tones, and suprathreshold inputs are more suitable to describe the PL and PLn responses of bushy cells. The results also suggested that the different arrangements of synaptic inputs may affect the input-output rate relationships of bushy cells. However, the relative importance of such arrangements was not reported in their study because their model also included complex (nonlinear) effects of neural dynamics. Because of its simplicity and mathematical tractability, statistical analysis of neural activity based on integrate-and-fire (I&F) model estimations of neuronal physiological and anatomical parameters has a long history in theoretical neuroscience (Tuckwell, 1988). The I&F model was also the first neural model to capture the essential properties of neural behavior: synaptic integration and threshold for responding. A generalization of this simple phenomenological model (known as the spike response model; see Gerstner & Kistler, 2002) can emulate more physiologically realistic Hodgkin-Huxleytype (channel) models (Kistler, Gerstner, & van Hemmen, 1997) and has been widely used in the study of neural coding, synaptic plasticity, and pattern formation. I&F models have been used to study the regularity properties of spontaneous activity in auditory neurons (Molnar & Pfeiffer, 1968), the phase-locking properties of bushy cells (Joris, Carney et al., 1994), and the discharge pattern of onset neurons in the CN (Kalluri & Delgutte, 2003a, 2003b). Stochastic processes have been employed in modeling the responses

2574

X. Zhang and L. Carney

of single neurons using the I&F model. For example, Stein (1965) proposed a discontinuous Markov process as a neural model that incorporated the exponential decay of the membrane potential. Computational methods have provided quantitative statistical descriptions of the model response (Molnar & Pfeiffer, 1968; Colburn & Moss, 1981). However, previous methods for analyzing the I&F model have been limited to conditions with stationary inputs, and new techniques are needed to explore model responses to nonstationary (phase-locked) inputs. Kempter, Gerstner, van Hemmen, and Wagner (1998) investigated the coincidence-detection properties of an I&F model in response to periodic spike inputs. Their analysis concerns the dependence of the model response rate on neural parameters such as the number of synapses, the threshold, and the time course of the postsynaptic responses. They also explored the effects of these parameters on the neuron’s ability to convert a temporal code into a rate code. An extended study (Burkitt & Clark, 2001) has also evaluated the interspike interval (ISI) histogram and the period histogram for neural responses to ongoing periodic inputs. Both studies assume that there is a large number of small inputs to the model and that the membrane potential is approximated by a gaussian random variable; they limit their analyses to a model without refractoriness. In a series of studies, Gerstner, Plesser, and coworkers presented an analytical framework to study the I&F neuron model with nonstationary (periodical) input (Plesser & Tanaka, 1997; Plesser & Geisel, 1999; Plesser & Gerstner, 2000; Herrmann & Gerstner, 2001). Their approach requires an explicit description of either the diffusion noise (stochastic description of the membrane potential) or the escape noise (stochastic description of firing probability). Their work provides insights to the I&F model with a large number of inputs. In this study, we explored how the neural response statistics change with different synapse configurations using an I&F model. The model cell received a varying number of convergent AN inputs, which were superimposed and modeled as a nonstationary Poisson point process. A computational method based on Stein’s model is proposed to calculate accurately the ISI histogram and poststimulus time histogram (PST) of the I&F model in response to an arbitrary stimulus waveform. The method presented here applies to the I&F model without any limitations on the model parameters and is especially efficient with a small number of inputs with fast membrane decay time constants. The model parameters were systematically investigated using responses to both stationary and nonstationary inputs. Various response properties of the model cell were explored, including the rate response of the model cell, regularity in response to stationary inputs, and phase locking in response to nonstationary input. The general conclusions about the effect of model parameters on the neural response statistics apply to all cells that receive convergent inputs, though the statistics we investigated here are of particular interest for the study of bushy cells in the VCN, which are known to receive a relatively small number of large inputs

Response Properties of an Integrate-and-Fire Model

2575

from AN fibers (Manis & Marx, 1991; Cant, 1992) and have fast membrane dynamics (Oertel, 1983). 2 Method 2.1 The I&F Model. The model used in this study is a simple I&F neuron with the following properties: 1. Each input spike (except those that arrived during the model’s dead time) at time tk from channel i generated an excitatory postsynaptic potential (EPSP) given by Vi,tk (t) = Ai e −(t−tk )/τi .

(2.1)

The amplitude (Ai ) and time constant (τi ) of the EPSP (Vi,tk ) represent the basic configuration of the synapse integration and were explored systematically, along with different input stimuli. 2. The membrane potential V(t) (0 at rest) was the linear sum of all incoming EPSPs. 3. The model cell fired when the membrane potential V(t) exceeded the threshold. The threshold was always set to 1 so the EPSP amplitude represented the synapse strength relative to threshold. 4. The membrane potential V(t) was reset to zero after firing, with a dead time of 0.7 ms. (Spikes that arrived during the dead time did not generate EPSPs.) How the membrane potential V(t) changed with time after firing was very important to the neural response statistics. When the neuron was not discharging (assuming the dead-time period ended at t = 0, that is, the previous spike time was at t = −0.7 ms) and the model input was a Poisson stationary process with arrival rate R, the mean and variance of the model’s membrane potential values were given by (Stein, 1965) µv = RAτ (1 − e −t/τ ),

(2.2)

σv2 = RA2 (τ/2)(1 − e −2t/τ ).

(2.3)

and

2.2 Stimulus Description and Superposition of AN Inputs. The discharge pattern of the AN fiber can be described as a nonhomogeneous Poisson process modified to include refractory effects (Johnson & Swami, 1983). Since the EPSPs are integrated linearly by the I&F model, inputs from multiple AN fibers that produce EPSPs with identical amplitudes and time

2576

X. Zhang and L. Carney

constants can be superimposed. (This simplification is considered further in section 4.). The equivalent input can then be described by a nonhomogeneous Poisson process (Cox, 1962) as the number of input fibers increases. Figure 1 illustrates the change of the ISI distribution of the superimposed input with different numbers of input AN fibers. Each model AN fiber’s discharge times were produced by a renewal process that simulated a stationary input (100 sp/sec) modified by refractoriness (Carney, 1993). The ISI curves (calculated based on 100,000 simulated spikes) are plotted on normalized axes so that they are comparable with each other. (The solid line represents the ISI for a Poisson process.) The simulation shows that the

Normalized ISI density function (f(t)*µ)

0

10

–1

10

–2

10

0

Poisson 1 ANF 5 ANFs 10 ANFs 20 ANFs 1 2 Normalized ISI (t/µ)

3

Figure 1: ISI distribution of a superposition model with different numbers of independent AN inputs. Each input model AN fiber had a stationary Poisson response of 100 sp/sec modified by refractoriness (Carney, 1993). The input spikes were interleaved, and the ISI was calculated based on 100,000 simulated discharges. Both axes were normalized (either multiplied or divided by the mean interval spike time µ) to make the ISI distributions comparable. The ISI distribution goes to zero at the absolute refractory period for a single AN input (dotted line). As the number of independent inputs increased (to values larger than 5), the combined input spikes could be approximated by a simple Poisson process (solid line).

Response Properties of an Integrate-and-Fire Model

2577

superimposed input could be approximated by a Poisson process (i.e., the effect of input refractoriness on ISI could be ignored) when there were more than five independent AN fiber inputs. For the model cell that received stationary input from multiple AN fibers, we treated the total input spike train as a Poisson process with rate R. When the input was periodic, the total input spike train was described as a nonhomogeneous Poisson process with an instantaneous rate of firing sAN (t) given by (see Colburn, 1973; Colburn, Carney, & Heinz, 2003) s AN (t) = R

e ϕ sin(2π ft) 1/ f ϕ sin(2π ft) , f 0 e dt

(2.4)

where the exponential function in the numerator represents the periodic signal, with ϕ and f determining the strength of phase locking and frequency of the input. The parameter R is the mean firing rate of the nonstationary Poisson process. The exponential function is normalized by the denominator, which is the modified Bessel function I0 [g] described in Colburn et al. (2003). 2.3 Analytical Calculations of the ISI and PST for the I&F Model with Stationary and Nonstationary Inputs. Stein (1965) proposed a discontinuous Markov process model to describe the statistics of the membrane potential for the I&F model mentioned above. Molnar and Pfeiffer (1968) used this model to numerically calculate the ISI of the output for the case with stationary input. The following analysis and computational results extend this method to include both the ISI and PST histograms of the model output. Given that the previous output spike time is at t = 0 and that the potential V(t ) is always less than 1 for t in the interval [0,t), we define Fc [Vx ,t ] as the conditional cumulative probability that the membrane potential V(t ) is less than the potential Vx at time t , for all t in the interval [0,t). Thus, Fc [Vx ,t ] = Prob (V(t ) ≤ Vx ), and Fc [Vx = 1, t ] = 1. For a stationary Poisson input with rate R, the probability of an input spike’s occurring in a short time interval from t to t + can be represented as R, and the membrane potential V(t + ) at time t + could exceed 1 because of the incoming spikes (all the incoming EPSPs are assumed to be added at the end of the interval, t + ). Therefore, we can express the unconditional cumulative probability function F [Vx , t + ] [defined as Prob(V(t + ) ≤ Vx ) at time t + ] only in terms of the conditional cumulative probability Fc [Vx ,t] at time t based on the transition of the Markov process model (Stein, 1965) as F [Vx , t + ] = (1 − R)Fc [Vx e /τ , t] + RFc [Vx e /τ − A, t], where A and τ are the amplitude and time constant of the input EPSP.

(2.5)

2578

X. Zhang and L. Carney

For a threshold voltage equal to 1, the probability that the model cell will have an output spike in the time interval from t to t + is approximated by 1 − F [Vx = 1, t + ]. The conditional cumulative probability Fc [Vx ,t ] at time t in the interval (t, t + ) can be approximated by the unconditional cumulative probability, F [Vx ,t ] (approximated by F [Vx , t + ]) divided by the cumulative probability that the voltage remains below threshold, F [1,t ] (which is approximated by F [1, t + ]): Fc [Vx ,t ] = F [Vx ,t + ]/F [1,t + ] for all Vx < 1, and Fc [Vx , t ] = 1 for all Vx ≥ 1.

(2.6)

This allows the computation of Fc [Vx , t) for all t and all Vx by computing a new value in each interval. If there is a spike, Vx is reset to zero, and the process is restarted. Thus, the output of the I&F model can be described as a renewal process (Cox, 1962) with a hazard function ρ(t), which is defined as the rate of a renewal (spike) event that occurs at time t and is determined by ρ(t) = (1 − F [1, t])/.

(2.7)

The ISI of the model output with stationary input can be specified by f ISI (t) = S(t)ρ(t),

(2.8)

where S(t) is the survival function of the renewal process, or the probability that there is no renewal (spike) event between 0 and t. S(t) can be written in terms of the hazard function as S(t) = e −

t 0

ρ(x)d x

.

(2.9)

The above analysis can be easily extended to the situation where the input is a nonhomogeneous Poisson process described by R(t) with a previous output spike time at t0 . In this case, the phase-dependent first-passage time density to threshold is represented as f ISI (t − t0 |t0 ), which can be easily derived from equations 2.5 to 2.9 by resetting the membrane potential at time t0 . The survival and hazard functions can also be written as S(t − t0 |t0 ) and ρ(t − t0 |t0 ). In this form, for which the arguments of S(·), ρ(·), and f ISI (·) are intervals, (t − t0 ), the functions are phase (t0 ) dependent. The calculation accuracy of the phase-dependent f ISI (·) from the above equations is not affected by the frequency of the input oscillation or the mean ISI of the output spikes with respect to the length of the integration window. The

Response Properties of an Integrate-and-Fire Model

2579

unconditional firing probability P(t) (which is an estimate of the PST histogram) of the model output to the input R(t) can described as (Cox, 1962) P(t) =

t

P(x) f I SI (t − x|x)d x,

−∞

(2.10)

where x represents the spike time before time t. The calculation of P(t) from the above equation is not possible computationally because the duration over which the integral is computed is not limited. We now assume that the cumulative conditional probability of the membrane potential Fc [Vx , t|t0 ] (where t0 is the previous spike time) is determined by the input spikes during the preceding time period (t − T, t), where T >> τ . This is a reasonable assumption since the potential contributed by spikes before t − T decays with a time constant τ and can be neglected compared to the potential contributed by recent spikes if T >> τ . For all previous spike times for which t0 < t − T, the cumulative probabilities of the membrane potential, Fc [Vx , t|t0 ] and F [Vx , t|t0 ], can be approximated as Fc [Vx , t|t − T] and F [Vx , t|t − T], and the hazard function ρ(t − t0 |t0 ) derived from equation 2.7 can be approximated by ρ(T|t − T). Since ρ(T|t − T) is independent of the previous spike time t0 , the unconditional firing probability P(t) can be calculated numerically (see the appendix in Herrmann & Gerstner, 2001). Here P(t) can be rewritten as P(t) = =

t −∞ t −∞

≈

P(x) f ISI (t − x|x)dx P(x)S(t − x|x)ρ(t − x|x)dx

t−T

−∞

+

P(x)S(t − x|x)ρ(T|t − T)dx

t

P(x)S(t − x|x)ρ(t − x|x)dx

t−T

= ρ(T|t − T) +

t

t−T

−∞

(2.11)

P(x)S(t − x|x)dx

P(x)S(t − x|x)ρ(t − x|x)dx.

t−T

The integral on the second line was separated into two integrals on the third line, and for all spike times previous to x < t − T, ρ(t − x|x) was approximated by ρ(T|t − T), which was the hazard function at time t given a previous spike at t − T. The second integral in the final line of the above

2580

X. Zhang and L. Carney

equation has a limited duration, and thus the numerical calculation based on equations 2.7 to 2.10 is possible. The first integral in the final line of the above equation can be further simplified as

Presidue (t + ) = ≈

t+−T −∞ t−T −∞

≈

t−T

−∞

P(x)S(t + − x|x)dx

P(x)S(t + − x|x)dx + P(t − T)S(T|t − T) P(x)[S(t − x|x) + · ds(t − x|x)]dx

+ P(t − T)S(T|t − T) t−T dS(t − x|x) = P(x)S(t − x|x) 1 + · dx S(t − x|x) −∞

(2.12)

+ P(t − T)S(T|t − T) = (1 − ρ(T|t − T))Presidue (t) + P(t − T)S(T|t − T), where the last step of the derivation is based on the relationship between the survival function S(t) and the hazard function ρ(t) (Cox, 1962): dS(t) = −ρ(t). S

(2.13)

The final line in equation 2.12 can be described by a differential equation and calculated numerically: d Presidue = −Presidue ρ(T|t − T) + P(t − T)S(T|t − T)dt.

(2.14)

Using the above relationships, P(t) can be calculated given R(t), and the mean ISI for a nonstationary input from time t1 to t2 can be represented by t2 f ISI (t) =

t1

P(x) f ISI (t|x)dx t2 t1 P(x)dx

(2.15)

and calculated numerically. (This is not shown in detail because we are not interested here in the ISI for the nonstationary input.) For a model receiving input EPSPs with two different amplitudes but the same time constant, the Markov process of the I&F model (see equation 2.5)

Response Properties of an Integrate-and-Fire Model

2581

can be described as F [Vx , t + ] = P00 Fc Vx e /τ , t + P10 Fc Vx e /τ − A1 , t + P01 Fc Vx e /τ − A2 , t + P11 Fc Vx e /τ − A1 − A2 , t , (2.16) where P00 , P10 , P01 , P11 represent the joint probability of input spikes from two channels with different EPSP amplitudes (A1 and A2 ) in the interval from t to t + . The derivation above can then be extended to calculate the PST and ISI histograms of the model response to arbitrary inputs with mixed-amplitude EPSPs. The same technique can be applied to allow multiple spikes to arrive in a time window (such that a large can be used to approximate the Poisson process), making this computation more efficient.

3 Results 3.1 Predictions for a Model That Receives Stationary Inputs. The steady-state response of an AN fiber to a CF tone at a high frequency is generally assumed to be a stationary point process (Siebert, 1965; Kiang, 1965). The response of a neuron receiving stationary inputs can be modeled successfully as a stationary renewal process fully characterized by the ISI interval of the mean, µ, and standard deviation, σ , of the process (Cox, 1962). The mean rate of the model output is defined as Rate = 1/mean interval = 1/µ,

(3.1)

and the quantitative measure of the response regularity is described by the coefficient of variation (CV): CV = σ/µ.

(3.2)

This regularity measure of the cell response is important since it may represent different underlying processing mechanisms, and it has been used as one of the criteria to classify different response types in the CN (Young, Robert, & Shofner, 1988; Blackburn & Sachs, 1989). A cell with a CV value close to 1 is considered irregular, and its response can be treated as a process essentially similar to the Poisson process (σ = µ). It is more realistic to model a cell with a dead-time-modified (τd ) Poisson process, and the measure of CV for such a process is affected by the response firing rate Rout (1/Rout = µ = σ + τd , and CV = σ /µ = 1 − Rout τd ) (Rothman et al., 1993). To reflect the

2582

X. Zhang and L. Carney

more fundamental nature of the underlying process, the modified coefficient of variation (CV ) of the cell responses (Rothman et al., 1993) is used as a measure of the cell regularity: C V = σ/(µ − τd ).

(3.3)

The mean rate and CV measurements of the I&F model responses to stationary inputs with various model parameters are shown in Figures 2 and 3. The input to the I&F model was a stationary Poisson process, and the calculations were based on the equations described in section 2. Figure 2 illustrates the effect of the time constant (τ ) of the EPSP (with a fixed amplitude of 1/3; threshold is equal to 1) on the model cell responses. The response rate of the model cell was plotted as a function of input strength (input discharge rate R multiplied by the EPSP amplitude) in Figure 2a. With a large time constant, the model output was more affected by the integration (energy) of the input EPSPs, and the model response rate changed more linearly with the input strength. When the time constant was short, the model cell’s response was dominated by the coincidence-detection mechanism, and the probability of discharge in an effective time window w was approximated by (Stein, 1965) Pf =

Rk w k−1 e −Rw , (k − 1)!

(3.4)

where k equals the number of input spikes that are required to arrive within the time window w (proportional to τ ) to generate an output spike. The model response rate increased rapidly (and nonlinearly) as the input rate R increased. The regularity measure (CV ) of the model response is plotted in Figure 2b as a function of input strength. The CV of the model response decreased as the input rate increased for all time constants. However, model cells with short time constants (up to 400 µs) were still classified as irregular (CV of the model response was higher than 0.65). When the time constant of the model cell was long, the model cell discharged more regularly (with small CV ). This is because the probability of discharge of the model cell was greatly affected by the integration time constant of the membrane potential after the dead time, because the mean and variance of the model potential increased slowly after the dead time (see equation 2.2 and 2.3). Figure 2c shows the regularity measure of the model cell response as a function of the model response rate. For a fixed output response rate, model cells with large time constants were more regular than cells with short time constants, showing that the decrease of the irregularity in Figure 2b with increasing EPSP time constant for a fixed input rate was not caused by the increased response rate.

Response Properties of an Integrate-and-Fire Model

2583

200 µs

(a)

400 µs

800 Rate

1 ms

400

(b)

1

5 ms

0

1000

2000

CV ’

0.8 0.6 0.4 0

(c)

1000 2000 Input Strength(R*Ae)

1

CV ’

0.8 200 µs 400 µs 1 ms 5 ms

0.6 0.4 0

100

200 300 400 Output Rate (sp/sec)

500

600

Figure 2: Responses for models with different EPSP time constants to stationary inputs. The EPSP amplitude was fixed at 1/3 for all models, and model threshold was equal to 1. The abscissa in a and b is input strength, which is defined as input rate multiplied by the EPSP amplitude. (a) Model response rate as a function of input strength. For high input strength, the output rate was limited by the dead time of the I&F model. For small input strength, the inputoutput rate function was determined by the EPSP time constant. The model rate responses initially changed linearly with input strength for large EPSP time constants and increased nonlinearly for small EPSP time constants. For reference, the function described in equation 3.4 is plotted (dotted curve) with k = 4, and w = 200 µs. (b) Regularity measure (CV ) of model response as a function of input strength. The CV generally dropped as the input strength increased, but remained high (> 0.65) for model cells with short time constants (up to 400 µs). (c) Regularity measure replotted as a function of model response rate. The increase in irregularity of the model response was caused by the EPSP time constant, not by the drop in the model response rate.

2584

X. Zhang and L. Carney

Rate (sp/sec)

(a)

400 0

(b)

0.2 0.4 0.7 0.9

800

0

500

1000

1500

2000

2500

3000

0

500

1000 1500 2000 Input Strength(R*Ae)

2500

3000

200 300 400 Output Rate (sp/sec)

500

600

1

CV ’

0.8 0.6 0.4

(c)

1

CV ’

0.8 0.2 0.4 0.7 0.9

0.6 0.4 0

100

Figure 3: Responses of models receiving stationary inputs with different EPSP amplitudes and a fixed time constant (400 µs). Data are plotted in the same manner as in Figure 2. (a) The input-output rate function was more linear for models with large EPSP amplitude and more nonlinear for models with small EPSP amplitude. (b) Regularity measure CV of model responses. The change of model EPSP amplitude did not affect the CV of the model responses; the CV remained high, presumably because of the short EPSP time constant used in the computation. (c) CV replotted as a function of model response rate. When model cells had same output rate, there was no clear monotonic relationship between CV and the model EPSP amplitude.

Response Properties of an Integrate-and-Fire Model

2585

Model responses for input EPSPs with different amplitude are shown in Figure 3. The time constant of the model EPSP was fixed at 400 µs, and the amplitude was always below the threshold of 1 (otherwise the model response process would have been the same as the input process, modified only by refractoriness). The model response rate as a function of input strength is plotted in Figure 3a. Since the input strength is defined as EPSP amplitude multiplied by the input rate, the input rate for each model cell (with different amplitude EPSPs) differed at each abscissa value, but the total energy of the input was the same. The response rate of the model output increased as the input strength increased; however, model cells with small inputs required greater input strength to generate the same response rate. Weak inputs required a larger number (k) of input spikes in an effective time window w, and the rate tended to change nonlinearly as the input strength increased, as expected from equation 3.4. The CV of the model response (see Figure 3b) dropped as input strength increased but remained high for all model cells, regardless of EPSP amplitude. When the regularity measure is plotted as a function of the output response rate in Figure 3c, it is clear that the relationship between CV and the strength of the synapse input (amplitude of the EPSP) is not simple. The CV changed nonmonotonically with increased amplitude of the model EPSPs. This result shows that the regularity of model cells that received subthreshold inputs was determined primarily by the time constant of the input EPSPs. 3.2 Predictions for the Model That Receives Synchronized Input. The most prominent feature of AN fiber responses to low-frequency tones is that the discharges phase-lock to the stimulus frequency up to about 4 to 5 kHz (Johnson, 1980). Enhanced phase locking has been reported in VCN bushy cells (Joris, Carney et al., 1994; Joris, Smith et al., 1994) and can be modeled as a consequence of converging subthreshold AN inputs (Joris, Carney et al., 1994; Rothman et al., 1993). The combined input from convergent AN discharges to the I&F model was represented by a single nonstationary (periodic) Poisson process, as described in association with equation 2.4. The PST of the model response to such an input is also periodic and can be calculated numerically based on the methods described in section 2. The degree of phase locking of the model response was quantified by the synchronization index (SI), which is defined as SI = B/A, where B is the fundamental frequency (stimulus frequency) component and A is the DC component of the Fourier series of the response PST histogram (Johnson, 1980). The responses for I&F models with different time constants are plotted as a function of input synchronization index (SI) in Figure 4. The model EPSP amplitude was fixed at 1/3 (threshold was equal to 1), and the combined input had a constant rate of 2400 spikes per second. The input stimulus had a frequency of 500 Hz, and its SI was varied systematically. The model rate responses are illustrated in the top panel of Figure 4. The model response with a short time constant changed dramatically when the input SI increased (i.e.,

2586

X. Zhang and L. Carney

Rate Output (sp/sec)

(a) 400 300 200 100 0

Output Sync. Index

(b)

0

0.2

0.4

0.6

0.8

1

1 0.8 0.6 0.1 ms 0.2 ms 0.4 ms 1 ms

0.4 0.2 0

0

0.2

0.4 0.6 Input Sync. Index

0.8

1

Figure 4: Responses for models with different EPSP time constants (see legend) and a fixed amplitude EPSP (1/3) to nonstationary inputs. The input waveform had a frequency of 500 Hz and a fixed average rate of 2400 spikes/sec (see equation 2.4). The results were plotted as a function of input synchronization index (SI). (a) Rate responses for different models. For a small time constant, the model response rate increased dramatically when the input SI increased (a timing code was converted to a rate code in this situation). For a large time constant, the model response rate did not change much as the input SI changed. (b) SI measure of the model output. The dotted line represents output SI equal to input SI. The model responses with short time constants had more enhanced synchronization than did model responses with large time constants, and the SI measure is nearly independent of model time constants when the inputs were highly synchronized.

when the input spikes were more synchronized). For a large time constant, the model response rate primary depended on the total energy of the input and did not change dramatically as the input SI changed. The SI measure of the output response is plotted in the bottom panel of Figure 4; the dotted line is the result for which the output SI equals the input SI. The model responses with short time constants had more enhanced synchronization

Response Properties of an Integrate-and-Fire Model

2587

Rate Output (sp/sec)

(a) 300 200

100

0

Output Sync. Index

(b)

0

0.2

0.4

0.6

0.8

1

1 0.8 0.6 0.1 ms 0.2 ms 0.4 ms 1 ms

0.4 0.2 0

0

0.2

0.4 0.6 Input Sync. Index

0.8

1

Figure 5: Similar plot to Figure 4 except that the stimulus frequency was 2000 Hz. The output SI measure was greatly affected by the model time constant for both highly and weakly synchronized inputs.

than did model responses with large time constants, and the SI measure was not affected by the model time constant when the inputs were highly synchronized. Figure 5 is similar to Figure 4 except that the stimulus frequency was 2000 Hz. With a high input frequency, the response rate of the model was similar to the result in Figure 3, but the output SI measure was affected by the model time constant for both highly and weakly synchronized inputs. The response of the model with a large time constant (left triangle) had degraded synchronization as compared to the input SI measure (dotted line). The reduction of synchronization due to the large time constant was most effective at high frequencies. This may explain, in part, the physiological observation that the SI of CN cells in response to tones at CF is enhanced with respect to AN fibers at low frequencies but is lower than that of AN fibers at midfrequencies (Blackburn & Sachs, 1992; Joris, Carney et al., 1994).

2588

X. Zhang and L. Carney

Rate Output (sp/sec)

(a) 400 300 200 100 0

(b)

0

0.2

0.4

0.6

0.8

1

1

Output SI

0.8 0.6

0.2 0.3 0.5 0.7 0.9

0.4 0.2 0

0

0.2

0.4 0.6 Input SI

0.8

1

Figure 6: Responses of models with different EPSP amplitudes and a fixed EPSP time constant of 400 µs to nonstationary inputs. The results are plotted in the same way as Figure 4. The stimulus had a frequency of 500 Hz with fixed input strength of 800. Since the input strength was defined as the EPSP amplitude multiplied by the input rate, the input rate doubled when the model amplitude was decreased by half. In this way, the energy of the model inputs was kept constant for different input SIs. (a) Model rate responses. (b) SI measure of the model responses. The EPSP amplitude had a larger effect on the degradation of the output SI when the input SI was high as compared to the effects of time constant illustrated in Figure 4.

The responses of a model with different EPSP amplitudes (with a fixed time constant of 400 µs) are plotted as a function of input SI in Figure 6. The stimulus frequency was 500 Hz, and the input strength (EPSP amplitude multiplied by the mean input rate) was fixed at 800. With strong inputs (a large model EPSP amplitude), the model response rate (top panel) changed slowly as the input SI measure changed, and the synchronization enhancement (input SI, bottom panel) was also lowest for the model with the largest EPSP amplitude (model threshold was equal to 1). When the input had a high SI measure (usually true for low-frequency inputs), the amplitude of

Response Properties of an Integrate-and-Fire Model

2589

Output Rate (sp/sec)

(a) 300

Sync. Coeff.

(b)

Output Rate 250 200 150 100 50 2 10 1

3

10

4

10

0.8 0.6 0.4 0.2 0 2 10

Input SI Output SI Output SI, σ=70 µs 3

10 Input Frequency (Hz)

4

10

Figure 7: Responses as a function of stimulus frequency for a model with EPSP amplitude of 1/3 and a time constant of 400 µs. The input spike rate was fixed at 2400 spikes per second, and the input SI varied with frequency systematically to fit the AN fiber data (Johnson, 1980; Rothman & Manis, 2003a, plotted as circles in the bottom panel). (a) Model response rate changed nonmonotonically as input frequency increased. (b) SI of the model output with and without time jitter added (crosses and diamonds, respectively). The time jitter had a normal distribution with a standard deviation of 70 µs. The calculation was based on the convolution of the output PST and this normal distribution.

the inputs had a larger effect on the output SI measure than did the model time constant (see Figures 4 and 6). The synchronization of actual AN inputs to CN cells changes systematically as a function of the stimulus frequency (Johnson, 1980). Model cell responses to inputs with realistic synchronization at each stimulus frequency are illustrated in Figure 7. The input spike rate was fixed at 2400 spikes per second, and each input spike generated a model EPSP with an amplitude of 1/3 and a time constant of 400 µs. The SI measure for each input frequency was plotted in the bottom panel (circles) of the figure and fitted to AN fiber data (Johnson 1980; Rothman & Manis, 2003a). The model

2590

X. Zhang and L. Carney

response rate (top panel) changed nonmonotonically as the input frequency increased. At low frequencies, the dead time (absolute refractoriness) of the model prevented multiple discharges in each cycle, and the rate increased as stimulus frequency increased, since the model cell fired once in each stimulus cycle. The model response rate dropped with increasing stimulus frequency, because the inputs were less synchronized. The SI measure of the model output (the diamonds in the bottom panel of Figure 7) was higher than the input SI across all stimulus frequencies. If we assume that a time jitter of 70 µs (Kopp-Scheinpflug, Dehmel, Dorrscheidt, & Rubsamen, 2002) was added to each output spike (we assumed the time jitter had a normal distribution with standard deviation 70 µs; the calculation was based on the convolution of the output PST and this normal distribution)1 , the SI measure of the model output was more comparable to the observation that some CN cell types have enhanced synchronization in response to low frequencies and reduced synchronization, as compared to their AN inputs, in response to higher frequencies (e.g., see Joris, Carney et al., 1994). 3.3 Effects of Mixed-Amplitude Inputs on Model Responses. As illustrated in the above results, the model cell response to stationary inputs required a short time constant to maintain appropriate irregularity and required strong inputs for a linear input-output rate function (e.g., to explain high-CF PL responses to CF tones). However, the model cell response to nonstationary inputs required a large number of weak inputs to create enhanced synchronization and a large time constant to be more responsive to both synchronized and nonsynchronized inputs without showing a reduction of the enhancement of the synchronization at low frequencies (when the input SI measure was high; see Figure 4). We hypothesized that cells with mixed-amplitude inputs would respond to high-frequency stimuli (i.e., stationary inputs) irregularly and also show enhanced phase locking to lowfrequency stimuli (i.e., nonstationary inputs). This combination of responses is required to explain the behavior of high-frequency PL and PLn cells in the VCN. The responses of models with mixed-amplitude EPSP parameters to stationary inputs across different input strengths are plotted in Figure 8. The time constant of the models was fixed at 400 µs, and the other parameters for each model (plotted in different symbols) are described in the figure legend. For the model with mixed-amplitude inputs, the amplitudes of the weak and strong inputs were fixed at 1/6 and 7/10, respectively (threshold was equal to 1), and the rate for both strong and weak inputs changed with the input strength. The responses of models for both mixed-amplitude inputs

1

The time jitter applied here may be caused by the dynamics of spike generation, refractoriness, or degraded timing due to the strong inputs. (The effective coincidence window for strong inputs is larger than that for weak inputs.)

Response Properties of an Integrate-and-Fire Model

2591

(a) 400

Rate

300 200 100 0 400

600 800 1000 Input Strength(R×Ae)

1200

(b) 1.0

CV ’

0.95

0.9

400

1/6 Mixed 1/3 1/2 600 800 1000 Input Strength(R×Ae)

1200

Figure 8: The responses for models with different synaptic configurations to stationary inputs. All model EPSPs had a fixed time constant of 400 µs. The results for models with same-amplitude inputs were plotted with a solid line and different symbols (see the legend for EPSP amplitude). The model with mixed-amplitude inputs (dotted line with squares) had EPSPs with amplitudes of 1/6 and 7/10, and the ratio of weak input rate to strong input rate was fixed at 28.8:1 (results were plotted against the weak input strength for this model). (a) Model response rate. Compared to the same-amplitude input model with similar responses, the model with mixed-amplitude inputs had a higher response rate with small inputs and a lower response rate with large inputs. (b) CV measure of the model responses. All model responses had CV measures that would be classified as irregular cells, presumably because of the short EPSP time constant used.

and single-amplitude inputs had high values of regularity (bottom panel of the plot), consistent with the finding that the CV depends primarily on the time constant of the model (see Figure 2). For the models with similar output rates, the model with mixed-amplitude inputs tended to respond more linearly to the change of the input strength (top panel). Increasing

2592

X. Zhang and L. Carney

the EPSP amplitude for the model with the same inputs made the model respond more linearly, but this manipulation reduced the enhancement of phase locking in response to low-frequency stimuli. Figure 9 shows the effects of model parameters on responses to synchronized inputs. The models with same-amplitude inputs had a constant input strength of 1200. For models with mixed-amplitude inputs, the input strength was 1200 for weak inputs with an amplitude of 1/6, and the strength was 1600 for weak inputs with an amplitude of 1/12. The rate of strong inputs for the mixed-amplitude model was fixed at 250 spikes per second; the strong input amplitude was 7/10. Other parameters for the model and stimulus are described in the legend. The rate responses (top panel) for all models changed nonmonotonically as a function of stimulus frequency. For the mixed-amplitude model, the response rate (plotted with asterisks and downward triangles) to the low-frequency inputs depended on the synchronized weak inputs; as stimulus frequency increased, the inputs were less synchronized and the model response was more dependent on the strong inputs. The SI measure of the model responses with mixed-amplitude inputs dropped more quickly with increasing stimulus frequency than did that of the models with same-amplitude inputs. At low frequencies, the SI measure of the model response stayed high because the output was dominated by the discharges generated by the weak inputs. At high frequencies, the model response was determined by the strong input, and thus the synchronization of the input was not enhanced. In general, the cell with mixed-amplitude inputs had a more linear input-output rate function in response to highfrequency tones than did model cells with same-amplitude inputs, and the SI degraded more rapidly as stimulus frequency increased.2 Both properties are desirable to explain the physiological responses of the globular bushy cells in the CN. 4 Discussion 4.1 Calculation of the PSTs and ISIs of I&F Models with Nonstationary Inputs. Statistical analysis of neural activity, together with stochastic neuron models, have proven to be useful tools for estimating neuronal physiological and anatomical parameters and elucidating the different functions of various neurons (Tuckwell, 1988). In addition to the discrete Markov process discussed here, other stochastic neuron models have been proposed, including the Ornstein-Uhlenbeck process (OUP) approximating diffusions, and partial differential equations modeling the spatial extent of neurons (especially for dendrites) (Tuckwell 1989). However, little progress has been

2

Previously, we assumed a constant time jitter to degrade the SI at high frequency. If the SI in response to high frequencies is already degraded, a smaller time jitter is required, which thus maintains the timing information at low frequencies.

Response Properties of an Integrate-and-Fire Model

2593

Rate Output (sp/sec)

(a) 500 400 300 200 100 0 2 10

(b)

3

10

4

10

1

Output SI

0.8 0.6 0.4 0.2 2 10

Input 1/6 2/5 Mixed, weak=1/6 Mixed, weak=1/12 3

10 Input Frequency (Hz)

4

10

Figure 9: Responses for models with different synaptic configurations to nonstationary inputs. The results are plotted in the same way as in Figure 7. The solid lines (see legend for model EPSP amplitude) represent the results for sameamplitude input models. The results for mixed-amplitude input models are plotted with a dotted line (stars) and a dot-dashed line (downward triangles). All model EPSP time constants were fixed at 400 µs. The input strength was fixed at 1200 for the same-amplitude input models and was 1200 and 1600 for the mixed-amplitude input models with weak amplitudes of 1/6 and 1/12, respectively. The strong inputs for both mixed-amplitude input models had a rate of 250 spikes per second, and each input EPSP had an amplitude of 7/10. (a) Model rate responses. (b) SI measure of the model responses. The SI measure of the stimulus input (fitted to AN fiber data) was plotted with circles. The SI measure for mixed-amplitude input models stays high in response to low frequencies and drops more quickly than that of the same-amplitude input model as the input frequency increased.

2594

X. Zhang and L. Carney

made to provide a satisfactory analytical solution for the first passage time problem for these models, and researchers have generally either analyzed their models with limited ranges of parameters (e.g., Kempter et al., 1998) or resorted to Monte Carlo simulations. The numerical method proposed in this study provides a way to calculate the statistics of the neuron model with more accuracy and efficiency than using Monte Carlo simulations, without the compromise of using only stationary inputs or limiting the model’s parameter space. Because equations 2.11 to 2.14 depend only on the assumptions that the neuron can be modeled as a renewal process and that only recent input discharges determine the response, the method can be generalized in different ways as long as the conditional first passage time can be calculated numerically: 1. Relative refractoriness can be incorporated by changing the firing threshold as a function of time (assuming the previous discharge time occurs at time zero). 2. Inhibitory effects may be incorporated based on equation 2.16, in which the amplitude of inhibitory EPSPs is negative and arriving inhibitory discharges decrease the model potential. 3. The assumption that the membrane potential decays exponentially simplifies our analysis by allowing the EPSPs of incoming spikes to be combined, without having to keep track of the history of input spike times. It is possible to extend the model in more biologically realistic situations, where integrative properties of neurons are altered by synaptic conductances. For example, potentials contributed by recently incoming discharges could be implemented using voltagedependent EPSPs, where potentials contributed by earlier discharges could be included in the exponential decay tail. Such an extension would be particularly useful in situations with small numbers of inputs, in which other analyses are limited (Burkitt, 2001; Richardson, 2004). 4. Noise that is intrinsic to the neuron can be introduced as a diffusion of the potential distribution at each step of the calculation. 4.2 Regularity of the Model Cell Response to Stationary Inputs: Effects of Time Constant, Synapse Amplitude, and Refractoriness. Regularity analysis of the model responses suggested that a small value for the EPSP time constant was important to prevent the cell from regular firing, and this prediction agrees with findings in physiological studies (Blackburn & Sachs, 1989; Young et al., 1988). The EPSP inputs to a bushy cell have a very short time constant since the somatic synapse bypasses any dendritic filtering, and the low-threshold potassium channels reduce the effective membrane time constant (Rothman & Manis, 2003a). All three response types associated with bushy cells (PL, PLn, and On-L) demonstrate irregular

Response Properties of an Integrate-and-Fire Model

2595

discharge patterns (Rothman et al., 1993), regardless of possible differences in their input synapse strengths. In contrast, the chopper response type, which usually has a regular response pattern, is believed to be related to stellate cells in VCN that have large dendritic trees contacted by AN fibers (Young et al., 1988) and long-duration EPSPs (Oertel, 1983). Regularity is also affected by the relative refractoriness of the cell responses (which is not corrected for in the calculation of CV ), especially when the mean ISI interval is comparable to the duration of refractoriness. Our simulations for model AN responses showed that the CV of a Poisson process modified by relative refractoriness decreased dramatically as model response rate increased (not shown). This result is consistent with the simulations reported by Rothman et al. (1993) using a channel-based (Hodgkin-Huxleylike) model. Their model responses had the lowest regularity measure when the input EPSP was just above the absolute threshold, where the refractoriness effect was strongest. Response regularity was higher for models with just subthreshold inputs, since the combination of two required inputs was much higher than the absolute threshold. Rothman et al. (1993) argued that a secure input, which generates an EPSP much higher than threshold, is necessary to maintain response irregularity for PLn cells, because the strong input decreases the relative refractory period. They also argued that the regular response of the model onset cell, which may not be physiological realistic, could be improved with inclusion of inhibition. Yet it is possible, as illustrated in their later study (Rothman & Young, 1996), that an inhibitory mechanism can also be used to increase the irregularity of the PLn model cell responses without the requirement of a strong suprathreshold input. Other mechanisms may also reduce the effect of refractoriness (Rothman & Manis, 2003a) and thus increase the irregularity of the model cell responses. 4.3 Effect of EPSP Amplitude on the Input-Output Rate Function. The input-output rate function of the model response was strongly affected by the amplitude of the model EPSPs. The input rate and amplitude had different effects on the statistics of the model potential distribution. While increasing both the input rate and EPSP amplitude increased the expected value (mean) of the membrane potential, the variance of the potential was proportional to the square of the EPSP amplitude but had a linear relationship with the input rate. With the same input strength (EPSP amplitude multiplied by the input rate), the potential of the model with larger EPSP amplitudes had larger variance, and the model cell response depended more on the fluctuations of the potential. The input-output rate function of the model cell tended to be exponential when the relative potential variance was small and to be linear when there were large potential fluctuations (Tuckwell & Richter, 1978). This prediction has important implications for the synapse conditions of bushy cells. The input discharge rate to bushy cells changes dramatically during tone bursts as a result of onset adaptation in high-spontaneous-rate

2596

X. Zhang and L. Carney

AN fibers. The fact that bushy cells with PL or PLn response types have response rates during tone bursts that are similar to those of the input AN fibers suggests that they receive at least one large input. Further, the On-L response type bushy cells that have a nonlinear input-output rate function may receive many small inputs. These predictions agree with the morphological correlates of the different cell types in the CN (see the review by Cant, 1992). The PL responses are usually observed in SBCs, which have one or a few large synapses known as end bulbs of Held. The PLn and OnL response type units are more closely related to the GBC, which receives smaller modified end bulbs (as compared to the larger end bulb of Held) that are varied in number and size. The model of linear summation of mixed-amplitude EPSPs may also be interpreted as an approximation of a model with nonlinear summation. The arrival of a large EPSP at the cell changes the membrane properties and thus would influence the contributions from subsequent inputs. Of course, it would be interesting to determine in future work whether a nonlinear model with more realistic voltage-dependent mechanisms shows the properties predicted by a mixed-input model with linear summation.

4.4 Enhanced Phase Locking and Its Relation to EPSP Amplitude and Time Constant. Increasing the EPSP amplitude increased the potential fluctuation and degraded-phase locking of the model response. With a large number of small inputs, the membrane potential usually followed the expected value of the potential with a small variance, and the model potential could be treated as deterministic. The model cell fired very precisely around the time that the expected value of the potential crossed the threshold. This conclusion may also apply to the channel-based model, in which all the EPSPs are linearly summed. Of course, the small variance in the potential may be disturbed by other nonlinear properties, such as refractoriness. Realistically, for large inputs that generate EPSPs just above threshold, the timing of the action potentials was affected by the amplitude of the EPSPs (Rothman et al., 1993), and this relationship degraded the phase locking of the model cell to the synchronized inputs (especially to the midfrequency inputs; see Rothman et al., 1993). Small inputs, in fact, helped increase the precise timing of action potentials, since action potentials generated early in the periodic cycle, when there was a low rate of small inputs, had a large delay, and action potentials generated later had a small delay. This was illustrated in our study of mixed-amplitude inputs when the strong input amplitude was near, but still below, threshold. Enhanced phase locking was not greatly affected by the EPSP time constant as long as the EPSP time constant was short (e.g., by about a factor of 4) as compared to the cycle of the stimulus frequency. Of course, the time constant is still much smaller than observed in other neurons that are not specialized for temporal coding. The short time constant of the membrane

Response Properties of an Integrate-and-Fire Model

2597

conductance has other effects on the precise timing of the neuron’s response, such as refractoriness. It is interesting to consider the effect of spatiotemporal summation resulting from adjacent auditory nerve fiber inputs. Instead of expressing the spatial spread of inputs explicitly (Kuhlmann, Burkitt, Paolini, & Clark, 2002), the spatial spread of inputs can be modeled as a decrease of synchronization in the combined input (see Figures 4 and 5). The EPSP time constant was more important for enhancement of phase locking when the inputs were effectively spatially separated (i.e., for small synchronization index in Figures 4 and 5). 4.5 Implications of Mixed-Amplitude Inputs for the Bushy Cell Model. Results from this study show that model cells that receive mixedamplitude inputs demonstrated response properties that have been observed in some cells in the CN. The neurons encode or enhance the temporal information at low frequencies and also carry rate information at high frequencies. These properties made the model neurons more efficient in processing information in different conditions. The inputs to high-CF cells in the CN in response to complex sounds usually have temporal (envelope) fluctuations due to narrowband peripheral filtering. A cell that receives mixedamplitude inputs can benefit from both spectral and temporal cues. The different number and size of the end bulbs may contribute to the different synapse configurations for bushy cells, and the dendrites that branch profusely within several hundred microns of the cell body (Rhode & Greenberg, 1992) could also provide weak inputs to enhance timing information in response to complex sounds. 4.6 Potential Effects of Inhibition on Model Responses. Inhibitory inputs to bushy cells have been shown to exist in physiological studies (Caspary, Backoff, Finlayson, & Palombi, 1994; Wu & Oertel, 1986).3 The function of inhibitory inputs on model response statistics can be interpreted in several ways. First, inhibitory inputs will have different effects on the mean and variance of the model potential if the inhibitory postsynaptic potential (IPSP) is integrated linearly in the I&F model. The mean of the potential will decrease as the inhibitory input rate increases, while the variance of the membrane potential will be equal to the sum of the potential variances contributed by excitatory and inhibitory inputs. The model cell responses will depend primarily on the variance of the potential distribution (that is, 3 Caspary et al.’s (1994) study showed that inhibition has the same receptive field as excitation and that the role of inhibition is generally not lateral inhibition, which is often described as a mechanism for sharpening the receptive field. This on-frequency inhibition can be interpreted as a modulation filter that extracts the envelope fluctuation in the inputs (Nelson & Carney, 2004). As discussed here, inhibition could also contribute to the enhanced timing of the cell responses in CN.

2598

X. Zhang and L. Carney

the fluctuation of the voltage) when the inhibitory and excitory inputs are balanced. This situation is similar to what occurs when the amplitude of individual EPSPs increases; thus, including inhibition may make the cell’s response rate vary more linearly with the input rate. Second, inhibition will have different effects on the peaks and valleys of a nonstationary input. IPSPs usually have a larger time constant; therefore, the integral of the IPSPs in response to nonstationary inputs will not fluctuate as much as the integral of the EPSPs. As a result, the model cell will tend to respond more at the peak of the synchronized inputs, and the inhibition will contribute to enhanced phase locking. Finally, inhibition in bushy cells will have several nonlinear effects on the membrane properties. Inhibition will effectively make the membrane time constant faster by adding membrane conductance, and it will also decrease the effective amplitude of EPSPs, thus reducing the amplitude of a secure synapse to an amplitude that is just above or even below the threshold. In summary, while the membrane properties of a neuron define the cell’s capacity to process the information carried in the input spikes, the synaptic configuration of the cell’s inputs determines how the information is actually processed in response to various stimuli. For example, a cell that has synapses that generate mixed-amplitude EPSPs has a linear input-output rate function when the inputs are stationary and enhanced synchronization of its output to the stimulus when the inputs are nonstationary. In this manner, the nervous system may achieve numerous signal-processing functions that are advantageous for specific stimuli. Acknowledgments This work was supported by NIH-NIDCD grant R01-01641. We gratefully acknowledge the comments of Steve Colburn and Barbara ShinnCunningham on a previous version of this manuscript and the editorial assistance of Susan Early. References Blackburn, C. C., & Sachs, M. B. (1989). Classification of unit types in the anteroventral cochlear nucleus: PST histograms and regularity analysis. J. Neurophysiol., 62(6), 1303–1329. Blackburn, C. C., & Sachs, M. B. (1992). Effects of OFF-BF tones on responses of chopper units in ventral cochlear nucleus. I. Regularity and temporal adaptation patterns. J. Neurophysiol., 68(1), 124–143. Bourk, T. (1976). Electrical responses of neural units in the anteroventral cochlear nucleus of the cat. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials. Biol. Cybern., 85(4), 247–255.

Response Properties of an Integrate-and-Fire Model

2599

Burkitt, A. N., & Clark, G. M. (2001). Synchronization of the neural response to noisy periodic synaptic input. Neural Comput., 13(12), 2639–2672. Cant, N. B. (1992). The cochlear nucleus: Neuronal types and their synaptic organization. In D. B. Webster, A. N. Popper, & R. R. Fay (Eds.), The mammalian auditory pathway: Neuroanatomy (pp. 66–116). New York: Springer-Verlag. Carney, L. H. (1993). A model for the responses of low-frequency auditory-nerve fibers in cat. J. Acoust. Soc. Am., 93(1), 401–417. Caspary, D. M., Backoff, P. M., Finlayson, P. G., & Palombi, P. S. (1994). Inhibitory inputs modulate discharge rate within frequency receptive fields of anteroventral cochlear nucleus neurons. J. Neurophysiol., 72(5), 2124–2133. Colburn, H. S. (1973). Theory of binaural interaction based on auditory-nerve data. I. General strategy and preliminary results on interaural discrimination. J. Acoust. Soc. Am., 54(6), 1458–1470. Colburn, H. S., Carney, L. H., & Heinz, M. G. (2003). Quantifying the information in auditory-nerve responses for level discrimination. J. Assoc. Res. Otolaryngol., 4(3), 294–311. Colburn, H. S., & Moss, P. J. (1981). Binaural interaction models and mechanisms. Paper presented at the Neuronal Mechanisms of Hearing Conference, New York. Cox, D. R. (1962). Renewal theory. London: Methuen. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-fire neuron. J. Comput. Neurosci., 11(2), 135–151. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. J. Acoust. Soc. Am., 68(4), 11151122. Johnson, D. H., & Swami, A. (1983). The transmission of signals by auditory-nerve fiber discharge patterns. J. Acoust. Soc. Am., 74(2), 493–501. Joris, P. X., Carney, L. H., Smith, P. H., & Yin, T. C. (1994). Enhancement of neural synchronization in the anteroventral cochlear nucleus. I. Responses to tones at the characteristic frequency. J. Neurophysiol., 71(3), 1022–1036. Joris, P. X., Smith, P. H., & Yin, T. C. (1994). Enhancement of neural synchronization in the anteroventral cochlear nucleus. II. Responses in the tuning curve tail. J. Neurophysiol., 71(3), 1037–1051. Joris, P. X., Smith, P. H., & Yin, T. C. (1998). Coincidence detection in the auditory system: 50 years after Jeffress. Neuron, 21(6), 1235–1238. Kalluri, S., & Delgutte, B. (2003a). Mathematical models of cochlear nucleus onset neurons: I. Point neuron with many weak synaptic inputs. J. Comput. Neurosci., 14(1), 71–90. Kalluri, S., & Delgutte, B. (2003b). Mathematical models of cochlear nucleus onset neurons: II. Model with dynamic spike-blocking state. J. Comput. Neurosci., 14(1), 91–110. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Comput., 10(8), 1987–2017.

2600

X. Zhang and L. Carney

Kiang, N. Y.-S. (1965). Discharge patterns of single fibers in the cat’s auditory nerve. Cambridge, MA: MIT Press. Kipke, D. R., & Levy, K. L. (1997). Sensitivity of the cochlear nucleus octopus cell to synaptic and membrane properties: A modeling study. Journal of the Acoustical Society of America, 102(1), 403–412. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the HodgkinHuxley equations to a single-variable threshold model. Neural Computation, 9(5), 1015–1045. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci., 19(4), 130–137. Kopp-Scheinpflug, C., Dehmel, S., Dorrscheidt, G. J., & Rubsamen, R. (2002). Interaction of excitation and inhibition in anteroventral cochlear nucleus neurons that receive large endbulb synaptic endings. J. Neurosci., 22(24), 11004–11018. Kuhlmann, L., Burkitt, A. N., Paolini, A., & Clark, G. M. (2002). Summation of spatiotemporal input patterns in leaky integrate-and-fire neurons: Application to neurons in the cochlear nucleus receiving converging auditory nerve fiber input. J. Comput. Neurosci., 12(1), 55–73. Manis, P. B., & Marx, S. O. (1991). Outward currents in isolated ventral cochlear nucleus neurons. J. Neurosci., 11(9), 2865–2880. Molnar, C. E., & Pfeiffer, R. R. (1968). Interpretation of spontaneous spike discharge patterns of neurons in the cochlear nucleus. Proceedings of the IEEE, 56, 993–1004. Nelson, P. C., & Carney, L. H. (2004). A phenomenological model of peripheral and central neural responses to amplitude-modulated tones. J. Acoust. Soc. Am., 116, 2173–2186. Oertel, D. (1983). Synaptic responses and electrical properties of cells in brain slices of the mouse anteroventral cochlear nucleus. J. Neurosci., 3(10), 2043–2053. Oertel, D. (1985). Use of brain slices in the study of the auditory system: Spatial and temporal summation of synaptic inputs in cells in the anteroventral cochlear nucleus of the mouse. J. Acoust. Soc. Am., 78(1 Pt. 2), 328–333. Plesser, H. E., & Geisel, T. (1999). Markov analysis of stochastic resonance in a periodically driven integrate-and-fire neuron. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics, 59(6), 7008–7017. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-fire neurons: From stochastic input to escape rates. Neural Comput., 12(2), 367–384. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Physics Letters A, 225(4–6), 228–234. Rhode, W. S., & Greenberg, S. (1992). Physiology of the cochlear nuclei. In A. N. Popper & R. R. Fay (Eds.), The mammalian auditory pathway: Neurophysiology (pp. 94– 152). New York: Springer-Verlag. Richardson, M. J. (2004). Effects of synaptic conductance on the voltage distribution and firing rate of spiking neurons. Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 69(5 Pt. 1), 051918. Rothman, J. S., & Manis, P. B. (2003a). The roles potassium currents play in regulating the electrical activity of ventral cochlear nucleus neurons. J. Neurophysiol., 89(6), 3097–3113. Rothman, J. S., & Manis, P. B. (2003b). Kinetic analyses of three distinct potassium conductances in ventral cochlear nucleus neurons. J. Neurophysiol., 89(6), 3083– 3096.

Response Properties of an Integrate-and-Fire Model

2601

Rothman, J. S., & Manis, P. B. (2003c). Differential expression of three distinct potassium currents in the ventral cochlear nucleus. J. Neurophysiol., 89(6), 3070–3082. Rothman, J. S., & Young, E. D. (1996). Enhancement of neural synchronization in computational models of ventral cochlear nucleus bushy cells. Auditory Neuroscience, 2, 47–62. Rothman, J. S., Young, E. D., & Manis, P. B. (1993). Convergence of auditory nerve fibers onto bushy cells in the ventral cochlear nucleus: Implications of a computational model. J. Neurophysiol., 70(6), 2562–2583. Siebert, W. M. (1965). Some implications of the stochastic behavior of primary auditory neurons. Kybernetik, 2(5), 206–215. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 91, 173– 194. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Tuckwell, H. C., & Richter, W. (1978). Neuronal interspike time distributions and the estimation of neurophysiological and neuroanatomical parameters. J. Theor. Biol., 71(2), 167–183. Wu, S. H., & Oertel, D. (1986). Inhibitory circuitry in the ventral cochlear nucleus is probably mediated by glycine. J. Neurosci., 6(9), 2691–2706. Yin, T. C. T. (2002). Neural mechanisms of encoding biaural localization cues in the auditory brainstem. In D. Oertel, R. R. Fay, & A. N. Popper (Eds.), Integrative functions in the mammalian auditor pathway (pp. 99–159). New York: Springer. Young, E. D., Robert, J.-M., & Shofner, W. P. (1988). Regularity and latency of units in ventral cochlear nucleus: Implications for unit classification and generation of response properties. J. Neurophysiol., 60, 1–29.

Received August 20, 2004; accepted April 20, 2005.

LETTER

Communicated by Lehel Csato

Incremental Online Learning in High Dimensions Sethu Vijayakumar [email protected] School of Informatics, University of Edinburgh, Edinburgh EH9 3JZ, U.K.

Aaron D’Souza [email protected]

Stefan Schaal [email protected] Department of Computer Science, University of Southern California, Los Angeles, CA 90089-2520, U.S.A.

Locally weighted projection regression (LWPR) is a new algorithm for incremental nonlinear function approximation in high-dimensional spaces with redundant and irrelevant input dimensions. At its core, it employs nonparametric regression with locally linear models. In order to stay computationally efficient and numerically robust, each local model performs the regression analysis with a small number of univariate regressions in selected directions in input space in the spirit of partial least squares regression. We discuss when and how local learning techniques can successfully work in high-dimensional spaces and review the various techniques for local dimensionality reduction before finally deriving the LWPR algorithm. The properties of LWPR are that it (1) learns rapidly with second-order learning methods based on incremental training, (2) uses statistically sound stochastic leave-one-out cross validation for learning without the need to memorize training data, (3) adjusts its weighting kernels based on only local information in order to minimize the danger of negative interference of incremental learning, (4) has a computational complexity that is linear in the number of inputs, and (5) can deal with a large number of—possibly redundant—inputs, as shown in various empirical evaluations with up to 90 dimensional data sets. For a probabilistic interpretation, predictive variance and confidence intervals are derived. To our knowledge, LWPR is the first truly incremental spatially localized learning method that can successfully and efficiently operate in very high-dimensional spaces.

Neural Computation 17, 2602–2634 (2005)

© 2005 Massachusetts Institute of Technology

Incremental Online Learning in High Dimensions

2603

1 Introduction Despite the recent progress in statistical learning, nonlinear function approximation with high-dimensional input data remains a nontrivial problem, especially in incremental and real-time formulations. There is, however, an increasing number of problem domains where both of these properties are important. Examples include the online modeling of dynamic processes observed by visual surveillance, user modeling for advanced computer interfaces and game playing, and the learning of value functions, policies, and models for learning control, particularly in the context of high-dimensional movement systems like humans or humanoid robots. An ideal algorithm for such tasks needs to avoid potential numerical problems from redundancy in the input data, eliminate irrelevant input dimensions, keep the computational complexity of learning updates low while remaining data efficient, allow online incremental learning, and, of course, achieve accurate function approximation and adequate generalization. When looking for a learning framework to address these goals, one can identify two broad classes of function approximation methods: (1) methods that fit nonlinear functions globally, typically by input space expansions with predefined or parameterized basis functions and subsequent linear combinations of the expanded inputs; and (2) methods that fit nonlinear functions locally, usually by using spatially localized simple (e.g., loworder polynomial) models in the original input space and automatically adjusting the complexity (e.g., number of local models and their locality) to accurately account for the nonlinearities and distributions of the target function. Interestingly, the current trends in statistical learning have concentrated on methods that fall primarily in the first class of global nonlinear function approximators, for example, gaussian process regression (GPR; Williams & Rasmussen, 1996), support vector machine regression (SVMR; Smola & Scholkopf, ¨ 1998), and variational Bayes for mixture models (VBM; Ghahramani & Beal, 2000).1 In spite of the solid theoretical foundations that these approaches possess in terms of generalization and convergence, they are not necessarily the most suitable for online learning in high-dimensional spaces. First, they require an a priori determination of the right modeling biases. For instance, in the case of GPR and SVMR, these biases involve selecting the right function space in terms of the choice of basis or kernel functions (Vijayakumar & Ogawa, 1999), and in VBM the biases are concerned with the right number of latent variables and proper initialization.2 Second, all of these recent function approximator methods were developed 1 Mixture models are actually in between global and local function approximators since they use local model fitting but employ a global optimization criterion. 2 It must be noted that some recent work (Scholkopf, ¨ Burgess, & Smola, 1999) has started to look at model selection for SVMs and GPRs and automatic determination of number of latent models for VBM (Ghahramani & Beal, 2000).

2604

S. Vijayakumar, A. D’Souza, and S. Schaal

primarily for batch data analysis and are not easily or efficiently adjusted for incrementally arriving data. For instance, in SVMR, adding a new data point can drastically change the outcome of the global optimization problem in terms of which data points actually become support vectors, such that all (or a carefully selected subset of) data have to be kept in memory for reevaluation. Thus, adding a new data point in SVMR is computationally rather expensive, a property that is also shared by GPR. VBM suffers from similar problems due to the need for storing and reevaluating data when adding new mixture components (Ueda, Nakano, Ghahramani, & Hinton, 2000). In general, it seems that most suggested Bayesian learning algorithms are computationally too expensive for real-time learning because they tend to represent the complete joint distribution of the data, albeit as a conditionally independent factored representation. As a last point, incremental approximation of functions with global methods is prone to lead to negative interference when input distributions change (Schaal & Atkeson, 1998). Such changes are, however, typical in many online learning tasks. In contrast to the global learning methods described above, function approximation with spatially localized models is rather well suited for incremental and real-time learning, particularly in the framework of locally weighted learning (LWL; Atkeson, Moore, & Schaal, 1997). LWL methods are very useful when there is limited knowledge about the model complexity such that the model resources can be increased in a purely incremental and data-driven fashion, as demonstrated in previous work (Schaal & Atkeson, 1998). However, since these techniques allocate resources to cover the input space in a localized fashion, in general, with an increasing number of input dimensions, they encounter an exponential explosion in the number of local models required for accurate approximation—often referred to as the “curse of dimensionality” (Scott, 1992). Hence, at the outset, highdimensional function approximation seems to be computationally infeasible for spatially localized learning. Some efficient global learning methods with automatic resource allocation in high-dimensional spaces, however, have been employed successfully by using techniques of projection regression (PR). PR copes with high-dimensional inputs by decomposing multivariate regressions into a superposition of single-variate regressions along a few selected projections in input space. The major difficulty of PR lies in the selection of efficient projections, that is, how to achieve the best-fitting result with as few univariate regressions as possible. Among the best-known PR algorithms are projection pursuit regression (Friedman & Stutzle, 1981) and its generalization in the form of generalized additive models (Hastie & Tibshirani, 1990). Sigmoidal neural networks can equally be conceived of as a method of projection regression, in particular when new projections are added sequentially, as in cascade correlation (Fahlman & Lebiere, 1990). In this letter, we suggest a method of extending the beneficial properties of spatially localized learning to high-dimensional function approximation

Incremental Online Learning in High Dimensions

2605

problems. The prerequisite of our approach is that the high-dimensional learning problems we address have locally low-dimensional distributions, an assumption that holds for a large class of real-world data (Tenenbaum, de Silva, & Langford, 2000; Roweis & Saul, 2000; Vlassis, Motomura, & Krose, 2002; D’Souza, Vijayakumar, & Schaal, 2001). If distributions are locally low dimensional, the allocation of local models can be restricted to these thin distributions, and only a tiny part of the entire high-dimensional space needs to be filled with local models. Thus, the curse of dimensionality of spatially localized model fitting can be avoided. Under these circumstances, an alternative method of projection regression can be derived, focusing on finding efficient local projections. Local projections can be used to accomplish local function approximation in the neighborhood of a given query point with traditional LWL approaches, thus inheriting most of the statistical properties from well-established methods (Hastie & Loader, 1993; Atkeson et al. 1997). As this letter will demonstrate, the resulting learning algorithm combines the fast, efficient, and incremental capabilities of LWL techniques while alleviating the problems faced due to high-dimensional input domains through local projections. In the following sections, we first review approaches of how to find good local projections by looking into various schemes for performing dimensionality reduction for regression, including principal component regression, factor analysis, and partial least squares regression. Afterward, we embed the most efficient and robust of these projection algorithms in an incremental nonlinear function approximator (Vijayakumar & Schaal, 1998) capable of automatically adjusting the model complexity in a purely data-driven fashion. In several evaluations, on both synthetic and real-world data, the resulting incremental learning system demonstrates high accuracy for function fitting in very high-dimensional spaces, robustness toward irrelevant and redundant inputs, as well as low computational complexity. Comparisons will prove the competitiveness with other state-of-the-art learning systems.

2 Local Dimensionality Reduction for Locally Weighted Learning Assuming that data are characterized by locally low-dimensional distributions, efficient algorithms are needed to exploit this property. We will focus on locally weighted learning (LWL) methods (Atkeson et al., 1997) because they allow us to adapt a variety of linear dimensionality-reduction techniques for the purpose of nonlinear function approximation (see section 3) and because they are easily modified for incremental learning. LWLrelated methods have also found widespread application in mixture models (Jordan & Jacobs, 1994; Xu, Jordan, & Hinton, 1995; Ghahramani & Beal, 2000) such that the results of this section can contribute to this field too.

2606

S. Vijayakumar, A. D’Souza, and S. Schaal

The learning problems considered here assume the standard regression model: y = f (x) + , where x denotes the N-dimensional input vector, y the (for simplicity) scalar output, and a mean-zero random noise term. When only a local subset of data in the vicinity of a point xc is considered and the locality is chosen appropriately, a low-order polynomial can be employed to model this local subset. Due to a favorable compromise between computational complexity and quality of function approximation (Hastie & Loader, 1993), we choose linear models y = β T x + .

(2.1)

A measure of locality for each data point, the weight wi , is computed from a gaussian kernel, wi = exp(−0.5(xi − xc )T D(xi − xc )), and W ≡ diag{w1 , . . . , w M },

(2.2)

where D is a positive semidefinite distance metric that determines the size and shape of the neighborhood contributing to the local model (Atkeson et al., 1997). The weights wi will enter all following algorithms to ensure spatial localization in input space. Without loss of generality, we assume zero mean of all inputs and outputs in our algorithms, which is ensured by subtracting the weighted mean x or y from the data, where

x=

M i=1

wi xi

M i=1

wi ,

and y =

M i=1

wi yi

M

wi ,

(2.3)

i=1

and M denotes the number of data points. The input data are summarized in the rows of the matrix X = [x1 x2 , . . . , x M ]T , the corresponding outputs are the coefficients of the vector y, and the corresponding weights, determined from equation 2.2, are in the diagonal matrix W. As candidate algorithms for local dimensionality reduction, we consider two techniques, factor analysis and partial least squares regression. Factor analysis (Everitt, 1984) is a density estimation technique that assumes that the observed data were generated from a lower-dimensional process, characterized by k latent or hidden variables v that are all independently distributed with mean zero and unit variance. The observed variables are generated from the latent variables through the transformation matrix U

Incremental Online Learning in High Dimensions

2607

and additive mean zero independent noise with diagonal covariance matrix Ω: z = Uv + ,

(2.4)

where E{T } = Ω and E denotes the expectation operator. If both v and are normally distributed, the parameters Ω and U can be obtained iteratively by the expectation-maximization algorithm (EM) (Rubin & Thayer, 1982). Factor analysis is superset for several dimensionality-reduction algorithms. For z = x and E{T } = σ 2 I, we obtain principal component analysis in input space (Tipping & Bishop, 1999). For the purpose of regression, the lower-dimensional representation v would serve as a new input to the regression problem—an algorithm called principal component regression (PCR). However, it is well documented that PCR has the huge danger of eliminating low-variance input dimensions that are nevertheless crucial for the regression problem, thus leading to inferior function approximation results (Frank & Friedman, 1993; Schaal, Vijayakumar, & Atkeson, 1998). Thus, for the purpose of regression, it is more useful to use factor analysis in joint space of input and output data: z=

x , = x , E{T } = Ω. y y

(2.5)

Again, if we assume E{T } = σ 2 I, we obtain the PCA solution, this time performed in joint space. PCA algorithms are appealing, as they can be solved rather efficiently. Alternatively, without additional constraints on Ω (except that it is diagonal), the most powerful factor analysis algorithm for dimensionality reduction is obtained and requires an iterative EM solution. In both joint-space formulations, the regression parameters β (cf. equation 2.1) can be recovered by computing the expectation of p(y|x), which is obtained from standard manipulations of the normally distributed joint distribution p(x, v, y) (Schaal et al., 1998). While empirical evaluation in Schaal et al. (1998) verified that the unconstrained (i.e., non-PCA) version of joint-space factor analysis performs very well for regression, it also highlighted an important problem. As a density estimation method, factor analysis crucially depends on representing the complete latent space v of the joint input vector z; otherwise, performance degrades severely. Hence, even if there are input dimensions that are irrelevant for the regression, they need to be represented in the latent variable vector unless they are redundant with combinations of other inputs. This property is problematic for our goals, as we expect a large number of irrelevant inputs in high-dimensional learning problems. The inferior performance of factor analysis when the latent space is underestimated also makes it hard to apply it in constructive algorithms—those

2608

S. Vijayakumar, A. D’Souza, and S. Schaal

Table 1: Locally Weighted Implementation of Partial Least Squares Regression. 1. Initialize: Xres = X, yres = y 2. Repeat for r = 1 to R (No. of projections) T Wy (a) ur = Xres res where W ≡ diag{w1 , . . . , w M } is the matrix of locality weights. (b) βr = zrT Wyres /(zrT Wzr ) where zr = Xres ur . (c) yres = yres − zr βr . T Wz /(zT Wz ). (d) Xres = Xres − zr pr T where pr = Xres r r r

that grow the latent space in a data-driven way until the full latent space is recovered. Since the regression results are of low quality until the full latent space is recovered, predictions of the learning system cannot be trusted until a significant amount of data has been encountered, with the open problem of how to quantify “significant.” As a surprising result of the empirical comparisons of local dimensionality-reduction techniques presented in Schaal et al. (1998), one particular algorithm, partial least squares regression (PLS) (Wold, 1975; Frank & Friedman, 1993), achieved equally good and more robust results than factor analysis for regression without any of the noted problems. PLS, a technique extensively used in chemometrics, recursively computes orthogonal projections of the input data and performs single-variable regressions along these projections on the residuals of the previous iteration step. Table 1 provides an outline of the PLS algorithm, derived here for implementing the locally weighted version (LWPLS). The key ingredient in PLS is to use the direction of maximal correlation between the residual error and the input data as the projection direction at every regression step. Additionally, PLS regresses the inputs of the previous step against the projected inputs z in order to ensure the orthogonality of all the projections u (step 2d). Actually, this additional regression could be avoided by replacing p with u in Step 2d, similar to techniques used in principal component analysis (Sanger, 1989). However, using this regression step leads to better performance of the algorithm as PLS chooses the most effective projections if the input data have a spherical distribution: in the spherical case, with only one projection, PLS will find the direction of the gradient and achieve optimal regression results. The regression step in 2d chooses the reduced input data Xres such that the resulting data vectors have minimal norms and, hence, push the distribution of Xres to become more spherical. An additional consequence of step 2d is that all the projections zr become uncorrelated, that is, zTj zr = 0 ∀ j = r , a property that will be important in the derivations below. Due to all these consideration, we will choose PLS as the basis for an incremental nonlinear function approximator, which, in the next sections, will be demonstrated to have appealing properties for nontrivial function fitting problems.

Incremental Online Learning in High Dimensions

2609

3 Locally Weighted Projection Regression For nonlinear function approximation, the core concept of our learning system, locally weighted projection regression (LWPR), is to find approximations by means of piecewise linear models (Atkeson et al., 1997). Learning involves automatically determining the appropriate number of local models K , the parameters β k of the hyperplane in each model, and also the region of validity, called receptive field (RF), parameterized as a distance metric Dk in a gaussian kernel (cf. equation 2.2): 1 wk = exp − (x − ck )T Dk (x − ck ) . 2

(3.1)

Given a query point x, every linear model calculates a prediction yˆ k (x). The total output of the learning system is the normalized weighted mean of all K linear models, yˆ =

K k=1

wk yˆ k

K

wk ,

(3.2)

k=1

also illustrated in Figure 1. The centers ck of the RFs remain fixed in order to minimize negative interference during incremental learning that could occur due to changing input distributions (Schaal & Atkeson, 1998). Local models are created on an as-needed basis as described in section 3.2. Table 2 provides a reference list of indices and symbols that are used consistently across the description of the LWPR algorithm. 3.1 Learning with LWPR. Despite its appealing simplicity, the piecewise linear modeling approach becomes numerically brittle and computationally too expensive in high-dimensional input spaces when using ordinary linear regression to determine the local model parameters (Schaal & Atkeson, 1998). Thus, we will use locally weighted partial least squares regression within each local model to fit the hyperplane. As a significant computational advantage, we expect that far fewer projections than the actual number of input dimensions are needed for accurate learning. The next sections describe the necessary modifications of PLS for this implementation, embed the local regression into the LWL framework, explain a method of automatic distance metric adaptation, and finish with a complete nonlinear learning scheme, called locally weighted projection regression (LWPR). 3.1.1 Incremental Computation of Projections and Local Regression. For incremental learning, that is, a scheme that does not explicitly store any training data, the sufficient statistics of the learning algorithm need to be accumulated in appropriate variables. Table 3 provides suitable incremental update

S. Vijayakumar, A. D’Souza, and S. Schaal

np

ut

2610

...

Receptive Field Weighting

β i,k

xreg x1

x4

xn

x3 x2

In pu ts

ui,k

Dk

Weighted Average

...

xi

ytrain

Correlation Computation

Linear Unit

Σ

yk y

Output Figure 1: Information processing unit of LWPR.

rules. The variables a zz,r , a zres,r , and axz,r are sufficient statistics that enable us to perform the univariate regressions in step 2c.1.2 and step 2c.2.2, similar to recursive least squares, that is, a fast Newton-like incremental learning technique. λ ∈ [0, 1] denotes a forgetting factor that allows exponential forgetting of older data in the sufficient statistics. Forgetting is necessary in incremental learning since a change of some learning parameters will affect a change in the sufficient statistics. Such forgetting factors are a standard technique in recursive system identification (Ljung & Soderstrom, 1986). It can be shown that the prediction error of step 2b corresponds to the leave-one-out cross-validation error of the current point after the regression parameters were updated with the data point. Hence, it is denoted by e cv . In Table 3, for R = N, that is, the same number of projections as the input dimensionality, the entire input space would be spanned by the projections ur and the regression results would be identical to that of ordinary linear regression (Wold, 1975). However, once again, we emphasize the important properties of the local projection scheme. First, if all the input variables are statistically independent and have equal variance,3 PLS will find the optimal 3

It should be noted that we could insert one more preprocessing step in Table 3 that independently scales all inputs to unit variance. Empirically, however, we did not notice a significant improvement of the algorithm, so we omit this step for simplicity.

Incremental Online Learning in High Dimensions

2611

Table 2: Legend of Indexes and Symbols Used for LWPR. Notation

Affectation

M N k = (1 : K ) r = (1 : R) M {xi , yi }i=1 M {zi }i=1 {zi,r }rR=1 ur pr

Number of training data points Input dimensionality (i.e., dim. of x) Number of local models Number of local projections used by PLS Training data Lower-dimensional projection of input data xi (by PLS) Elements of projected input zi r th projection direction, i.e., zi,r = xiT ur Regressed input space to be subtracted to maintain orthogonality of projection directions Batch representations of input and projected data Activation of data (x, y) on a local model centered at c Weight matrix W ≡ diag{w1 , . . . , w M } representing the activation due to all M data points Sum of weights w seen by the local model after n data points r th component of slope of the local linear model β ≡ [β1 · · · β R ]T Sufficient statistics for incremental computation of r th dimension of variable var after seeing n data points.

X, Z w W Wn βr n a var,r

projection direction ur in roughly a single sweep through the training data. The optimal projection direction corresponds to the gradient of the local linearization parameters of the function to be approximated. Second, choosing the projection direction from correlating the input and the output data in step 2b.1 automatically excludes irrelevant input dimensions. And third, there is no danger of numerical problems due to redundant input dimensions as the univariate regressions can easily be prevented from becoming singular. 3.1.2 Adjusting the Shape and Size of Receptive Field. The distance metric D and, hence, the locality of the receptive fields, can be learned for each local model individually by stochastic gradient descent in a penalized leave-oneout cross-validation cost function (Schaal & Atkeson, 1998), 1 J = M i=1

M

wi

i=1

wi (yi − yˆ i,−i )2 +

N γ D2 , N i, j=1 ij

(3.3)

where M denotes the number of data points in the training set. The first term of the cost function is the mean leave-one-out cross-validation error of the local model (indicated by the subscript i, −i) which ensures proper generalization (Schaal & Atkeson, 1998). The second term, the penalty term, makes sure that receptive fields cannot shrink indefinitely in case of large amounts of training data. Such shrinkage would be statistically correct for

2612

S. Vijayakumar, A. D’Souza, and S. Schaal

Table 3: Incremental Locally Weighted PLS for One RF Centered at c. 1. Initialization: (# data points seen n = 0) x00 = 0, β00 = 0, W0 = 0, ur0 = 0, pr0 = 0; r = 1 : R 2. Incorporating new data: Given training point (x, y) 2a. Compute activation and update the means 1. w = exp(− 12 (x − c)T D(x − c)); xn+1 0

Wn+1 = λWn + w

2. = β0n+1 = (λWn β0n + wy)/Wn+1 2b. Compute the current prediction error (λWn xn0

+ wx)/Wn+1 ;

xres,1 = x − xn+1 ˆ = β0n+1 0 , y Repeat for r = 1 : R (# projections) T urn / urn T urn 1. zr = xres,r 2. yˆ ← yˆ + βrn zr 3. xres,r+1 = xres,r − zr prn 4. MSErn+1 = λMSErn + w (y − y) ˆ2 e cv = y − yˆ 2c. Update the local model r es1 = y − β0n+1 For r = 1 : R (# projections) 2c.1 Update the local regression and compute residuals n+1 = λ a n + w z2 ; a n+1 = λ a n 1. a zz,r zres,r zres,r + w zr resr zz,r r n+1 n+1 2. βrn+1 = a zres,r /a zz,r 3. resr +1 = resr − zr βrn+1 n 4. an+1 xz,r = λ axz,r + wxres,r zr 2c.2 Update the projection directions 1. urn+1 = λ urn + wxres,r r esr n+1 2. prn+1 = an+1 xz,r /a zz,r e = r esr +1 3. Predicting with novel data (xq ): Initialize: yq = β0 , xq = xq − x0 Repeat for r = 1 : R • yq ← yq + βr sr where sr = urT xq • xq ← xq − sr prn

Note: The subscript k referring to the kth local model is omitted throughout this table since we are referring to updates in one local model or RF.

asymptotically unbiased function approximation, but it would require maintaining an ever increasing number of local models in the learning system, which is computationally too expensive. The trade-off parameter γ can be determined either empirically or from assessments of the maximal local curvature of the function to be approximated (Schaal & Atkeson, 1997); in general, results are not very sensitive to this parameter (Schaal & Atkeson, 1998), as it primarily affects resource efficiency—when input and output data are preprocessed to have unit variance, γ can be kept constant, for example, at γ = 1e − 7 as in all our experiments. It should be noted that due to

Incremental Online Learning in High Dimensions

2613

the local cost function in equation 3.3, learning becomes entirely localized too; no parameters from other local models are needed for updates as, for instance, in competitive learning with mixture models. Moreover, minimizing equation 3.3 can be accomplished in an incremental way without keeping data in memory (Schaal & Atkeson, 1998). This property is due to a reformulation of the leave-one-out cross-validation error as the PRESS residual error (Belsley, Kuh, & Welsch, 1980). As detailed in Schaal and Atkeson (1998) the bias-variance trade-off is thus resolved for every local model individually such that an increasing number of local models will not lead to overfitting. Indeed, it leads to better approximation results due to model averaging (see equation 3.2) in the sense of committee machines (Perrone & Cooper, 1993). In ordinary weighted linear regression, expanding equation 3.3 with the PRESS residual error results in M

1 J = M i=1

wi

i=1

N wi (yi − yˆ i )2 γ + D2 ,

2 N i, j=1 ij 1 − wi xiT Pxi

(3.4)

where P corresponds to the inverted weighted covariance matrix of the input data. Interestingly, we found that the PRESS residuals of equation 3.4 can be exactly formulated in terms of the PLS projected inputs zi ≡ [zi,1 . . . zi,R ]T (cf. Table 3) as 1 J = M i=1

1 ≡ M i=1

M

wi

i=1 M

wi

i=1

N γ + Dij2

2 T N 1 − wi zi Pz zi i, j=1

wi (yi − yˆ i )2

J1 +

γ J 2, N

(3.5)

where Pz corresponds to the inverse covariance matrix computed from the projected inputs zi for R = N, that is, the zi ’s spans the same full-rank input space4 as xi ’s in equation 3.4 (cf. the proof in appendix A). It can also been proved, as explained in appendix A, that Pz is diagonal, which greatly contributes to the computational efficiency of our update rules. Based on this cost function, the distance metric in LWPR is learned by gradient descent, Mn+1 = Mn − α

∂J where D = MT M (for positive definiteness) ∂M

(3.6)

where M is an upper triangular matrix resulting from a Cholesky decomposition of D. Following Schaal and Atkeson (1998), a stochastic approxi∂J mation of the gradient ∂M of equation 3.5 can be derived by keeping track 4 For rank-deficient input spaces, the equivalence of equations 3.4 and 3.5 holds in the subspace spanned by X.

2614

S. Vijayakumar, A. D’Souza, and S. Schaal

Table 4: Derivatives for Distance Metric Update. For the current data point x, its PLS projection z and activation w: M ∂J w ∂ J2 ∂ J 1 ∂w ≈ + n+1 (stochastic update of equation 3.5) ∂M ∂w ∂M ∂M W i=1 N ∂ Di j 1 ∂ J2 γ ∂D ∂w = − w(x − c)T (x − c); =2 Di j ∂ Mkl 2 ∂ Mkl ∂ Mkl N i, j=1 ∂ Mkl ∂ Di j = Mkj δil + Mki δ jl ; where δi j = 1 if i = j else δi j = 0. ∂ Mkl M 2 a n+1 2e 2 e cv ∂ J1 T E − n+1 qT anH − n+1 q2 anG − = n+1 ∂w W W W (Wn+1 )2 i=1         n+1 z1 /a zz,1 z1 z12 q 12    2  .   .  2  . .   ,q =  . , where z =  .  , z =  ..   ..  , q =   .  . 2 n+1 zR zR q k2 z R /a zz,R n an+1 H = λa H +

2 z2 w e cv z n+1 w 2 e cv ; aG = λanG + where h = wzT q (1 − h) (1 − h)

2 = λa nE + we cv a n+1 E

Note: Refer to Table 3 for some of the variables.

of several sufficient statistics, as shown in Table 4. It should be noted that in these update laws, we treated the PLS projection direction, and hence, z, as if it were independent of the distance metric, such that chain rules need not be taken throughout the entire PLS recursions. Empirically, this simplification did not seem to have any negative impact and reduced the update rules significantly. 3.2 The Complete LWPR Algorithm. All update rules can be combined in an incremental learning scheme that automatically allocates new locally linear models as needed. The concept of the final learning network is illustrated in Figure 1, and an outline of the final LWPR algorithm is shown in Table 5. In this pseudocode, wgen is a threshold that determines when to create a new receptive field, as discussed in Schaal and Atkeson (1998). wgen is a computational efficiency parameter and not a complexity parameter as in mixture models. The closer wgen is set to 1, the more overlap local models will have, which is beneficial in the spirit of committee machines (cf. Schaal & Atkeson, 1998; Perrone & Cooper 1993) but more costly to compute. In general, the more overlap is permitted, the better the function-fitting results, without any danger that the increase in overlap can lead to overfitting. Ddef is the initial (usually diagonal) distance metric in equation 3.1. The initial number of projections is set to R = 2. The algorithm has a simple

Incremental Online Learning in High Dimensions

2615

Table 5: Pseudocode of the Complete LWPR Algorithm. • Initialize the LWPR with no receptive field (RF). • For every new training sample (x,y): – For k = 1 to K (number of receptive fields): ∗ Calculate the activation from equation 3.1 ∗ Update projections and regression (see Table 3) and distance metric (see Table 4) ∗ Check if number of projections needs to be increased (cf. section 3.2) – If no RF was activated by more than wgen ; ∗ Create a new RF with R = 2, c = x, D = Ddef

mechanism of determining whether R should be increased by recursively keeping track of the mean-squared error (MSE) as a function of the number of projections included in a local model—step 2b.4 in Table 3. If the MSE at the next projection does not decrease more than a certain percentage of the r +1 previous MSE, MSE MSEr > φ, where φ ∈ [0, 1], the algorithm will stop adding new projections locally. As MSEr can be interpreted as an approximation of the leave-one-out cross-validation error of each projection, this threshold criterion avoids problems due to overfitting. Due to the need to compare the MSE of two successive projections, LWPR needs to be initialized with at least two projection dimensions. A comparison of these mechanisms of constructive learning with previous algorithms in the literature (e.g, Platt, 1991) can be found in Schaal and Atkeson (1998). 3.2.1 Speed-Up for Learning from Trajectories. If in incremental learning, training data are generated from trajectories (i.e., data are temporally correlated), it is possible to accelerate lookup and training times by taking advantage of the fact that two consecutively arriving training points are close neighbors in input space. For such cases, we added a special data structure to LWPR that allows restricting updates and lookups to only a small fraction of local models instead of exhaustively sweeping through all of them. For this purpose, each local model maintains a list of all other local models that overlap sufficiently with it. Sufficient overlap between two models i and j can be determined from the centers and distance metrics. The point x in input space that is the closest to both centers in the sense of a Mahalanobis distance is x = (Di + D j )−1 (Di ci + D j c j ). Inserting this point into equation 3.1 of one of the local models gives the activation w due to this point. The two local models are listed as sufficiently overlapping if w ≥ wgen (cf. Table 5). For diagonal distance metrics, the overlap computation is linear in the number of inputs. Whenever a new data point is added to LWPR, one neighborhood relation is checked for the maximally activated RF. An appropriate counter for each local model ensures that overlap with all other local models is checked exhaustively. Given this nearest-neighbor data structure, lookup and learning can be confined to only a few RFs. For every lookup (update), the identification number of the maximally activated RF is returned.

2616

S. Vijayakumar, A. D’Souza, and S. Schaal

The next lookup (update) will consider only the neighbors of this RF. It can be shown that this method is as good as an exhaustive lookup (update) strategy that excludes RFs that are activated below a certain threshold wcutoff . 3.2.2 Pruning of Local Models. As in the RFWR algorithm (Schaal & Atkeson, 1998), it is possible to prune local models depending on the level of overlap between two local models or the accumulated locally weighted mean-squared error. The pruning strategy is virtually identical to that in (Schaal & Atkeson, 1998, sec. 3.14). However, due to the numerical robustness of PLS, we have noticed that the need for pruning or merging is almost nonexistent in the LWPR implementation, such that we do not expand on this possible feature of the algorithm. 3.2.3 Computational Complexity. For a diagonal distance metric D and under the assumption that the number of projections R remains small and bounded, the computational complexity of one incremental update of all parameters of LWPR is linear in the number of input dimensions N. To the best of our knowledge, this property makes LWPR one of the computationally most efficient algorithms that have been suggested for highdimensional function approximation. This low-computational complexity sets LWPR apart from our earlier work on the RFWR algorithm (Schaal & Atkeson, 1998), which was cubic in the number of input dimensions. We thus accomplished one of our main goals: maintaining the appealing function approximation properties of RFWR while eliminating its problems in high-dimensional learning problems. 3.2.4 Confidence Intervals. Under the classical probabilistic interpretation of weighted least squares (Gelman, Carlin, Stern, & Rubin, 1995), that each local model’s conditional distribution is normal with heteroscedastic variances p(y|x; wk ) ∼ N(zk T βk , sk 2 /wk ), it is possible to derive the predictive 2 variances σpred,k for a new query point xq for each local model in LWPR.5 The derivation of this measure is in analogy with ordinary linear regression (Schaal & Atkeson, 1994; Myers, 1990) and is also consistent with the Bayesian formulation of predictive variances (Gelman et al., 1995). For each 2 individual local model, σpred,k can be estimated as (refer to Tables 4 and 3 for variable definitions):

2 σpred,k = sk2 1 + wk zqT,k qk ,

(3.7)

5 Note that w is used here as an abbreviated version of w k {q ,k} —the weight contribution due to query point q in model k—for simplicity.

Incremental Online Learning in High Dimensions

2617

where zq ,k is the projected query point xq under the kth local model, and M

(Mk − pk ); Mk ≡ wk,i ≈ Wk n=M sk 2 ≈ MSE n=M k,R

pk ≡

M

i=1 2 T wk,i zk,i qk,i ≈ a n=M with incremental update of p

k

i=1 n 2 T a n+1 p = λa p + wk zk qk . k

k

The definition of M in terms of the sum of weights reflects the effective number of data points entering the computation of the local variance sk 2 (Schaal & Atkeson, 1994) after an update of M training points has been performed. The definition of p , also referred to as the local degrees of freedom, is analogous to the global degrees of freedom of linear smoothers (Hastie & Tibshirani, 1990; Schaal & Atkeson, 1994). In order to obtain a predictive variance measure for the averaging formula (equation 3.2), one could just compute the weighted average of the predictive variance in equation 3.7. While this approach is viable, it nevertheless ignores important information that can be obtained from variance across the individual predictions yˆ q ,k and is thus potentially too optimistic. To remedy this issue, we postulate that from the view of combining individual yˆ q ,k , each contributing yq ,k was generated from the process yq ,k = yq + 1 + 2,k , where we assume two separate noise processes: (1) one whose variance σ 2 is independent of the local model, that is, 1 ∼ N(0, σ 2 /wk ) (and accounts for the differences between the predictions of the local models) and (2) another, 2 which is the noise process 2,k ∼ N(0, σpred,k /wk ) of the individual local models. It can be shown (see appendix B) that equation 3.2 is a consistent way of combining prediction from multiple models under the noise model we just described and that the combined predictive variance over all models can be approximated as σpred 2

2 2 k wk σpred,k k wk σ = + . ( k wk )2 ( k w k )2

(3.8)

The estimate of σpred,k is given in equation 3.7. The global variance across models can be approximated as σ 2 = k wk ( yˆ q − yˆ k,q )2 / k wk . Inserting these values in equation 3.8, we obtain: K

1 2 σpred = wk ( yˆ q − yˆ k,q )2 + sk2 1 + wk zkT qk . 2 ( k wk ) k=1

(3.9)

2618

S. Vijayakumar, A. D’Souza, and S. Schaal GP confidence bounds

LWPR confidence bounds

350

350

300

target

300

target

250

traindata

250

traindata approx

approx

200

200

conf

conf 150

150

100

100

50

50 0

0 –50 –5

–4

–3

–2

–1

0

1

A

2

3

4

5

–50 –5

–4

–3

–2

–1

0

1

2

3

4

5

B

Figure 2: Function approximation with 200 noisy data points along with plots of confidence intervals for (A) gaussian process regression and (B) LWPR algorithms. Note the absence of data in the range [0.5 1.5].

A one-standard-deviation-based confidence interval would thus be Ic = yˆ q ± σpred .

(3.10)

The variance estimate in equation 3.8 is consistent with the intuitive requirement that when only one local model contributes to the prediction, the variance is entirely attributed to the predictive variance of that single model. Moreover, a query point that does not receive a high weight from any local model will have a large confidence interval due to the small squared sum-of-weight value in the denominator. Figure 2 illustrates comparisons of confidence interval plots on a toy problem with 200 noisy data points. Data from the range [0.5 1.5] were excluded from the training set. Both gaussian process regression and LWPR show qualitatively similar confidence interval bounds and fitting results. 4 Empirical Evaluation The following sections provide an evaluation of our proposed LWPR learning algorithm over a range of artificial and real-world data sets. Whenever useful and feasible, comparisons to state-of-the-art alternative learning algorithms are provided, in particular, support vector regression (SVM) and gaussian process regression (GP). SVMR and GPR were chosen due to their generally acknowledged excellent performance in nonlinear regression on finite data sets. However, it should be noted that both SVM and GP are batch learning systems, while LWPR was implemented as a fully incremental algorithm, as described in the previous sections.

Incremental Online Learning in High Dimensions

A

2619

B

Figure 3: (A) Target and (B) learned nonlinear cross function.

4.1 Function Approximation with Redundant and Irrelevant Data. We implemented LWPR algorithm as outlined in section 3. In each local model, the projection regressions are performed by (locally weighted) PLS, and the distance metric D is learned by stochastic incremental cross validation; all learning methods employed second-order learning techniques; incremental PLS uses recursive least squares, and gradient descent in the distance metric was accelerated as described in Schaal and Atkeson (1998). In all our evaluations, an initial (diagonal) distance metric of Ddef = 30I was chosen, the activation threshold for adding local models was wgen = 0.2, and the threshold for adding new projections was φ = 0.9 (cf. section 3.2). As a first test, we ran LWPR on 500 noisy training data drawn from the two-dimensional function (Cross 2D) generated from y = max{exp(−10x12 ), exp(−50x22 , 1.25exp(−5(x12 + x22 )))} + N(0, 0.01), as shown in Figure 3A. This function has a mixture of areas of rather high and rather low curvature and is an interesting test of the learning and generalization capabilities of a learning algorithm: learning models with low complexity find it hard to capture the nonlinearities accurately, while more complex models easily overfit, especially in linear regions. A second test added eight constant (i.e., redundant) dimensions to the inputs and rotated this new input space by a random 10-dimensional rotation matrix to create a 10-dimensional input space with high-rank deficiency (Cross 10D). A third test added another 10 (irrelevant) input dimensions to the inputs of the second test, each having N(0, 0.052 ) gaussian noise, thus obtaining a data set with 20-dimensional input space (Cross 20D). Typical learning curves with these data sets are illustrated in Figure 4. In all three cases, LWPR reduced the normalized mean squared error (thick lines) on a noiseless test set (1681 points on a 41 × 41 grid in the unit-square in input space) rapidly in 10 to 20 epochs of training to less than nMSE = 0.05, and it converged to the

S. Vijayakumar, A. D’Souza, and S. Schaal

0.14

70

0.12

60

0.1

50

0.08 0.06

2D-cross 10D-cross

40 30

20D-cross

0.04

20

0.02

10

0 1000

10000 #Training Data Points

0 100000

#Receptive Fields/Average #Projections

nMSE on Test Set

2620

Figure 4: Learning curves for 2D, 10D, and 20D data for cross approximation.

excellent function approximation result of nMSE = 0.015 after 100,000 data presentations or 200 epochs.6 Figure 5 shows the adapted distance metric, while Figure 3B illustrates the reconstruction of the original function from the 20-dimensional test data, visualized in 3D, a highly accurate approximation. The rising thin lines in Figure 4 show the number of local models that LWPR allocated during learning. The very thin lines at the bottom of the graph indicate the average number of projections that the local models allocated: the average settled at a value of around two local projections, as is appropriate for this originally two-dimensional data set. This set of tests demonstrates that LWPR is able to recover a low-dimensional nonlinear function embedded in high-dimensional space despite irrelevant and redundant dimensions and that the data efficiency of the algorithm does not degrade in higher-dimensional input spaces. The computational complexity of the algorithm increased only linearly with the number of input dimensions, as explained in section 3. The results of this evaluations can be directly compared with our earlier work on the RFWR algorithm (Schaal & Atkeson, 1998), in particular Figures 4 and 5 of this earlier article. The learning speed and the number of allocated local models for LWPR is essentially the same as for RFWR in the 2D test set. Applying RFWR to the 10- and 20-dimensional data set

6 Since LWPR is an incremental algorithm, data presentations in this case refer to repeated random-order presentations of training data from our noisy data set of size 500.

Incremental Online Learning in High Dimensions

2621

1

x2

0.5

0

–0.5

–1

–1.5

–1

–0.5

0 x1

0.5

1

1.5

Figure 5: The automatically tuned distance metric for the cross approximation.

of this article, however, is problematic, as it requires a careful selection of initial ridge regression parameters to stabilize the highly rank-deficient full covariance matrix of the input data, and it is easy to create too much bias or too little numerical stabilization initially, which can trap the local distance metric adaptation in local minima. While the LWPR algorithm just computes about a factor 10 times longer for the 20D experiment in comparison to the 2D experiment, RFWR requires a 1000-fold increase of computation time, thus rendering this algorithm unsuitable for high-dimensional regression. In order to compare LWPR’s results to other popular regression methods, we evaluated the 2D, 10D, and 20D cross data sets with gaussian process regression (GP) and support vector (SVM) regression in addition to our LWPR method. It should be noted that neither SVM nor GP methods is an incremental method, although they can be considered state-of-the-art for batch regression under relatively small numbers of training data and reasonable input dimensionality. The computational complexity of these methods is prohibitively high for real-time applications. The GP algorithm (Gibbs & MacKay, 1997) used a generic covariance function and optimized over the hyperparameters. The SVM regression was performed using a standard available package (Saunders et al., 1998) and optimized for kernel choices. Figure 6 compares the performance of LWPR and gaussian processes for the above-mentioned data sets using 100, 300, and 500 training data points.7 As in Figure 3 the test data set consisted of 1681 data points corresponding to the vertices of a 41 × 41 grid over the unit square; the corresponding output values were the exact function values. The approximation error was 7 We have not plotted the results for SVM regression since it was found to consistently perform worse than GP regression for the given number of training data.

2622

S. Vijayakumar, A. D’Souza, and S. Schaal

nMSE

GaussProcess

LWPR

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 100 points

300 points 500 points Cross 2 Dim.

100 points

300 points 500 points Cross 10 Dim.

100 points

300 points 500 points Cross 20 Dim.

Figure 6: Normalized mean squared error comparisons between LWPR and gaussian processes for 2D, 10D, and 20D Cross data sets.

measured as a normalized weighted mean squared error, nMSE, that is, the weighted MSE on the test set normalized by the variance of the outputs of 2 the test set. The weights were chosen as 1/σpred,i for each test point xi . Using such a weighted nMSE was useful to allow the algorithms to incorporate their confidence in the prediction of a query point, which is especially useful for training data sets with few data points where query points often lie far away from any training data and require strong extrapolation to form a prediction. Multiple runs on 10 randomly chosen training data sets were performed to accumulate the statistics. As can be seen from Figure 6, the performance differences of LWPR and GP were largely statistically insignificant across training data sizes and input dimensionality. LWPR had a tendency to perform slightly better on the 100-point data sets, most likely due to its quickly decreasing confidence when significant extrapolation is required for a test point. For the 300-point data sets, GP had a minor advantage and less variance in its predictions, while for 500-point data sets, both algorithms achieved equivalent results. While GPs used all the input dimensions for predicting the output (deduced from the final converged coefficients of the covariance matrix), LWPR stopped at an average of two local projections, reflecting that it exploited the low-dimensional distribution of the data. Thus, this comparison illustrates that LWPR is a highly competitive learning algorithm in terms of its generalization capabilities and accuracy of results, in spite of its’ being a truly incremental, computationally efficient and real-time implementable algorithm. 4.2 Comparisons on Benchmark Regression Data Sets. While LWPR is specifically geared toward real-time incremental learning in high dimensions, it can nevertheless also be employed for traditional batch data analysis. Here we compare its performance on two natural real-world benchmark data sets, again using gaussian processes and support vector regression as competitors.

Incremental Online Learning in High Dimensions

2623

Table 6: Comparison of Normalized Mean Squared Errors on Boston and Abalone Data Sets.

Boston Abalone

Gaussian Process

Support Vectors

LWPR

0.0806 ± 0.0195 0.4440 ± 0.0209

0.1115 ± 0.09 0.4830 ± 0.03

0.0846 ± 0.0225 0.4056 ± 0.0131

The data sets we used were the Boston Housing data and the Abalone data set, both available from the UCI Machine Learning Repository (Hettich & Bay 1999). The Boston Housing data, which had 14 attributes, was split randomly (10 random splits) into disjoint sets of 404 training and 102 testing data. The Abalone data set, which had 9 attributes, was downsampled to yield 10 disjoint sets of 500 training data points and 1177 testing points.8 The GP used hyperparameter estimation for the open parameters of the covariance matrix, while for SVM regression, the results were obtained by employing a gaussian kernel of width 3.9 and 10 for the Boston and Abalone data sets, respectively, based on the optimized values suggested in Scholkopf, ¨ Smola, Williamson, & Bartlett (2000). Table 6 shows the comparisons of the normalized mean squared error (nMSE) achieved by GP, SVM, and LWPR on both data sets. Once again, LWPR was highly competitive on these real-world data sets, consistently outperforming SVM regression and achieving very similar nMSE results as GP regression. 4.3 Sensorimotor Learning in High-Dimensional Space. In this section, we look at the application of LWPR to real-time learning in highdimensional spaces in a data-rich environment, an example of which is learning for robot control. In such domains, LWPR is (to the best of our knowledge) one of the only viable and practical options for principled statistical learning. The goal of learning in this evaluation is to estimate the inverse dynamics model (also referred to as an internal model) of the robotic system such that it can be used as a component of a feedforward controller for executing fast, accurate movements. The inverse dynamics model is a mapping from joint position, joint velocity, and joint acceleration to joint torques, a function with three times the number of degrees of freedom (DOF) as input dimensionality. We implemented LWPR on the real-time operating system (vx-Works) for the Sarcos humanoid robot in Figure 7A, a 30 DOF system, which used its right hand to draw a lying figure 8 pattern. Out of the four parallel

8 The gaussian process algorithm had problems of convergence and numerical stability for training data sizes above 500 points. However, a more comprehensive computation can be carried out by using techniques from Williams and Seeger (2001) to scale up the GP results, as pointed out by one of the reviewers.

2624

S. Vijayakumar, A. D’Souza, and S. Schaal

0.4

z displacement (in meters)

0.35 0.3 0.25 0.2 0.15

Traj_desired 0.1

Traj_nouff Traj_10s

0.05

Traj_300s 0 0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x displacement (in meters)

A

B

Figure 7: (A) The 30-DOF SARCOS humanoid robot. (B) Results of online learning of the inverse dynamics with LWPR on the humanoid robot.

processors of the system, one 366 Mhz PowerPC processor was completely devoted to lookup and learning with LWPR. In order to accelerate lookup and training times, the nearest-neighbor data lookup described in section 3.2.1 was used. Learning of the inverse dynamics model required learning in a 90-dimensional input space and the outputs were the 30 torque commands for each of the DOFs. Ideally, we would learn one individual LWPR model for each of the 30 output dimensions. However, as the learning of 30 parallel LWPR models would have exceeded the computational power of our 366 Mhz real-time processors, we chose to learn one single LWPR model with a 30-dimensional output vector: each projection of PLS in LWPR regressed all outputs versus the projected input data. The projection direction was chosen as the mean projection across all outputs at each projection stage of PLS. This approach is suboptimal, as it is quite unlikely that all output dimensions agree on one good projection direction; essentially, one assumes that the gradients of all outputs point roughly into the same direction. On the other hand, as D’Souza et al. (2001) demonstrated that movement data of actual physical movement systems lie on locally low-dimensional distributions, one can hope that LWPR with multiple outputs can still work successfully by simply spanning this locally low-dimensional input space with all projections. The LWPR model was trained online while the robot performed a pseudo–randomly drifting figure 8 pattern in front of its body. Lookup proceeded at 480 Hz, while updating the learning model was achieved at about 70 Hz. After 10 seconds of training, learning was stopped, and the robot

Incremental Online Learning in High Dimensions

2625

attempted to draw a planar figure 8 in the x − z plane of the robot endeffector at 2 Hz frequency for the entire pattern. The same test pattern was also performed after 300 seconds of training. Figure 7B demonstrates the result of learning. In this figure, Trajdesired denotes the desired figure 8 pattern; Traj10 is the LWPR learning result after 10 seconds of training; and Traj300 is the result after 300 seconds of training. The Trajnouff trace demonstrates the figure 8 patterns performed without any inverse dynamics model, just using a low-gain negative feedback (proportional-derivative (PD)) controller. LWPR rapidly improves over a control system with no inverse dynamics controller; within 10 seconds of movement, the most significant inertial and gravity perturbation have been compensated. Convergence to low error tracking of the figure 8 takes slightly longer—about 300 seconds (Traj300 in Figure 7B—but is reliably achieved. About 50 local models were created for this task. While tracking performance is not perfect, the learned inverse dynamics outperformed the model estimated by rigid body dynamics methods (An, Atkeson, & Hollerbach, 1988) significantly in terms of its average tracking error of the desired trajectory. This rigid dynamics model was estimated from about 1 hour of data collection and 30 minutes off-line processing of the data. These results are the first that demonstrate an actual implementation of real-time inverse dynamics learning on such a robot of this complexity. 4.3.1 Online Learning for Autonomous Airplane Control. The online learning abilities of LWPR are ideally suited to be incorporated in algorithms of provably stable adaptive control. The control theoretic development of such an approach was presented in Nakanishi, Farrell, and Schaal (2004). In essence, the problem formulation begins with a specific class of equations of motion of the form x˙ = f (x) + g(x)u,

(4.1)

where x denotes the state of the control system, the control inputs, and f (x) and g (x) are nonlinear function to approximated. A suitable control law for such a system is u = gˆ (x)−1 (− fˆ (x) + x˙ c + K (xc − x)),

(4.2)

where xc , x˙ c are a desired reference trajectory to be tracked, and the “hat” notation indicates that these are the approximated version of the unknown function. We applied LWPR in this control framework to learn the unknown function f and g for the problem of autonomous airplane control on a highfidelity simulator. For simplicity, we considered only a planar version of the

2626

S. Vijayakumar, A. D’Souza, and S. Schaal

airplane, governed by the differential equation (Stevens & Lewis, 2003): 1 (T cos α − D) − g sin γ m 1 g cos γ α˙ = − +Q (L + T sin α) + mV V Q˙ = c M. V˙ =

(4.3)

In these equations, V denotes the forward speed of the airplane, m the mass, T the thrust, α the angle of attack, g the gravity constant, γ the flight path angle with regard to the horizontal world coordinate system axis, Q the pitch rate, and c an inertial constant. The complexity of these equations is hidden in D, L, and M, which are the unknown highly nonlinear aerodynamic lift force, drag force, and pitch moment terms, which are specific to every airplane. While we will not go into the detail of provably stable adaptive control with LWPR in this letter and how the control law 4.2 is applied for airplane control, from the viewpoint of learning, the main components to learn are the lift and drag forces, and the pitch moment. These can be obtained by rearranging equation 4.3 to:

D = T cos α − V˙ + g sin γ m = f D (α, Q, V, M, γ , δOFL , δOFR , δMFL , δMFR , δSPL , δSPR ) g cos γ + Q − α˙ mV − T sin α L= V = f L (α, Q, V, M, γ , δOFL , δOFR , δMFL , δMFR , δSPL , δSPR ) M=

Q = f M (α, Q, V, M, γ , δOFL , δOFR , δMFL , δMFR , δSPL , δSPR ) . c (4.4)

The δ terms denote the control surface angles of the airplane, with indices Midboard-Flap-Left/Right (MFL, MFR), Outboard-Flap-Left/Right (OFL, OFR), and left and right spoilers (SPL, SPR). All terms on the right-hand side of equation 4.4 are known, such that we have to cope with three simultaneous function approximation problems in an 11-dimensional input space, an ideal application for LWPR. We implemented LWPR for the three functions above in a high-fidelity simulink simulation of an autonomous airplane using the adaptive control approach of Nakanishi et al. (2004). The airplane started with no initial knowledge, just the proportional controller term in equation 4.2 (the term multiplied by K). The task of the controller was to fly doublets—

Incremental Online Learning in High Dimensions

2627

Figure 8: LWPR learning results for adaptive learning control on a simulated autonomous airplane. (A) Tracking of flight angle: γ . (B) Approximation of lift force: D. (C) Approximation of drag force: L. (D) Approximation of pitch moment: M. At 400 seconds into the flight, a failure is simulated that locks one control surface to a 17 degree angle. Note that for clarity of presentation, an axis break was inserted after 200 seconds.

up-and-down trajectories that are essentially sinusoidlike variations of the flight path angle γ . Figure 8 demonstrates the results of this experiment. Figure 8A shows the desired trajectory in γ and its realization by the controller. Figures 8B, 8C, and 8D illustrate the online function approximation of D, L , and M. As can be seen, the control of γ achieves almost perfect tracking after just a few seconds. The function approximations of D and L are very accurate after a very short time. The approximation M requires longer for convergence but

2628

S. Vijayakumar, A. D’Souza, and S. Schaal

progresses quickly. About 10 local models were needed for learning f D and f L , and about 20 local models were allocated for f M . An interesting element of Figure 8 happens after 400 seconds of flight, where we simulated a failure of the airplane mechanics by locking the MFR to a 17 degree deflection. As can be seen, the function approximators very quickly reorganize after this change, and the flight is successfully continued, although γ tracking has some error for a while until it converges back to good tracking performance. The strong signal changes in the first seconds after the failure are due to oscillations of the control surfaces, not a problem in function approximation. Without adaptive control, the airplane would have crashed. 5 Discussion Nonlinear regression with spatially localized models remains one of the most data-efficient and computationally efficient methods for incremental learning with automatic determination of the model complexity. In order to overcome the curse of dimensionality of local learning systems, this article investigated methods of linear projection regression and how to employ them in spatially localized nonlinear function approximation for highdimensional input data with redundant and irrelevant components. Due to its robustness in such a setting, we chose partial least squares regression at the core of a novel function approximator, locally weighted projection regression (LWPR). The proposed technique was evaluated on a range of artificial and real-world data sets in up to 90-dimensional input spaces. Besides showing fast and robust learning performance due to second-order learning methods based on stochastic leave-one-out cross validation, LWPR excelled by its low computational complexity: updating each local model with a new data point remained linear in its computational cost in the number of inputs since the algorithm accomplishes good approximation results with only three to four projections irrespective of the number of input dimensions. To our knowledge, this is the first spatially localized incremental learning system that can efficiently work in high-dimensional spaces and is thus suited for online and real-time applications. In addition, LWPR compared favorably in its generalization performance with state-of-the-art batch regression regression methods like gaussian process regression and can provide qualitatively similar estimates of confidence bounds and predictive variances. The major drawback of LWPR in its current form is the need for gradient descent to optimize the local distance metrics in each local model, and the manual tuning of a forgetting factor as required in almost all recursive learning algorithms that accumulate sufficient statistics. Future work will derive a probabilistic version of partial least squares regression that will allow a complete Bayesian treatment of locally weighted regression with locally linear models, with, we hope, no remaining open parameters for

Incremental Online Learning in High Dimensions

2629

manual adjustment. Whether a full Bayesian version of LWPR can achieve the same computational efficiency as in the current implementation, however, remains to be seen. Appendix A: PRESS Residuals for PLS We prove that under the assumption that x lives in a reduced dimensional subspace, the PRESS residuals of equation 3.4 can indeed be replaced by the residual denoted by equation 3.5, xiT Pxi = ziT Pz zi ,

(A.1)

where P = (XT X)† , Pz = (ZT Z)−1 corresponds to the (pseudo)inverse covariance matrices and the symbol † represents the SVD pseudoinverse (Press, Flannery, Teukolsky, & Vetterling, 1989) that generates a solution to the inverse in the embedded lower-dimensional manifold of x with minimum norm solution in the sense of a Mahalanobis distance. (Refer to Table 1 for the batch notations.) Part 1. Let T be the transformation matrix with full rank in row space that denotes coordinate transformation from the rank-deficient space of x to the full rank space of z. Then for any z = TT x and the corresponding inverse covariance matrix Pz , we can show that ziT Pz zi = xiT T((XT)T (XT))−1 TT xi = xiT (XT X)† xi = xiT Pxi .

(A.2)

A linear transformation maintains the norm. Part 2. In this part, we show that the recursive PLS projections that transform the inputs x to z can be written as a linear transformation matrix, which completes our proof. Clarifying some of the notation:9 

   x1 z1     X ≡  ...  , Z ≡  ...  ≡ [˜z1 . . . z˜ R ]. xM zM

(A.3)

We now look at each of the R PLS projection directions and attempt to show that zi = TT xi or (in a batch sense) Z = XT by showing (for each individual projections) r , there exists a tr such that z˜ r = Xtr , where T = [t1 t2 . . . tr ].

(A.4)

9 Here, we have used z and z ˜ to distinguish between the row vectors and the column vectors, respectively, of the projected data matrix Z.

2630

S. Vijayakumar, A. D’Souza, and S. Schaal

For r = 1 (cf. Table 1), z˜ 1 = XXT y = Xt1 , where t1 = XT y.

(A.5)

For r = 2 (cf. Table 1), T yres . z˜ 2 = Xres Xres

(A.6)

We also know from the algorithm (cf. Table 1) that z˜ 1 z˜ 1T X z˜ 1 z˜ 1T = I − X = Pz˜ 1 X z˜ 1T z˜ 1 z˜ 1T z˜ 1 z˜ 1 z˜ T y z˜ 1 z˜ T yres = y − T 1 = I − T 1 y = Pz˜ 1 y, z˜ 1 z˜ 1 z˜ 1 z˜ 1

Xres = X −

(A.7) (A.8)

where Pz˜ 1 represents a projection operator. Using results from equations A.7 and A.8 in equation A.6, s2 = Pz˜ 1 X(Pz˜ 1 X)T Pz˜ 1 y = Pz˜ 1 XXT Pz˜ 1 y . . . using property of projection operator: Pz˜ 1 = Pz˜T1 = Pz˜ 1 Pz˜ 1 = Pz˜ 1 Xt 2 .

(A.9)

It can be shown easily by writing out the pseudo-inversion that there exists an operator R2 = (I − u1 ut1 XT X/u1T XT Xu1 ) such that Pz˜ 1 X = XR2 .

(A.10)

Using equations A.9 and A.10, we can write z˜ 2 = Xt2 where t2 =

R2 t 2

=

u1 ut XT X I − T 1T u1 X Xu1

XT Pz˜ 1 y.

(A.11)

This operation can be carried out recursively to determine all the tk , showing that the PLS projections can be written as a linear transformation. This completes the proof of the validity of the modified PRESS residual of equation A.1 for PLS projections. Also note that, Pz is diagonal by virtue of the fact that in the PLS algorithm, after every projection iteration, the projected components of input space X are subtracted before computing the next projection (Table 1d or Table 3 2b.3), ensuring the next component of Z will always be orthogonal to the previous ones. This property was discussed in Frank and Friedman (1993).

Incremental Online Learning in High Dimensions

2631

Appendix B: Combined Predictive Variances The noise model for combining prediction from individual local model is yq ,k = yq + 1 + 2,k , 2 /wk ). The mean prediction due where 1 ∼ N(0, σ 2 /wk ) and 2,k ∼ N(0, σpred,k to multiple local models can be written according to a heteroscedastic average,

yˆ q =

k

wk wk yq ,k 2 2 σ 2 + σ pr σ 2 + σ pr ed,k ed,k k

ˆ q ,k k wk yq ,k k wk y ≈ ≈ , w w k k k k

(B.1)

2 under the assumption that (σ 2 + σ pr ed,k ) is approximately constant for all contributing models k and that yˆ q ,k is an estimate over multiple noisy instances of yq ,k that has averaged out the noise process 2,k —exactly what happens within each local model. Thus, equation B.1 is consistent with equation 3.2 under the proposed dual noise model. The combined predictive variance can now be derived as

2 σ pr ed

= E yq2 − (E{yq })2 = E

=

1 k wk

2 E

1 = E ( k w k )2

wk yq ,k k wk

2 wk yq

+

k

2 − (E{yq })2

k

2 wk 1

+

k

2 wk 1

k

+

2 wk 2,k

− ( yˆ q )2

k

2 wk 2,k

.

(B.2)

k

Using the fact that E{x2 } = (E{x})2 + var(x) and noting that 1 and 2,k have zero mean, 1 wk 1 + wk 2,k

2 var k k k wk k wk 2 σ2 σpred,k 1 1 2 2 wk wk + =

2

2 wk wk k k k wk k wk

2 σpred =

1

2 var

2632

S. Vijayakumar, A. D’Souza, and S. Schaal

2 wk σ 2 k wk σpred,k σpred = +

2

2 , k wk k wk

2

k

(B.3)

which gives the expression for the combined predictive variances.

References An, C. H., Atkeson, C., & Hollerbach, J. (1988). Model based control of a robot manipulator. Cambridge, MA: MIT Press. Atkeson, C., Moore, A., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(4), 76–113. Belsley, D. A., Kuh, E., & Welsch, D. (1980). Regression diagnostics. New York: Wiley. D’Souza, A., Vijayakumar, S., & Schaal, S. (2001). Are internal models of the entire body learnable? Society for Neuroscience Abstracts, 27. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman and Hall. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Diego: Morgan-Kaufmann. Frank, I., & Friedman, J. (1993). A statistical view of some chemometric tools. Technometrics, 35(2), 109–135. Friedman, J. H., & Stutzle, W. (1981). Projection pursuit regression. Journal of America Statistical Association, 76, 817–823. Gelman, A. B., Carlin, J. S., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall. Ghahramani, Z., & Beal, M. (2000). Variational inference for Bayesian mixtures of factor analysers. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 449–455). Cambridge, MA: MIT Press. Gibbs, M., & Mackay, D. J. C. (1997). Efficient implementation of gaussian processes (Tech. Rep.). Cambridge: Cavendish Laboratory. Hastie, T., & Loader, C. (1993). Local regression: Automatic kernel carpentry. Statistical Science, 8, 120–143. Hastie, T., & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall. Hettich, S., & Bay, S. (1999). The UCI KDD archive. London: University of California, Irvine, Department of Information and Computer Science. Jordan, M., & Jacobs, R. (1994). Hierarchical mixture of experts and the EM algorithm. Neural Computation, 6(2), 181–214. Ljung, L., & Soderstrom, T. (1986). Theory and practice of recursive identification. Cambridge, MA: MIT Press. Myers, R. H. (1990). Classical and modern regression with applications. Boston: PWSKent. Nakanishi, J., Farrell, J. A., & Schaal, S. (2004). Composite adaptive control with locally weighted statistical learning. In IEEE International Conference on Robotics and Automation (pp. 2647–2652). Piscataway, NJ: IEEE.

Incremental Online Learning in High Dimensions

2633

Perrone, M. P., & Cooper, L. N. (1993). Neural networks for speech and image processing, London: Chapman Hall. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3(2), 213–225. Press, W., Flannery, B., Teukolsky, S., & Vetterling, W. (1989). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by local linear embedding. Science, 290, 2323–2326. Rubin, D., & Thayer, D. (1982). EM algorithms for ML factor analysis. Psychometrika, 47(1), 69–76. Sanger, T. (1989). Optimal unsupervised learning in a single layer feedforward neural network. Neural Networks, 2, 459–473. Saunders, C., Stitson, M., Weston, J., Bottou, L., Schoelkopf, B., & Smola, A. (1998). Support vector machine—reference manual (Tech. Rep. TR CSD-TR-98-03). London: Department of Computer Science, Royal Holloway, University of London. Schaal, S., & Atkeson, C. (1994). Assessing the quality of learned local models. In J. D. Cowan, G. Tesauro, & J. Alspetor (Eds.). Advances in neural information processing systems, 6 (pp. 160–167). San Mateo, CA: Morgan-Kaufmann. Schaal, S., & Atkeson, C. (1997). Receptive field weighted regression (Tech. Rep. TR-H209). Kyoto, Japan: ATR Human Information Processing. Schaal, S., & Atkeson, C. (1998). Constructive incremental learning from only local information. Neural Computation, 10(8), 2047–2084. Schaal, S., Vijayakumar, S., & Atkeson, C. G. (1998). Local dimensionality reduction. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Scholkopf, ¨ B., Burgess, C., & Smola, A. J. (1999). Advances in kernel methods: Support vector learning. Cambridge, MA: MIT Press. Scholkopf, ¨ B., Smola, A. J., Williamson, R., & Bartlett, P. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245. Scott, D. (1992). Multivariate density estimation. New York: Wiley. Smola, A. J., & Scholkopf, ¨ B. (1998). A tutorial on support vector regression (NEUROCOLT Tech. Rep. NC-TR-98-030). London: Royal Holloway College. Stevens, B. L., & Lewis, F. L. (2003). Aircraft control and simulation. New York: Wiley. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319–2323. Tipping, M., & Bishop, C. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B, 61(3), 611–622. Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128. Vijayakumar, S., & Ogawa, H. (1999). RKHS based functional analysis for exact incremental learning. Neurocomputing, 29(1–3), 85–113. Vijayakumar, S., & Schaal, S. (1998). Local adaptive subspace regression. Neural Processing Letters, 7(3), 139–149. Vlassis, N., Motomura, Y., & Krose, B. (2002). Supervised dimensionality reduction of intrinsically low-dimensional data. Neural Computation, 14, 191–215. Williams, C. K. I., & Rasmussen, C. (1996). Gaussian processes for regression. In M. Touretzky & Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press.

2634

S. Vijayakumar, A. D’Souza, and S. Schaal

Williams, C. K. I., & Seeger, M. (2001). Using the Nystrom method to speed up kernel machines. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Wold, H. (1975). Perspectives in probability and statistics. London: Chapman Hall. Xu, L., Jordan, M., & Hinton, G. (1995). An alternative model for mixtures of experts. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 633–640). Cambridge, MA: MIT Press.

Received March 17, 2003; accepted June 1, 2005.

LETTER

Communicated by Peter Bartlett

On the Nonlearnability of a Single Spiking Neuron ˇ ıma Jiˇrı´ S´ [email protected] Institute of Computer Science, Academy of Sciences of the Czech Republic, P.O. Box 5, 18207 Prague 8, Czech Republic

Jiˇrı´ Sgall [email protected] ˘ a 25, Mathematical Institute, Academy of Sciences of the Czech Republic, Zitn´ 11567 Prague 1, Czech Republic

We study the computational complexity of training a single spiking neuron N with binary coded inputs and output that, in addition to adaptive weights and a threshold, has adjustable synaptic delays. A synchronization technique is introduced so that the results concerning the nonlearnability of spiking neurons with binary delays are generalized to arbitrary real-valued delays. In particular, the consistency problem for N with programmable weights, a threshold, and delays, and its approximation version are proven to be NP-complete. It follows that the spiking neurons with arbitrary synaptic delays are not properly PAC learnable and do not allow robust learning unless RP = NP. In addition, the representation problem for N , a question whether an n-variable Boolean function given in DNF (or as a disjunction of O(n) threshold gates) can be computed by a spiking neuron, is shown to be coNP-hard. 1 A Spiking Neuron with Synaptic Delays Neural networks establish an important class of learning models that are widely applied in practical applications to solving artificial intelligence tasks (Haykin, 1999). The prominent position among neural network models has recently been occupied by networks of spiking neurons (Maass & Bishop, 1999). As compared to the traditional perceptron unit (Rosenblatt, 1958), the spiking neuron represents a biologically more plausible model in which the times that pulses need to travel through particular synapses, called delays, are taken into account (Maass, 1997). It is known that the synaptic delays are tuned in biological neural systems through a variety of mechanisms (Gerstner & Kistler, 2002). We consider a simplified model of a spiking neuron with adjustable weights, a threshold, and delays where pulses are implemented by a step function rather than a smooth function of a similar shape. This makes easy Neural Computation 17, 2635–2647 (2005)

© 2005 Massachusetts Institute of Technology

ˇ ıma and J. Sgall J. S´

2636

silicon implementation in pulsed VLSI possible (Maass & Bishop, 1999). In addition, the spiking neuron under consideration is used to compute Boolean functions, and thus binary coding is assumed for both the inputs and the output. The computational power of this model was analyzed by Schmitt (1998), and its learning complexity was studied by Maass and Schmitt (1999). Formally, a spiking neuron N has n inputs receiving binary values x1 , . . . , xn ∈ {0, 1}; each input i (1 ≤ i ≤ n) is associated with a real synaptic weight wi ∈ R and nonnegative delay di ∈ R+ 0 , and the last parameter is the output threshold θ ∈ R. The input value xi = 1 is presented in the form of a unit-length rectangular pulse (spike) of height |wi | (for wi < 0 upside down). This pulse travels through ith synapse in continuous time, producing a synaptic time delay di . Denote Di = [di , di + 1) the time interval of unit length during which N is influenced by the spike from input i. Taking the delay into account, a postsynaptic potential xi (t) ∈ R at ith input to N at continuous time t ≥ 0 can be expressed as xi (t) =

wi

for xi = 1 and t ∈ Di = [di , di + 1)

0

otherwise.

(1.1)

The (membrane) potential of N at time instant t ≥ 0 is then determined as the sum of current postsynaptic potentials: ξ (t) =

n

xi (t) .

(1.2)

i=1

Neuron N fires as soon as potential ξ (t) reaches threshold θ. Thus, N outputs y = 1 if ξ (t) ≥ θ for some t ≥ 0 and N outputs y = 0 if ξ (t) < θ for all t ≥ 0. Thus, for given weights w1 , . . . , wn , delays d1 , . . . , dn , and threshold θ, neuron N implements Boolean function yN : {0, 1}n −→ {0, 1} defined above. Also denote by FN the class of all Boolean functions computable by spiking neurons. Note that N coincides with the classical perceptron when all synaptic delays are zero. We give several results concerning computational complexity of training a single spiking neuron N with programmable weights, a threshold, and delays. In section 2, the so-called consistency problem is shown to be NP-complete for N with arbitrary real-valued delays. This generalizes the previous NP-completeness proof that was valid only for binary (or bounded ˇ ıma, discrete) delays and solves an open problem (Maass & Schmitt 1999; S´ 2003). Assuming RP = NP, it follows that the spiking neurons with binary coded inputs and outputs are not properly PAC learnable and also that they do not allow robust learning. This might be a little surprising in comparison to classical perceptrons: perceptrons are not properly PAC learnable for binary weights (Pitt & Valiant, 1988), but they become learnable when

On the Nonlearnability of a Single Spiking Neuron

2637

real-valued weights are allowed. Thus, in the learning complexity of neurons, the delays play a different role from the weights and the threshold. In addition, only the unit weights and bounded delays are exploited in our proof, and the weights together with the threshold need not be modifiable. Hence, our nonlearnability result holds also for the spiking neurons restricted to bounded real-valued delays and/or unit weights and/or a fixed threshold. Finally, the representation problem for spiking neurons is proven to be coNP-hard in section 3. ˇ ıma, 2003) considered perceptrons A preliminary version of this letter (S´ with delays having analog input values, while the results presented now are valid for spiking neurons with binary coded inputs. 2 A Single Spiking Neuron Is Not Learnable The computational complexity of training a spiking neuron can be analyzed by using the consistency (loading) problem (Judd, 1990), which is the problem of finding the neuron parameters for a given training task so that the function computed by neuron is perfectly consistent with all training data. Thus, a training set contains m training examples, each composed of n-dimensional input xk from {0, 1}n labeled with the desired scalar output value b k from {0, 1} corresponding to negative and positive examples: T = (xk ; b k ) | xk ∈ {0, 1}n , b k ∈ {0, 1}, k = 1, . . . , m .

(2.1)

The decision version for the consistency problem is formulated as follows: Consistency Problem for Spiking Neuron (CPSN) Instance: A training set T for spiking neuron N having n inputs. Question: Are there real-valued weights w1 , . . . , wn ∈ R, threshold θ ∈ R, and nonnegative delays d1 , . . . , dn ∈ R+ 0 for N such that yN (x) = b for every training example (x; b) ∈ T? The consistency problem is related to the PAC (probably approximately correct) learning model (Valiant, 1984), which is defined as follows. The example oracle E X( f, D) for a Boolean target function f : {0, 1}n −→ {0, 1} with respect to distribution D : {0, 1}n −→ [0, 1] returns the training example (x; f (x)) where x is drawn from {0, 1}n according to distribution D. Denote by f g = {x ∈ {0, 1}n | f (x) = g(x)} the symmetric difference between f and g. The spiking neuron is properly PAC learnable if there is a learning algorithm L such that for any target function f ∈ FN and for any distribution D, given n ≥ 1, accuracy ε > 0, and confidence δ > 0 as input, L has access to oracle E X( f, D), runs in time polynomial in n, 1/ε, 1/δ, and produces

ˇ ıma and J. Sgall J. S´

2638

weights w1 , . . . , wn , threshold θ, and delays d1 , . . . , dn of spiking neuron N , which with probability at least 1 − δ satisfies

D(x) < ε.

(2.2)

x∈ f yN

An efficient algorithm for the consistency problem is required within the proper PAC learning framework (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989). In addition, the VC-dimension of the problem has to be polynomial; this requirement is satisfied by the common neural network models (Anthony & Bartlett, 1999; Roychowdhury, Siu, & Orlitsky, 1994; Vidyasagar, 1997), in particular, the VC-dimension of spiking neuron N with n inputs is (n log n) (Maass & Schmitt, 1999). These results on the polynomiality of the VC-dimension encouraged proposals of several learning heuristics for networks of spiking neurons, for example, spike propagation (Bohte, Kok, & La Poutr´e, 2000). On the other hand, NP-hardness of the consistency problem implies that the neuron is not properly PAC learnable under generally accepted complexity-theoretic assumption RP = NP (Pitt & Valiant, 1988). An almost exhaustive list of such NP-hardness results for feedforward perceptron networks including single units was presented by ˇ ıma (2002). S´ For ordinary perceptrons with zero delays, that is, di = 0 for i = 1, . . . , n, the consistency problem is solvable in polynomial time by linear programming, although this problem restricted to binary weights is NP-complete (Pitt & Valiant, 1988). However, already for binary delays di ∈ {0, 1} the consistency problem becomes NP-complete, even for spiking neurons having fixed weights (Maass & Schmitt, 1999). This implies that neuron N with binary delays is not properly PAC learnable unless RP = NP. The result generalizes also to bounded discrete delay values di ∈ {0, 1, . . . , k} for fixed k ≥ 2. For the spiking neurons with unbounded delays, however, NP-hardness of the consistency problem was listed among open problems (Maass & Schmitt, ˇ ıma, 2003). 1999; S´ In this section, we prove that the consistency problem is NP-complete for a single spiking neuron N with arbitrary real-valued delays, which answers the previous open question. The result appears at first sight a little surprising since, for example, the consistency problem for the perceptron with zero delays becomes polynomial-time solvable when we go from binary to realvalued weights. This shows that the delays have different impact on the learning complexity of neurons than the weights and the threshold. Furthermore, our NP-completeness result holds even in a restricted model with bounded delays, unit weights, and a fixed threshold. In fact, it holds for any intermediate model between this restricted model and the general model stated above. This follows immediately from a natural robustness of the combinatorial proof, which uses sets of training examples that, on one hand, if representable, they can be represented by neurons

On the Nonlearnability of a Single Spiking Neuron

2639

in the restricted class, and on the other hand, if not representable, cannot be represented even by the general spiking neurons. Thus, our results are valid for a wide variety of models and may be more applicable to realistic biological systems. For the proof, a synchronization technique is introduced whose main idea can be described as follows. Let A ⊆ {1, . . . , n} be a subset of inputs, and let ( i∈A ei ; 1) ∈ T be a consistent positive example written as 0-1 linear combinations of basis vectors ei ∈ {0, 1}n , that is, all the items of ei are zero, but its ith component equals 1. The simultaneous consistency of negative examples ( e ; 0) ∈ T for every B ⊆ A such that |B| = |A| − 1 implies i i∈B i∈A Di = ∅. In this way, it can be ensured that N is simultaneously influenced by the spikes from inputs A, which is then exploited for the synchronization of the input spikes. Theorem 1. The consistency problem for spiking neuron (CPSN) is NP-complete. In addition, this holds also for the model of spiking neurons with any combination of the following restrictions: (i) bounded delays, when the values di are restricted to any real interval strictly longer than 1; (ii) unit weights, that is wi ∈ {0, 1} (or any set containing {0, 1}); and (iii) a fixed threshold θ ≥ 3. Proof. The fact that CPSN belongs to NP has already been stated by Maass and Schmitt (1999). In order to achieve the NP-hardness result, the following variant of the set-splitting problem, which is known to be NP-complete (Garey & Johnson, 1979), will be reduced to CPSN in polynomial time. 3Set-Splitting Problem (3SSP) Instance: A finite set S = {1, . . . , s} of s elements and a collection C of subsets of S such that |c| = 3 for all c ∈ C. Question: Is there a partition of S into two disjoint subsets S 1 and S 2 , that is, S = S 1 ∪ S 2 and S 1 ∩ S 2 = ∅, such that all c ∈ C satisfy c ⊆ S1 and c ⊆ S2 ? The 3SSP problem was also used for proving the result restricted to binary delays (Maass & Schmitt, 1999). The synchronization technique generalizes the proof to arbitrary delays. Given a 3SSP instance (S, C), we construct a training set T for spiking neuron N with n = 6(s + 1) inputs. The inputs are indexed x p,i for p = 0, 1, . . . , s, i = 1, . . . , 6. One 6-tuple of inputs x p,i corresponds to each element p ∈ S; in addition, the first six inputs x0,i are used for overall synchronization. The same index notation is used for corresponding weights w p,i , delays d p,i , time intervals Dp,i = [d p,i , d p,i + 1), and basis vectors e p,i ∈ {0, 1}n , for all pairs of p = 0, 1, . . . , s, i = 1, . . . , 6. The training set T, written in terms of the basis vectors, is as follows: (e0,i + e0, j ; 0)

for 1 ≤ i < j ≤ 6,

(2.3)

ˇ ıma and J. Sgall J. S´

2640

(e0,i + e0, j + e0,k ; 1) (e p,1 + e p,2 + e p,3 ; 1) (e p,4 + e p,5 + e p,6 ; 1) (e0,i + e0, j + e p,k + e p, ; 0)

for 1 ≤ i < j < k ≤ 6,

(2.4)

for p ∈ S,

(2.5)

for p ∈ S, 1 ≤ i, k ≤ 3, and 4 ≤ j, ≤ 6,

(e0,i + e0, j + e p,k + e p, ; 1)

for

(2.6) p ∈ S, 1 ≤ k ≤ 3, 4 ≤ ≤ 6, and 1 ≤ i < j ≤ 3 or 4 ≤ i < j ≤ 6, (2.7)

(e p,i + eq , j + er,k ; 0)

for { p, q , r } ∈ C and 1 ≤ i, j, k ≤ 3. (2.8)

The number of training examples is |T| = 35 + (2 + 34 + 2 · 33 )s + 33 |C| = O(|S| + |C|), and T can be constructed in polynomial time from S and C. Now we verify that the 3SSP instance has a solution if and only if the corresponding CPSN instance is solvable. First, assume that there exists a solution (S1 , S2 ) of the 3SSP instance (S, C). Let θ = 3, w p,i = 1 for all p and i, and d0,1 = d0,2 = d0,3 = 0.5, d p,1 = d p,2 = d p,3 = 0, d p,1 = d p,2 = d p,3 = 1.5,

d0,4 = d0,5 = d0,6 = 1

(2.9)

d p,4 = d p,5 = d p,6 = 1.5

for p ∈ S1 ,

(2.10)

d p,4 = d p,5 = d p,6 = 0

for p ∈ S2 .

(2.11)

With these parameters, N is consistent with training examples 2.3 to 2.7: Two weights w0,i = w0, j = 1 cannot reach threshold θ = 3 for negative example 2.3. For any positive example 2.4, the potential of N equals the threshold at least for the duration of time interval [1, 1.5) according to definition 2.9. Examples 2.5 to 2.7 are verified similarly. For negative example 2.8, recall that any { p, q , r } ∈ C intersects both S1 and S2 , which implies Dp,i ∩ Dq , j ∩ Dr,k = ∅ by definitions 2.10 and 2.11, and hence the potential never reaches the threshold. This completes the argument for the CPSN instance to be solvable. For the converse, assume that there exist weights w p,i , delays d p,i , and threshold θ such that N is consistent with training examples 2.3 to 2.8. Any consistent negative example ensures θ > 0, since a sufficiently large t is not in any Dp,i and then ξ (t) = 0 while ξ (t) < θ is still required to hold. We can multiply all weights and the threshold by 3/θ > 0, preserving yN . Thus, without loss of generality, we assume θ = 3 in the rest of the proof.

On the Nonlearnability of a Single Spiking Neuron

2641

From the simultaneous consistency of training examples 2.3 and 2.4, it follows by the synchronization technique that w0,i > 0 D0,i ∩ D0, j ∩ D0,k = ∅

for i = 1, . . . , 6, and

(2.12)

for 1 ≤ i < j < k ≤ 6.

(2.13)

Furthermore, for each p = 0, . . . , s, positive examples e p,1 + e p,2 + e p,3 and e p,4 + e p,5 + e p,6 (examples 2.4 and 2.5) together with θ = 3 guarantee that there are i ∈ {1, 2, 3} and j ∈ {4, 5, 6} such that w p,i ≥ 1 and w p, j ≥ 1. Since training set T is invariant under permutations of {e p,1 , e p,2 , e p,3 }, resp. of {e p,4 , e p,5 , e p,6 }, we may assume without loss of generality that w p,1 ≥ 1

and

w p,4 ≥ 1 for every p = 0, . . . , s.

(2.14)

For any p ∈ S, consider the positive example e0,1 + e0,2 + e p,1 + e p,4 (example 2.7) and the negative examples e0,1 + e0,2 and e0,2 + e0,4 + e p,1 + e p,4 (examples 2.3 and 2.6). Together with the fact that all the involved weights are positive due to inequalities 2.12 and 2.14, these examples imply that D0,1 intersects Dp,1 ∪ Dp,4 . A symmetric argument shows that D0,4 intersects Dp,1 ∪ Dp,4 as well. Since D0,1 ∩ D0,4 = ∅ by condition 2.13 and e0,1 + e0,4 + e p,1 + e p,4 is a negative example 2.6, the interval Dp,1 is disjoint with D0,1 ∩ D0,4 , as the relevant weights are at least 1 according to inequality 2.14; by convexity, this implies that Dp,1 can intersect at most one of D0,1 and D0,4 . Similarly for Dp,4 . Thus, D0,1 intersects exactly one of Dp,1 , and Dp,4 and D0,4 intersects the other one. Finally, define the splitting of S = S1 ∪ S2 so that p ∈ S1 if Dp,1 intersects D0,1 and p ∈ S2 if Dp,1 intersects D0,4 . It remains to prove that (S1 , S2 ) is a solution of the 3SSP. Let { p, q , r } ∈ C. Suppose for a contradiction that { p, q , r } ⊆ S j for some j ∈ {1, 2}. By the definition of (S1 , S2 ), this implies that Dp,1 , Dq ,1 , Dr,1 all intersect D0,i for the same i ∈ {1, 4}. Since Dp,1 , Dq ,1 , Dr,1 are all disjoint with D0,1 ∩ D0,4 = ∅ and all the intervals have the same length, this implies that Dp,1 ∩ Dq ,1 ∩ Dr,1 = ∅. However, this contradicts the consistency of the negative example e p,1 + eq ,1 + er,1 (example 2.8). Thus, (S1 , S2 ) solves the 3SSP instance (S, C). It remains to verify that the result holds even in the restricted models, as specified in the statement of the theorem. If a training set corresponds to a negative input of 3SSP, then by the previous proof, it cannot be represented by the general spiking neurons. It remains to verify that training sets corresponding to a positive input of 3SSP can be represented by the restricted spiking neurons. Their representation in the proof already uses bounded delays from the set {0, 0.5, 1, 1.5}, and it is easy to verify that the delays can be replaced by {d, d + δ, d + 1, d + 1 + δ} for arbitrary d ≥ 0 and δ ∈ (0, 1), and thus in particular any real interval of length larger than 1 is sufficient.

ˇ ıma and J. Sgall J. S´

2642

The construction already uses unit weights and a fixed threshold θ = 3. For a fixed threshold θ > 3, it is easy to use a padded variant construction: add some number of new inputs and new training examples that force the new weights to be at least 1, and let these new inputs be 1 in all the old examples (details are omitted). Corollary 1. If RP = NP, then a single spiking neuron N with programmable weights, a threshold, and delays is not properly PAC learnable. In addition, this holds for a spiking neuron with bounded delays and/or unit weights and/or a fixed threshold. A single spiking neuron N can compute only very simple Boolean functions (Schmitt, 1998). Therefore, the consistency problem frequently has no solution: for no setting of weights, a threshold, and delays, function yN is consistent with all the training examples. In this case, one would be satisfied with a good approximation in practice, that is, with the neuron parameters yielding a small training error. For example, in the incremental learning algorithms (e.g., Fahlman & Lebiere, 1990) that adapt single neurons before these are wired to a neural network, an efficient procedure for minimizing the training error is crucial to keep the network size small for successful generalization. Thus, the decision version for the approximation problem is formulated as follows: Approximation Problem for Spiking Neuron N (APSN) Instance: A training set T for spiking neuron N and a positive integer k. Question: Are there real-valued weights w1 , . . . , wn ∈ R, threshold θ ∈ R, and nonnegative delays d1 , . . . , dn ∈ R+ 0 for N such that yN (x) = b for at most k training examples (x; b) ∈ T? Obviously, the consistency problem is a special case of the approximation problem for k = 0. Thus, theorem 1 implies that the approximation problem for spiking neuron (APSN) is NP-complete, which was previously ˇ ıma, 2003). Also for the perceptrons with zero delays proved separately (S´ for which the consistency problem is polynomial-time solvable, several authors proved that the approximation problem is NP-complete (Hoffgen, ¨ Simon, & Van Horn, 1995; Roychowdhury, Siu, & Kailath, 1995) even if the threshold is assumed to be zero (Amaldi, 1991; Johnson & Preparata, 1978). Within the PAC framework, the NP-hardness of the approximation problem implies that the neuron does not allow robust learning unless RP = NP (Hoffgen ¨ et al., 1995). Recall that in robust learning, target function f can be an arbitrary Boolean function, and condition 2.2 is then replaced with D(x) < inf D(x) + ε. (2.15) x∈ f yN

g∈F N

x∈ f g

Thus, we have the following corollary:

On the Nonlearnability of a Single Spiking Neuron

2643

Corollary 2. If RP = NP, then a single spiking neuron N with programmable weights, a threshold, and delays does not allow robust learning. In addition, this holds for a spiking neuron with bounded delays and/or unit weights and/or a fixed threshold. 3 The Representation Problem In this section we deal with the representation (membership) problem for spiking neurons: Representation Problem for Spiking Neuron N (RPSN) Instance: A Boolean function f in DNF (disjunctive normal form). Question: Is f computable by a single spiking neuron N , that is, are there real-valued weights w1 , . . . , wn ∈ R, threshold θ ∈ R, and nonnegative n delays d1 , . . . , dn ∈ R+ 0 for N such that yN (x) = f (x) for every x ∈ {0, 1} ? The representation problem for perceptrons with zero delays, known as the linear separability problem, is known to be coNP-complete (Hegedus ¨ & Megiddo, 1996). We generalize the coNP-hardness result for spiking neurons with arbitrary delays. On the other hand, it is easily seen that the RPSN is p in the complexity class 2 from the polynomial time hierarchy (for a defip nition, see Balc´azar, D´ıaz, & Gabarro, ´ 1995). Hardness of RPSN for 2 (or for NP) would imply (Aizenstein, Hegedus, ¨ Hellerstein, & Pitt, 1998) that the spiking neurons with arbitrary delays are not learnable with membership and equivalence queries (unless NP = coNP). This remains an open problem. Theorem 2. The representation problem for spiking neuron (RPSN) is coNP-hard p and belongs to 2 . Proof. The tautology problem that is known to be coNP-complete (Cook, 1971) will be reduced to RPSN in polynomial time in a similar way as it was done for the linear separability problem (Hegedus ¨ & Megiddo, 1996): Tautology Problem (TAUT) Instance: A Boolean function g in DNF. Question: Is g a tautology, that is, g(x) = 1 for every x ∈ {0, 1}n ? Thus, given a TAUT instance g over n variables x1 , . . . , xn , we construct a corresponding RPSN instance f over n + 2 variables x1 , . . . , xn , y1 , y2 in polynomial time as follows: f (x1 , . . . , xn , y1 , y2 ) = (g(x1 , . . . , xn ) ∧ y1 ) ∨ (y1 ∧ y¯ 2 ) ∨ ( y¯ 1 ∧ y2 ).

(3.1)

2644

ˇ ıma and J. Sgall J. S´

For TAUT instance g in DNF, function f can be turned into DNF by one application of distributivity; this DNF is an RPSN instance corresponding to g. We now show that this TAUT instance is a tautology if and only if the corresponding RPSN instance is solvable. So first assume that g is a tautology. Hence, f given by formula 3.1 can be equivalently rewritten as y1 ∨ y2 , which is trivially computable by a spiking neuron. On the other hand, assume that there exists a ∈ {0, 1}n such that g(a) = 0. In this case, f (a, y1 , y2 ), reduces to XO R(y1 , y2 ). Using a similar argument as to show that XO R cannot be implemented by a single spiking neuron (Maass & Schmitt, 1999), we show the same for f . (Note that we cannot simply refer to this result as a black box, since the class of functions computed by spiking neurons is not closed under substitution of constants for variables.) Assume for contradiction that f is represented by N with weights w1 and w2 corresponding to the inputs y1 and y2 , respectively. Since f (a, 0, 0) = 0, f (a, 1, 0) = 1, and N represents f , we have w1 > 0. On the other hand, since f (a, 0, 1) = 1, f (a, 1, 1) = 0, and N represents f , we have w1 < 0, a contradiction. p For proving that RPSN ∈ 2 , consider an alternating algorithm for the RPSN that, given f in DNF, guesses polynomial size representations (Maass & Schmitt, 1999) of weights, a threshold, and delays for spiking neuron N first in its existential state, and then verifies yN (x) = f (x) for every x ∈ {0, 1}n in its universal state. Note that yN (x) can be computed in polynomial time since the number of time intervals to be checked grows only linearly. Maass and Schmitt observed (1999) that the class of n-variable Boolean functions computable by spiking neurons is strictly contained in the class DL T that consists of functions representable as disjunctions of O(n) threshold gates with n inputs (computing Boolean linear threshold functions L T). Thus, class DL T corresponds to two-layer networks with linear number of hidden perceptrons (with zero delays) and one output OR gate. The smallest number of threshold gates in such a representation is called the threshold number (Hammer, Ibaraki, & Peled, 1981). It was shown (Schmitt, 1998) that the threshold number of spiking neurons with n inputs is at most n − 1 and can be lower-bounded by n/2 . On the other hand, there exists a Boolean function with threshold number 2 that cannot be computed by a single spiking neuron (Schmitt, 1998). Note that TAUT is coNP-hard even if restricted to formulas with a linear number of monomials. For any g in DNF with m monomials, the formula g ∨ x1 ∨ . . . ∨ xm with m new variables xi is tautology if and only if g is a tautology and, in addition, it has (m) of both variables and monomials. Since any monomial is a special case of a threshold gate, this DNF formula with linear number of monomials can be further transformed into a DL T formula. It follows that a modified version of RPSN, whose instances are Boolean functions f from DL T (instead of DNF), is also coNP-hard. On the

On the Nonlearnability of a Single Spiking Neuron

2645

p

other hand, also this modified version is in 2 , using the same argument as in theorem 2. 4 Conclusion We have analyzed the computational complexity of training a single spiking neuron with programmable weights, a threshold, and delays. This model covers certain aspects of biological neurons. We have developed a synchronization technique that generalizes the known nonlearnability results for arbitrary real-valued synaptic delays. In particular, we have proved that the spiking neurons with binary coded inputs and outputs are not properly PAC learnable and do not allow robust learning unless RP = NP, which solves a previously open problem. In addition, we have shown that it is coNP-hard to decide whether a disjunction of O(n) threshold gates, which is known to implement any spiking neuron, can reversely be computed by a single spiking neuron. An open problem remains for further research whether the spiking neurons are learnable with membership and equivalence queries. Acknowledgments The presentation of this letter benefited from valuable comments of anonyˇ research was partially supported by grant GA CR mous reviewers. J.S.’s No. 201/02/1456, the “Information Society” project 1ET100300517, and the Institutional Research Plan AV0Z10300504. J.S.’s research was partially supported by Institutional Research Plan AV0Z10190503 and by the Institute for Theoretical Computer Science, Prague (projects LN00A056 and ˇ ˇ 1M0021620808 of MSMT CR). References Aizenstein, H., Hegedus, ¨ T., Hellerstein, L., & Pitt, L. (1998). Theoretic hardness results for query learning. Computational Complexity, 7(1), 19–53. Amaldi, E. (1991). On the complexity of training perceptrons. In T. Kohonen, K. M¨akisara, O. Simula, & J. Kangas (Eds.), Proceedings of the 1st International Conference on Artificial Neural Networks (ICANN’91) (pp. 55–60). Amsterdam: Elsevier. Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations Cambridge: Cambridge University Press. Balc´azar, J. L., D´ıaz, J., & Gabarro, ´ J. (1995). Structural complexity I (2nd ed.). Berlin: Springer-Verlag. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4), 929–965. Bohte, M., Kok, J. N., & La Poutr´e, H. (2000). Spike-prop: Error-backpropagation in multi-layer networks of spiking neurons. In Proceedings of the European Symposium

2646

ˇ ıma and J. Sgall J. S´

on Artificial Neural Networks (ESANN’2000) (pp. 419–425). Brussels: D-Facto Publications. Cook, S. A. (1971). The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on Theory of Computing (STOC’71) (pp. 151–158). New York: ACM Press. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems (NIPS’ 89), 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. San Francisco: Freeman. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Hammer, P. L., Ibaraki, T., & Peled, U. N. (1981). Threshold numbers and threshold completions. In P. Hansen (Ed.), Studies on graphs and discrete programming, Annals of Discrete Mathematics, 11, Mathematics Studies, 59 (pp. 125–145). Amsterdam: North-Holland. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Hegedus, ¨ T., & Megiddo, N. (1996). On the geometric separability of Boolean functions. Discrete Applied Mathematics, 66(3), 205–218. Hoffgen, ¨ K.-U., Simon, H.-U., & Van Horn, K. S. (1995). Robust trainability of single neurons. Journal of Computer and System Sciences, 50(1), 114–125. Johnson, D. S., & Preparata, F. P. (1978). The densest hemisphere problem. Theoretical Computer Science, 6(1), 93–107. Judd, J. S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Maass, W. (1997). Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9), 1659–1671. Maass, W., & Bishop, C. M. (Eds.). (1999). Pulsed neural networks. Cambridge, MA: MIT Press. Maass, W., & Schmitt, M. (1999). On the complexity of learning for spiking neurons with temporal coding. Information and Computation, 153(1), 26–46. Pitt, L., & Valiant, L. G. (1988). Computational limitations on learning from examples. Journal of the ACM, 35(4), 965–984. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. Roychowdhury, V. P., Siu, K.-Y., & Kailath, T. (1995). Classification of linearly nonseparable patterns by linear threshold elements. IEEE Transactions on Neural Networks, 6(2), 318–331. Roychowdhury, V. P., Siu, K.-Y., & Orlitsky, A. (Eds.). (1994). Theoretical advances in neural computation and learning. Boston: Kluwer. Schmitt, M. (1998). On computing Boolean functions by a spiking neuron. Annals of Mathematics and Artificial Intelligence, 24(1–4), 181–191. ˇ ıma, J. (2002). Training a single sigmoidal neuron is hard. Neural Computation, 14(11), S´ 2709–2728. ˇ ıma, J. (2003). On the complexity of training a single perceptron with proS´ grammable synaptic delays. In Proceedings of the 14th International Conference on

On the Nonlearnability of a Single Spiking Neuron

2647

Algorithmic Learning Theory (ALT’2003), LNAI 2842 (pp. 221–233). Berlin: SpringerVerlag. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Vidyasagar, M. (1997). A theory of learning and generalization. London: Springer-Verlag.

Received November 5, 2004; accepted March 21, 2005.

LETTER

Communicated by Shihab Shamma

A Novel Model-Based Hearing Compensation Design Using a Gradient-Free Optimization Method Zhe Chen [email protected] Department of Electrical and Computer Engineering, McMaster University Hamilton, Ontario L85 4k1, Canada

Suzanna Becker [email protected] Department of Psychology, McMaster University Hamilton, Ontario L85 4k1, Canada

Jeff Bondy [email protected] Department of Electrical and Computer Engineering, McMaster University Hamilton, Ontario L85 4k1, Canada

Ian C. Bruce [email protected] Department of Electrical and Computer Engineering, McMaster University Hamilton, Ontario L85 4k1, Canada

Simon Haykin [email protected] Department of Electrical and Computer Engineering, McMaster University Hamilton, Ontario L85 4k1, Canada

We propose a novel model-based hearing compensation strategy and gradient-free optimization procedure for a learning-based hearing aid design. Motivated by physiological data and normal and impaired auditory nerve models, a hearing compensation strategy is cast as a neural coding problem, and a Neurocompensator is designed to compensate for the hearing loss and enhance the speech. With the goal of learning the Neurocompensator parameters, we use a gradient-free optimization procedure, an improved version of the ALOPEX that we have developed (Haykin, Chen, & Becker, 2004), to learn the unknown parameters of the Neurocompensator. We present our methodology, learning procedure, and experimental results in detail; discussion is also given regarding the unsupervised learning and optimization methods.

Neural Computation 17, 2648–2671 (2005)

© 2005 Massachusetts Institute of Technology

A Novel Model-Based Hearing Compensation Design

2649

1 Introduction Current fitting strategies for hearing aids set the amplification in each frequency channel based on the hearing-impaired person’s audiogram, which measures pure tone thresholds for each of a small set of frequencies. However, it is well known that the detection of a sound can be strongly masked in the presence of background noise or competing speech, for example. It is therefore not surprising that many people with hearing loss end up not wearing their hearing aids. The devices are unhelpful and may even worsen the wearer’s ability to hear sounds under noisy listening conditions. Directional microphones and other generic signal processing strategies for noise reduction have resulted in modest benefits in some contexts, but not dramatic improvement. Instead, the approach we take here is to treat hearing aid design as a neural coding problem. We start with detailed models of the normal auditory nerve as well as that of a hearing-impaired person. We then search for a signal transformation that, when applied to the input to the impaired model, will result in a neural code that is close to that of the intact model. We refer to this strategy as neural compensation (Becker & Bruce, 2002). The signal transformation is highly nonlinear and dynamic and calculates the gain in each frequency channel by combining information across multiple channels rather than using a static set of channel-specific gains. The Neurocompensator should therefore be capable of approximating the contrast enhancement function of the normal ear. Neural compensation (Becker & Bruce, 2002) was motivated by the design of adaptive hearing aid devices for hearing-impaired persons. The goal of the Neurocompensator is to restore near-normal firing patterns in the auditory nerve in spite of the hair cell damage in the inner ear. A schematic diagram of normal and impaired hearing systems, as well as the neural compensation, is illustrated in Figure 1. Ideally, the Neurocompensator attempts to compensate the hearing impairment in the auditory system and match the output of the compensated system, as closely as possible, to the output of the normal hearing system. In other words, by regarding the outputs of the normal and impaired hearing systems as the neural codes generated by the brain, we attempt to maximize the similarity of the neural codes generated ˆ in Figure 1. from models H and H The early development of the Neurocompensator was described in Bondy, Becker, Bruce, Trainor, and Haykin (2004). In this initial work, we compared the output of the normal and damaged models directly at the level of the raw spike trains. However, auditory nerves have high spontaneous firing rates, and when driven by auditory input, they convey predominantly steady-state information, whereas the transient information is most critical to speech perception. In our previous work, we tested the algorithm on vowel sounds, which are relatively steady state. Here, we apply a transient detection procedure to the auditory nerve spike trains to simulate higher levels of auditory processing, and we train and test the model on continuous

2650

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin Spiking Output (neural codes)

Temporal Input (speech)

H maximize the similarity

H Nc

H

Figure 1: A schematic diagram of Neurocompensation. (Top) Normal hearing system. (Middle) Impaired hearing system. (Bottom) Neurocompensator (Nc) followed by the Impaired hearing system. The hearing systems map the temporal speech signal input to a spike trains map (neural codes) output; H and ˆ denote the input-output mappings of the normal and impaired ear models, H respectively. The Neurocompensator acts as a preprocessor before the impaired ear model in order to produce the similar neural codes as the normal neural codes from the normal ear model.

speech containing both voiced and unvoiced components. Also, in our previous work, an ad hoc perturbation-like optimization procedure was used to learn the Neurocompensator parameters with a simple error metric. Moreover, it does not provide a probabilistic measure of how well the Neurocompensator compensates for the hearing loss; neither does it present an informative comparative metric between the compensated and the normal hearing systems. It is our goal in this article to formulate a principled methodology and improve the optimization efficiency. In the work reported here, we incorporate four major advances in the development of the Neurocompensator algorithm. (1) We apply an onset-detection procedure to the auditory nerve model outputs (Bondy, Bruce, Dong, Becker, & Haykin, 2003) and adapt the model to continuous speech signals. (2) We develop a probabilistic metric to characterize the discrepancy between the onset spike train maps. (3) We incorporate an improved ALOPEX (ALgorithm Of Pattern EXtraction) procedure (Haykin, Chen, & Becker, 2004) for gradientfree optimization. (4) We present a major improvement in the design of the Neurocompensator that combines a fixed linear frequency-specific gain calculated by a standard widely used hearing aid algorithm (NAL-RP; Byrne, Parkinson, & Newall, 1990) with a context-dependent divisive normalization term whose coefficients are optimized using the ALOPEX.

A Novel Model-Based Hearing Compensation Design

2651

The letter is organized as follows. In section 2, we present the modelbased hearing compensation strategy and detail the probabilistic modeling of neural spike trains data. Section 3 describes the methodology for learning the Neurocompensator, including the architecture and the optimization procedure. We present some experimental results in section 4, followed by summary and discussion in sections 5. 2 Model-Based Hearing Compensation Strategy 2.1 An Overview of the System. Given the Neurocompensator diagram illustrated in Figure 1, the learning of the adaptive hearing system is shown in Figure 2. First, the time domain audio (speech or natural sound) signal is converted into frequency domain through short-time Fourier transform. The role of the Neurocompensator, which is modeled through frequencydependent gain coefficients for different bands (described later in this section), is to conduct spectral enhancement in the frequency domain. Given ˆ auditory models, the feedback error the normal (H) and impaired (H) is calculated via a probabilistic metric by comparing the spike train images between the normal and compensated hearing systems (detailed in section 3.1). Furthermore, a gradient-free optimization procedure (detailed in section 3.2) uses the error for updating the Neurocompensator’s

frequency weighting audio input Σ

H

Nc

^

H error

Figure 2: Block diagram of training the Neurocompensator (Nc). The normal ˆ auditory models’ output is a set of the spike trains at (H) and impaired (H) different best frequencies, which are then subjected to an onset-detection process (see text), while the Neurocompensator is represented as a preprocessor that calculates gains for each of the different frequencies. The error is actually the Kullback-Leibler (KL) divergence between the probability distributions of the two models’ outputs (see text).

2652

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

Table 1: Selected Speech Samples Used in the Experiments. Speech Sample TIMIT-1 TIMIT-2 TIMIT-3 TIMIT-4 TIDIGITS-1 TIDIGITS-2 TIDIGITS-3 TIDIGITS-4

Content /The emperor had a mean temper./ /His scalp was blistered by today’s hot sun./ /Would a tomboy often play outdoor?/ /Almost all of the colleges are now coeducational./ /one/ /one, two/ /nine, five, one/ /eight, one, o, nine, one/

parameters to minimize the discrepancy between the neural codes generated from the normal and impaired hearing models. 2.2 Experimental Data. The audio data presented to the ear models can be speech or any other natural sound. In our experiments, the speech data are selected from the TIMIT and the TIDIGITS databases. From the TIMIT database, 10 spoken sentences by different male and female speakers are used for the simulations reported here; the sample frequency of the speech data is 16 kHz. In the TIDIGITS database, the data consist of English spoken digits (in the form of isolated digits or multiple-digit sequences) recorded in a quiet environment, with sample frequency 8 kHz. All of the experimental data were subjected to resampling preprocessing (to 16 kHz if applicable) prior to being presented to the auditory models. Some of the speech samples used in the experiments are listed in Table 1. Ideally, all of the speech samples are truncated to within the same length. 2.3 Auditory Models. The auditory peripheral model used here is based on the earlier work of Bruce and colleagues (Bruce, Sachs, & Young, 2003). In particular, the model consists of a middle-ear filter, time-varying narrowand wide-band filters, inner hair cell, outer hair cell, synapse model, and spike generator, describing the auditory periphery path from the middle ear to the auditory nerve. More recently, a new middle ear model and a new saturated exponential synapse gain control have been incorporated into that model.1 The hearing-impaired version of the model described in detail in Bondy et al. (2004) simulates a typical steeply sloped high-frequency hearing loss. With the normal or impaired auditory models (Bruce et al., 2003), the spike train maps can be generated via feeding the temporal audio (speech

1 For further information on the auditory peripheral models, see Ian C. Bruce’s web site: http://www.ece.mcmaster.ca/∼ ibruce/.

A Novel Model-Based Hearing Compensation Design

2653

or natural sound) signal to the system.2 We further process the auditory representation generated by the auditory nerve models by applying an onset detection procedure (Bondy et al., 2003), consisting of a derivative mask with rectification and thresholding (see the appendix). This removes much of the noisy spontaneous spiking and high degree of steady-state information in the signal-driven spike trains. The resultant spike trains onset map is used here as the basis for comparing the neural codes generated by the normal and impaired models. 2.4 Probabilistic Modeling. In order to compare the neural codes of the normal and impaired models, we characterized the spike trains onset timefrequency map, which contains a number of two-dimensional data points (represented as black dots in the output image), by its probability density function. To overcome the inherent noisiness of the spike-generating and onset-detection processes, we chose a two-dimensional mixture of gaussians to characterize this distribution, given its spatial smoothing property across the spectral-temporal plane. Suppose that D1 ≡ {xi }i=1 and D2 ≡ {zi }i=1 denote the two-dimensional neural codes (i.e., the onset spike train binary images) that are calculated from the normal and impaired hearing models (Bruce et al., 2003), respectively.3 Assume that p(D1 |M) is a probabilistic model that characterizes the data D1 , when M here is represented by a gaussian mixture model—M ≡ {c j , µ j , j } Kj=1 . Note that {xi } ∈ D1 are the data points calculated from the normal ear model (with input-output mapping H) given the audio (speech) data. Suppose the data {xi } ∈ Rd are drawn from a two-dimensional (d = 2) mixture of gaussian density: K p(x) = p( j) p(x| j) j=1

=

K

cj

j=1

1 |x − µ j | , exp − |x − µ j |T −1 j 2 (2π)d | j | 1

(2.1)

where c j is the prior probability for the jth gaussian component, with mean µ j and covariance matrix j . Given a total of data points in the timefrequency spike trains onset map, we can calculate the joint likelihood of the data given the mixture model M: p(D1 |M) =

p(xi ).

(2.2)

i=1 2 The C++ codes written for generating the neural spike trains are available from Ian C. Bruce upon request. 3 Note that in general, = , where and denote the total number of points in D 1 and D2 , respectively.

2654

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

Alternatively, we can calculate the log likelihood L = log p(D1 |M) =

log p(xi ),

(2.3)

i=1

and the associated average log likelihood La v = L/. In this article, we have not used any model selection procedure for gaussian mixture modeling. Nevertheless, it is straightforward to use the penalized maximum likelihood that incorporates a complexity metric, such as the Bayesian information criterion (BIC),4 for model selection. More discussion of the model selection issue will be given later. The clustering is fitted via a mixture of elliptical gaussians using the expectation-maximization (EM) algorithm (e.g., Duda, Hart, & Stork, 2001). It is known that the EM algorithm is guaranteed only to converge monotonically to a local minimum or saddle point. In our early investigations (Gupta, 2004), several empirical findings were observed. First, it is necessary to rescale the time and frequency ranges for better gaussian mixture fitting; an optimal scale ratio (time versus frequency) of 0.25 applied to the normalized time-frequency coordinate is suggested; namely, the time axis is constrained within the region [0,1], whereas the frequency axis is within the region [0,0.25]. This is tantamount to scaling the variance of the coordinates and compressing the data in terms of their distance, which is advantageous for probabilistic fitting. Second, for the spike trains onset map, a total of 20 to 30 mixtures of elliptical gaussians is sufficient to characterize the data distribution (see Figure 3), although the optimal number of mixtures varies from one data set to another. For simplicity, a fixed number of mixtures determined empirically is assumed throughout our experiments, though this is not a principled solution. In addition, gaussian mixture fitting via the EM algorithm is well known to be sensitive to the initialized (mean and covariance) parameters (see Figure 4 for an illustration) for both the convergence speed and log likelihood performance. With a better initialization scheme compared to Gupta (2004), we use the K-means clustering method (e.g., Duda et al., 2001) to initialize the mean parameters to accelerate the convergence. We found that 10 to 20 iterations of the batch EM algorithm produce reasonable-fitting results for all data used thus far. 2.5 Spectral Enhancement. Spectral enhancement is achieved through the Neurocompensator. The principle of the Neurocompensator is to control the spectral contrast via the gain coefficients using the idea of divisive

For a K -mixture of gaussians model, the BIC is defined as B I C(K ) = represents the total number of log p(xi |θ) − 2K log , where K = K 1 + d + d(d+1) 2 free parameters in the model.

4

i=1

A Novel Model-Based Hearing Compensation Design

2655

0.25 0.2 0.15 0.1 0.05 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.25 0.2 0.15 0.1 0.05 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.25 0.2 0.15 0.1 0.05 0

0.25 0.2 0.15 0.1 0.05 0

Figure 3: Three selected sets of spike trains data calculated from the normal hearing model and their probabilistic fittings using 20 (the first three plots) or 30 (the fourth plot) gaussian mixtures. In these four plots, the horizontal axis represents scaled time; and the vertical axis represents scaled frequency, with a frequency versus timescale ratio of 0.25. For the third plot, L = 22009, La v = 1.97, and B I C(20) = 20891; for the fourth plot, L = 23942, La v = 2.14, and B I C(30) = 22264. It is evident that the fourth plot is a better fit than the third one.

normalization (Schwartz & Simoncelli, 2001). In particular, the frequencydependent gain coefficient, G i , at the ith frequency band, is calculated as Gi =

f i 2 , 2 j v ji f j + σ

(2.4)

where i and j represent the indices of the frequency bands; v ji denotes the cross-frequency-effect coefficient; G i is a nonlinear function of the weighted

A

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

Scaled frequency-axis

2656

0.25 0.2 0.15 0.1 0.05 0 -0.05

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

B

Scaled frequency-axis

Scaled time-axis

0.25 0.2 0.15 0.1 0.05 0 -0.05

0

0.1

0.2

0.3

0.4

0.5

0.6

Scaled time-axis

Log-likelihood

C

D

Scaled frequency-axis

Iterations

0.25 0.2 0.15 0.1 0.05 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Scaled time-axis

Figure 4: (A) The initialized 20 gaussian mixtures via K-means clustering. (B) The gaussian mixture fitting after 80 iterations of the EM algorithm. (C) The log-likelihood convergence curve. (D) Another fitting result obtained from a different initial condition.

A Novel Model-Based Hearing Compensation Design

2657

input (frequency) power; f i 2 , divided by the weighted sum of all the frequencies’ power; and σ is a regularization constant that ensures that the gain coefficient G i does not go to infinity. The design of gain coefficient function is the essence of a Neurocompensator. Applying gain coefficients to frequency bands is tantamount to implementing a bank of nonlinear filters, the motivation of which is to mimic the inner hair cells’ frequency response. The divisive normalization was originally aimed at suppressing the statistical dependency between the filters’ responses (Schwartz & Simoncelli, 2001). Here, we employ a similar functional form, but rather than adapting the normalization coefficients to optimize information transmission, we adapt the parameters to optimize a measure of the similarity between the neural codes generated by the two models (see section 3). For the present purpose, we propose a slightly different version of equation 2.4, as follows: Gi = h

w i f i 2 , where wi ∝ G iN AL−RP , 2 j v ji f j + σ

(2.5)

where G iN AL−RP represents a positive coefficient based on NAL-RP (National Acoustics Lab–Revised Profound), a standard hearing aid fitting protocol (Byrne et al., 1990) that can be calculated from the ith frequency band (see Bondy et al., 2004); and h(·) is a continuous, smooth (e.g., sigmoid) function that constrains the range of the gains as well as ensures that the gains will vary smoothly in time. When h(·) is linear and G iN AL−RP = 1, equation 2.5 reduces to 2.4. When all v ji = 0 and h(·) is linear, equation 2.5 reduces to the standard, fixed linear gain NAL-RP algorithm. We have chosen wi to be proportional (in value) to the G iN AL−RP that is given by the standard NALRP algorithm for calculation of the gains, while ensuring that wi will not be so large or small as to push the sigmoid function into the saturated region where derivatives would be near zero; wi will be fixed after appropriate scaling. For the hearing aid application, it is appropriate to constrain G i ≥ 0.5 Now, the goal of the learning procedure is to find the optimal parameters {v ji } that compensate the hearing impairment or intelligibility according to a certain performance metric. Because these normalization parameters are adapted to compensate for impaired auditory peripheral processing, we expect them to mimic the true neurobiological filter that they are substituting for. For example, for a fixed-frequency channel j, v ji might evolve toward an “on-center off-surround”–shape filter. Since the Neurocompensator attempts to substitute the role of a real neurobiological filter, it is reasonable to impose biologically realistic constraints on the compensator parameters: the gain coefficients G i should be nonnegative, bounded, and varying smoothly over a short period of time. It is important to note that unlike the traditional hearing aid algorithms, the parameters to be optimized are not independent, 5

The case G i < 0 has an effect of phase reversal to the frequency domain.

2658

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

in the sense that the cross-frequency interference may cause modifying one parameter to indirectly affect the optimality of the others. All of these issues make the learning of the Neurocompensator a hard optimization problem, and the solution might not be unique. In our early investigations (Bondy et al., 2004), the optimization procedure and the error metric used were quite ad hoc, and certain instability during the optimization was also observed. One of our major goals here is to recast this optimization problem in a more principled way. 3 Training the Neurocompensator 3.1 Optimization Problem Formulation. Let θ ≡ {v ji } denote the vector that contains all of the parameters to be estimated in the Neurocompensator. Let D2 = {zi } denote the data calculated from the deficient ear model (with ˆ after preprocessing the audio (speech) with the input-output mapping H), Neurocompensator parameterized by θ. Let p(D2 |M, θ) be the marginal likelihood of the impaired model’s spike trains having been generated by a normal model; then the associated log likelihood can be written as La v

K 1 1 = log p(D2 |M, θ) = log c k N (µk , k ; zi ) i=1 k=1 K 1 = log c k N (µk , k ; zi ) , i=1 k=1

where M is a gaussian mixture model fitted to the normal hearing model’s output, D1 , by maximizing log p(D1 |M), which can be optimized off-line as a preprocessing step. One way of optimizing the Neurocompensator would be to maximize La v with respect to θ; however, directly maximizing it may cause a “saturation,” since the number of points in D2 , , might grow over .6 A better objective function that does not suffer this pitfall is the KullbackLeibler (KL) divergence between the probability of observing the impaired model’s output under the normal versus impaired density function. Unfortunately, calculating the latter is much more costly because it must be done repeatedly, interleaved with optimization of the Neurocompensator parameters θ. We therefore consider a discrete sampling approach to estimate this density, which is computationally simpler than fitting a gaussian mixture model. Specifically, we quantize or discretize evenly the spike trains onset map into a number of bins, where each bin contains zero or more of the spikes.

6 This has been confirmed in our experiments. The worst case of the saturation effect will be that {zi } are uniformly distributed across the whole spike trains map.

A Novel Model-Based Hearing Compensation Design

2659

To quantitatively measure the discrepancy between the normal spike trains and reconstructed spike trains maps, we calculate the probability of each bin that covers the spikes; this can be easily done by counting the number of the spikes in the bin and further normalizing by the total number of the spikes in the whole spike trains map. In particular, the objective function to be minimized is a quantized form of the KL divergence, E ≡ KL(D2 D1 ) =

#bins

p(bini |D2 ) log

i

p(bini |D2 ) , p(bini |D1 )

(3.1)

where p(bini |D1 ) and p(bini |D2 ) represent the probabilities of the ith bin that contains the spikes in the normal and reconstructed spike trains maps, respectively. Note that p(bini |D1 ) can be calculated (only once) in the preprocessing step. In our experiment, we quantize evenly the spike trains map into a (40-time)×(10-frequency) mesh grid (see Figure 5A), with a total of 400 bins. However, equation 3.1 suffers from two drawbacks. (1) For some bins, the denominator p(bini |D1 ) can be zero, thereby causing a numerical problem, and (2), there is no smoothing between two discrete maps, hence it will suffer from the noise in the spiking or onset detection processes. Fortunately, since we have the gaussian mixture probabilistic fitting for D1 at hand, this can provide a spatial smoothing across the neighboring (time and frequency) bins, thereby counteracting the noise effect. To overcome the above two problems, we therefore substitute p(bini |D1 ) (quantized version) with p(bini |M) (continuous version), where p(bini |M) is calculated by fitting the center point in the ith bin with the gaussian mixture model M, divided by a normalization factor: j p(bin j |M) (see Figure 5B).7 To do so, we modify 3.1 to obtain our final objective function: E ≡ KL(D2 M) =

# bins

p(bini |D2 ) log

i

p(bini |D2 ) . p(bini |M)

(3.2)

Note that p(bini |M) is usually a nonzero value due to the overlapping gaussian covering, although it can be very small.8 As before, p(bini |M) can be calculated in the preprocessing step. When p(bini |D2 ) = p(bini |M), it follows that E = 0; otherwise, E is a nonnegative value given 0 ≤ p(bini |D2 ) < 1, 0 ≤ p(bini |M) < 1. Since the probability p(bini |D2 ) can be zero, we have assumed that 0 log 0 = 0. 7

To see how close the approximation is, we calculate the KL divergence in the example 400 p(bini |D1 ) i=1 p(bini |D1 ) log p(bin |M) = 0.1888.

of Figure 5:

i

To avoid the numerical problem in practice, we add a very small value (10−16 ) to the denominator to prevent overflowing. 8

2660

A

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

0.25 0.2 0.15 0.1 0.05 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.25 0.2 0.15 0.1 0.05 0

B

0.015 Pr(bini|D1) Pr(bini|M) KLD=0.1888

0.01

0.005

0

0

50

100

150

200

250

300

350

400

Indices of the bins Figure 5: (A) A grid quantization compared with a gaussian mixture fitting (middle panel) on the spike trains map. Each map contains 40 × 10 = 400 bins; the arabic numerals inside the bins indicate their respective indices. (B) The approximation comparison between p1 = p(bini |D1 ) and p2 = p(bini |M) (i = 1, . . . , 400), KL( p1 p2 ) = 0.1888.

A Novel Model-Based Hearing Compensation Design

2661

It is noted that direct calculation of the gradient ∂∂θE in either equation 3.1 or 3.2 is inaccessible due to the characteristics of the ear model as well as the form of the objective function; hence, we can only resort to gradient-free optimization, which we discuss below. During the training phase, the gain coefficients are adapted to minimize the discrepancy between the “Neurocompensated” and the original spike trains (see Figure 2). 3.2 Gradient-Free Optimization: ALOPEX. The ALOPEX algorithm was originally developed in vision research for optimizing the neurons’ response in terms of number of spikes (Harth & Tzanakou, 1974; Tzanakou, Michalak, & Harth, 1979; Harth, Unnikrishnan, & Pandya, 1987). Since then, many versions of the ALOPEX have been developed (Unnikrishnan & Venugopal, 1994; Tzanakou, 2000; Bia, 2001; Sastry, Magesh, & Unnikrishnan, 2002; Chen, Haykin, & Becker, 2003). As a generic optimization framework, the ALOPEX-type algorithms have certain appealing advantages: gradient free, network architecture independent, and synchronous learning. These appealing features make the ALOPEX a useful tool for nonconvex optimization and many machine learning problems. In Chen et al. (2003) and Haykin et al. (2004), we proposed a modified version of the ALOPEX algorithm, which aims to maintain the improved convergence speed of the ALOPEX-B algorithm (Bia, 2001) over the original ALOPEX, while improving the susceptibility of ALOPEX-B to local minima that we found in our earlier investigations. Specifically, let θ denote a vector of some unknown parameters, and assume that the objective function, E ≡ E(θ), is to be minimized; our proposed algorithm proceeds as follows: θ(t + 1) = θ(t) − ηθ(t)E(t) + γ ξ(t),

(3.3)

where η and γ are the step-size parameters, and θ(t) = θ(t) − θ(t − 1), E(t) = E(t) − E(t − 1). The vector ξ(t) is a random vector with its jth entry determined element-wise by ξ j (t) = sgn(u j − p j (t)), u j ∼ U(0, 1),

(3.4)

p j (t) = φ(C j (t)),

(3.5)

sgn(θ j (t))E(t) , t−k |E(k − 1)| k=2 λ(λ − 1)

C j (t) = t

(3.6)

where u is a uniformly distributed random variable drawn from the region (0, 1), sgn(·) is the signum function, and φ(·) is the logistic sigmoid function. The scalar 0 < λ < 1 is a forgetting parameter. An optimal forgetting parameter is often problem dependent; a common value is often chosen within the range [0.35, 0.7]. The parameter setup used in our experiments is η = 0.01, γ = 0.05, and λ = 0.5.

2662

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

The algorithm starts with a randomly initialized parameter θ(0) and stops when the cost function E(t) is sufficiently small or a predefined maximal step is reached. The stochastic component ξ(t), being a random force with certain acceptance probability, is included to help (but with no guarantee) the algorithm escape from local minima. It is noteworthy to make several remarks here regarding the optimization algorithm:

r

r

Being a stochastic correlative learning algorithm, the modified ALOPEX-B algorithm incorporates two types of correlation. The first kind of correlation takes a form of instantaneous cross-correlation described by the product term θ(t)E(t). The second kind of correlation appears in the computation of ξ(t) as in equations 3.4 through 3.6, which determines the acceptance probability of random perturbation force ξ(t). (See Haykin et al., 2004, for further discussion.) It is straightforward to apply a more sophisticated version of the ALOPEX algorithm for optimization, for instance, the Monte Carlo sampling-based ALOPEX algorithms developed in Chen et al. (2003) and Haykin et al. (2004). Nevertheless, we caution that the error metric should be carefully bounded and scaled for calculating the posterior probability p(θ) ∝ exp(−E(θ)).

The entire learning procedure is summarized as follows: 1. Initialize the parameters: {v ji } ∈ U(−0.5, 0.5), σ = 0.001. Randomly select one speech sample. 2. Load the selected speech data, the associated spike trains fitting mixture parameters M ≡ {c i , µi , i }, and the probability p(bini |M), the latter two of which are precalculated off-line. 3. Apply the short-term Fourier transform (STFT) to the speech data (128-point FFT with a 64-point overlapping Hamming window).9 The results of time-frequency analysis then provide the temporal-spectral information across frequency bands.10 4. Apply the gain coefficients θ to the frequency bands according to equation 2.5. Perform inverse Fourier transform to reconstruct the time domain waveform. 5. Present the reconstructed waveform to the hearing-impaired ear model; produce a “Neurocompensated” spike trains map.

9

For 16 kHz sampling frequency, it corresponds to a duration of 8 ms. Depending on the frequency resolution requirement, the number of frequency bands can vary from 20 to 40. We use 20 frequency bands in the experiments. 10

A Novel Model-Based Hearing Compensation Design

2663

6. Using the quantized approximation to the hearing-impaired data probability density and the precalculated gaussian mixture model. Calculate the objective function 3.2. 7. Apply the improved ALOPEX algorithm (see equations 3.3 through 3.6) to optimize θ. 8. Repeat steps 3 through 7 for a fixed number (say 100 to 200) of iterations. 9. Select another speech sample, and repeat steps 2 through 8. Repeat the whole procedure until the convergence criterion is satisfied. As far as step 7 in the optimization procedure is concerned, two kinds of optimization schemes can be considered:

r

r

Synchronous optimization. All of the gain coefficients are treated with no difference; all of the parameters are updated in parallel across different frequency bands. This scheme is simple, but due to the crossfrequency interdependence of the coefficients, it can be very slow given a poor parameter initialization. Asynchronous optimization. The gain coefficients in different frequency bands are treated differently and optimized sequentially with different priority. Starting with the highest-frequency band, all the other parameters associated with the lower-frequency bands are set as zeros; update only the parameters associated with the high-frequency band. Then freeze these parameters, switch to a lower-frequency band (i.e., the second highest) repeat the optimization, and so on. For each frequency band, the optimization stopping criterion is empirically set as repeating 10 to 15 iterations. This sequential optimization can be justified by the fact that in a hearing-impaired system, it is the lower frequencies that tend to interfere with the detection of higher frequencies, not the converse.

4 Experimental Results To reduce the computational burden, we have consistently used a fixed number (K = 20) of gaussian mixtures for fitting all of spike trains data. We present results here based on the training speech samples listed in Table 1, totaling about 14.1 seconds of continuous speech. We apply the improved version of the ALOPEX-B algorithm for optimization, where the objective function to be minimized is equation 3.2. Figure 6 shows the performance metric curve using the synchronous optimization scheme. We have not extensively investigated the asynchronous optimization scheme, but it was observed in an empirical test that inappropriate initialization may cause unstable performance. For this reason, we have restricted ourselves here to the synchronous optimization scheme.

2664

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin 0.7

KL divergence

0.65

0.6

0.55

0.5

0.45

0.4

0.35

0

10

20

30

40

50

60

70

80

90

Iteration

Figure 6: Learning curve of one speech sample using synchronous optimization. The KL divergence starts with 0.63 and stays around 0.4 after 90 iterations.

Note that finding the optimal θ from normal spike trains is an ill-posed inverse problem; hence, it is impossible to build a perfect inverse model. However, it is hoped that the reconstructed spike trains image from the compensated hearing-impaired model is close to the one from the normal hearing model after the learning the Neurocompensator. Figure 7 shows the comparison between the normal, deficient, and Neurocompensated spike trains maps of the training speech sample. Upon completion of the training process, we freeze θ and further test the Neurocompensator on some unseen speech samples. The training and testing KL divergence results of the experimental data are summarized in Table 2. Two sets of testing results on two spoken speech signals are shown in Figure 8. It is seen that the Neurocompensated spike trains maps are reasonably close to the normal ones, though not perfect. This is quite encouraging given the fact that we have used only about 3.7 seconds of speech for training here. Ideally, given sufficient computational power, we should use as many speech samples as possible for training. It is hoped that by averaging across more speech samples (with different contexts, speakers, spoken speeds, and so forth), the learning process can yield a more accurate and robust solution. 5 Summary and Discussion We have described a novel methodology for learning a Neurocompensator, an ingredient of a learning-based, intelligent hearing aid device. The

A Novel Model-Based Hearing Compensation Design

2665

0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.1 0

0.3 0.2 0.1 0

Figure 7: Comparisons of normal, deficient, and Neurocompensated (respectively, from top to bottom panels) spike trains onset maps. The deficient spike trains map is generated using the hearing-impaired model applied to the deficient waveform (which is produced by preprocessing the signal through the standard NAL-RP algorithm, with all gains set to G i ≡ G iN AL−RP for the 20 timefrequency bands and then reconstructing the signal by inverse FFT). The KL divergence between the deficient and normal spike trains is 0.664 before the learning, as opposed to 0.42 between the Neurocompensated and normal spike trains after the learning.

learning is achieved by probabilistic modeling of auditory nerve model spike trains and a gradient-free optimization procedure for parameter update. Based on our empirical experiments, it has been shown that the Neurocompensator provides a promising approach to adaptive compensation for reducing perceptual distortion due to hearing loss. We have observed some problems with our current approach. In particular, we have found in the experiments that the optimization solution is nonunique. As seen from Figure 7, there are still obvious differences between the normal and Neurocompensated spike trains maps. We suspect that constraining the solution space and incorporating prior knowledge might somewhat alleviate this issue. Second, we found that the parameters are somewhat training data dependent. In other words, one set of Neurocompensator parameters good for one speech sample does not necessarily

2666

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

Table 2: Training and Testing Results of the Experimental Data in Table 1. Speech Sample TIMIT-1 TIMIT-2 TIMIT-3 TIMIT-4 TIDIGITS-1 TIDIGITS-2 TIDIGITS-3 TIDIGITS-4

KLinit (D2 M)

KLfinal (D2 M)

KLfinal (D2 D1 )

KL(D1 M)

1.2058 0.6152 0.6692 0.6477 1.0626 1.0234 0.4913 0.6346

0.4462 0.4697 0.6105 0.4666 0.1798 0.4345 0.2013 0.2599

1.2828 1.9255 1.7367 1.8329 0.5591 1.5918 0.5759 0.3757

0.1885 0.2493 0.2741 0.2743 0.0547 0.1634 0.0871 0.1888

Notes: The right-most column KL(D1 M) indicates the approximation accuracy between the quantized pmf and continuous gaussian mixture pdf on the neural codes obtained from the normal hearing system. It can be roughly viewed as a lower bound for the values in the third and fourth columns, which are the final values of KL(D2 M) and KL(D2 D1 ) for the training or testing data after the learning is terminated. The second and third columns show the values of KL(D2 M) (objective function 3.2) before and after employing the Neurocompensator. The numbers in boldface indicate the training results.

produce a similarly good performance for another one (see Table 2). This problem should be somewhat alleviated by averaging across more training samples. Another solution to this problem may be to train a mixture of Neurocompensator modules adapted to different input statistics, such as different talkers under varying listening conditions. One could then use a trained classifier to select the best Neurocompensator for the current context. One obvious weakness here is to use a fixed number of mixtures for different spike trains image data. In order to alleviate the computational burden of our procedure and focus on the optimization part, we have neglected to consider model selection in our probabilistic modeling. In the literature, however, there are some principled ways, such as Bayesian approaches (Roberts, Husmeier, Rezek, & Penny, 1998; Attias, 2000), the merging-splitting approach (Ueda, Nakano, Ghahramani, & Hinton, 2000), or the greedy approach (Verbeek, Vlassis, & Krose, ¨ 2003), to tackle this issue. Another important area for future investigation is the design of the gain function 2.5. We have found that the form of the gain function (e.g., the range and the shape of h(·) function) has a crucial effect on the optimization performance, particularly on the speed of convergence. The possibility of incorporating prior knowledge or adding constraints to the gain function might also accelerate the convergence speed of optimization. How to design an optimal form of the gain function remains an unsolved problem. In the simulations reported here, we have used a generic hearingimpaired model with a classic “ski-slope” loss profile, with a sharp linear falloff in perceptibility at high frequencies (Bondy et al., 2003). However, using the same auditory nerve model, it is possible to create an extremely

A Novel Model-Based Hearing Compensation Design

A

2667

0.25 0.2 0.15 0.1 0.05 0 −0.05

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.25 0.2 0.15 0.1 0.05 0 −0.05

B

0.25 0.2 0.15 0.1 0.05 0 −0.05

0.25 0.2 0.15 0.1 0.05 0 −0.05

Figure 8: Testing results on two untrained continuous speech samples. Comparison is made between the normal and Neurocompensated spike trains onset maps. The KL divergence of equation 3.1 is 0.2013 between the top two maps (A) and 0.5591 between the bottom two maps (B).

detailed and accurate model of an individual’s hearing loss profile, and then learn appropriate compensation parameters; In other words, the Neurocompensator can be designed to be person specific. This requires separate estimation of the impairment of inner and outer hair cells at a wide range of

2668

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

frequencies. Although such measurements go well beyond the standard audiogram, psychophysical tests have been developed for this purpose (Shera, Guinan, & Oxenham, 2002; Plack & Oxenham, 2000; Moore, Huss, Vickers, Glasberg, & Alcantara, 2000; Moore, Vickers, Plack, & Oxenham, 1999). It is particularly important to map out “holes in hearing”—any dead region of the cochlea where inner hair cells, the primary auditory sensory receptors, have died off. Although nearby hair cells will fire in response to the frequencies normally transmitted by the dead region, a simple amplification of frequencies in a dead region, as would be done by standard hearing aids, may result in severe perceptual distortions. Unlike other hearing aid strategies, the Neurocompensator should be able to correct for such distortions. However, one limitation of this approach is that it neglects the normal listener’s ability to perform auditory sound localization and stream segregation, and the use of top-down expectations to focus attention. Future development of this work could incorporate more sophisticated auditory models to train the Neurocompensator. After further development of our algorithm, the ultimate test of its efficacy will be to conduct human hearing tests. The hearing-impaired person(s) will listen to the reconstructed speech waveform yielded from the hearing aid device (i.e. Neurocompensator) and compare the intelligibility quality with and without the hearing compensation. Once the training is accomplished, the hearing test requires no additional computational effort and is easily performed. Furthermore, once the Neurocompensator parameters are optimized, the algorithm represented by equation 2.5 could be straightforwardly and efficiently implemented in a digital hearing aid circuit.

Appendix: Onset Spike Trains Map Generation The onset of energetic amplitude modulation (AM) components of the stimuli coded in the spike trains map is used in our experiments for perceptual grouping. In what follows, we briefly describe the motivation, representation of the spike trains map, and the onset map generation procedure (Bondy et al., 2003). The goal of transforming the spike trains into the AM onset map is to provide a more parsimonious representation of the important acoustic events. Auditory research has showed that the AM feature extraction plays a critical role, being biologically viable and psychophysically justified. The slow AM fluctuations that are highlighted by our transformation mapping are based on the important AM found in speech (Drullman, Festen, & Plomp, 1994). In addition, the study in auditory periphery (Wang & Shamma, 1995) demonstrated that the spectral-temporal response fields (STRFs) at many points in the auditory brain show strong AM responses; Fishbach, Yeshurun, and Nelken (2003) also proposed their auditory function model based on the AM feature extractors.

A Novel Model-Based Hearing Compensation Design

2669

Here we use a single AM extractor per frequency channel that passes the psychophysically important modulations. The instantaneous neural spike trains are computed for a set of 20 logarithmically spaced central frequencies (CFs) using the auditory model developed in Bruce et al. (2003). The representation is then a finite resolution of time-frequency map (with horizontal axis representing time and vertical axis representing frequency). Onset of AM in each frequency band is calculated with a difference of exponential filters, h 1 [n], in each frequency band: h 1 [n] =

n n exp(−n/α1 ) − 2 exp(−n/α2 ). α12 α2

The input to h 1 is the instantaneous discharge rate over time for each channel. The values α1 and α2 are selected to pass the psychophysically important frequencies from 4 to 32 Hz. These frequencies contribute most to intelligibility, with a signal’s fine temporal structure adding only a small amount to the intelligibility. This is a little wider than the data in Drullman et al. (1994) because of the difficulty in making a very sharp filter. The onset data are then integrated over a typical acoustic event time window, h 2 [n], which has a 6dB cutoff at 125 Hz. This integrator is defined as h 2 [n] =

n exp(−n/α1 ). α32

For a sample rate of 11,025 Hz, the parameters are chosen to be α1 = 0.06, α2 = 0.10 and α3 = 0.001. The values of α1 and α2 turned out to be very similar to those chosen in the feedback architecture in Nelson & Carney (2004) that explored the linear AM response. Thus, applying h 1 corresponds to employing an AM extractor for each frequency channel, which can be thought of as the basic excitatory and inhibitory interplay between the auditory neurons. The data from the AM feature detector are then integrated with h 2 over the typical syllabic rate. An adaptive threshold and refraction operation is then applied, which mimics the neural firing patterns to produce AM “events” over a certain length. The thresholding was selected to produce a suitable sparsity for grouping. For instance, the threshold value is selected to produce some percentage (0.1 to 0.5%) of active events in the discretized time-frequency spike trains map when the refractory period is set as 1 ms. The greater the threshold value, the sparser are the spikes in the onset map; on the other hand, increasing the refractory period would thin out the continuous blocks in the onset map. In our experiments, active event probabilities from 0.01 to 5 percent were tried before settling on 0.2 percent for the Neurocompensator simulations. Typical threshold value is within the region [1, 1.7].

2670

Z. Chen, S. Becker, J. Bondy, I. Bruce, and S. Haykin

Acknowledgments The work reported here was supported by the Natural Sciences and Engineering Research Council of Canada and is associated with the Blind Source Separation and Application project. We thank S. Gupta for the early input and the assistance of implementation. We also thank an anonymous reviewer for valuable comments that improved the presentation of the article.

References Attias, H. (2000). A variational Bayesian framework for graphical models. In S. A. Solla, T. K. Leen, & K. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 201–215). Cambridge, MA: MIT Press. Becker, S., & Bruce, I. C. (2002). Neural coding in the auditory periphery: Insights from physiology and modeling lead to a novel hearing compensation algorithm. In Workshop in Neural Information Coding, Les Houches, France. Bia, A. (2001). Alopex-B: A new, simple, but yet faster version of the Alopex training algorithm. International Journal of Neural Systems, 11(6), 497–507. Bondy, J., Becker, S., Bruce, I., Trainor, L., & Haykin, S. (2004). A novel signalprocessing strategy for hearing-aid design: Neurocompensation. Signal Processing, 84, 1239–1253. Bondy, J., Bruce I., Dong, R., Becker, S., & Haykin, S. (2003). Modeling intelligibility of hearing-aid compression circuits. In Proc. 37th Asilomar Conf. Signals, Systems, and Computers (pp. 720–724). Bruce, I. C., Sachs, M. B., & Young, E. (2003). An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses. Journal of the Acoustical Society of America, 113, 369–388. Byrne, W., Parkinson, A., & Newall, P. (1990). Hearing aid gain and frequency response requirements for the severely/profoundly hearing impaired. Ear and Hearing, 11, 40–49. Chen, Z., Haykin, S., & Becker, S. (2003). Sampling-based ALOPEX algorithms for neural networks and optimization (Tech. Rep.). Hamilton, Ontario: Adaptive Systems Lab, McMaster University. Available online at http://soma.crl.mcmaster. ca/∼ zhechen/download/alopex.ps. Drullman, R., Festen, J. M., & Plomp, R. (1994). Effect of reducing slow temporal modulations on speech reception. Journal of the Acoustical Society of America, 95(5), 2670–2680. Duda, R. O, Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. Fishbach, A., Yeshurun, Y., & Nelken, I. (2003). Neural model for physiological responses to frequency and amplitude transitions uncovers topographical order in the auditory cortex. Journal of Neurophysiology, 90, 3663–3678. Gupta, S. (2004). Efficient testing of the Neurocompensator through the development of an unsupervised learning clustering algorithm and an adaptive psychometric function. Bachelor’s thesis, McMaster University.

A Novel Model-Based Hearing Compensation Design

2671

Harth, E., & Tzanakou, E. (1974). Alopex: A stochastic method for determining visual receptive fields. Vision Research, 14, 1475–1482. Harth, E., Unnikrishnan, K. P., & Pandya, A. S. (1987). The inversion of sensory processing by feedback pathways: A model of visual cognitive functions. Science, 237, 184–187. Haykin, S., Chen, Z., & Becker, S. (2004). Stochastic correlative learning algorithms. IEEE Transactions on Signal Processing, 52(8), 2200–2209. Moore, B. C., Huss, M., Vickers, D. A., Glasberg, B. R., & Alcantara, J. I. (2000). A test for the diagnosis of dead regions in the cochlea. British Journal of Audiology, 34(4), 205–224. Moore, B. C., Vickers, D. A., Plack, C. J., & Oxenham, A. J. (1999). Interrelationship between different psychoacoustic measures assumed to be related to the cochlear active mechanism. Journal of the Acoustical Society of America, 106(6), 2761–2778. Nelson, P. C., & Carney, L. H. (2004). A phenomenological model of peripheral and central neural responses to amplitude modulated tones. Journal of the Acoustical Society of America, 116(4), 2173–2186. Plack, C. J., & Oxenham, A. J. (2000). Basilar-membrane nonlinearity estimated by pulsation threshold. Journal of the Acoustical Society of America, 107(1), 501–507. Roberts, S. J., Husmeier, D., Rezek, I., & Penny, W. (1998). Bayesian approaches to gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1133–1144. Sastry, P. S., Magesh, M., & Unnikrishnan, K. P. (2002). Two timescale analysis of the Alopex algorithm for optimization. Neural Computation, 14, 2729–2750. Schwartz, O., & Simoncelli, E. (2001). Natural sound statistics and divisive normalization in the auditory system. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 166–172). Cambridge, MA: MIT Press. Shera, C. A., Guinan, J. J., & Oxenham, A. J. (2002). Revised estimates of human cochlear tuning from otoacoustic and behavior measurements. Proceedings of National Academy of Sciences, USA, 99(5), 3318–3323. Tzanakou, E. (2000). Supervised and unsupervised pattern recognition: Feature extraction and computational intelligence. Boca Raton, FL: CRC Press. Tzanakou, E., Michalak, R., & Harth, E. (1979). The Alopex process: Visual receptive fields by response feedback. Biological Cybernetics, 35, 161–174. Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128. Unnikrishnan, K. P., & Venugopal, K. P. (1994). Alopex: A correlation-based learning algorithm for feedforward and recurrent neural networks. Neural Computation, 6, 469–490. Verbeek, J. J., Vlassis, N., & Krose, ¨ B. (2003). Efficient greedy learning of gaussian mixture models. Neural Computation, 15, 469–485. Wang, K., & Shamma, S. A. (1995). Spectral shape analysis in the central auditory system. IEEE Transactions on Speech and Audio Processing, 3(5), 382–395.

Received September 3, 2004; accepted April 19, 2005.

LETTER

Communicated by Steven Nowlan

A Robust Information Clustering Algorithm Qing Song [email protected] School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798

We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimization of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The maximization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main advantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density estimate and forms a simple data clustering algorithm based on the square error of the Euclidean distance. 1 Introduction There are basically two unsupervised learning approaches in data clustering and classification problems: the parametric and nonparametric algorithms (Bishop, 1995). In the parametric clustering approach, predefined probability density functions are used to calculate the clustering objective function to achieve optimal performance. This has limited applications in cases of unknown distributions of the input pattern (Vapnik, 1998; Scholkopf, Smola, & Muller, 1998). In a related robust density estimation of the filtering problem, the minimax optimization was studied by Levy & Nikoukhah (2004). Alternatively, the nonparametric clustering algorithm normally divides the input patterns into groups and minimizes the dissimilarity or maximizes the similarity objective functions (Bajcsy & Ahuja, 1998; Gokcay & Principe, 2002; Shen & Wu, 2004). For specific data structures, the nonparametric density-based clustering algorithms were applied to identify highdensity clusters separated from the low-density pattern by exploring either regions of the high-density pattern (Bajcsy & Ahuja, 1998) or regions with fewer patterns, such as in valley-seeking clustering (Gokcay & Principle, 2002). Recently, issues of robust clustering have received extensive attention. This commonly refers to two aspects: outlier and cluster numbers that are Neural Computation 17, 2672–2698 (2005)

© 2005 Massachusetts Institute of Technology

A Robust Information Clustering Algorithm

2673

closely linked to each other (Dave & Krishnapuram, 1997). Not knowing the number of clusters in a data set complicates the task of separating the good data points from the noise points and, conversely, the presence of noise makes it harder to determine the number of clusters. A nonparametric density estimation was used in a similarity-based fuzzy clustering algorithm (Shen & Wu, 2004), which used an M-estimator to search for a robust solution. An interesting information-theoretic perspective approach sought a parametric study for an optimal cluster number based on the information bottleneck method (Tishby, Pereira, & Bialek, 1999) and mutual information correction (Still & Bialek, 2004). We propose a robust information clustering (RIC) algorithm and claim that any data point could become an outlier in the RIC learning procedure. This is dependent on the given data structure and chosen cluster centers. Our primary target is to partition the given data set into effective clusters and determine an optimal cluster number based on the basic deterministic annealing (DA) clustering via the identification of outliers (the latter is only a by-product of RIC; see also the simulation results). The RIC is basically a two-step minimax mutual information (MI) approach. The minimization of the MI, or precisely, the rate distortion function leads to mass-constrained DA clustering, which is designed essentially to divide a given data set into effective clusters of the Euclidean distance (Rose, 1998). As the cluster number is increased from one to a predefined maximum number and the temperature is lowered in the annealing procedure, the DA tends to explore more and more details in the input data structure and may result in the overfitting problem with poor generalization ability. The maximization of MI, or precisely the capacity maximization, leads to the minimax optimization of the RIC algorithm based on a common constraint: the dissimilarity or distortion measure. The minimax MI estimates an upper bound of the empirical risk and identifies the noisy input data points (outliers) by choosing different cluster numbers. Furthermore, we reinvestigate the character of the titled distribution, which can be interpreted as the cluster membership function that forms a set of indicator functions and is linear in parameters at zero temperature. This allows the RIC to calculate explicitly the Vapnik-Cervonenkis (VC)-dimension and determines an optimal cluster number based on the structural risk minimization (SRM) principle with compromise between the minimum empirical risk (signal to noise ratio) and VC-dimension (Vapnik, 1998).1 Contrary to the parametric algorithms, the RIC is a simple yet effective nonparametric method to solve the intertwined robust clustering problems, that is, to estimate an optimal, or at least a suboptimal, cluster number

1

VC-dimension is also one of the capacity components in statistical learning theory (Vapnik, 1998). This is different from the concept of capacity maximization in classical information theory. See section 4 for more details.

2674

Q. Song

via the identification of outliers (unreliable data points). The argument is that the ultimate target of the RIC is not to look for or approximate the true probability distribution of the input data set (similarly, we are also not looking for “true” clusters or cluster centers), which is proved to be an ill-posed problem (Vapnik, 1998), but to determine an optimal number of effective clusters by eliminating the unreliable data points (outliers) based on the inverse theorem and MI maximization. It also turns out that the outlier is in fact a data point that fails to achieve capacity. Therefore, any data point could become an outlier as the cluster number is increased in the annealing procedure. Furthermore, by replacing the Euclidean distance with other dissimilarity measures, it is possible to extend the new algorithm into the kernel and nonlinear clustering algorithms for linearly nonseparable patterns (Scholkopf et al., 1998; Gokcay & Principe, 2002; Song, Hu, & Xie, 2002). The letter is organized as follows. Section 2 gives the motivation of the proposed research by reviewing a few related and well-established clustering algorithms. Section 3 discusses the rate distortion function, which forms the foundation of the DA clustering. The capacity maximization, SRM principle, and the RIC algorithm are studied in section 4. In section 5, instructive simulation results are presented to show the superiority of the RIC. The conclusion is presented in section 6. 2 Motivation Suppose that there are two random n-dimensional data sets xi ∈ X i = 1, . . . , l and wk ∈ W k = 1, . . . , K , which represent input data points (information source in term of communication theory) and cluster centers (prototypes), respectively. For the clustering problem, a hard dissimilarity measure can be presented in a norm space, for example, a square error in the Euclidean distance, min d(wk , xi ), k,i

(2.1)

where d(wk , xi ) = wk − xi 2 . Note that the definition of equation 2.1 is used in DA and most modelfree clustering algorithms for a general data clustering problem. This can be easily extended into kernel and other nonlinear-based measures to cover the linearly nonseparable data clustering problem (Scholkopf et al., 1998; Gokcay & Principe, 2002). Based on the hard distortion measure, equation 2.1, some popular clustering algorithms have been developed, including basic k-means, fuzzy c-means (FCM), and improved robust versions—the robust noise and possibilistic clustering algorithms (Krishnapuram & Keller, 1993). The optimization of possibilistic clustering,

A Robust Information Clustering Algorithm

2675

for example, can be reformulated as a minimization of the Lagrangian function,

J =

K l

(ui,k )m d(wk , xi ) +

i=1 k=1

K l

ηi (1 − ui,k )m ,

(2.2)

i=1 k=1

where ui,k is the fuzzy membership with the degree parameter m, and ηi is a suitable positive number to control the robustness of each cluster. The common theme among the robust clustering algorithms is to reject or ignore a subset of the input patterns by evaluating the membership function ui j , which can also be viewed as a robust weight function to phase out outliers. The robust performance of the fuzzy algorithms in equation 2.2 is explained in a sense that it involves fuzzy membership of every pattern to all clusters instead of crisp membership. However, in addition to the sensitivity to the initialization of prototypes, the objective function of robust fuzzy clustering algorithms is treated as a monotonic decreasing function, which leads to difficulties finding an optimal cluster number K and a proper fuzziness parameter m. Recently, maximization of the relative entropy of the continuous time domain, which is similar to maximization of the MI of the discrete time domain in equation 3.1, has been used in robust signal filtering and density estimation (Levy & Nikoukhah, 2004). This uses the following objective function as a minimax optimization problem, 1 J = min max E f X − W 2 −υ f 2 W

ln R

f z (z) dz − c , f z (z)

(2.3)

where X and W represent the input and output data sets, f z (z) is defined as a nominal probability density function against the true density function f z (z), and υ is the Lagrange multiplier with the positive constraint c. It maximizes the relative entropy of the second term of the cost function 2.3 against uncertainty or outlier through the true density function for the least favorable density estimation. The cost function is minimized in a sense of the least mean square. The key point of this minimax approach is to search for a saddle point of the objective function for a global optimal solution. A recent information-theoretic perspective approach sought a parametric study based on the information bottleneck method for an optimal cluster number via MI correction (Still & Bialek, 2004). The algorithm maximizes the following objective function, max I (W, V) − T I (W, X),

p(W|X)

(2.4)

2676

Q. Song

where V represents the relevant information data set against the input data set X and assumes that the joint distribution p(V, W) is known approximately. I (W, V) and I (W, X) are the mutual information. T is a temperature parameter (refer to the next section for details). A key point of this approach is that it presents implicitly a structural risk minimization problem (Vapnik 1998) and uses the corrected mutual information to search for a risk bound at an optimal temperature (cluster number). 3 Rate Distortion Function and DA Clustering The original DA clustering is an optimal algorithm in term of insensitivity to the volume of input patterns in the respective cluster. This avoids local minima of the hard clustering algorithm like k-means and splits the whole input data set into effective clusters in the annealing procedure (Rose, 1998). However, the DA algorithm is not robust against disturbance and outliers because it tends to associate the membership of a particular pattern in all clusters with equal probability distribution (Dave & Krishnapuram, 1997). Furthermore, the DA is an inherent empirical risk-minimization algorithm. This explores details of the input data structure without a limit and needs a preselected maximum cluster number K max to stop the annealing procedure. To solve these problems, we first investigate the original rate distortion function, which lays the theoretical foundation of the DA algorithm (Blahut, 1988; Gray, 1990). The definition of the rate distortion function is defined as (Blahut, 1988) R(D( p ∗ (X))) = min I ( p ∗ (X), p(W|X))

(3.1)

p(W|X)

I ( p ∗ (X), p(W|X)) l K p ∗ (xi ) p(wk |xi ) ln l = p(W|X) i=1 k=1

i=1

p(wk |xi ) p(wk |xi ) p ∗ (xi )

,

(3.2)

with the constraint K l

p ∗ (xi ) p(wk |xi )d(wk , xi ) ≤ D( p ∗ (X))

i=1 k=1

=

K l i=1 k=1

p ∗ (xi ) p(wk |xi )d(wk , xi ), (3.3)

A Robust Information Clustering Algorithm

2677

where I ( p ∗ (X), p(W|X)) is the mutual information.2 p(wk |xi ) ∈ p(W|X) is the titled distribution, p(wk |xi ) =

p(wk ) exp(−d(wk , xi )/T) , Nxi

(3.4)

where the normalized factor is Nxi =

K

p(wk ) exp(−d(wk , xi )/T)

(3.5)

k=1

with the induced unconditional pmf, p(wk ) =

l

p(wk , xi ) =

i=1

l

p ∗ (xi ) p(wk |xi ),

k = 1, . . . , K .

(3.6)

i=1

p(wk |xi ) ∈ p(W|X) achieves a minimum point of the lower curve R(D( p ∗ (X))) in Figure 1 at a specific temperature T (Blahut, 1988). p ∗ (xi ) ∈ p ∗ (X) is a fixed unconditional a priori pmf (normally as an equal distribution in DA clustering; Rose, 1998). The rate distortion function is usually investigated in term of a parameter s = −1/T with T ∈ (0, ∞). This is introduced as a Lagrange multiplier and equals the slope of the rate distortion function curve as shown in Figure 1 in classical information theory (Blahut, 1988). T is also referred as the temperature parameter to control the data clustering procedure as its value is lowered from infinity to zero (Rose, 1998). Therefore, the rate distortion function can be presented as a constraint optimization problem: R(D( p ∗ (X))) = min I ( p ∗ (X), p(W|X)) p(W|X)

−s

K l

∗

∗

p (xi ) p(wk |xi )d(wk , xi ) − D( p (X)) .

(3.7)

i=1 k=1

One important property of R(D ( p ∗ (X))) is that it is a decreasing, convex, and continuous function defined in the interval 0 ≤ D ( p ∗ (X)) ≤ Dmax for 2 The MI I ( p ∗ (X), p(W|X)) has another notation I (X, W) similar to the one used in equation 2.4. However, as pointed out by Blahut (1988), the latter may not be the best notation for the optimization problem because it suggests that MI is merely a function of the variable vectors X and W. For the same reason, we use probability distribution notation for all related functions. For example, the rate distortion function is presented as R(D( p ∗ (X))), which is a bit more complicated than the original paper (Blahut, 1972). This inconvenience turns out to be worth it as we study the related RIC capacity problem, which is coupled closely with the rate distortion function, as shown in the next section.

2678

Q. Song

C ( D ( p ( X )))

Capacity Maximization

I

D ( p (X))

D( p(X))

D ( p * (X))

D( p* (X))

m in F ( p ( W | X ), p * (X ))

1 T

0

Empirical Risk Minimization

p ( W |X )

T

R( D( p* (X)))

D ( p(X))

D ( p* (X))

Dmax

Optimal Saddle Point

Figure 1: Plots of the rate distortion function and capacity curves for any particular cluster number K ≤ K max . The plots are parameterized by the temperature T.

any particular cluster number 0 < K ≤ Kmax , as shown in Figure 1 (Blahut, 1972). Define the DA clustering objective function as (Rose, 1998) F ( p ∗ (X), p(W|X)) = I ( p ∗ (X), p(W|X)) −s

K l

p ∗ (xi ) p(wk |xi )d(wk , xi ).

(3.8)

i=1 k=1

The rate distortion function, R(D( p ∗ (X))) = s D( p ∗ (X)) + min F ( p ∗ (X), p(W|X)), p(W|X)

(3.9)

is minimized by the titled distribution 3.4 (Blahut, 1972). From the data clustering point of view, equations 2.2 and 3.8 are well known to be soft dissimilarity measures of different clusters (Dave & Krishnapuram, 1997). To accommodate the DA-based RIC algorithm in a single framework of classical information theory, we use a slightly different treatment from the original paper of Rose (1998) for the DA clustering algorithm, that is, to minimize equation 3.8 with respect to the free pmf p(wk |xi ) rather than the direct minimization against the cluster center W. This recasts

A Robust Information Clustering Algorithm

2679

the clustering optimization problem as that of seeking the distribution pmf and minimizing equation 3.8 subject to a specified level of randomness. This can be measured by the minimization of the MI, equation 3.1. The optimization is now to minimize the function F ( p ∗ (X), p(W|X)), which is a by-product of the MI minimization over the titled distribution p(wk |xi ) to achieve a minimum distortion, and leads to the mass-constrained DA clustering algorithm. Plugging equation 3.4 into 3.8, the optimal objective function, equation 3.8, becomes the entropy functional in a compact form:3 F ( p ∗ (X), p(W|X)) = −

l

p ∗ (xi ) ln

i=1

K

p(wk ) exp (−d(wk , xi )/T).

(3.10)

k=1

Minimizing equation 3.10 against the cluster center wk , we have l ∂ F ( p ∗ (X), p(W|X)) = p ∗ (xi ) p(wk |xi )(wk − xi ) = 0, ∂wk i=1

(3.11)

which leads to the optimal clustering center, wk =

l

αik xi ,

(3.12)

i=1

where αik =

p ∗ (xi ) p(wk |xi ) . p(wk )

(3.13)

For any cluster number K ≤ K max and a fixed arbitrary pmf set p ∗ (xi ) ∈ p ∗ (X), minimization of the clustering objective function 3.8 against the pmf set p(W|X) is monotone nonincrease and converges to a minimum point of the convex function curve at a particular temperature. The soft distortion measure D( p ∗ (X)) in equation 3.3 and the MI, equation 3.1, are minimized simultaneously in a sense of empirical risk minimization.

3

F ( p ∗ (X), p(W|X)) =

−s

l K

p ∗ (xi ) p(wk |xi ) ln

i=1 k=1

p ∗ (xi ) p(wk |xi )d(wk , xi )

i=1 k=1 l

K

i=1

k=1

p ∗ (xi ) ln

=−

l K

p(wk |xi ) = 1).

=−

l i=1

p ∗ (xi )

p(wk ) exp(−d(wk ,xi )/T) p(wk )Nxi K k=1

p(wk |xi ) ln Nxi

p(wk ) exp(−d(wk , xi )/T) (according to equation 3.4,

K k=1

2680

Q. Song

4 Minimax Optimization and the Structural Risk Minimization 4.1 Capacity Maximization and Input Data Reliability. In the constrained minimization of MI of the last section, we obtain an optimal feedforward transition probability: a priori pmf p(wk |xi ) ∈ p(W|X). A backward transition probability, a posteriori pmf p(xi |wk ) ∈ p(X|W), can be obtained through the Bayes formula, p(xi |wk ) = l

p(xi ) p(wk |xi )

i=1

p(xi ) p(wk |xi )

=

p(xi ) p(wk |xi ) . p(wk )

(4.1)

The backward transition probability is useful to assess the realizability of the input data set in classical information theory. Directly using the pmf, equation 4.1, yields an optimization problem by simply evaluating a single pmf p(xi |wk ), and is not a good idea to reject outlier (Mackay, 1999). However, we can use the capacity function of classical information theory. This is defined by maximizing an alternative presentation of the MI against input probability distribution, C = max I ( p(X), p(W|X)),

(4.2)

p(X)

with I ( p(X), p(W|X)) = I ( p(X), p(X|W)) =

K l

p(xi ) p(wk |xi ) ln

i=1 k=1

p(xi |wk ) , p(xi )

(4.3)

where C is a constant represented the channel capacity. Now we are in a position to introduce the channel reliability of classical information theory (Bluhat, 1988). To deal with the input data uncertainty, the MI can be presented in a simple channel entropy form, I ( p(X), p(X|W)) = H( p(X)) − H( p(W), p(X|W)),

(4.4)

where the first term represents uncertainty of the channel input variable X,4 H( p(X)) = −

l

p(xi ) ln( p(xi )),

(4.5)

i=1

4

In nats (per symbol) since we use the natural logarithm basis rather bits (per symbol) in log2 function. Note that we use the special entropy notations H( p(X)) = H(X) and H( p(W), p(X|W)) = H(X|W) here.

A Robust Information Clustering Algorithm

2681

and the second term is conditional entropy, K l H( p(W), p(X|W)) = − p(wk ) p(xi |wk ) ln p(xi |wk ).

(4.6)

i=1 k=1

Lemma 1 (inverse theorem). 5 The clustering data reliability is presented in a single symbol error pe of the input data set with empirical error probability, pe =

l K

p(xi |wk ),

(4.7)

i=1 k=i

such that if the input uncertainty H( p(X)) is greater than C, the error pe is bounded away from zero as pe ≥

1 (H( p(X)) − C − 1). ln l

(4.8)

Proof. We first give an intuitive discussion here over Fano’s inequality. (See Blahut, 1988, for a formal proof.) Uncertainty in the estimated channel input can be broken into two parts: the uncertainty in the channel whether an empirical error pe was made, and given that an error is made, the uncertainty in the true value. However, the error occurs with probability pe such that the first uncertainty is H( pe ) = −(1 − pe ) ln(1 − pe ) and can be no larger than ln(l). This occurs only when all alternative errors are equally likely. Therefore, if the equivocation can be interpreted as the information lost, we should have Fano’s inequality: H( p((W)), p(X|W)) ≤ H( pe ) + pe ln (l).

(4.9)

Now consider that the maximum of the MI is C in equation 4.2, so we can rewrite equation 4.4 as H( p(W), p(X|W)) = H( p(X)) − I ( p(X), p(X|W)) ≥ H( p(X)) − C.

(4.10)

Then Fano’s inequality is applied to get H( p(X)) − C ≤ H( pe ) + pe ln(l) ≤ 1 + pe ln l.

5

(4.11)

There is a tighter bound pe compared to the one of lemma 1, as in the work of Jelinet (1968). However, this may not be very helpful since minimization of the empirical risk is not necessary to minimize the real structural risk, as shown in section 4.3.

2682

Q. Song

Lemma 1 gives an important indication that any income information (input data) beyond the capacity C will generate unreliable data transmission. This is also called the inverse theorem in a sense that it uses the DA-generated optimal titled distribution to produce the backward transition probability, equation 4.1, and assess an upper bound of the empirical risk, equation 4.10. 4.2 Capacity Maximization and the Optimal Solution. Equation 3.3 is well known to be a soft dissimilarity measure minimized by the DA clustering as the temperature T is lowered toward zero (Rose, 1998). However, there is no way for the DA to search for an optimal temperature value and, in turn, an optimal cluster number because the rate distortion function provides only limited information and aims at the empirical risk minimization, as shown in section 3. Therefore, we propose a capacity or MI maximization scheme. This is implicitly dependent on the distortion measure similar to the rate distortion function. We define a constrained maximization of MI as6 C(D( p(X))) = max C(D( p(X))) = max I ( p(X), p(W|X)), p(X)

p(X)

(4.12)

with a similar constraint as in equation 3.3: D( p(X)) =

K l

p(xi ) p(wk |xi )d(wk , xi ) ≤ D( p ∗ (X)).

(4.13)

i=1 k=1

This is because minimization of the soft distortion measure D( p ∗ (X)), equation 3.3, is the ultimate target of the DA clustering algorithm as analyzed in section 3. We need to assess maximum possibility to make an error (risk). According to lemma 1, reliability of the input data set depends on the capacity, that is, the maximum value of the MI against the input density estimate. To do this, we evaluate the optimal a priori pmf robust density distribution pmf p(xi ) ∈ ( p(X)) to replace the fixed arbitrary p ∗ (xi ) in the distortion measure, equation 3.3, and assess reliability of the input data of each particular cluster number K based on a posteriori pmf in equation 4.1. If most of the data points (if not all) achieve the capacity (fewer outliers), then we can claim that the clustering result reaches an optimal, or at least a suboptimal, solution at this particular cluster number in a sense of empirical risk minimization.

6 Here we use a similar notation of the capacity function as for the rate distortion function R(D( p(X))) to indicate implicitly that the specific capacity function is in fact an implicit function of the distortion measure D( p(X)). For each particular temperature T, the capacity C(D( p(X))) achieves a point at the upper curve corresponding to the lower carve R(D( p ∗ (X))), as shown in equation 4.17.

A Robust Information Clustering Algorithm

2683

Similar to the minimization of the rate distortion function in section 3, constrained capacity maximization can be rewritten as an optimization problem with a Lagrange multiplier λ ≥ 0: C(D( p(X))) = max[I ( p(X), p(W|X)) + λ(D( p ∗ (X)) − D( p(X)))]. (4.14) p(X)

Theorem 1. Maximum of the constrained capacity C(D( p(X))) is achieved by the robust density estimate, exp

p(xi ) = l

i=1 exp

K k=1

p(wk |xi ) ln p(xi |wk ) − λ p(wk |xi )d(wk |xi )

K

k=1

, p(wk |xi ) ln p(xi |wk ) − λ p(wk |xi )d(wk |xi )

(4.15)

with the specific distortion measure D( p(X)) = D( p ∗ (X)) for p(xi ) ≥ 0 of all 0 ≤ i ≤ l. Proof. Similar to Blahut (1972), we can temporarily ignore the condition p(xi ) ≥ 0 and set the derivative of the optimal function 4.14 equal to zero against the independent variable a priori pmf p(xi ). This results in l ∂ p(xi ) − 1) = 0 (C(D( p(X))) + λ1 ∂ p(xi ) i=1 = − ln p(xi ) − 1 +

K

p(wk |xi )(ln p(xi |wk )

k=1

−λ p(wk |xi )d(wk , xi )) + λ1 p(xi ).

(4.16)

We also select a suitable λ1 , which ensure that the probability constraint l i=1 p(xi ) = 1 is guaranteed and leads to the robust density distribution estimate, equation 4.15. According to the Kuhn-Tucker theorem (Blahut, 1988), if there exists an optimal robust distribution p(xi ), which is derived from equation 4.15, then the inequality constraint, equation 4.13, of the distortion measure becomes equality and achieves the optimal solution of equation 4.14 at an optimal saddle point between the curve C(D( p(X))) and R(D( p ∗ (X))) with the corresponding average distortion measure, D( p(X)) = D( p ∗ (X))).

(4.17)

By dividing the input data into effective clusters, the DA clustering minimizes the relative Shannon entropy without a priori knowledge of the data distribution (Gray, 1990). The prototype (cluster center), equation 3.12, is

2684

Q. Song K

( p(w k | xi )ln

p(w k | xi )

)

l

k 1

C

p(w k | xi ) p(xi ) i 1

p(xi )

0 p(w 2 | xi )

p(w1 | xi )

w1

p(xi | w1 )

p(xi | w 2 )

w2

Figure 2: The titled distribution and robust density estimation based on the inverse theorem for a two-cluster data set.

clearly presented as a mass center. This is insensitive to the initialization of cluster centers and volumes with a fixed probability distribution, for example, an equal value p ∗ (xi ) = 1/l for the entire input data points (Rose, 1998). Therefore, the prototype parameter αki depends on the titled distribution p(wk |xi ), equation 3.4, which tends to associate the membership of any particular pattern in all clusters and is not robust against outlier or disturbance of the training data (Dave & Krishnapuram, 1997). This in turn generates difficulties in determining an optimal cluster number as shown in Figure 2 (see also the simulation results). Any data point located around the middle position between two effective clusters could be considered an outlier. Corollary 1. The capacity curve C(D( p(X))) is continuous, nondecreasing, and concave on D( p(X)) for any particular cluster number K . Proof. Let p (xi ) ∈ p (X) and p (xi ) ∈ p (X) achieve [D( p (X)), C(D( p (X)))] and [D( p (X)), C(D( p (X)))], respectively, and p(xi ) = λ p (xi ) + λ p (xi ) is an optimal density estimate in theorem 1, where λ = 1 − λ and 0 < λ < 1. Then D( p(X)) =

K l (λ p (xi ) + λ p (xi )) p(wk |xi )d(wk , xi ) i=1 k=1

= λ D( p (X)) + λ D( p (X)),

(4.18)

A Robust Information Clustering Algorithm

2685

and because p(X) is the optimal value, we have C(D( p(X))) ≥ I ( p(X), p(W|X)).

(4.19)

Now we use the fact that I ( p(X), p(W|X)) is concave (upward convex) in p(X) (Jelinet, 1968; Blahut, 1988) and arrive at C(D( p(X))) ≥ λ I ( p (X), p(W|X)) + λ I ( p (X), p(W|X)).

(4.20)

We have, finally, C(λ D( p (X)) + λ D( p (X))) ≥ λ C(D( p (X))) + λ C(D( p (X))). (4.21) Furthermore, because C(D( p(X))) is concave on [0, Dmax ], it is continuous, nonnegative, and nondecreasing to achieve the maximum value at Dmax , which must also be strictly increased for D( p(X)) smaller than Dmax . Corollary 2. The robust distribution estimate p(X) achieves the capacity at K

p(wk |xi )

ln l i=1

k=1

p(wk |xi ) p(xi ) p(wk |xi )

− λ p(wk |xi )d(wk |xi ) = V, ∀ p(xi ) = 0 (4.22)

K

p(wk |xi )

ln l i=1

k=1

p(wk |xi ) p(xi ) p(wk |xi )

− λ p(wk |xi )d(wk |xi ) < V, ∀ p(xi ) = 0. (4.23)

The above two equations can be presented as the Kuhn-Tucker condition (Vapnik, 1998): p(xi ) V −

K k=1

p(wk |xi ) ln l i=1

p(wk |xi ) p(xi ) p(wk |xi )

− λ p(wk |xi )d(wk |xi )

= 0, ∀i.

(4.24)

Proof. Similar to the proof of theorem 1, we use the concave property of C(D( p(X))), l ∂ p(xi ) − 1) ≥ 0, (C(D( p(X))) + λ1 ∂ p(xi ) i=1

(4.25)

2686

Q. Song

which can be rewritten as K

p(wk |xi ) ln l

k=1

p(wk |xi )

i=1

p(xi ) p(wk |xi )

− λ p(wk |xi )d(wk |xi ) ≤ −λ1 + 1, ∀i (4.26)

with equality for all p(xi ) = 0. Setting −λ1 + 1 = V completes the proof. Similarly, it is easy to show that if we choose λ = 0, the Kuhn-Tucker condition becomes

p(xi ) C −

K k=1

p(wk |xi ) ln l i=1

p(wk |xi ) p(xi ) p(wk |xi )

= 0, ∀i,

(4.27)

where C is the maximum capacity value defined in equation 4.2. Note that the MI is not negative. However, individual items in the sum of the capacity maximization, equation 4.2, can be negative. If the ith pattern xi is taken into account and p(wk |xi ) < li=1 p(xi ) p(wk |xi ), then the probability of the kth code vector (cluster center) is decreased by the observed pattern and gives negative information about pattern xi . This particular input pattern may be considered an unreliable pattern (outlier), and its negative effect must be offset by other input patterns. Therefore, the maximization of the MI, equation 4.2, provides a robust density estimation of the noisy pattern (outlier) in terms that the average information is over all clusters and input patterns. The robust density estimation and optimization is now to maximize the MI against the pmf p(xi ) and p(xi |wk ), for any value of i; if p(xi |wk ) = 0, then p(xi ) should be set equal to zero in order to obtain the maximum, such that a corresponding training pattern (outlier) xi can be deleted and dropped from further consideration in the optimization procedure as outlier shown in Figure 2. As a by-product, the robust density estimation leads to an improved criterion at calculation of the critical temperature to split the input data set into more clusters of the RIC compared to the DA as the temperature is lowered (Rose, 1998). The critical temperature of the RIC can be determined by the maximum eigenvalue of the covariance (Rose, 1998), VXW =

l

p(xi |wk )(xi − wk )(xi − wk )T ,

(4.28)

i=1

where p(xi |wk ) is optimized by equation 4.1. This has a bigger value representing the reliable data since the channel communication error pe is relatively smaller compared to the one of outlier (see lemma 1).

A Robust Information Clustering Algorithm

2687

4.3 Structural Risk Minimization and Optimal Cluster Number. To solve the intertwined outlier and cluster number problem, some intuitive notations can be obtained based on classical information theory as presented the previous sections. Increasing K and model complexity (as the temperature is lowered) may reduce capacity C(D( p(X))) since it is a nondecreasing function of D( p(X)), as shown in corollary 1 (see also Figure 1). Therefore, in view of theorem 1, we should use the smallest cluster number as long as a relatively small number of outliers is achieved (if not zero outlier), say 1 percent of the entire input data points. However, how to make a tradeoff between empirical risk minimization and capacity maximization is a difficult problem for classical information theory. We can solve this difficulty by bridging the gap between classical information theory, on which the RIC algorithm is based, and the relatively new statistical learning theory with the so-called structural risk minimization (SRM) principle (Vapnik, 1998). Under the SRM, a set of admissible structures with nested subsets can be defined specifically for the RIC clustering problem as S1 ⊂ S2 ⊂ . . . ⊂ SK . . . ,

(4.29)

where SK = (Q K (xi , W) : W ∈ K ), ∀i, with a set of indicator functions of the empirical risk,7

Q K (xi , W) =

K k=1

lim p(wk |xi ) =

T→0

K k=1

lim

T→0

p(wk ) exp(−d(xi , wk )/T) , ∀i. Nxi (4.30)

We shall show that the titled distribution p(wk |xi ), equation 3.4, at zero temperature, as in equation 4.30, can be approximated by the complement of a step function. This is linear in parameters and assigns the cluster membership of each input data point based on the Euclidean distance between data point xi and cluster center wk for a final hard clustering partition (Rose, 1998; see also the algorithm in section 4.4). The titled distribution at T → 0 can be presented as p(wk ) exp(−d(xi , wk )/T) lim K k=1 p(wk ) exp(−d(xi , wk )/T)

T→0

7 According to definition of the titled distribution, equation 3.4, it is easy to see that the defined indictor function is a constant number, that is, Q K (xi , W) = 1. See also note 3.

2688

Q. Song

≈

 p(wk ) exp(−d0 (xi , wk ))     p(wk ) exp(−d0 (xi , wk )) = 1,

if d0 (xi , wk ) = ∞;

 p(wk ) exp(−d0 (xi , wk ))   = 0,  K k=1 p(wk ) exp(−d0 (xi , wk ))

(4.31) if d0 (xi , wk ) → ∞.

Now consider the radius d0 (xi , wk ) between data point xi and cluster k at zero temperature. This can be rewritten as an inner product of two n-dimensional vectors of the input space as d0 (xi , wk ) = lim

T→0

=

n

d(xi , wk ) < xi − wk > . < xi − wk > = lim T→0 T T

rko φko (X),

(4.32)

o=1

where rko represents the radius parameter component in the n-dimensional space, and φko (X) is a linearly independent function similar to the hyperplane case (Vapnik, 1998). Using equations 4.32 and 4.31, we can rewrite 4.30 as Q (xi , W) = K

K k=1

θ

n

rko φko (X) − d0 (xi , wk ) , ∀i,

(4.33)

o=1

where θ (.) = 1 − θ(.) is the complement of the step function θ(.). Note that there is one and only one d0 (xi , wk ) = ∞, ∀(1 ≤ k ≤ K ) in each conditional equality of equation 4.31, since it gives a unique cluster membership of any data point xi in a nested structure SK . Therefore, the indicator Q K (xi , W) is linear in parameters. According to Vapnik (1998), the VC-dimension of the complexity control parameter is equal to the number of parameters: h K = (n + 1) ∗ K for each nested subset SK . By design of the DA clustering, the nested structure in equation 4.29 provides ordering of the VC-dimension h 1 ≤ h 2 ≤ . . . ≤ h K . . . , such that the increase of cluster number is proportional to the increase of the estimated VC-dimension from a neural network point of view (Vapnik, 1998). To obtain good generalization performance, one has to use the admissible structure, equation 4.29, based on the set of indicator functions to search for an optimal cluster number K . This minimizes a VC-bound ps similar to that of the support vector machine, except that we are looking for the strongest data point of the input space instead of seeking the weakest data point of the feature (kernel) space (Vapnik, 1998). So we have ε 4 1/2 ps ≤ η + 1+ 1+η 2 ε

(4.34)

A Robust Information Clustering Algorithm

2689

with η=

m l

ε=4

(4.35)

h K ln h2lK + 1 − ln ζ4 l

,

(4.36)

where m is the number of outliers identified in the capacity maximization as in the previous section. ζ < 1 is a constant. The signal-to-noise ratio η in equation 4.35 appears as the first term of the right-hand side of the VC-bound, equation 4.34. This represents the empirical risk, and the second term is the confidence interval of the SRMbased estimate. Discussion.

r

r

Stop criterion and optimal cluster number. At the initial DA clustering stage with a small cluster number K and relatively large ratio between the number of input data points and the VC-dimension, say, hlK > 20 (Vapnik, 1998), the real risk VC-bound, equation 4.34, is mainly determined by the first term of the right-hand side of the inequality, that is, the empirical risk (signal-to-noise) ratio η in equation 4.35. As the temperature is lowered and the cluster number is increased, a relatively small ratio hlK may require both terms in the right-hand side of equation 4.34 to be small simultaneously. Therefore, we can assess first the ratio l/(h K ), which is near the upper bound of the critical number 20, for a maximum cluster number K = K max , beyond which the second term of the VC-bound, equation 4.34, may become dominant even for a small empirical risk ratio η, especially in a high-dimensional data space. Therefore, we can follow the minimax MI optimization as in sections 3 and 4 to increase the cluster number from one until K max for a minimum value of the VC-bound, that is, take a trade-off between minimization of the empirical risk and VC-dimension. Selection of λ. The degree of robustness of the RIC algorithm is controlled by the parameter λ. The Kuhn-Tucker condition in corollary 2 tells that a relatively larger value of λ yields more outliers (noisy patterns). If one chooses λ = 0, the RIC allows the maximum empirical risk with a possible overcapacity distortion beyond the optimal saddle point and a minimum number of the estimated outliers (see Figure 1). In a general clustering problem using the L 2 distortion measure, equation 2.2, selection of the λ is insensitive to determination of an optimal cluster number because the VC-bound depends on only the relative values of η and h K over different cluster numbers (see also example 2).

2690

Q. Song

As a general rule of thumb, if eliminating more outliers is an interest, we can gradually increase λ and redo the capacity maximization to reject outliers located between intercluster boundaries at an optimal cluster number determined by an arbitrary value of λ. 4.4 Implementation of the RIC Algorithm Phase I (Minimization) 1. Determine the ratio l/(n ∗ K ), which is near the critical number 20 for a maximum cluster number K = K max , and p(xi ) = 1/l for i = 1 to l. 2. Initialize: T > 2E max (Vx ) where E max is the largest eigenvalue of the variance matrix Vx of the input pattern set X, K = 1, and p(w1 ) = 1. 3. For i = 1, . . . , K of the fixed-point iteration of the DA clustering according to equations 3.4, 4.15, and 3.12. 4. Convergence test: If not satisfied, go to 3. 5. If T ≤ Tmin , perform the last iteration and stop. 6. Cooling step: T ← αT, (α < 1). 7. If K < K max , check condition for phase transition for i = 1, . . . K . If a critical temperature T = 2E max (Vxw ), where E max (Vxw ) is the largest eigenvalue of the covariance VXW matrix in equation 4.28 between the input pattern and code vector (Rose, 1998), is reached for the clustering, add a new center w K +1 = w K + δ with p(w K +1 ) = p(w K )/2, p(w K ) ← p(w K )/2, and update K + 1 ← K . Phase II (Maximization) 8. If it is the first time for the calculation of the robust density estimation, select p(xi ) = 1/l, ∞ > λ ≥ 0 and ε > 0, and start the fixed-point iteration of the robust density estimation in the following step 9 to 10. 9.

K c i = exp ( p(wk |xi ) ln l k=1

i=1

p(wk |xi ) p(xi ) p(wk |xi )

− λ p(wk |xi )d(wk , xi )) , (4.37)

10. If ln

l

p(xi )c i − ln max c i < ε i=1...l

i=1

(4.38)

then go to 9, where ε > 0; otherwise, update the density estimation, p(xi ) = p(xi ) l i=1

ci p(xi )c i

.

(4.39)

A Robust Information Clustering Algorithm

2691

11. Verify the robust solutions of the RIC, algorithm around the optimal saddle point for a minimum value of the VC-bound equation 4.34, within the range of maximum cluster number K max . If the minimum is found, then delete outliers, and set T → 0 for the titled distribution to obtain cluster membership of all input data points for a hard clustering solution. Recalculate the cluster center using equation 3.12 without outliers; then stop. Otherwise go to 3. 5 Simulation Results This section presents a few simulation examples to show the superiority of the RIC over the standard DA clustering algorithm. This is in fact a selfcomparison since the RIC is just an extension of the DA by identifying outliers for an optimal cluster number. A comparison can also be made with the popular fuzzy c-means (FCM) and the robust version of the FCM clustering algorithms (see section 2). However, this may not make much sense since the FCM needs a predetermined cluster number in addition to the initialization problem (Krishnapuram & Keller, 1993). Example 1, which follows, presents a numerical analysis to reveal details of the weakness of the titled distribution. This also explains how the robust density estimate of the RIC algorithm finds an optimal cluster number via the identification of outliers. Example 2 illustrates that one can always choose a relatively larger control parameter λ to eliminate more outliers between the intercluster area without affecting the estimated optimal cluster number. Example 3 shows an interesting partition of a specific data set without clear cluster boundaries. In particular, we show that any data point could become outlier dependent on the given data structure and chosen cluster centers in the annealing procedure based on the limited number of input data for a minimum VC-bound. Similarly, we are not looking for “true” clusters or cluster centers but effective clusters in a sense of the SRM based on the simple Euclidean distance.8 Example 1. Figure 3 is an extended example used in the robust FCM clustering algorithm (Krishnapuram & Keller, 1993), which has two well-separated clusters with seven data points each and four outliers sitting around the middle position between the two given clusters. The data set has 18 data points such that the ratio l/ h 1 = 18/(3 ∗ 1) is already smaller than the critical number 20. An optimal cluster number should be the minimum two (note that DA does not work for one cluster). However, we would like to use this example to reveal the weakness of the titled distribution and how the robust density estimate helps. Figure 3 also shows that the RIC algorithm with We set ζ = 0.1 of the VC-bound for all the simulation results. The Matlab program can be downloaded from the author’s Internet address: http://www.ntu.edu. sg/home/eqsong/. 8

2692

Q. Song

(a) The original data set

(b) K = 2

ps = 4.9766

(c) K = 3

(d) K = 4

ps = 6.4161

ps = 5.7029

Figure 3: The clustering results of RIC (λ = 0) in example 1. The bigger ∗ represents the estimated cluster center of the RIC after eliminating the estimated outliers. The black dot points are the identified outliers by the RIC in b, c, and d.

A Robust Information Clustering Algorithm

2693

Table 1: Optimal Titled Distribution p(wk |xi ) and Robust Density Estimate p(xi ) in Example 1 with K = 2. i

p(xi )

p(w1 |xi )

p(w2 |xi )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0.3134 0.0638 0.0354 0.0329 0.0309 0.0176 0.0083 0.0030 0.0133 0.0401 0.0484 0.0567 0.1244 0.2133 0.0000 0.0000 0.0000 0.0000

0.9994 0.9991 0.9987 0.9987 0.9987 0.9981 0.9972 0.0028 0.0019 0.0013 0.0013 0.0013 0.0009 0.0006 0.9994 0.9994 0.9994 0.9994

0.0006 0.0009 0.0013 0.0013 0.0013 0.0019 0.0028 0.9972 0.9981 0.9987 0.9987 0.9987 0.9991 0.9994 0.0006 0.0006 0.0006 0.0006

K = 2 identifies the four data points around the middle position between the two clusters as outliers and eliminates them with p(xi ) = 0. Further details on the values of the titled distribution p(wk |xi ) and the robust estimate p(xi ) are listed in Table 1 for the case of K = 2. The first 14 rows correspond to the data in the two clusters, and the last 4 rows represent the four identified outliers. Despite the balanced geometric positions of the outliers, the membership of the four outliers is assigned to cluster 1 by the DA because of p(w1 |xi ) ≈ 1 for the four outliers. The minor difference in the numerical error may be the only cause for the DA to assign the membership of the four data points to the first cluster. This explains why minimization of the titled distribution is not robust (Dave & Krishnapuram, 1997). More important, the RIC estimates the real risk-bound ps as the cluster number is increased from one. This also eliminates the effect of outliers. The ratio between the number of total data points and VC-dimension h 2 is small at 18/6 = 3, so the second term of the VC-bound becomes dominant as K increases, as shown in Figure 3. The optimal cluster number is determined as “two” with a minimum ps = 4.9766 despite the fact that the minimum number of outliers of the empirical risk is achieved at the cluster number K = 4. Note also that the original outliers become valid data points as the cluster numbers are increased to K = 3 and K = 4, respectively. Example 2. The two-dimensional data set has 292 data points so the ratio l/ h 7 = 292/(3 ∗ 7) is well below the critical number 20. We should search for an optimal cluster number from two to seven clusters. Figures 4 and 5

2694

Q. Song

(a) ps = 1.5635 K = 2

(b) ps = 0.6883 K = 3

(c) ps = 1.1888 K = 4

(d) ps = 1.4246 K = 5

(e) ps = 1.3208 K = 6

(f) ps = 2.4590 K = 7

Figure 4: The two-dimensional data set with 292 data points in example 2 is clustered by the RIC algorithm with λ = 0. The black dot points are identified outliers by the RIC in all pictures.

A Robust Information Clustering Algorithm

(a) ps = 1.8924 K = 2

(b) ps = 0.9303 K = 3

(c) ps = 1.2826 K = 4

(d) ps = 1.5124 K = 5

(e) ps = 1.3718 K = 6

(f) ps = 2.46244 K = 7

2695

Figure 5: The two-dimensional data set with 292 data points in example 2 is clustered by the RIC algorithm with λ = 1.8. The black dot points are identified outliers by the RIC in all pictures.

2696

Q. Song

(a) ps = 1.8177 η = 0.8667 (b) ps = 1.3396 η = 0.3900

(c) ps = 0.8486 η = 0

(d) ps = 0.9870 η = 0 η

(e) ps = 1.1374 η = 0.0033 η

(f) ps = 2.169 η = 0.4467

Figure 6: The two-dimensional data set with 300 data points in example 3, clustered by the RIC algorithm with λ = 0. The black dot points are identified outliers by the RIC in all pictures with (a) K = 2, (b) K = 3, (c) K = 4, (d) K = 5, (e) K = 6, and (f) K = 7.

A Robust Information Clustering Algorithm

2697

show that a “native” noise-free three-cluster data set is clustered by the RIC algorithm with different cluster numbers. The RIC gives the correct optimal cluster number “three” because there is a minimum value of the VC-bound ps . This also coincides with the empirical risk of the minimum number of outliers at K = 3 for both cases, λ = 0 and λ = 1.8. Note that we can always use a relatively larger λ value to eliminate more outliers between the intercluster area without affecting the optimal cluster number in a general clustering problem. The black dot points are identified outliers by the RIC in all pictures. Example 3. This is an instructive example to show the application of the RIC algorithm with λ = 0 for a data set without clear cluster boundaries in a two-dimensional space. The data set has 300 data points such that the ratio l/ h 7 = 300/(3 ∗ 7) is well below the critical number 20. We shall search for an optimal cluster number from two to seven clusters. In particular, to show the difference between the empirical risk η and the VC-bound ps , we indicate both values for each case. Figure 6 illustrates that the optimal cluster number is four based on the SRM principle. It is interesting to note that the five-cluster case also achieves the minimum number of outliers in a sense of the empirical risk minimization, but its VC-bound ps is bigger than the one of the four-cluster because of the increase in the VC-dimension. 6 Conclusion A robust information clustering algorithm is developed based on the minimax optimization of MI. In addition to the algorithm, the theoretical contributions of this letter are twofold: (1) the capacity maximization is implicitly linked to the distortion measure against the input pattern pmf and provides an upper bound of the empirical risk to phase out outliers; (2) the optimal cluster number is estimated based on the SRM principle of statistical learning theory. The RIC can also be extended to the c-shells or kernel-based algorithms to deal with the linearly nonseparable data. This is an interesting topic for further research. Acknowledgments I thank the anonymous reviewers for constructive comments on an earlier version of this letter. References Bajcsy, P., & Ahuja, N. (1998). Location- and density-based hierarchical clustering using similarity analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20, 1011–1015.

2698

Q. Song

Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Blahut, R. E. (1972). Computation of channel capacity and rate-distortion functions. IEEE Trans. on Information Theory, 18, 460–473. Blahut, R. E. (1988). Principle and practice of information theory. Reading, MA: AddisonWesley. Dave, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Trans. on Fuzzy Systems, 5, 270–293. Gokcay, E., & Principe, J. C. (2002). Information theoretic clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24, 158–171. Gray, R. M. (1990). Source coding theory. Norwood, MA: Kluwer. Jelinet, F. (1968). Probabilistic information theory. New York: McGraw-Hill. Krishnapuram, R., & Keller, J. M. (1993). A possibilistic approach to clustering. IEEE Trans. on Fuzzy Systems, 1, 98–110. Levy, B. C., & Nikoukhah, R. (2004). Robust least-squares estimation with a relative entropy constraint. IEEE Trans. on Information Theory, 50, 89–104. Mackay, D. C. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation, 11, 1035–1068. Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problem. Proceedings of the IEEE, 86, 2210– 2239. Scholkopf, B., Smola, A., & Muller, K. M. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Shen, M., & Wu, K. L. (2004). A similarity-based robust clustering method. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26, 434–448. Song, Q., Hu, W. J., & Xie, W. F. (2002). Robust support vector machine for bullet hole image classification. IEEE Transactions on Systems, Man and Cybernetics—Part C, 32, 440–448. Still, S., & Bialek, W. (2004). How many clusters? An information-theoretic perspective. Neural Computation, 16, 2483–2506. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In B. Hajek and R. S. Sreenivas (Eds.), Proc. 37th Annual Allerton Conf. Urbana: University of Illinois. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.

Received July 28, 2004; accepted April 20, 2005.

LETTER

Communicated by Tom Heskes

Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks Justin Werfel [email protected] Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Xiaohui Xie [email protected] Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02141, U.S.A.

H. Sebastian Seung [email protected] Howard Hughes Medical Institute, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. Learning speed is defined as the rate of exponential decay in the learning curves. When the scalar parameter that controls the size of weight updates is chosen to maximize learning speed, node perturbation is slower than direct gradient descent by a factor equal to the number of output units; weight perturbation is slower still by an additional factor equal to the number of input units. Parallel perturbation allows faster learning than sequential perturbation, by a factor that does not depend on network size. We also characterize how uncertainty in quantities used in the stochastic updates affects the learning curves. This study suggests that in practice, weight perturbation may be slow for large networks, and node perturbation can have performance comparable to that of direct gradient descent when there are few output units. However, these statements depend on the specifics of the learning problem, such as the input distribution and the target function, and are not universally applicable.

Neural Computation 17, 2699–2718 (2005)

© 2005 Massachusetts Institute of Technology

2700

J. Werfel, X. Xie, and H. Seung

1 Introduction Learning in artificial systems can be formulated as optimization of an objective function that quantifies the system’s performance. A typical approach to this optimization is to follow the gradient of the objective function with respect to the tunable parameters of the system. Frequently this is accomplished directly, by calculating the gradient explicitly and updating the parameters by a small step in the direction of locally greatest improvement. In many circumstances, however, attempts at direct gradient following can encounter problems. In VLSI and other hardware implementations, computation of the gradient may be excessively unwieldy, if not impossible, due to unavoidable imperfections in manufacturing (Widrow & Lehr, 1990; Jabri & Flower, 1992; Flower & Jabri, 1993; Cauwenberghs, 1993, 1996). In some cases, as with many where the reinforcement learning framework is used, there may be no explicit form for the objective function and hence no way of calculating its gradient (Fiete, Fee, & Seung, 2004). And in biological systems, any argument that direct gradient calculation might be what the system is actually doing typically encounters severe obstacles. For instance, backpropagation, the standard method for training artificial neural networks (ANNs), requires two-way, multipurpose synapses, units with global knowledge about the system that are able to recognize different kinds of signals and treat them in very different ways, and (in the case of trajectory learning) the ability to run backward in time, all of which strain the bounds of biological plausibility (Widrow & Lehr, 1990; Bartlett & Baxter, 1999). For reasons such as these, there has been broad interest in stochastic methods that approximate the gradient on average. Compared to a method that follows the true gradient directly, we might intuitively expect a stochastic gradient-following approach to learn more slowly. In this study (based on analysis of ANNs, for which the tunable parameters are the network weights), the stochastic algorithms use a reinforcement learning framework with a single reward signal, which is assigned based on the contributions of all the network weights. That single reward is all that is available to evaluate how every one of the weights should be updated, in contrast to a true gradient method where the optimal updates are all calculated exactly. If calculation of the gradient is not computationally expensive enough to represent a bottleneck, and the error landscape is sufficiently well behaved that following the gradient is typically the quickest way to decrease error, then the significant advantage that explicit gradient methods have in general in terms of the amount of information available to them for each update could be expected to allow much faster learning. Moreover, if the network is made larger and the number of weights thereby increased, the problem of spatial credit assignment becomes still more difficult; thus, we would tend to expect the performance of stochastic gradient methods to scale up with network size more poorly than that of deterministic methods. However, under some circumstances,

Learning with Stochastic Gradient Descent

2701

stochastic methods can be equally as effective as direct ones in training even large networks, generating nearly identical learning curves (see, e.g., Figure 3 below). Under what circumstances, then, will stochastic gradient descent have performance comparable to that of the deterministic variety? And how good can that performance be? In this letter, we investigate these issues quantitatively by analytically calculating learning curves for a linear perceptron using a direct gradient method and two stochastic methods, node perturbation and weight perturbation. We find that the maximum learning speed for each algorithm scales inversely with the first power of the dimensionality of the noise injected into the system; this result is in contradiction to previous work, which reported maximum learning speed scaling inversely with the square root of the dimensionality of the injected noise (Cauwenberghs, 1993). Weight perturbation, which depends on the use of higher-dimensional noise, scales more poorly than node perturbation, which in turn scales more poorly than the noiseless direct gradient method. Further, parallel variation of the network weights in the stochastic algorithms allows learning to take place at a higher speed than does sequential variation, by a constant factor. We also consider how uncertainty in quantities used to calculate the weight updates affects learning speed and lowest mean error attainable by a stochastically trained network. These exact results depend on the specifics of the learning model, including the linearity of the network, the distribution of inputs it receives, the target function we train it to approximate, and the objective function that quantifies its performance. Under other conditions, the results may be qualitatively different, as we discuss. (Some of the results in this letter were presented in preliminary form in Werfel, Xie, & Seung, 2004.) 2 Perceptron Comparison Direct and stochastic gradient approaches are general classes of training methods. We study the operation of exemplars of both on a feedforward linear perceptron, which has the advantage over the nonlinear case that the learning curves can be calculated exactly (Heskes & Kappen 1991; Baldi & Hornik, 1993; Biehl & Riegler 1994; Mace & Coolen 1998). We have N input units and M output units, connected by a weight matrix w of MN elements; outputs in response to an input x are given by y = wx. For the ensemble of possible inputs, we want to train the network to produce desired corresponding outputs y = d; in order to ensure that this task is realizable by the network, we assume the existence of a teacher matrix w ∗ such that d = w ∗ x. For objective function, we use the squared error

E=

1 1 1 |y − d|2 = |(w − w ∗ )x|2 = |Wx|2 , 2 2 2

(2.1)

2702

J. Werfel, X. Xie, and H. Seung

where we have defined the matrix W ≡ w − w ∗ . We train the network with an online approach, choosing at each time step an input vector x with components drawn from a gaussian distribution with mean 0 and variance γ 2 , and using it to construct a weight update according to one of the three prescriptions below. The online gradient-following approach explicitly uses the gradient of the objective function for a given input to determine the weight update, WOL = −η∇ E, where η > 0 is the learning rate. This is the approach taken, for example, by backpropagation. In the stochastic algorithms, the gradient is not calculated directly; instead, some noise is introduced into the system, affecting its error for a given input, and the difference between the error with and without noise is used to estimate the gradient. The simplest case is when noise is added directly to the weight matrix: E WP =

1 |(W + ψ)x|2 . 2

Such an approach has been termed “weight perturbation,” frequently with only one weight being varied at a time (Jabri & Flower, 1992; Cauwenberghs, 1993). We choose each element of the noise matrix ψ from a gaussian distribution with mean 0 and variance σ 2 . Intuitively, if the addition of the noise lowers the error, that perturbation to the weight matrix is retained, which will mean lower error for that input in future. Conversely, if the noise leads to an increase in error, the opposite change is made to the weights; the effect of small noise on error can be approximated as linear, and the opposite change in weights will lead to the opposite change in error, again decreasing error for that input in future. These two cases can be combined into the single weight update, WWP = −

η (E − E)ψ. σ 2 WP

A more subtle way to introduce stochasticity involves adding the noise to the output of each output unit rather than to every weight: E NP =

1 |Wx + ξ |2 . 2

Such an approach is sometimes called “node perturbation,” though that term has traditionally referred to a serial approach where noise is added to one output unit at a time (Widrow & Lehr, 1990; Flower & Jabri, 1993). Here,

Learning with Stochastic Gradient Descent

2703

if the addition of the noise ξ leads to a decrease in error, the weights are adjusted in such a way as to move the outputs in the direction of that noise. The degree of freedom for each output unit corresponds to the adjustment of its threshold, making the unit more or less responsive to a given pattern of input activity. The elements of ξ are again chosen independently from a gaussian distribution with variance σ 2 ; here, ξ has M elements, whereas ψ in the previous case had MN. The REINFORCE framework (Williams, 1992) gives for the weight update WNP = −

η (E − E)ξ x T . σ 2 NP

These stochastic frameworks produce weight updates identical to that of direct gradient descent on the objective function when averaged over all values of the noise (Williams, 1992; Cauwenberghs, 1993), which is the sense in which they constitute stochastic gradient descent. This result is easy to verify in the particular forms taken by WNP and WWP here, shown below. It is worth emphasizing that not only will they give a decrease in error on average, but every update will decrease the error, so long as the noise is small (compared to Wx for node perturbation, W for weight perturbation). 2.1 Reducing the Dimensionality of the Space of Learning Constants. Three constants affect the course of learning for the system as formulated above: learning rate η, variance of input distribution γ 2 , and variance of injected noise σ 2 . We can simplify the problem by rewriting the expressions in the previous section. For true gradient descent, WOL = −(ηγ 2 )W γx γx T , where γx is drawn from a gaussian distribution with variance 1. Hence, any change in γ in the original formulation can be offset by a corresponding change in η, and the relevant space of learning constants is only onedimensional. T T For node perturbation, WNP = ( ση2 γ 4 )( γξ W γx + 12 γξ γξ ) γξ γx T , where γξ is σ drawn from a gaussian distribution with variance γ , and γx has variance 1. Here too, a change in γ can be compensated for by appropriate changes in the other two parameters, and the relevant learning constant space has two dimensions. For weight perturbation, WWP = ( ση2 γ 2 )( γx T ψ T W γx + 12 γx T ψ T ψ γx )ψ. Once again changes in γ can be subsumed into changes in the other parameters, and we need consider only a two-dimensional learning constant space. Without loss of generality, therefore, we set γ = 1 to simplify the remainder of this discussion. 2.2 Learning Curves. The appendix gives derivations for the following learning curves and convergence conditions on η, where the parenthesized

2704

J. Werfel, X. Xie, and H. Seung

superscript is a time index, and the angle brackets indicate a mean taken over both noise and inputs at every time step. For the online gradient method,

(t) E OL = (1 − 2η + (N + 2)η2 )t E (0)

ηOL <

2 . N+2

In a single online learning run, E (t) would depend on the particular values of x that were randomly chosen; averaging over the ensemble of possible inputs x removes this variation. We therefore use this averaged error E (t) as the learning curve measuring the performance of the system. The limit on η depends on N because of the randomness inherent in an online training regimen; the exact gradient for error due to a given single input x will not in general match that for error averaged over the entire ensemble of inputs. (More details appear in the appendix.) This “gradient noise” (Widrow & Lehr, 1990) is common to all three algorithms considered here. For node and weight perturbation:

(t)

E NP = E

(0)

ησ 2 (M + 2)(M + 4)MN − 8(2 − (N + 2)(M + 2)η)

· (1 − 2η + (M + 2)(N + 2)η2 )t +

ησ 2 (M + 2)(M + 4)MN 8(2 − (M + 2)(N + 2)η)

2 (M + 2)(N + 2) (t) ησ 2 MN(MN(M + 2)(N + 2) + 12(MN + 2)) E WP = E (0) − 8(2 − (N + 2)(MN + 2)η) ηNP <

· (1 − 2η + (N + 2)(MN + 2)η2 )t + ηWP <

ησ 2 MN(MN(M + 2)(N + 2) + 12(MN + 2)) 8(2 − (N + 2)(MN + 2)η)

2 (MN + 2)(N + 2)

3 Comparison of Learning Curves All three of the above learning curves E (t) take the form t ¯ E(α(η)) + β(η, σ ),

Learning with Stochastic Gradient Descent

2705

E E (0)

E τ~−1/ln(α) β

t Figure 1: Sketch of a sample learning curve, converging from initial error E (0) to residual error β at speed − ln(α).

where β is the residual error that the network will approach as t → ∞ if learning converges, E¯ ≡ E (0) − β is the transient error, and α is a multiplicative factor by which E¯ changes at each time step. The magnitude of α, which depends on the parameter η but not on σ , determines whether the average error will converge and the speed at which it will do so.1 Figure 1 illustrates schematically these quantities, by which we will be comparing the algorithms. For the online gradient method, β = 0; a network trained this way, if it converges, will approach zero error as t → ∞. The stochastic algorithms have positive β, which is a result of the noise: when W is far from the minimum of the objective function, the noise will typically be small in comparison to the term to which it is added, but close to the minimum, the noise will prevent the system from attaining arbitrarily low error. The residual error depends on both η and σ ; in the limit σ → 0, this residual error vanishes. Of course, σ cannot be set directly to 0, or the stochastic algorithms will cease to function. 3.1 Equal Average Updates. One way to compare these different algorithms with respect to performance is to choose learning rates η such that all three have the same weight update on average. As noted above, choosing the same value of η in all three cases will ensure this condition. That common value of η must be small enough that all three algorithms converge. 1 If we take η MN 2 , the learning curves, to highest order in η, M, and N, 1 Note the important distinction between learning rate η, which is a constant affecting the magnitude of individual weight updates, and speed of learning, a property of a given learning curve associated with this multiplicative factor α(η). The latter is the relevant measure of performance when comparing learning curves.

2706

J. Werfel, X. Xie, and H. Seung

become

(t) ¯ − 2η)t E OL = E(1

1 (t) E NP = E¯ (1 − 2η)t + ησ 2 M3 N 16 (t) 1 E WP = E¯ (1 − 2η)t + ησ 2 M3 N3 . 16 In section 1, we outlined an argument that because of the problem of determining updates to many weights based on a single reward signal, a stochastic gradient-following approach might be expected to learn more slowly than a direct one, which has no such credit assignment problem. However, for equal small η, the average error for all three algorithms converges at the same speed. Weight perturbation approaches a larger value of residual error than does node perturbation, unless a value of σ at least a factor of N larger for the latter than for the former is chosen; however, in the σ → 0 limit, the residual error vanishes for both. 3.2 Maximal Learning Speeds. The usual way to choose the learning rate in applications of training networks is to use the value of η for which the error experimentally turns out to have the fastest speed of convergence. In this light, the previous comparison may be of more theoretical than practical interest. Arguably a more practical way to compare the algorithms, then, is to choose the “optimal” learning rate for each, here defined as that value of η for which the average error converges most quickly.2 The learning curves, to highest order in M and N, are then 1 t (t) E OL = E¯ 1 − N (t) 1 t 1 2 2 ¯ E NP = E 1 − + σ M MN 8 t (t) 1 1 + σ 2 M2 N. E WP = E¯ 1 − MN2 8

Direct gradient descent, then, can train a network faster than can node perturbation, which in turn is faster than weight perturbation.

2 Other definitions for what is to be considered optimal are possible, for example, the amount of time taken for error to fall below some specified threshold, or the level to which error falls within a specified number of updates. We use the present definition because the rate of convergence of error to its asymptotic value seems arguably the most relevant quantity, and is analytically tractable under this framework.

Learning with Stochastic Gradient Descent

2707

The noise takes different forms in the two stochastic variants. For node perturbation, ξi is added directly to the ith output unit; for weight perturbation, the quantity added to the same output unit is i j ψi j x j . By the central limit theorem, the latter approaches a gaussian with mean 0 and variance Nσ 2 for large N. For the most direct comparison of the two stochastic variants, therefore, the variance of ξ should be chosen a factor N larger than that of ψ. With this choice, the residual error for the two stochastic variants becomes identical, and the learning curves differ only in their speeds of convergence. 3.3 Parallel vs. Sequential Update. The terms node and weight perturbation have been most often used to refer to sequential variation of the tunable system parameters. We can apply the node perturbation approach as described here, but add noise to only a single output unit at a time and update only the corresponding weights; or apply weight perturbation, varying and adjusting only one weight at a time, where the one output unit or one weight is chosen at random with uniform probability. Similar analysis then gives for the convergence condition and learning curves 2 3(N + 2) (t) 15ησ 2 MN 3(N + 2) 2 t 2 ¯ E NP = E 1 − η + + η M M 4(2 − 3(N + 2)η) ηNP <

2 3(N + 2) (t) 2 45ησ 2 MN 3(N + 2) 2 t ¯ E WP = E 1 − + η+ η MN MN 4(2 − 3(N + 2)η) ηWP <

for this serial update strategy. Choosing the optimal learning rates makes the learning curves

(t) E NP = E¯ 1 −

t

MN 5 + σ2 4 N+2 t (t) 1 15 MN E WP = E¯ 1 − + σ2 . 3MN(N + 2) 4 N+2 1 3M(N + 2)

3.4 Uncertainty in Weight Updates. In practice, any of the quantities used to calculate weight updates may have some associated uncertainty, due, for instance, to poor estimation or VLSI manufacturing imperfections (G. Cauwenberghs, personal communication, November 2003). Treating these uncertainties as gaussian random variables, a very general case can

2708

J. Werfel, X. Xie, and H. Seung

be written W = −

η (E(x + a , ζ + b) − E(x + c, 0 + e) + d) (x + f, ζ + g), σ2

where E is the squared error; is the eligibility; ζ is either ξ or ψ according to whether node or weight perturbation is being considered; and a . . . g are tensors, with dimensions matching those of their addends, whose components are normally-distributed variables with means µa ,i . . . µg,i and 2 variances σa2,i . . . σg,i (where the second index is over components). Expanding this full expression according to the treatment above gives a set of equations too lengthy to report here, as well as a set of constraints on a . . . g that must hold in order for the recursive approach described here to be applicable (e.g., a must be identical to c; certain pairs of variables must not both have nonzero mean). Because these general results are not particularly illuminating, we will discuss only two special cases here. With uncertainty only on the offset error (only d nonzero), for both node and weight perturbation, only the residual error is affected, not the maximum rate of convergence or bound on η. The new learning curves are

(t) ¯ − 2η + (M + 2)(N + 2)η2 )t E NP = E(1 ηMN σ 4 (M + 2)(M + 4) + 4 σd2 + µ2d + µd σ 2 (M + 2) + 8σ 2 (2 − (M + 2)(N + 2)η) (t) ¯ − 2η + (N + 2)(MN + 2)η2 )t E WP = E(1 + ηMN σ 4 (MN(M + 2)(N + 2) + 12(MN + 2)) + 4 σd2 2 + µ2d + µd σ 2 (MN + 2) [8σ (2 − (N + 2)(MN + 2)η)].

With uncertainty on the injected noise used to calculate the eligibility (only g nonzero), both the residual error and the bound on η are affected. For node perturbation, the new learning curve and condition on η are

t 1 (t) E NP = E¯ 1 − 2η + η2 (N + 2) M + 2 + 2 (|σg |2 + |µg |2 ) σ + ηNP <

ησ 2 MN(M + 2)((M + 4)σ 2 + |σg |2 + |µg |2 ) 8(2σ 2 − η(N + 2)((M + 2)σ 2 + |σg |2 + |µg |2 ))

2σ 2 . (N + 2)((M + 2)σ 2 + |σg |2 + |µg |2 )

With weight perturbation, in order for the recursive method used here to be applied, µg,i j must be zero for all {i, j}, but σg,i j may still be nonzero in

Learning with Stochastic Gradient Descent

2709

general. To highest order in M and N, the learning curve and condition on η become

t η2 (t) E WP = E¯ 1 − 2η + 2 N(MNσ 2 + σg 2 ) σ + ηWP <

ησ 2 M2 N2 (MNσ 2 + σg 2 ) 8(2σ 2 − ηN(MNσ 2 + σg 2 ))

2σ 2 . N(MNσ 2 + σg 2 )

4 Discussion In a linear feedforward network of N input and M output units, in terms of the maximum possible speed of convergence of average error, online gradient descent on a squared error function is faster by a factor of M than node perturbation, which in turn is faster by a factor of N than weight perturbation. The difference in the speed of convergence is the dimensionality of the noise. Weight perturbation operates by explicit exploration of the entire MN-dimensional weight space; only one component of a particular update will be in the direction of the true gradient for a given input, while the other components can be viewed as noise masking that signal. That is, an update can be written as W = W (the “learning signal,” the actual gradient) + (W − W) (the “learning noise”), where the average is taken √over all values of ψ. This learning noise will typically have magnitude MN larger than the learning signal, and so MN samples are required in order to average it away. Direct gradient descent gives weight updates that are purely signal in this sense; while still occurring in an MN-dimensional space, they are by definition exactly in the direction of the gradient for a given input. Thus, no exploration of the weight space or averaging over multiple samples is necessary, and the maximum learning speed is correspondingly greater. Node perturbation is a stochastic algorithm like weight perturbation, but it explores the M-dimensional output space rather than the larger weight space; the learning noise is of lower dimension, and correspondingly fewer samples need to be averaged to reveal a learning signal of a given size. It has previously been argued that the maximum learning speed should scale not with the dimensionality of the update, as shown here, but with the square root of that dimensionality (Cauwenberghs, 1993). That claim is based on the fact that the squared magnitude of the update goes as the number of dimensions, and for a given error landscape and position in weight space, there will be a maximum update size, greater than which instability will result. However, a more quantitative approach is to examine the

2710

J. Werfel, X. Xie, and H. Seung

conditions under which error will decrease, as we have done above. Rather than stopping with the statement that the size of the weight update scales as the square root of the number of dimensions, we have shown that this fact implies that the restriction on convergence scales with the first power of the dimensionality. Numerical simulations of error curves, averaged over many individual trials with online updating, support these conclusions with respect to both the quantitative shapes of the learning curves and the scaling behavior of the conditions on convergence (see Figure 2). 4.1 Parallel vs. Sequential Update. Serial perturbation allows the use of a larger η than does the parallel approach first discussed, by a factor of D, the dimensionality of the noise (M for node perturbation, MN for weight perturbation). However, the learning curves at optimal η scale with M and N in the same way for serial as for parallel updates. Intuitively, by the argument of the preceding section, a perturbation with only one component (and correspondingly smaller learning-signal-to-learning-noise ratio) allows the step size to be increased by a factor of D, but the algorithm must then cycle through all D components, so the net learning speed is no faster. Further, while the fastest learning curves scale the same way with network size for serial as for parallel update, the speed differs by a constant factor: the network can learn three times faster with parallel update than with serial. This factor comes about from a term in the parallel equations (D + 2). With parallel updates and large networks, the constant 2 is negligible compared to D; but that term becomes 3 for serial updates, where the effective dimensionality of the noise is 1. The 2 represents, in effect, the “overhead cost” of the learning noise; the parallel approach minimizes the effect of this overhead by updating all components at once, while the serial approach encounters it with each successive component. The serial curves also have a residual error smaller by a factor of D than parallel. However, if the variance σ 2 is scaled by D to make the total noise injected into the system directly comparable in the two cases, as in the discussion of the difference between the residual error with node versus weight perturbation in equation 3.2 above, this difference vanishes. A constant difference remains, but this can also be adjusted for by the choice of variance. 4.2 Uncertainty in Weight Updates. For both node and weight perturbation, with uncertainty only in the offset error (nonzero d), the residual error is increased compared to the case without uncertainty. Moreover, in this case, it cannot be made arbitrarily small by making σ arbitrarily small; as σ → 0, the residual error now diverges. For a given value of σd , an optimal value of σ can be calculated for which residual error is a minimum. Learning speed for a given η, and bounds on η, remain unaffected.

Learning with Stochastic Gradient Descent

2711

Error (arbitrary units)

Online gradient method

10

0

10

2

Error (arbitrary units)

Node perturbation

0

10

10

2

10

4

Error (arbitrary units)

Weight perturbation

0

10

# of examples

10

5

Figure 2: Sample learning curves for the three algorithms applied to a linear feedforward network as described in the text, showing the agreement between theory (black) and experiment (gray). In each case, a network of linear units with N = 20, M = 25, σ = 10−3 , and optimal η was trained on successive input examples for the number of iterations shown. One hundred such runs were averaged together in each case. The three gray lines show the mean (solid) and standard deviation (dashed) of squared error among those runs.

With uncertainty only in the injected noise used in calculation of the eligibility (nonzero g), the residual error can still be made arbitrarily low by choosing σ sufficiently small. However, smaller σ now means a stricter upper bound on η; there is a trade-off between residual error and learning speed.

2712

J. Werfel, X. Xie, and H. Seung

Because the uncertainties a . . . g can be expected to be nonzero in general, we can expect typically to encounter the limitations found in both of these special cases, that is, uneliminable residual error and trade-off between residual error and learning speed. 4.3 Other Issues. The analysis in this letter and the corresponding results cover the case where the objective function and distribution of inputs are isotropic. In a sense, this constitutes a worst case, where every weight must be learned with equal importance in the calculation of the cost function. A companion article (in preparation) discusses the more general case of an anisotropic quadratic cost function, which can result, for example, from a restricted set of input patterns. In such a case, different learning modes progress at different speeds, the stochastic algorithms will in general not scale so poorly, and parallel variation of weights may outperform sequential variation by more than a constant factor. An issue of considerable importance is that of baseline subtraction in the stochastic learning rules. Both can be written as W = −η/σ 2 ( E¯ − B)e, where the scalar B is the baseline. In the analysis above, we have chosen B to be the error in the absence of noise. So long as the baseline is uncorrelated with the noise, the stochastic updates will still match the true gradient on average. However, while the mean updates remain the same, their variance can be very great if the baseline is poorly chosen; individual updates will not necessarily decrease the error, and the learning noise will mask the learning signal to make the algorithm effectively unusable. If the baseline is correlated with the noise, even the mean update will not in general match the true gradient; analysis of such a case is not pursued in this study. While the analytic approach taken here cannot be extended readily to networks of nonlinear units, these results appear to extend at least qualitatively to more complicated networks and architectures. For instance, Figure 3 shows learning curves that result from applying the three algorithms to a two-layer feedforward network of nonlinear units. All three algorithms give identical learning curves if the learning rate is set small enough; as η is increased, the weight perturbation curve fails to converge to low error, while the other two curves continue to match; increasing η further leads to the node perturbation curve’s also failing to converge. Finally, we have considered only those target functions that can be realized by these networks, and have not treated the class of functions for which no teacher weight matrix w ∗ exists, where attaining zero error is impossible. 5 Conclusion We have shown that stochastic gradient descent techniques can be expected to scale with increasing network size more poorly than direct ones, in terms of maximum learning speed. This result may serve as a caution regarding the size of networks they may usefully be applied to. However, with learning

Learning with Stochastic Gradient Descent

2713

Error (arbitrary units)

η = 4 × 10−4

10

0

5

10 −3

Error (arbitrary units)

η = 4 × 10

10

0

10

5

−2

Error (arbitrary units)

η = 4 × 10

10

0

2

10 # of examples

Figure 3: Sample learning curves for the three algorithms applied to a two-layer nonlinear feedforward network (gradient descent, dotted; node perturbation, dashed; weight perturbation, solid). The input, hidden, and output layers each had 10 units, with output equal to the hyperbolic tangent of their weighted input. Inputs and noise were drawn from the same distributions as in the linear case; σ = 10−3 , η had the value shown for all three algorithms in each panel. In each case, the network was trained on successive input examples for the number of iterations shown. Error was evaluated based on the total squared difference between the output of the network and that of a teacher network with randomly chosen weights; the test error was the mean of that for 100 random inputs not used in training. Curves show averages over 100 independent runs.

2714

J. Werfel, X. Xie, and H. Seung

rates small, equal learning curves in each of the three will follow from equal learning rates, although individual weight updates will typically be considerably different. This is because for correspondingly small adjustments to the weights, only the component parallel to the gradient will have a significant effect on error; orthogonal components will not affect the error to first order. Moreover, node perturbation can have performance comparable to that of direct gradient descent even in training very large networks, so long as the number of output units is small (Fiete et al., 2004). Thus, these stochastic methods may be of considerable utility for training networks in some situations, particularly in reinforcement learning frameworks and those where the gradient of the objective function is difficult or impossible to calculate, for mathematical or practical reasons. Appendix: Derivations Here we give in detail the calculation of the learning curves and convergence conditions stated in the text. A.1 Online Gradient Method. Taking the gradient of the objective function of equation 2.1 gives WOL = −ηWxx T

(A.1)

as the individual weight update for particular values of W and x. Given the weight matrix at time t = 0 and an input vector x, we can apply one such update and square each element of the result, obtaining (1)2 Wi j

=

(0) Wi j

−η

(0) Wik xk x j

(0) Wi j

−η

k

(0) Wil xl x j

.

l

Averaging over the distribution of possible inputs,

(1)2

Wi j

(0)2

= (1 − 2η + 2η2 )Wi j

+ η2

(0)2

Wik .

k

If we sum over rows (i.e., over all inputs to a given output unit),

(1)2

Wi j

= (1 − 2η + (N + 2)η2 )

j

(0)2

Wi j .

j

Summing over columns as well (i.e., over all output units) gives ij

(1)2

Wi j

= (1 − 2η + (N + 2)η2 )

ij

(0)2

Wi j .

(A.2)

Learning with Stochastic Gradient Descent

2715

This result gives us a recursion relation specifying W 2 as a function of time: after t updates,

(t)2

= (1 − 2η + (N + 2)η2 )t

Wi j

ij

(0)2

Wi j .

ij

We then have as the condition for convergence of the average error ηOL <

2 , N+2

and the learning rate for which the average error converges most quickly on this quadratic landscape is ∗ ηOL =

1 . N+2

(A.3)

The learning curve can be written

(t) E OL = (1 − 2η + (N + 2)η2 )t E (0) ,

where the parenthesized superscript indicates the number of updates. Another way to write equation A.2 is explicitly in terms of “gradient signal” (0) (term multiplying Wi j ) plus “gradient noise” (Widrow & Lehr, 1990) (contamination from other components of W due to projection onto x): (1)2 Wi j

=

(0) Wi j 1

−

ηx 2j

−η

k = j

2 (0) Wik xk x j

Expanding the multiplication, the cross terms vanish when the average is taken, giving

(1)2

Wi j

=

ij

(0)2

Wi j (1 − 2η + 3η2 ) + η2 (N − 1)

ij

(0)

Wi j ,

ij

where the first term is due entirely to the signal and the second to the noise. Choosing η 1/N allows the signal to be revealed via averaging over N samples (see also section 4). A.2 Node Perturbation. The weight update for a given ξ and x is WNP

η =− 2 σ

1 T T ξ Wx + ξ ξ ξ x T . 2

2716

J. Werfel, X. Xie, and H. Seung

Averages are taken at each step not only over the inputs but also over the noise. Taking the same approach as before, we have

(1)2

Wi j

(0)2

= Wi j (1 − 2η + 4η2 )

(0)2 (0)2 (0)2 2 +η Wkl + 2 Wk j + 2 Wil kl

k

l

1 + η2 σ 2 (M2 + 6M + 8). 4 Summing over all elements, we obtain (1)2 (0)2 Wi j (1 − 2η + η2 (M + 2)(N + 2)) Wi j = ij

ij

1 + η2 σ 2 MN(M2 + 6M + 8). 4 The condition for convergence is ηNP <

2 , (M + 2)(N + 2)

so that average error will converge fastest for ∗ ηNP =

1 . (M + 2)(N + 2)

(A.4)

The learning curve is (t) ησ 2 (M + 2)(M + 4)MN (1 − 2η + (M + 2)(N + 2)η2 )t E NP = E (0) − 8(2 − (N + 2)(M + 2)η) +

ησ 2 (M + 2)(M + 4)MN . 8(2 − (N + 2)(M + 2)η)

A.3 Weight Perturbation. Here the weight update is η 1 WWP = − 2 x T ψ T Wx + x T ψ T ψ x ψ. σ 2 The same approach as above gives in this case (1)2 (0)2 Wi j (1 − 2η + η2 ((MN + 2)(N + 2))) Wi j = ij

ij

1 + η2 σ 2 (M3 N3 + 2M2 N3 + 2M3 N2 + 16M2 N2 + 24MN). 4

Learning with Stochastic Gradient Descent

2717

The condition for convergence is then ηWP <

2 , (MN + 2)(N + 2)

so that the η giving fastest convergence of average error is ∗ ηWP =

1 . (MN + 2)(N + 2)

(A.5)

The learning curve is

ησ 2 MN(MN(M + 2)(N + 2) + 12(MN + 2)) (t) E WP = E (0) − 8(2 − (N + 2)(MN + 2)η) · (1 − 2η + (N + 2)(MN + 2)η2 )t +

ησ 2 MN(MN(M + 2)(N + 2) + 12(MN + 2)) . 8(2 − (N + 2)(MN + 2)η)

Acknowledgments We thank Ila Fiete and Gert Cauwenberghs for useful discussions and comments. This work was supported in part by a Packard Foundation Fellowship (to H.S.S.) and NIH grants (GM07484 to MIT and MH60651 to H.S.S.). References Baldi, P., & Hornik, K. (1993). Learning in linear neural networks: A survey. IEEE Transactions on Neural Networks, 6(4), 837–858. Bartlett, P., & Baxter, J. (1999). Hebbian synaptic modifications in spiking neurons that learn (Tech. Rep.). Canberra: Research School of Information Sciences and Engineering, Australian National University. Biehl, M., & Riegler, P. (1994). On-line learning with a perceptron. Europhys. Lett., 28, 525–530. Cauwenberghs, G. (1993). A fast stochastic error-descent algorithm for supervised learning and optimization. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 244–251). San Mateo, CA: Morgan Kaufmann. Cauwenberghs, G. (1996). An analog VLSI recurrent neural network learning a continuous-time trajectory. IEEE Transactions on Neural Networks, 7(2), 346–361. Fiete, I. R., Fee, M. S., & Seung, H. S. (2004). Neural theory of gradient learning with empiric synapses. Manuscript submitted for publication. Flower, B., & Jabri, M. (1993). Summed weight neuron perturbation: An O(n) improvement over weight perturbation. In C. L. Giles, S. J. Hanson, & J. D. Cowan

2718

J. Werfel, X. Xie, and H. Seung

(Eds.), Advances in neural information processing Systems, 5 (pp. 212–219). San Mateo, CA: Morgan Kaufmann. Heskes, T. M., & Kappen, B. (1991). Learning processes in neural networks. Physical Review A, 44(4), 2718–2726. Jabri, M., & Flower, B. (1992). Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayered networks. IEEE Transactions on Neural Networks, 3(1), 154–157. Mace, C. W. H., & Coolen, A. C. C. (1998). Statistical mechanical analysis of the dynamics of learning in perceptrons. Statistics and Computing, 8, 55–88. Werfel, J., Xie, X., & Seung, H. S. (2004). Learning curves for stochastic gradient descent in linear feedforward networks. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems, 16 Cambridge, MA: MIT Press. Widrow, B., & Lehr, M. A. (1990). Thirty years of adaptive neural networks: Perceptron, Madaline, and backpropagation. Proc. IEEE, 78(9), 1415–1442. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.

Received November 1, 2004; accepted March 23, 2005.

LETTER

Communicated by Shun-ichi Amari

Information Geometry of Interspike Intervals in Spiking Neurons Kazushi Ikeda [email protected]. Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606-8501 Japan

An information geometrical method is developed for characterizing or classifying neurons in cortical areas, whose spike rates fluctuate in time. Under the assumption that the interspike intervals of a spike sequence of a neuron obey a gamma process with a time-variant spike rate and a fixed shape parameter, we formulate the problem of characterization as a semiparametric statistical estimation, where the spike rate is a nuisance parameter. We derive optimal criteria from the information geometrical viewpoint when certain assumptions are added to the formulation, and we show that some existing measures, such as the coefficient of variation and the local variation, are expressed as estimators of certain functions under the same assumptions. 1 Introduction Recently, the characteristics of neurons in cortical areas have been discussed based on the statistical properties of the interspike intervals (ISIs) of a spike sequence, such as the coefficient of variation, C V , the skewness coefficient, SK , and the correlation coefficient of consecutive intervals, C O R (Holt, Softky, Koch, & Douglas, 1996; Shinomoto, Sakai, & Funahashi, 1999; Sakai, Funahashi, & Shinomoto, 1999; Shinomoto, Shima, & Tanji, 2002, 2003). The local variation, L V , is especially useful for classification of individual neurons since L V is robust against changes of the spike rate (Shinomoto et al., 2003). However, the statistical meaning of the L V measure has yet to be clarified. This is a cause of some difficulty for finding an optimal criterion from the information-theoretic viewpoint (Miura, Shinomoto, & Okada, 2004). It is known that ISIs can be modeled as a gamma process with a variable rate (Tiesinga, Fellous, & Sejnowski, 2002; Shinomoto et al., 2003). From the information-geometrical viewpoint, gamma distributions form a twodimensional e-flat manifold S since they are an exponential family with two parameters (Amari, 1985; Amari & Nagaoka, 2000). Hence, we should refer to a statistical parameter orthogonal to the parameter corresponding to the spike rate for characterizing and categorizing individual neurons since the spike rate fluctuates in time and hence is useless in the categorization task. As a result, such a parameter is shown to be the so-called shape parameter Neural Computation 17, 2719–2735 (2005)

© 2005 Massachusetts Institute of Technology

2720

K. Ikeda

of the gamma distribution. In this letter, we formulate the characterization task in an information-geometrical manner and derive optimal criteria when certain assumptions are added to the formulation. We also show that some existing measures, such as C V , SK , and L V , are expressed as alternative measures of the shape parameter under the same assumptions. The rest of the letter is organized as follows. Section 2 introduces some measures of a spike sequence and their properties. In section 3, we show the properties of the manifold of interspike intervals with a brief review of information geometry. Section 4 is devoted to deriving optimal measures, and the geometrical meanings of the existing measures are shown in section 5. Conclusions are given in section 6.

2 Interspike Intervals in Spiking Neurons When a spike sequence is given and its N ISIs are written as t1 , t2 , . . . , tN , the C V and SK measures are defined as the standard deviation of ISIs divided by the mean of ISIs and the skewness of ISI distribution, respectively, that is, N 1 1 CV = (tn − t¯ )2 , t¯ N − 1

(2.1)

n=1

N 1 (tn − t¯ )3 SK = N − 1 n=1

N 1 (tn − t¯ )2 ) N − 1 n=1

3/2 ,

(2.2)

where t¯ =

N 1 tn . N n=1

(2.3)

The C V measure expresses the regularity. It takes a low value for a regular spike sequence, one for a sequence of an infinite length generated by a stationary Poisson process, and a large value when the process is time dependent. The SK measure shows the asymmetry of a sequence. It can be either positive or negative, but takes two for a sequence of an infinite length generated by a stationary Poisson process. Since they are based on the mean t¯ of ISIs, C V or SK will take a large value when the spike rate is globally modulated even though the spike sequence is locally quasi-regular (Holt et al., 1996; Shinomoto & Tsubo, 2001). Due to this property, they are not suitable for classifying such neurons in cortical areas that change their spike rate drastically in a waiting period task, for example.

Information Geometry of Interspike Intervals

2721

To overcome the problem, Shinomoto et al. (2003) proposed the L V measure, defined as

LV =

N−1 3(tn − tn+1 )2 1 , N − 1 n=1 (tn + tn+1 )2

(2.4)

where the factor 3 is taken so that the expectation of L V becomes one when the sequence obeys a stationary Poisson process. Since the L V measure reflects the stepwise variability of ISIs and does not compare ISIs with different spike rates, L V can take a small value even for a sequence with a time-variant spike rate. They confirmed that C V undergoes a large change but L V does not for a sequence generated by a time-dependent Poisson process (Shinomoto & Tsubo, 2001). Although the effectiveness of L V in classifying neurons has clearly been demonstrated in Shinomoto et al. (2003), it is rather heuristic except for the invariance in the time scaling and symmetry in time (Miura et al., 2004). Hence, a theoretical background is necessary to guarantee performance and find better criteria. 3 Information Geometry of Interspike Intervals 3.1 Preliminaries of Information Geometry. Information geometry (Amari, 1985; Amari & Nagaoka, 2000) is a general framework of Riemannian manifolds with dual affine connections and has been widely applied to statistical inference, information theory, neural networks, systems theory, mathematical programming, statistical physics, stochastic reasoning, and other areas. Here we introduce the properties of a well-known exponential family with one variable, t, and two parameters, θ 1 and θ 2 , since the manifold S of the gamma distributions treated here belongs to the family, as discussed in the next subsection. The probability density function p(t; θ 1 , θ 2 ) of such a distribution is written as l(t; θ 1 , θ 2 ) = log p(t; θ 1 , θ 2 )

(3.1)

= f (t) + θ x1 (t) + θ x2 (t) − (θ , θ ), 1

2

1

2

(3.2)

where (θ 1 , θ 2 ) is a coordinate system of S called the canonical parameters, x1 (t) and x2 (t) are random variables depending on t, and is a function of (θ 1 , θ 2 ). The Riemannian metric gij of the manifold S in the coordinate system (θ 1 , θ 2 ) is given by gij (θ 1 , θ 2 ) = E ∂i l(t; θ 1 , θ 2 )∂ j l(t; θ 1 , θ 2 ) ,

(3.3)

2722

K. Ikeda

where ∂i = ∂/∂θ i and E denotes the expectation with respect to the distribution p(t; θ 1 , θ 2 ). We can introduce another coordinate system (η1 , η2 ) of S called the expectation parameters, defined as ηi = E [xi (t)] = ∂i (θ 1 , θ 2 ).

(3.4)

The Riemannian metric g ij in the coordinate system (η1 , η2 ) is given by g ij (η1 , η2 ) = E [∂ i l(t; η1 , η2 )∂ j l(t; η1 , η2 )],

(3.5)

where ∂ i = ∂/∂ηi and the metric satisfies g ij g jk = g jk g ij = δki where δki is the Kronecker delta and the Einstein convention is used, say, g ij g jk =

2

g ij g jk .

(3.6)

j=1

The canonical parameters (θ 1 , θ 2 ) are called the e-affine coordinates since E ∂i ∂ j l(t; θ 1 , θ 2 )∂k l(t; θ 1 , θ 2 ) = 0

(3.7)

for all i, j, k. At the same time, the expectation parameters (η1 , η2 ) are the m-affine coordinates since E ∂ i ∂ j l(t; η1 , η2 )∂ k l(t; η1 , η2 ) = 0.

(3.8)

Hence, the manifold S is e-flat as well as m-flat. Intuitively, the flatness means that the manifold can be treated as a Euclidean space in these coordinate systems. From a statistical viewpoint, it is known that the maximum likelihood estimators (MLEs) of the m-affine coordinates of an exponential family achieve the Cram´er-Rao bound not asymptotically but exactly. In fact, from equation 3.4, E [(x¯ i − ηi )(x¯ j − η j )] =

gij 1 E ∂i l∂ j l = , N2 N

(3.9)

where l = log p(t1 , . . . , tn ; θ 1 , θ 2 ) = θ 1 N x¯ 1 + θ 2 N x¯ 2 − N(θ 1 , θ 2 ), x¯ 1 =

1 N

N n=1

x1 (tn ),

x¯ 2 =

1 N

N n=1

x2 (tn )

(3.10) (3.11)

Information Geometry of Interspike Intervals

2723

and p(t1 , . . . , tn ; θ 1 , θ 2 ) is the joint probability density of t’s when they are independent. 3.2 Information Geometry of Gamma Distributions. The probability density function of the gamma distribution with parameters λ and ξ is written as p(t; λ, ξ ) =

λξ ξ −1 exp(−λt), t (ξ )

(3.12)

where 1/λ is called the scale parameter and ξ the shape parameter, and (ξ ) is the gamma function,

∞

(ξ ) =

t ξ −1 exp(−t)dt.

(3.13)

0

The manifold S of gamma distributions is an exponential family since the density function is written as l(t; λ, ξ ) = − log t − λt + ξ log t + ξ log λ − log (ξ ) = f (t) + θ 1 x1 (t) + θ 2 x2 (t) − (θ 1 , θ 2 ),

(3.14) (3.15)

where f (t) = − log t, θ = −λ, 1

x1 (t) = t,

(3.16) θ = ξ,

(3.17)

x2 (t) = log t,

(3.18)

2

(θ , θ ) = log (ξ ) − ξ log λ, 1

2

(3.19)

that is, η1 = E [t] , η2 = E log t .

(3.20) (3.21)

The mean of ISIs is E[t] = E[x1 (t)] = η1 . Since the mean is time dependent, we would like to exclude the effect of this parameter. However, we cannot simply take η2 for that purpose since the metric g ij (η1 , η2 ) is not necessarily diagonal. In other words, the η1 - and η2 -coordinate curves (η2 = const. and η1 = const., respectively) are not orthogonal in general (see Figure 1). To cope with this difficulty, we employ another coordinate system, (η1 , θ 2 ), called the mixed coordinate system of S (Amari & Han, 1989; Amari,

2724

K. Ikeda

η1

η2

θ2 Figure 1: Coordinate curves on a manifold S. η1 - and η2 -coordinate curves (dashed lines) are not necessarily orthogonal, but η1 - and θ 2 -coordinate curves are orthogonal at any point in S.

2001). This has a convenient property that the coordinate curves η1 and θ 2 are orthogonal at any point in S, since ∂ 1 , ∂2 = ∂ 1 θ i ∂i , ∂2

(3.22)

= g 1i ∂i , ∂2

(3.23)

= g gi2 = 0.

(3.24)

1i

This orthogonality is useful when we discuss projection or decomposition. In the following, for brevity, we denote η1 and θ 2 by η and ξ , respectively. In order to remove the effect of η, we project a point in S into the e-flat submanifold M defined as η = η0 along an m-geodesic that is orthogonal to the submanifold, where η0 is a positive constant. In the mixed coordinate, the projection of a point (η, ξ ) is simply expressed as (η0 , ξ ) (see Figure 2). This is because the orthogonal projection of a point (η, ξ ) corresponds to the point in M closest to (η, ξ ) in the KL divergence due to the generalized Pythagoras theorem (Amari, 1985; Amari & Nagaoka, 2000). In fact, the divergence between two points p and q in S can be divided into two parts: one is responsible for the difference in η and the other in ξ due to the generalized Pythagoras theorem (see Figure 3). Hence, to eliminate the difference in η in evaluating the distance of two points, (η( p), ξ ( p)) and (η(q ), ξ (q )), we can project both points into the curve η = η0 . This operation is simply expressed in the mixed coordinate system as considering only ξ ( p) and ξ (q ) along the submanifold M.

Information Geometry of Interspike Intervals

2725

M (η = η0 ) m -geodesic

Figure 2: Orthogonal projection of points into an e-flat submanifold M. Any point (black circle) is orthogonally projected to the point in M (white circle) along an m-geodesic (dashed line).

p

ξ η

q Figure 3: The divergence between two points p and q (thick line) is divided into two parts, one responsible for the difference in ξ (solid line) and the other in η (dashed line).

We can confirm that the distance does not depend on η0 as below. A gamma distribution with mean η0 is written as p(t; ξ ) =

(ξ/η0 )ξ ξ −1 t exp(−ξ t/η0 ) (ξ )

l(t; ξ ) = − log t + ξ (log(t/η0 ) − t/η0 ) + ξ log ξ − log (ξ ),

(3.25) (3.26)

2726

K. Ikeda

since λ = ξ/η0 in equation 3.12. This shows that the one-dimensional manifold of such distributions is an exponential family and that ξ is the canonical parameter. The Riemannian metric g of M is written as g(ξ ) = E

dl dξ

2 =−

1 d2 l = ψ (ξ ) − , 2 dξ ξ

(3.27)

which does not depend on the mean η0 , where ψ(ξ ) is defined as ψ(ξ ) =

(ξ ) d log (ξ ) = , dξ (ξ )

(3.28)

and ψ(ξ ) and ψ (ξ ) are called the digamma and the trigamma functions, respectively. The distance of two points p and q in M, respectively represented as (η0 , ξ p ) and (η0 , ξq ) in the mixed coordinate system, is written as

ξq

ξp

g(ξ )dξ,

(3.29)

since the length of two points (η0 , ξ ) and (η0 , ξ + ξ ) in M is written as

2 = gij ξ i ξ j = g( ξ )2

(3.30)

when ξ is small. Let us note again that η0 can take an arbitrary positive value, but it must be constant in the above formulation. If we assume its variability, we need another formulation, as discussed in the next section. 3.3 Semiparametric Formulation. Suppose that each of interspike intervals tn , n = 1, . . . , N, independently obeys a gamma distribution with parameters (η(n) , ξ ) in the mixed coordinate system (η, ξ ), where the parameters η(n) , n = 1, . . . , N are randomly chosen from a fixed unknown probability density function k(η(1) , . . . , η(n) ) whereas ξ is fixed. We would like to ascertain only ξ and take no interest in η(n) or k(η). Such a problem is called semiparametric statistical estimation and can be formulated as follows: ξ ξ/η(n) ξ −1 p tn ; η , ξ = exp − ξ tn /η(n) , t (3.31) (ξ ) n

N p(t1 , . . . , tN ; ξ, k) = · · · k η(1) , . . . , η(N) p tn ; η(n) , ξ dη(1) · · · dη(N) ,

(n)

n=1

(3.32)

Information Geometry of Interspike Intervals

2727

where ξ is called the parameter of interest and k the nuisance parameter. If we also assume that η(n) is independently and identically distributed according to k(η), p(t1 , . . . , tN ; ξ, k) is rewritten as

p(t1 , . . . , tN ; ξ, k) =

···

N k η(n) p tn ; η(n) , ξ dη(1) · · · dη(N) .

(3.33)

n=1

This kind of problem can be solved using a function f (t; ξ ) of only t and ξ , satisfying Eξ,k [ f (t; ξ )] = 0, d f (t; ξ ) = 0, a (ξ ) = Eξ,k dξ Eξ,k f (t; ξ )2 < ∞, 2 d Eξ,k f (t; ξ ) < ∞, dξ

(3.34) (3.35) (3.36) (3.37)

for any ξ and k, where Eξ,k denotes the expectation with respect to the distribution p(t; ξ, k). Such a function is called an estimating function (Godambe, 1976, 1991). If we find an estimating function f (t, ξ ), we have an estimator ξˆ of ξ by solving the equation N

f (ti , ξˆ ) = 0.

(3.38)

i=1

Although information geometry of estimating functions has been exhaustively studied (Amari & Kawanabe, 1997), it is still difficult to find an estimating function, and to date we have not found one for equations 3.32 or 3.33. Hence, we cannot use this method for the task here.

4 Optimal Statistical Measures of ISIs The difficulty of semiparametric estimation in equation 3.32 or 3.33 results from the total randomness of η(n) . If we make some more assumptions on η(n) , we can estimate and use them for estimation of ξ . Optimal statistical measures of ISIs from a statistical viewpoint are discussed under certain assumptions in this section, and the meanings of the existing measures, C V , SK , and L V , are clarified using the same assumptions in the following section.

2728

K. Ikeda

4.1 Spike Rate Is Constant Through Sequence. Suppose that the parameter of the spike rate, η, is fixed through a given sequence and that all ISIs are independently and identically distributed. As discussed in section 3.2, we should estimate the parameters (η, ξ ) in the mixed coordinates from the given data and use only ξ in such a case. From a statistical viewpoint, the MLEs (ηˆ 1 , ηˆ 2 ) of (η1 , η2 ) in the m-affine coordinates are the optimal estimators and are easily calculated as ηˆ 1 =

N 1 tn , N n=1

(4.1)

ηˆ 2 =

N 1 log tn . N n=1

(4.2)

Hence we derive (η, ξ ) from (ηˆ 1 , ηˆ 2 ) by coordinate transformation. From the relationship between the canonical and expectation coordinate systems, η1 = −

θ2 , θ1

(4.3)

η2 = ψ(θ 2 ) − log(−θ 1 ),

(4.4)

the estimate ξˆ of ξ is expressed as the solution of log ηˆ 1 − ηˆ 2 = log ξˆ − ψ(ξˆ ).

(4.5)

It is difficult to solve equation 4.5 exactly since it includes the digamma function. Although we can solve it numerically, we consider here approximating equation 4.5 to ξˆ =

1 2(log ηˆ 1 − ηˆ 2 )

(4.6)

by ignoring the integral part of the Binet formula,

∞ 1 1 arctan(t/ξ ) log (ξ ) = ξ − log ξ − ξ + log(2π ) + 2 dt. 2 2 e 2π t − 1 0 (4.7) Figure 4 shows the form of ψ(ξ ) and its approximate log ξ − 1/(2ξ ), both of which agree well. When we compare or classify some sequences, we need the distance on ξ , which strongly depends on the metric. As shown in equation 3.27, the

Information Geometry of Interspike Intervals

2729

1 0.5

ψ(ξ) log ξ – 1/(2ξ)

0 –0.5 –1 –1.5 –2 0.5

1

1.5

2

2.5

3

Figure 4: The forms of ψ(ξ ) and its approximate log ξ − 1/(2ξ ).

metric includes the trigamma function and is difficult to evaluate exactly; however, it can be approximated in a similar way to equation 4.6 as

g(ξ ) =

1 . 2ξ 2

(4.8)

Therefore, equation 3.29 is approximately written as

ξq ξp

1 g(ξ )dξ = √ log ξq − log ξ p , 2

(4.9)

which means that we can use log ξˆ as a measure calculated by equation 4.6. Although equation 4.9 is a good measure under the assumption that the parameter of the spike rate is fixed through a given sequence, the assumption does not hold when analyzing spike sequences of neurons in cortical areas. Hence, the measure does not work well, as seen in section 2. 4.2 Pairwise Data Have the Same Spike Rate. We need to consider the randomness of the spike rate; however, it is difficult under the assumption that the spike rate is completely random, as seen in section 3.3. Hence, we assume here that every two adjacent interspike intervals are drawn from the same distribution; in other words, the nth pair of data, t2n and t2n+1 , in 2N ISIs, t0 , . . . , t2N−1 , obey a gamma distribution with (η(n) , ξ ) in the mixed coordinates.

2730

K. Ikeda

Under this assumption, we make the nth estimate (ηˆ (n) , ξˆ (n) ) from t2n and t2n+1 . Applying the result in section 4.1 to two-data cases, the optimal ξˆ (n) is the solution of log ηˆ 1 − ηˆ 2 = log ξˆ (n) − ψ(ξˆ (n) ),

(4.10)

where t2n + t2n+1 2 log t2n + log t2n+1 . ηˆ 2 = 2 ηˆ 1 =

(4.11) (4.12)

Using the same approximation as that in equation 4.6, equation 4.10 is rewritten as log ξˆ (n) = − log log

(t2n + t2n+1 )2 . 4t2n t2n+1

(4.13)

Now we have N estimates of ξˆ . There are many ways to make ξˆ from ξˆ (n) , n = 1, . . . , N, but a reasonable one is to minimize the sum of the squared distances, N 2 log ξˆ − log ξˆ (n) ,

(4.14)

n=1

where equation 4.9 is applied. In this case, the estimate ξˆ is expressed as log ξˆ =

N 1 log ξˆ (n) N n=1

=−

N (t2n + t2n+1 )2 1 log log . N n=1 4t2n t2n+1

(4.15)

(4.16)

5 Statistical Meaning of Existing Measures Optimal statistical measures under certain assumptions were given in the previous section. Using the same assumptions, we clarify the meanings of the existing measures, C V , SK , and L V . It is shown that they are expressed as estimators of certain functions of ξ under the same assumptions.

Information Geometry of Interspike Intervals

2731

5.1 Spike Rate Is Constant Through Sequence. We assume here that the spike rate is constant through sequence and that all ISIs obey a gamma distribution with (η, ξ ) in the mixed coordinate system. The C V measure is defined as ζˆ2 , CV = ζˆ1

(5.1)

where ζˆ1 and ζˆ2 are unbiased estimators, ζˆ1 =

N 1 tn , N n=1

(5.2)

ζˆ2 =

N 1 (tn − t¯ )2 , N − 1 n=1

(5.3)

of the mean ζ1 and the variance ζ2 of the √ distribution, respectively. It can be seen that the C V measure evaluates 1/ ξ by transforming the two unbiased estimators, ζˆ1 and ζˆ2 , since √

ζ2 1 = √ . ζ1 ξ

(5.4)

√ In fact, the expectation of C V is 1/ ξ (Shinomoto et al., 2003), which means √ that C V is an unbiased estimator of 1/ ξ . From the discussion in section 4.1, an optimal measure is log ξ given by transforming the MLEs (ηˆ 1 , ηˆ 2 ) as equation 4.6. Hence, C V is inferior to the optimal measure in the two points that it evaluates a different function of ξ from the optimal one and that it uses different estimators for evaluation, although it is statistically independent of η under the same assumption. In the same way, the SK measure is defined as SK =

ζˆ3 ˆ (ζ2 )3/2

(5.5)

where ζˆ2 and ζˆ3 are estimators, ζˆ2 =

N 1 (tn − t¯ )2 , N − 1 n=1

(5.6)

ζˆ3 =

N 1 (tn − t¯ )3 , N − 1 n=1

(5.7)

2732

K. Ikeda

of the variance ζ2 and the third-order moment ζ3 of the √ distribution, respectively. It can be seen that the SK measure evaluates 2/ ξ by transforming the two estimators, ζˆ2 and ζˆ3 , since 2 ζ3 = √ . 3/2 (ζ2 ) ξ

(5.8)

This means that SK is inferior to the optimal measure in the two points that it evaluates a different function of ξ from the optimal one and that it uses different estimators for evaluation, although it is statistically independent of η under the same assumption. Note that equations 5.4 and 5.8 are equivalent except scale to equation 4.9 as a function of ξ if their logarithms are compared. However, the optimal measure is based on the estimators that have less variance. 5.2 Pairwise Data Have the Same Spike Rate. We consider the meaning of the L V measure, assuming that every two adjacent interspike intervals are drawn from the same distribution. Instead of L V itself in which each interval appears twice, we analyze L˜ V , defined as N−1 3(t2n − t2n+1 )2 1 L˜ V = , N n=0 (t2n + t2n+1 )2

(5.9)

in which each interval appears only once. It is confirmed that the summand of equation 5.9 is expressed as a function of ηˆ 1 and ηˆ 2 defined in equations 4.11 and 4.12, that is,   (n)

exp(2ηˆ )  3(t2n − t2n+1 )2  = 3 1 − 22  , (t2n + t2n+1 )2 (n) ηˆ 1

(5.10)

and that the expectation of the summand is 3/(2ξ + 1) (Shinomoto et al., 2003). Hence, the summand of equation 5.9 is an estimator of a function of ξ , which is derived from the two unbiased estimators with minimum (n) (n) variances, ηˆ 1 and ηˆ 2 . We denote by ξˆ (n) such ξ satisfying 3(t2n − t2n+1 )2 3 = . 2 (t2n + t2n+1 ) 2ξ + 1

(5.11)

Then L˜ V is described as the parameter ξˆ that minimizes the sum of the squared distances between ξˆ (n) and ξ with the metric g(ξ ) =

36 , (2ξ + 1)4

since L˜ V is the arithmetic average of 3/(2ξˆ (n) + 1), n = 0, . . . , N − 1.

(5.12)

Information Geometry of Interspike Intervals

2733

1 0.9 0.8 0.7

g1/2

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8

10

ξ

Figure 5: The forms of the metrics equation 5.12 (dashed line).

g(ξ ) in equation 4.8 (solid line) and in

From the discussion in section 4.2, the optimal measure, equation 4.16, under this assumption is the arithmetic average of N estimates log ξˆ (n) , n = 0, . . . , N − 1, considering the Riemannian distance in the submanifold M. Hence, the L˜ V measure is inferior to the optimal measure in the point that it employs a different metric. However, the metric of L˜ V has a similar form to the true metric, as seen in Figure 5, and L˜ V surely evaluates a value independent of the spike rate under a rather weak assumption on the timevariant spike rate. Moreover, L˜ V is given by transforming the two unbiased estimators with minimum variances as well as the optimal measure. These facts seem one of the reasons that L V works so well in practice.

6 Conclusions Based on the fact that ISIs can be modeled as gamma distributions, we discussed several measures for characterizing the statistical properties of ISIs from an information-geometrical viewpoint. Since ISIs of neurons in cortical areas have a time-variant spike rate, their characterization becomes a problem of semiparametric statistical estimation. It is, in general, a difficult problem to solve; however, if we add some assumptions to the model, we can derive an optimal solution in accordance with those assumptions. These assumptions are also useful to clarify the meaning of the existing measures, C V , SK , and L V . The approach and some of the results given here may be

2734

K. Ikeda

applied to data analyses in other areas based on statistical models (Murata, Matsui, Miyauchi, Kakita, & Yanagida, 2003). Acknowledgments I thank the anonymous reviewers for their constructive comments to improve this letter. This study is supported in part by a Grant-in-Aid for Scientific Research (14084210, 15700130) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References Amari, S.-I. (1985). Differential-geometrical methods in statistics. Berlin: Springer-Verlag. Amari, S.-I. (2001). Information geometry on hierarchy of probability distributions. IEEE Trans. Information Theory, 47(5), 1701–1711. Amari, S.-I., & Han, T. S. (1989). Statistical inference under multiterminal rate restrictions: A differential geometric approach. IEEE Trans. Information Theory, 35(2), 217–227. Amari, S.-I., & Kawanabe, M. (1997). Information geometry of estimating functions in semiparametric statistical models. Bernoulli, 1(3), 29–54. Amari, S.-I., & Nagaoka, H. (2000). Information geometry. New York: Oxford University Press. Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277–284. Godambe, V. P. (Ed.). (1991). Estimating functions. New York: Oxford University Press. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. Journal of Neurophysiology, 75, 1806–1814. Miura, K., Shinomoto, S., & Okada, M. (2004). Search for optimal measure to discriminate random and regular spike trains (Tech. Rep. No. NC2004-52). Tokyo: IEICE. [in Japanese]. Murata, T., Matsui, N., Miyauchi, S., Kakita, Y., & Yanagida, T. (2003). Discrete stochastic process underlying perceptual rivalry. Neuroreport, 14(10), 1347– 1352. Sakai, Y., Funahashi, S., & Shinomoto, S. (1999). Temporally correlated inputs to leaky integrate-and-fire models can reproduce spiking statistics of cortical neurons. Neural Networks, 12, 1181–1190. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Computation, 11, 935–951. Shinomoto, S., Shima, K., & Tanji, J. (2002). New classification scheme of cortical sites with the neuronal spiking characteristics. Neural Networks, 15(10), 1165– 1169. Shinomoto, S., Shima, K., & Tanji, J. (2003). Differences in spiking patterns among cortical neurons. Neural Computation, 15(12), 2823–2842.

Information Geometry of Interspike Intervals

2735

Shinomoto, S., & Tsubo, Y. (2001). Modeling spiking behavior of neurons with timedependent Poisson processes. Physical Review E, 64, 041910. Tiesinga, P. H. E., Fellous, J. M., & Sejnowski, T. J. (2002). Attractor reliability reveals deterministic structure in neuronal spike trains. Neural Computation, 14, 1629– 1650.

Received September 14, 2004; accepted June 6, 2005.

2736

Erratum Marc Van Hulle’s article in the August 2005 issue of Neural Computation, Volume 17, number 8, page 1710 reads: κi, j,k,l = √

κi jkl σi2 σ j2 σk2 σl2

, with κi jkl the fourth central moment over input

dimensions i, j, k, l but should read: κi jkl −κi, j κk,l [3] , with κi jkl the fourth central moment over input κi, j,k,l = √ 2 2 2 2 σi σ j σk σl

dimensions i, j, k, l and κi, j κk,l [3] the sum over the 3 partitions of i, j, k, l

Index

Volume 17 By Author Abarbanel, Henry D.I.—See Kennel, Matthew B. Abbott, L.F.—See Swinehart, Christian D. Abeles, Moshe—See Aviel, Yuval Aertsen, Ad—See Morrison, Abigail Aihara, Kazuyuki—See Hamaguchi, Kosuke Aihara, Kazuyuki—See Masuda, Naoki Amari, Shun-ichi and Nakahara, Hiroyuki Difﬁculty of Singularity in Population Coding (Letter)

17(4): 839–858

Amari, Shun-ichi—See Wu, Si Anantharam, Venkat—See Pakzad, Payam Ang, Kai Keng and Quek, Chai RSPOP: Rough Set-Based Pseudo Outer-Product Fuzzy Rule Identiﬁcation Algorithm (Letter)

17(1): 205–243

Ang Jr., Marcelo H.—See Low, Kian Hsiang Appleby, Peter A. and Elliott, Terry Synaptic and Temporal Ensemble Interpretation of Spike-Timing Dependent Plasticity (Letter)

17(11): 2316–2336

Art´es-Rodr´ıguez, Antonio—See P´erez-Cruz, Fernando Atencia, Miguel, Joya, Gonzalo, and Sandoval, Francisco Dynamical Analysis of Continuous Higher-Order Hopﬁeld Networks for Combinatorial Optimization (Letter)

17(8): 1802–1819

2738

Index

Atiya, Amir F. Estimating the Posterior Probabilities Using the K -Nearest Neighbor Rule (Letter)

17(3): 731–740

Aviel, Yuval, Horn, David, and Abeles, Moshe Memory Capacity of Balanced Networks (Letter)

17(3): 691–713

Ay, Nihat—See Wennekers, Thomas Banquet, J.P., Gaussier, Ph., Quoy, M., Revel, A., and Burnod, Y. A Hierarchy of Associations in Hippocampo-Cortical Systems: Cognitive Maps and Navigations Strategies (Letter)

17(6): 1339–1384

Baram, Yoram Learning by Kernel Polarization (Note)

17(6): 1264–1275

Becker, Suzanna—See Chen, Zhe Becker, Suzanna—See Smith, Andrew James Berends, Michiel, Maex, Reinoud, and De Schutter, Erik The Effect of NMDA Receptors on Gain Modulation (Letter)

17(12): 2531–2547

Bondy, Jeff—See Chen, Zhe Borejsza, Karol—See Meunier, Claude ¨ Borgers, Christoph and Kopell, Nancy Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons (Letter) ˜ ´ Carlos—See P´erez-Cruz, Fernando Bousono-Calz on, Brown, Emery N.—See Okatan, Murat Bruce, Ian C.—See Chen, Zhe

17(3): 557–608

Index

Buhmann, Joachim M., Lange, Tilman, and Ramacher, Ulrich Image Segmentation by Networks of Spiking Neurons (Letter) Burdakov, Denis Gain Control by Concerted Changes in IA and IH Conductances (Note)

2739

17(5): 1010–1031

17(5): 991–995

Burnod, Yves—See Fortier, Pierre A. Burnod, Y.—See Banquet, J.P. Carney, Laurel H.—See Zhang, Xuedong Chan, Lai-Wan—See Zhang, Kun Chang, Ming-Wei and Lin, Chih-Jen Leave-One-Out Bounds for Support Vector Regression Model Selection (Letter)

17(5): 1188–1222

Chattarji, Sumantra—See Narayanan, Rishikesh Chen, Guanrong—See Chen, Tianping Chen, Tianping, Lu, Wenlian, and Chen, Guanrong Dynamical Behaviors of a Large Class of General Delayed Neural Networks (Letter)

17(4): 949–968

Chen, Zhe, Becker, Suzanna, Bondy, Jeff, Bruce, Ian C., and Haykin, Simon A Novel Model-Based Hearing Compensation Design Using a Gradient-Free Optimization Method (Letter)

17(12): 2648–2671

Chen, Zhe—See Haykin, Simon Chichilnisky, E.J.—See Kennel, Matthew B. Claussen, Jens Christian Winner-Relaxing Self-Organizing Maps (Note)

17(5): 996–1009

2740

Crammer, Koby and Singer, Yoram Online Ranking by Projecting (Letter) ¨ Sonja, and Iyengar, Satish Czanner, Gabriela, Grun, Theory of the Snowﬂake Plot and Its Relations to Higher-Order Analysis Methods (Letter)

Index

17(1): 145–175

17(7): 1456–1479

Denham, Michael—See Rajapakse, Rohana De Schutter, Erik—See Berends, Michiel Destexhe, A.—See Rudolph, M. Diesmann, Markus—See Morrison, Abigail Doiron, Brent—See Masuda, Naoki Doya, Kenji—See Morimoto, Jun D’Souza, Aaron—See Vijayakumar, Sethu Eidelberg, David—See Habeck, Christian Eliasmith, Chris A Uniﬁed Approach to Building and Controlling Spiking Attractor Networks (Letter)

17(6): 1276–1314

Elliott, Terry—See Appleby, Peter A. Feng, Gang—See Xia, Youshen Field, David J.—See Olshausen, Bruno A. Fiori, Simone Nonlinear Complex-Valued Extensions of Hebbian Learning: An Essay (Review) Fortier, Pierre A., Guigon, Emmanuel, and Burnod, Yves Supervised Learning in a Recurrent Network of Rate-Model Neurons Exhibiting Frequency Adaptation (Letter)

17(4): 779–838

17(9): 2060–2076

Index

2741

Fu, Yuli—See Xie, Shengli Fusi, Stefano—See Senn, Walter Gaussier, Ph.—See Banquet, J.P. Geisel, Theo—See Morrison, Abigail Gerstner, Wulfram—See Richardson, Magnus J.E. Ghez, Claude—See Habeck, Christian Glasmachers, Tobias and Igel, Christian Gradient-Based Adaptation of General Gaussian Kernels (Note)

17(10): 2099–2105

Golomb, David—See Pfeuty, Benjamin ´ Petr Greenwood, Priscilla E.—See L´ansky, Grimes, David B. and Rao, Rajesh P.N. Bilinear Sparse Coding for Invariant Vision (Letter)

17(1): 47–73

¨ Sonja—See Czanner, Gabriela Grun, Guigon, Emmanuel—See Fortier, Pierre A. Guyonneau, Rudy, VanRullen, Ruﬁn, and Thorpe, Simon J. Neurons Tune to the Earliest Spikes Through STDP (Letter)

17(4): 859–879

Habeck, Christian, Krakauer, John W., Ghez, Claude, Sackeim, Harold A., Eidelberg, David, Stern, Yaakov, and Moeller, James R. A New Approach to Spatial Covariance Modeling of Functional Brain Imaging Data: Ordinal Trend Analysis (Letter)

17(7): 1602–1645

Hamaguchi, Kosuke, Okada, Masato, Yamana, Michiko, and Aihara, Kazuyuki Correlated Firing in a Feedforward Network with Mexican-Hat-Type Connectivity (Letter)

17(9): 2034–2059

2742

Hammer, Barbara, Micheli, Alessio, and Sperduti, Alessandro Universal Approximation Capability of Cascade Correlation for Structures (Letter)

Index

17(5): 1109–1159

Hansel, David—See Pfeuty, Benjamin Hansen, Lars Kai—See Petersen, Kaare Brandt Haykin, Simon and Chen, Zhe The Cocktail Party Problem (Review)

17(9): 1875–1902

Haykin, Simon—See Chen, Zhe He, Zhaoshui—See Xie, Shengli Hines, Michael L.—See Lytton, William W. Horn, David—See Aviel, Yuval Hoshino, Osamu Cognitive Enhancement Mediated Through Postsynaptic Actions of Norepinephrine on Ongoing Cortical Activity (Letter)

17(8): 1739–1775

Igel, Christian—See Glasmachers, Tobias Ikeda, Kazushi Information Geometry of Interspike Intervals in Spiking Neurons (Letter)

17(12): 2719–2735

Ikeda, Kazushi and Murata, Noboru Geometrical Properties of Nu Support Vector Machines with Different Norms (Letter)

17(11): 2508–2529

Ishii, Shin—See Maeda, Shin-ichi Iyengar, Satish—See Czanner, Gabriela Jacobsson, Henrik Rule Extraction from Recurrent Neural Networks: A Taxonomy and Review (Review)

17(6): 1223–1263

Index

2743

Jedynak, Bruno M. and Khudanpur, Sanjeev Maximum Likelihood Set for Estimating a Probability Mass Function (Letter)

17(7): 1508–1530

Joshi, Prashant and Maass, Wolfgang Movement Generation with Circuits of Spiking Neurons (Letter)

17(8): 1715–1738

Joya, Gonzalo—See Atencia, Miguel Kanamaru, Takashi and Sekine, Masatoshi Synchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of Interactions (Letter)

17(6): 1315–1338

Kapur, Shitij—See Smith, Andrew James Karklin, Yan and Lewicki, Michael S. A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals (Letter) Kennel, Matthew B., Shlens, Jonathon, Abarbanel, Henry D.I., and Chichilnisky, E.J. Estimating Entropy Rates with Bayesian Conﬁdence Intervals (Letter)

17(2): 397–423

17(7): 1531–1576

Khudanpur, Sanjeev—See Jedynak, Bruno M. ¨ Kopell, Nancy—See Borgers, Christoph Krakauer, John W.—See Habeck, Christian Kristan, William B.—See Thomson, Eric E. Kwok, Terence and Smith, Kate A. Optimization via Intermittency with a Self-Organizing Neural Network (Letter)

17(11): 2454–2481

Lange, Tilman—See Buhmann, Joachim M. ´ Petr and Greenwood, Priscilla E. L´ansky, Optimal Signal Estimation in Neuronal Models (Letter)

17(10): 2240–2257

2744

Legenstein, Robert, Naeger, Christian, and Maass, Wolfgang What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? (Letter)

Index

17(11): 2337–2382

Leow, Wee Kheng—See Low, Kian Hsiang Leutgeb, Stefan—See Oliva, Dami´an Lewicki, Michael S.—See Karklin, Yan Lewicki, Michael. S.—See Smith, Evan Liang Faming Evidence Evaluation for Bayesian Neural Networks Using Contour Monte Carlo (Letter)

17(6): 1385–1410

Lin, Chih-Jen—See Chang, Ming-Wei Liu, Zhiyong—See Ma, Jinwen Longtin, Andr´e—See Masuda, Naoki Low, Kian Hsiang, Leow, Wee Kheng, and Ang Jr., Marcelo H. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks (Letter)

17(6): 1411–1445

Lu, Wenlian—See Chen, Tianping Lytton, William W. and Hines, Michael L. Independent Variable Time-Step Integration of Individual Neurons for Network Simulations (Letter)

17(4): 903–921

Ma, Jinwen, Liu, Zhiyong, and Xu, Lei A Further Result on the ICA One-Bit-Matching Conjecture (Note)

17(2): 331–334

Maass, Wolfgang—See Joshi, Prashant Maass, Wolfgang—See Legenstein, Robert

Index

Maeda, Shin-ichi, Song, Wen-Jie, and Ishii, Shin Nonlinear and Noisy Extension of Independent Component Analysis: Theory and Its Application to a Pitch Sensation Model (Letter)

2745

17(1): 115–144

Maex, Reinoud—See Berends, Michiel Mahani, Alireza S. and Wessel, Ralf Motion Contrast Classiﬁcation Is a Linearly Nonseparable Problem (Note)

17(8): 1700–1705

Martinez, Dominique Oscillatory Synchronization Requires Precise and Balanced Feedback Inhibition in a Model of the Insect Antennal Lobe (Letter)

17(12): 2548–2570

Masuda, Naoki, Doiron, Brent, Longtin, Andr´e, and Aihara, Kazuyuki Coding of Temporally Varying Signals in Networks of Spiking Neurons with Global Delayed Feedback (Letter)

17(10): 2139–2175

Mato, Germ´an—See Pfeuty, Benjamin Mehring, Carsten—See Morrison, Abigail Menchero, A, Montes Diez, R., R´ıos Insua, D., and ¨ Muller, P. Bayesian Analysis of Nonlinear Autoregression Models Based on Neural Networks (Letter)

17(2): 453–485

Meunier, Claude and Borejsza, Karol How Membrane Properties Shape the Discharge of Motoneurons: A Detailed Analytical Study (Letter)

17(11): 2383–2420

Micchelli, Charles A. and Pontil, Massimiliano On Learning Vector-Valued Functions (Letter)

17(1): 177–204

Micheli, Alessio—See Hammer, Barbara

2746

Mikula, Shawn and Niebur, Ernst Rate and Synchrony in Feedforward Networks of Coincidence Detectors: Analytical Solution (Letter)

Index

17(4): 881–902

Miller, David J.—See Zhao, Qi Mizumori, Sheri—See Oliva, Dami´an Moeller, James R.—See Habeck, Christian Montes Diez, R.—See Menchero, A. Morimoto, Jun and Doya, Kenji Robust Reinforcement Learning (Letter) Morrison, Abigail, Mehring, Carsten, Geisel, Theo, Aertsen, Ad, and Diesmann, Markus Advancing the Boundaries of High-Connectivity Network Simulation with Distributed Computing (Letter)

17(2): 335–359

17(8): 1776–1801

¨ Muller, P.—See Menchero, A. Mulvaney, Rory G.—See Tchernev, Elko B. Murata, Noboru—See Ikeda, Kazushi Naeger, Christian—See Legenstein, Robert Nakagiri, Shin-ichi—See Vanualailai, Jito Nakahara, Hiroyuki—See Amari, Shun-ichi Narayan, Anusha—See Narayanan, Rishikesh Narayanan, Rishikesh, Narayan, Anusha, and Chattarji, Sumantra A Probabilistic Framework for Region-Speciﬁc Remodeling of Dendrites in Three-Dimensional Neuronal Reconstructions (Letter) Nemenman, Ilya Fluctuation-Dissipation Theorem and Models of Learning (Letter)

17(1): 75–96

17(9): 2006–2033

Index

2747

Niebur, Ernst—See Mikula, Shawn Okada, Masato—See Hamaguchi, Kosuke Okatan, Murat, Wilson, Matthew A., and Brown, Emery N. Analyzing Functional Connectivity Using a Network Likelihood Model of Ensemble Neural Spiking Activity (Letter)

17(9): 1927–1961

Oliva, Dami´an, Samengo, In´es, Leutgeb, Stefan, and Mizumori, Sheri A Subjective Distance Between Stimuli: Quantifying the Metric Structure of Representations (Letter)

17(4): 969–990

Olshausen, Bruno A. and Field, David J. How Close Are We to Understanding V1? (Review)

17(8): 1665–1699

Pakzad, Payam and Anantharam, Venkat Estimation and Marginalization Using the Kikuchi Approximation Methods (Letter)

17(8): 1836–1873

Paninski, Liam Asymptotic Theory of InformationTheoretic Experimental Design (Letter)

17(7): 1480–1507

Panzeri, S.—See Pola, G. Paz, Rony—See Shpigelman, Lavi ˜ ´ Carlos, and P´erez-Cruz, Fernando, Bousono-Calz on, Art´es-Rodr´ıguez, Antonio Convergence of the IRWLS Procedure to the Support Vector Machine Solution (Note)

17(1): 7–18

Petersen, Kaare Brandt, Winther, Ole, and Hansen, Lars Kai On the Slow Convergence of EM and VBEM in Low-Noise Linear Models (Note)

17(9): 1921–1926

Petersen, R.S.—See Pola, G.

2748

Pfeuty, Benjamin, Mato, Germ´an, Golomb, David, and Hansel, David The Combined Effects of Inhibitory and Electrical Synapses in Synchrony (Letter)

Index

17(3): 633–670

Phatak, Dhananjay S.—See Tchernev, Elko B. Pola, G., Petersen, R.S., Thiele, A., Young, M.P., and Panzeri, S. Data-Robust Tight Lower Bounds to the Information Carried by Spike Times of a Neuronal Population (Letter)

17(9): 1962–2005

Pontil, Massimiliano—See Micchelli, Charles A. ¨ otter, ¨ Porr, Bernd—See Worg Florentin Quek, Chai—See Ang, Kai Keng Quoy, M.—See Banquet, J.P. Rajapakse, Rohana and Denham, Michael Fast Access to Concepts in Concept Lattices via Bidirectional Associative Memories (Letter)

17(10): 2291–2300

Ramacher, Ulrich—See Buhmann, Joachim M. Rao, Rajesh P.N.—See Grimes, David B. Reggia, James A.—See Schulz, Reiner Revel, A.—See Banquet, J.P. Richardson, Magnus J.E. and Gerstner, Wulfram Synaptic Shot Noise and Conductance Fluctuations Affect the Membrane Voltage with Equal Signiﬁcance (Letter)

17(4): 923–947

R´ıos Insua, D.—See Menchero, A. Roelfsema, Pieter R. and van Ooyen, Arjen Attention-Gated Reinforcement Learning of Internal Representations for Classiﬁcation (Letter)

17(10): 2176–2214

Index

Rudolph, M. and Destexhe, A. An Extended Analytic Expression for the Membrane Potential Distribution of Conductance-Based Synaptic Noise (Note)

2749

17(11): 2301–2315

Sackeim, Harold A.—See Habeck, Christian Samengo, In´es—See Oliva, Dami´an Sandoval, Francisco—See Atencia, Miguel Schaal, Stefan—See Vijayakumar, Sethu Schmitt, Michael On the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach (Letter) Schulz, Reiner and Reggia, James A. Mirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic Changes (Letter)

17(3): 715–729

17(5): 1059–1083

Sekine, Masatoshi—See Kanamaru, Takashi Senn, Walter and Fusi, Stefano Learning Only When Necessary: Better Memories of Correlated Patterns in Networks with Bounded Synapses (Letter)

17(10): 2106–2138

Seung, H. Sebastian—See Werfel, Justin ˇ ˇ ıma, Jiˇr´ı Sgall, Jiˇr´ı—See S´ Shlens, Jonathon—See Kennel, Matthew B. Shpigelman, Lavi, Singer, Yoram, Paz, Rony, and Vaadia, Eilon Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces (Letter) ˇ ıma, Jiˇr´ı and Sgall, ˇ S´ Jiˇr´ı On the Nonlearnability of a Single Spiking Neuron (Letter)

17(3): 671–690

17(12): 2635–2647

2750

Index

Singer, Yoram—See Crammer, Koby Singer, Yoram—See Shpigelman, Lavi Smith, Andrew James, Becker, Suzanna, and Kapur, Shitij A Computational Model of the Functional Role of the Ventral-Striatal D2 Receptor in the Expression of Previously Acquired Behaviors (Letter) Smith, Evan and Lewicki, Michael. S. Efﬁcient Coding of Time-Relative Structure Using Spikes (Letter)

17(2): 361–395

17(1): 19–45

Smith, Kate A.—See Kwok, Terence Song, Qing A Robust Information Clustering Algorithm (Letter)

17(12): 2672–2698

Song, Wen-Jie—See Maeda, Shin-ichi Sperduti, Alessandro—See Hammer, Barbara Stern, Yaakov—See Habeck, Christian Stiber, Michael Spike Timing Precision and Neural Error Correction: Local Behavior (Letter) Swinehart, Christian D. and Abbott, L.F. Supervised Learning Through Neuronal Response Modulation (Letter)

17(7): 1577–1601

17(3): 609–631

Tan, K.C.—See Tang, H.J. Tanaka, Shigeru—See Yamazaki, Tadashi Tang, H.J., Tan, K.C., and Zhang, Weinian Analysis of Cyclic Dynamics for Networks of Linear Threshold Neurons (Letter)

17(1): 97–114

Index

2751

Tchernev, Elko B., Mulvaney, Rory G., and Phatak, Dhananjay S. Investigating the Fault Tolerance of Neural Networks (Letter)

17(7): 1646–1664

Tchernev, Elko B., Mulvaney, Rory G., and Phatak, Dhananjay S. Perfect Fault Tolerance of the n-k-n Network (Note)

17(9): 1911–1920

Thiele, A.—See Pola, G. Thomson, Eric E. and Kristan, William B. Quantifying Stimulus Discriminability: A Comparison of Information Theory and Ideal Observer Analysis (Review)

17(4): 741–778

Thorpe, Simon J.—See Guyonneau, Rudy Tiesinga, Paul H.E. Stimulus Competition by Inhibitory Interference (Letter)

17(11): 2421–2453

Todorov, Emanuel Stochastic Optimal Control and Estimation Methods Adapted to the Noise Characteristics of the Sensorimotor System (Letter)

17(5): 1084–1108

Tonnelier, A. Catagorization of Neural Excitability Using Threshold Models (Note)

17(7): 1447–1455

Vaadia, Eilon—See Shpigelman, Lavi Valiant, Leslie G. Memorization and Association on a Realistic Neural Model (Letter)

17(3): 527–555

Van Hulle, Marc M. Edgeworth-Expanded Gaussian Mixture Density Modeling (Note)

17(8): 1706–1714

Van Hulle, Marc M. Edgeworth Approximation of Multivariate Differential Entropy (Note)

17(9): 1903–1910

2752

Van Hulle, Marc M. Maximum Likelihood Topographic Map Formation (Note)

Index

17(3): 503–513

van Ooyen, Arjen—See Roelfsema, Pieter R. VanRullen, Ruﬁn—See Guyonneau, Rudy Vanualailai, Jito and Nakagiri, Shin-ichi Some Generalized Sufﬁcient Convergence Criteria for Nonlinear Continuous Neural Networks (Letter)

17(8): 1820–1835

Vijayakumar, Sethu, D’Souza, Aaron, and Schaal, Stefan Incremental Online Learning in High Dimensions (Letter)

17(12): 2602–2634

Wennekers, Thomas and Ay, Nihat Finite State Automata Resulting from Temporal Information Maximization and a Temporal Learning Rule (Letter)

17(10): 2258–2290

Werfel, Justin, Xie, Xiaohui, and Seung, H. Sebastian Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks (Letter)

17(12): 2699–2718

Wessel, Ralf—See Mahani, Alireza S. Williams, Christopher K.I. How to Pretend That Correlated Variables Are Independent by Using Difference Observations (Note)

17(1): 1–6

Wilson, Matthew A.—See Okatan, Murat Windisch, David Loading Deep Networks Is Hard: The Pyramidal Case (Letter) Winther, Ole—See Petersen, Kaare Brandt

17(2): 487–502

Index

¨ otter, ¨ Worg Florentin and Porr, Bernd Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms (Review) Wu, Qiang and Zhou, Ding-Xuan SVM Soft Margin Classiﬁers: Linear Programming versus Quadratic Programming (Letter) Wu, Si and Amari, Shun-ichi Computing with Continuous Attractors: The Stability and the Online Aspects (Letter)

2753

17(2): 245–319

17(5): 1160–1187

17(10): 2215–2239

Xia, Youshen and Feng, Gang On Convergence Conditions of an Extended Projection Neural Network (Note)

17(3): 515–525

Xie, Shengli, He, Zhaoshui, and Fu, Yuli A Note on Stone’s Conjecture of Blind Signal Separation (Note)

17(2): 321–330

Xie, Xiaohui—See Werfel, Justin Xu, Lei—See Ma, Jinwen Yamana, Michiko—See Hamaguchi, Kosuke Yamazaki, Tadashi and Tanaka, Shigeru Neural Modeling of an Internal Clock (Letter)

17(5): 1032–1058

Young, M.P.—See Pola, G. Zhang, Kun and Chan, Lai-Wan Extended Gaussianization Method for Blind Separation of Post-Nonlinear Mixtures (Letter) Zhang, Tong Learning Bounds for Kernel Regression Using Effective Data Dimensionality (Letter)

17(2): 425–452

17(9): 2077–2098

2754

Index

Zhang, Weinian—See Tang, H.J. Zhang, Xuedong and Carney, Laurel H. Response Properties of an Integrate-and-Fire Model That Receives Subthreshold Inputs (Letter)

17(12): 2571–2601

Zhao, Qi and Miller, David J. Mixture Modeling with Pairwise, Instance-Level Class Constraints (Letter)

17(11): 2482–2507

Zhou, Ding-Xuan—See Wu, Qiang

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

Recommend Documents